Machine Learning
Department of Computer Science
Iowa State University


Study Guide


Week 1 (January 12, 2004)

Overview of the course. Overview of machine learning. Why should machines learn? Operational definition of learning. Taxonomy of machine learning. Specification of a computational model of learning. Example: specification and analysis of a model for conjunctive concept learning. Role of representational and inferential bias in learning.

Inductive Learning of Pattern Classifiers. Decision Tree Induction from Examples. Occam's razor.

Brief digression on probability theory and Information Theory. Review of elements of probability. Probability spaces. Ontological and epistemological commitments of probabilistic representations of knowledge. Bayesian (subjective view of probability) -- Probabilities as measures of belief conditioned on the agent's knowledge. Possible world interpretation of probability. Axioms of probability. Conditional probability. Bayes theorem. Random Variables. Discrete Random Variables as functions from event spaces to Value sets. Possible world interpretation of random variables. Joint Probability distributions. Conditional Probability Distributions. Conditional Independence of Random variables. Pair-wise independence and independence. Entropy of random variables, Information, Mutual Information. Measures of distance between probability distributions -- Kullback-Liebler divergence.

Required readings

Recommended Readings

Strongly Recommended Java Readings for those unfamiliar with Java.


Week 2 (January 19, 2003)

Learning Decision Tree Classifiers. Evaluation of classifiers -- estimation of performance measures; confidence interval calculation for estimates; cross-validation based estimates of hypothesis performance; leave-one-out and bootstrap estimates of performance; comparing two hypotheses; hypothesis testing and null hypothesis; comparing two learning algorithms.

Overfitting and methods to avoid overfitting -- dealing with small sample sizes; prepruning and post-pruning. Pitfalls of entropy as a splitting criterion for multi-valued splits. Alternative splitting strategies -- two-way versus multi-way splits; Alternative split criteria: Gini impurity, Entropy, etc. Cost-sensitive decision tree induction -- incorporating attribute measurement costs and misclassification costs into decision tree induction. Algorithms for Learning Decision Trees (continued). Dealing with categorical, numeric, and ordinal attributes. Dealing with missing attribute values during tree induction and instance classification;

Required readings

Recommended Readings


Week 3 (Beginning January 26, 2004)

Bayesian Framework for Learning. Learning Maximum a-posteriori (MAP) and Maximum Likelihood (ML) hypothesis from data. The relationship between MAP hypothesis learning, minimum description length principle (Occam's razor) and the role of priors. Equivalence of ML hypothesis learner and consistent learner for classification tasks. Equivalence of ML hypothesis learning and minimization of mean squared error for function approximation problems. ML estimation of probabilities. Bayes Optimal Classification (and how it differs from Maximum A posteriori Classifier (hypothesis). Gibbs Classifier.

Required readings

Recommended Readings


Week 4 (Beginning February 2, 2004)

Introduction to Artificial Neural Networks. Threshold logic unit (perceptron) and the associated hypothesis space. Connection with Logic and Geometry. Weight space and pattern space representations of perceptrons. Linear separability and related concepts. Perceptron Learning algorithm and its variants. Convergence properties of perceptron algorithm. Winner-Take-All Networks. Multiplicative Update Algorithms (e.g., Winnow and Balanced Winnow).

Required Readings

Recommended Readings


Week 5 (beginning February 9, 2004)

Introduction to neural networks as trainable function approximators. Function approximation from examples. Least Mean Squared (LMS) Error Criterion. Minimization of Error Functions. Review of Relevant Mathematics (Limits, Continuity and Differentiablity of Functions, Local Minima and Maxima, Derivatives, Partial Derivatives, Taylor Series Approximation, Multi-Variate Taylor Series Approximation) Derivation of a Learning Rule for Minimizing Mean Squared Error Function for a Simple Linear Neuron. Momentum modification for speeding up learning. Introduction to neural networks for nonlinear function approximation. Nonlinear function approximation using multi-layer neural networks. Universal function approximation theorem. Derivation of the generalized delta rule (GDR) (the backpropagation learning algorithm). Generalized delta rule (backpropagation algorithm) in practice - avoiding overfitting, choosing neuron activation functions, choosing learning rate, choosing initial weights, speeding up learning, improving generalization, circumventing local minima, using domain-specific constraints (e.g., translation invariance in visual pattern recognition), exploiting hints, using neural networks for function approximation and pattern classification. Relationship between neural networks and Bayesian pattern classification. Variations -- Radial basis function networks. Learning non linear functions by searching the space of network topologies as well as weights.

Required readings

Recommended Readings

Week 6 (Beginning February 16, 2004)

Bayesian Classifiers and Bayesian Networks. Minimal Error Bayes Classifier, Minimum Risk Bayes Classifier, Conditional Independence revisited. Naive Bayes Classifier and its relation to Linear Discriminant Functions. Estimation of probabilities from data Maximum Likelihood and Bayesian estimation of parameters from data, detailed treatment of estimation of parameters for multinomial distributions using conjugate Dirichlet priors; Example -- Naive Bayes text classification. Bayesian Networks. d-separation, and compact representation of joint probability distribution functions in Bayes Networks.

Required readings

Recommended Readings


Week 7 (Beginning February 23 2004)

Bayesian Networks. Conditional Independence Revisited, d-separation, and compact representation of joint probability distribution functions in Bayes Networks. Reasoning with Bayes Networks, Some algorithms for Exact Inference, and Approximate Inference of relevant probabilities from a Bayesian network using stochastic simulation (Sampling).

Learning Bayesian Networks from Data. Learning of parameters (conditional probability tables) from fully specified instances (when no attribute values are missing) in a network of known structure -- Maximum Likelihood and Bayesian estimation of parameters from data.

Required readings

Recommended Readings


Week 8 (beginning March 1, 2004)

Learning Bayesian networks with unknown structure -- searching the space of network topologies using suitable scoring functions to guide the search, structure learning in practice -- learning low order Bayesian networks (where each random variable directly depends on a small number of others), Bayesian approach to structure discovery, examples. Learning Bayesian network parameters in the presence of missing attribute values (using Expectation Maximization) when the structure is known; Learning networks of unknown structure in the presence of missing attribute values.

Required readings

Recommended Readings


Week 9 (Beginning March 8 2004)

Mistake bound analysis of learning algorithms. Mistake bound analysis of online algorithms for learning Conjunctive Concepts. Optimal Mistake Bounds. Version Space Halving Algorithm. Randomized Halving Algorithm. Learning monotone disjunctions in the presence of irrelevant attributes -- the Winnow and Balanced Winnow Algorithms. Multiplicative Update Algorithms for concept learning and function approximation. Weighted majority algorithm. Applications.

Probably Approximately Correct (PAC) Learning Model. Efficient PAC learnability. Sample Complexity of PAC Learning in terms of cardinality of hypothesis space (for finite hypothesis classes). Some Concept Classes that are easy to learn within the PAC setting.

Required readings

Recommended Readings


Spring Break

Week 10 (Beginning March 22 2004)

Efficiently PAC learnable concept classes. Sufficient conditions for efficient PAC learnability. Some concept classes that are not efficiently learnable in the PAC setting. Making hard-to-learn concept classes efficiently learnable -- transforming instance representation and hypothesis representation. Occam Learning Algorithms. PAC Learnability of infinite concept classes. Vapnik-Chervonenkis (VC) dimension. Properties of VC dimension, VC dimension and learnability, Learning from Noisy examples, Transforming weak learners into PAC learners through accuracy and confidence boosting, Learning under helpful distributions - Kolmogorov Complexity, Conditional Kolmogorov Complexity, Universal distributions, Learning Simple Concepts, Learning from Simple Examples

Required readings

Recommended Readings


Week 11 (Beginning March 29, 2004)

Support Vector Machines. Background: Dual representation of Perceptrons. A learning algorithm using dual representation of perceptrons. Margin and geometric margin. Maximal Margin Separating Hyperplanes -- Why? Maximal Margin separating hyperplanes -- How? Introduction to Lagrange/Karush-Kuhn-Tucker Optimization Theory. Optimization problems. Linear, quadratic, and convex optimization problems. Primal and dual representations of optimization problems. Convex Quadratic programming formulation of the maximal margin separating hyperplane finding problem. Characteristics of the maximal margin separating hyperplane. Kernel functions for classification of non linearly separable data. Soft margin SVM algorithms. Properties of Kernel functions. Implementation of SVM.

Required readings

Recommended readings


Week 12 (Beginning April 5, 2004)

Lazy Learning Algorithms. Instance based Learning, K-nearest neighbor classifiers, distance functions, locally weighted regression, sample application to document classification using TFIDF representation. Relative advantages and disadvantages of lazy learning and eager learning.

Learning Sequence Classifiers. Learning Sequence Classifiers Using Feature-Based Representations of Sequences -- e.g., bag of words representations for text classification. N-grams for text analysis and statistical natural language processing (NLP). Markov models and Hidden Markov Models (HMM). Some special classes of HMM (ergodic, left-to-right, etc.) Fundamental HMM problems -- computing the probability of a given observation sequence given a model; computing the most likely hidden state sequence given an observation sequence and a model; computing the most likely model given observation sequence(s). Forward, Backward, Forward-Backward and Viterbi Algorithms; Algorithms for Learning a HMM from data.

Required readings

Software

Recommended Readings

Week 13 (April 12, 2004)

Unsupervised or self-supervised learning. Clustering. Learning Mixture Models from Data; Identifiability of Mixture Models; Maximum Likelihood approach to Mixture Model Learning -- Expectation Maximization (EM) algorithms. K-means clustering algorithm and variants. Adaptive Resonance Theory (ART) family of clustering algorithms. Distance measures, Clustering Criteria -- Intra-Cluster and Inter-Cluster distances. Hierarchical Agglomerative Clustering Algorithm. Distributional Clustering, Applications to Learning Attribute Value Taxonomies from Data, Phylogeny Construction. Latent Semantic Indexing, Principal Component Analysis and related methods.

Required readings

Recommended Readings

Additional Information


Week 14

Reinforcement Learning. Agents that learn by exploration of environments, using environmental reward and punishment. Examples of reinforcement learning problems. Credit assignment problem and its implications. Exploration-exploitation dilemma. Some approaches to exploration-exploitation tradeoffs. Markov decision processes. Learning optimal policies from interaction with the environment -- determistic, stochastic, stationary and non-stationary environments. Value functions and Action-Value functions. Bellman equations and dynamic programming approach to learning optimal policies when the transition function and reward functions are known (i.e., the agent has a model of the environment). Q learning algorithm for learning optimal policies when an accurate model of the environment is not known. Temporal difference methods. Scaling up reinforcement learning to large state spaces -- function approximation methods for compact representation of actio-value functions; state abstraction methods for hierarchical reinforcement learning. Multi-agent reinforcement learning. Applications.

Required readings

Recommended readings


Week 15

Review, Summary of the Course, and Discussion of Some Current Research Problems


General Background Material of Interest


If you would like to be alerted by email when this page is updated, please register with Mind it.

Receive email when this page changes

Click Here
* Powered by Netmind *

Copyright © 1999-2003, Vasant Honavar, Department of Computer Science, Iowa State University. All rights reserved.

Dr. Vasant Honavar
Professor Department of Computer Science
Iowa State University
Atanasoff Hall, Ames, IA 50011-1040 USA
phone: +1-515-294-4377, fax: +1-515-294-0258


Additional Information


If you would like to be alerted by email when this page is updated, please register with Changedetect.


Copyright © 1999-2004, Vasant Honavar, Department of Computer Science, Iowa State University. All rights reserved.

Dr. Vasant Honavar
Professor Department of Computer Science
Iowa State University
Atanasoff Hall, Ames, IA 50011-1040 USA
phone: +1-515-294-4377, fax: +1-515-294-0258