- Contents
- Learning From Data MOOC - Course Textbook
- Abu-Mostafa et al. - 2012 - Learning from data a short course.pdf
- Learning From Data: A Short Course
- S L I D E S

The recommended textbook covers 14 out of the 18 lectures. The rest is covered by online material that is freely available to the book readers. This book, together with specially prepared online material freely accessible to our readers, provides a complete introduction to Machine Learning, the. Does anybody have any experience with the Learning from Data textbook by to this (with free PDF download): jinsmillmatchblesob.cf~ullman/jinsmillmatchblesob.cf

Author: | NICHOLE SIBRIAN |

Language: | English, Spanish, Japanese |

Country: | Tuvalu |

Genre: | Art |

Pages: | 310 |

Published (Last): | 27.05.2016 |

ISBN: | 444-9-57069-412-4 |

Distribution: | Free* [*Sign up for free] |

Uploaded by: | OCTAVIA |

Learning From Data Click button below to download or read this book. Description. A PHP Error was encountered Severity: Notice Message. LEARNING FROM DATA. The book website AMLbook. com contains supporting material for instructors and readers. LEARNING FROM DATA A SHORT. Download as PDF, TXT or read online from Scribd . There is also a forum that covers additional topics in learning from data. and made it a point to cover the.

In order to set up a list of libraries that you have access to, you must first login or sign up. Then set up a personal list of libraries from your profile page by clicking on your user name at the top right of any screen. You also may like to try some of these bookshops , which may or may not sell this item. Separate different tags with a comma. To include a comma in your tag, surround the tag with double quotes. The book focuses on the mathematical theory of learning, why it's feasible, how well one can learn in theory, etc. Pretty hardcore math, Well-written and carefully presented book. FYI, Dr. Abu-Mostafa has a class based on this book, which is available on Youtube. Thanks https: Please enable cookies in your browser to get the full Trove experience. Skip to content Skip to search.

Check copyright status Cite this Title Learning from data: Author Abu-Mostafa, Yaser S. Lin, Hsuan-Tien.

Physical Description xii, p. Subjects Machine learning -- Textbooks. Contents 1. The learning problem 2. Training versus testing 3. The linear model 4. Overfitting 5. Three learning principles Epilogue Further reading Appendix: Prof of the VC bound Notation. Notes Includes bibliographic references p.

View online Borrow download Freely available Show 0 more links Set up My libraries How do I set up "My libraries"? These 9 locations in All: Australian National University Library. Open to the public. Curtin University Library.

Open to the public ; Not open to the public Monash University Library. The University of Melbourne Library. In contrast to supervised learning where the training examples were of the form input.

This characterizes reinforcement learning. You may wonder how we could possibly learn anything from mere inputs. Unsupervised learning can be viewed as the task of spontaneously finding patterns and structure in input data. The decision regions in unsupervised learning may be identical to those in supervised learning. They still f all into clusters. Consider the coin classification problem that we discussed earlier in Figure 1. We are just given input examples xi.

If you use reinforcement learning instead. The rule may be somewhat ambiguous. Unsupervised learning of coin classification a The same data set of coins in Figure 1.

We still get similar clusters. This unlabeled data is shown in Figure l. Suppose that we didn't know the denomination of any of the coins in the data set. Statistics shares the basic premise of learning from data.

This is the main difference between the statistical approach They will arrange for Spanish lessons once you are there. The main field dedicated to the subject is called machine learning. Imagine that you don't speak a word of Spanish. We briefly mention two other important fields that approach learning from data in their own ways.

For a full month. All you have access to is a Spanish radio station. In other cases. In this case. I f a task can fit more tha n one type. When you arrive in Spain.

Because statistics is a mathematical field. We could also be looking at credit card spending patterns and trying to detect potential fraud.

Do you get. A visual learning problem.

Your task is to learn from this data set what f is. How could a limited data set reveal enough information to pin down the entire target function? Data mining is a practical field that focuses on finding patterns.

We really mean unknown. Recommender systems. Because databases are usually huge. This raises a natural question. The most important assertion about the target function is that it is unknown. The first two rows show the training examples each input x is a 9 bit vector represented visually as a 3 x 3 black and white array. We make less restrictive assumptions and deal with more general models than in statistics. The target function f is the object of learning. If the answer is no.

We know what we have already seen. Instead of going through a formal proof for the general case. That 's memorizing. Since we maintain that f is an unknown function. There is simply more than one function that fits the 6 training examples. The chances are the answers were not unanimous. This does not bode well for the feasibility of learning. To make matters worse. Try to learn what the function is then apply it to the test input given. We are given a data set V of five examples represented in the table below.

Does the data set V tell us anything outside of V that we didn't know before? If the answer is yes. Both functions agree with all the examples in the data set. This doesn't mean that we have learned f. If we remain true to the notion of unknown target. The table shows the case where g is chosen to match i on these examples.

Regardless of what g predicts on the three points we haven't seen before those outside of D. It also shows the data set D in blue and what the final hypothesis g may look like. Let us look at the problem of learning i. The table below shows all such functions Ji. The learn ing a lgorithm picks the hypothesis that m atches the data set the most. To measure the performa nce. The quality of the learning will be determined by how close our prediction is to the true value.

It is easy to verify that any 3 bits that replace the red question marks are as good as any other 3 bits. Since i is unknown except inside D. The whole purpose of learning i is to be able to predict the value of f on points that we haven't seen before. As long as f is an unknown function.

Consider a bin that contains red and green marbles. Does this mean that learning from data is doomed? If so. Once we establish that.

We won't have to change our basic assumption to do that. What we infer may not be much compared to learning a full target function. The proportion of red and green marbles in the bin is such that if we pick a marble at random. Yet the performance outside V is all that matters in learning! This dilemma is not restricted to Boolean functions. Whether 1-l has a hypothesis that perfectly agrees with V as depicted in the table or not. It doesn't matter what the algorithm does or what hypothesis set 1-l is used.

The target function will continue to be unknown. A random sample is picked from a bin ofred and green marbles. Let's take the simplest case of picking a sample. The utility of 1. The answer is a very small number. Although this is certainly possible. The bin can be large or small. The only quantity that is random in 1. We pick a random sample of N independent marbles with replacement from this bin.

Notice that only the size N of the sample affects the bound. A random sample from a population tends to agree with the views of the population at large. We can get mostly green marbles in the sample while the bin has mostly red marbles. One answer is that regardless of the colors of the N marbles that we picked. There is a subtle point here. Putting Inequality 1. We By contrast. Use binomial distribution.

It states that for any sample size N. It is just a constant. The training examples play the role of a sample from the bin. XN in V are picked independently according to P. The color that each point gets is not known to us. In real learning. If not. How does the bin model relate to the learning problem?

P can be unknown to us as well. Take any single hypothesis h E 'H and compare it to f on each point x E X. If the sample was not randomly selected but picked in a particular way. If v happens to be close to zero. If we have only one hypothesis to begin with.

The learning problem is now reduced to a bin problem. The two situations can be connected. With this equivalence. Let us see if we can extend the bin equivalence to the case where we have multiple hypotheses in order to capture real learning. If the inputs xi. In the same way. The error rate within the sample. The probability is based on the distribution P over X which is used to sample the data points x.

Probability added to the basic learning setup To do that. We have made explicit the dependency of Ein on the particular h that we are considering. The out-of-sample error Eout. The in-sample error Ein. If you are allowed to change h after you generate the data set. Each bin still represents the input space X. Why is that? Let us consider an entire hypothesis set H instead of just one hypothesis h.

The probability of red marbles in the mth bin is Eout hm and the fraction of red marbles in the mth sample is Ein hm. With multiple hypotheses in H. Let v1. Vrand and a nd plot the histograms of the distributions of v1. Crand is a coin you choose at random.

The hypothesis g is not fixed ahead o f time before generating the data. Since g has to be one of the hm 's regardless of the algorithm and the sample. Let's focus on 3 coins as follows: Vrand a n d Vmin be the fraction of heads you obtai n for the respective three coi ns. Flip each coi n independently times. Cmin is the coi n that had the m i n i m u m frequency of heads pick the earlier one in case of a tie.

R u n a computer sim u lation for flipping 1. The next exercise considers a simple coin experiment that further illustrates the difference between a fixed h and the final hypothesis g selected by the learning algorithm. There is a simple but crude way of doing that. B2 means that event B1 implies event B2. We will improve on that in Chapter 2. The question of whether V tells us anything outside of V that we didn't know before has two different answers.

We would like to reconcile these two arguments and pinpoint the sense in which learning is feasible: BM are any events. One argument says that we cannot learn anything outside of V. If we accept a probabilistic answer. If we insist on a deterministic answer. We now apply two basic rules in probability.

Let us reconcile the two arguments. Putting the two rules together. That's what makes the Hoeffding Inequality applicable. We don't insist on using any particular probability distribution. Let us pin down what we mean by the feasibility of learning. We still have to make Ein g Rj 0 in order to conclude that Eout g Rj 0.

We cannot guarantee that we will find a hypothesis that achieves Ein g Rj 0. Of course this ideal situation may not always happen in practice. S smart a n d crazy. What enabled this is the Hoeffding Inequality 1. Assume i n t h e probabilistic view that there i s a probability distribution on X. What we get instead is Eout g Rj Ein g.

We consider two learning a lgorithms. Is it possible that the hypothesis that produces turns out to be better than the hypothesis that S produces? If learning is successful. We have thus traded the condition Eout g Rj 0. Remember that Eout g is an unknown quantity. By adopting the probabilistic view. She is wil ling to pay you to solve her problem a n d produce for her a g which a pproximates f. Financial forecasting is an example where market unpredictability makes it impossible to get a forecast that has anywhere near zero error.

Breaking down the feasibility of learning into these two questions provides further insight into the role that different components of the learning problem play.

I f you d o return a hypothesis g. All we hope for is a forecast that gets it right more often than not. If the number of hypotheses ]VJ goes up. The feasibility of learning is thus split into two questions: This means that a hypothesis that has Ein g somewhat below 0. Can we make sure that Eout g is close enough to Ein g? One such insight has to do with the 'complexity' of these components. The second question is answered after we run the learning algorithm on the actual data and see how small we can get Ein to be.

If we get that. Can we make Ein g small enough? The Hoeffding Inequality 1. What is the best that you can promise her a mong the following: Even when we cannot learn a particular f. In many situations. Remember that 1. This is obviously a practical observation. In the extreme case. This means that we will get a worse value for Ein g when f i s complex. The second notion is about the nature of the target function.

The complexity of f.

A close look at Inequality 1. What are the ramifications of having such a 'noisy' target on the learning problem? The first notion is what approximation means when we say that our hypothesis approximates the target function well. If we want an affirmative answer to the first question. Let us examine if this can be inferred from the two questions above.

We might try to get around that by making our hypothesis set more complex so that we can fit the data better and get a lower Ein g.

If we fix the hypothesis set and the number of training examples. Either way we look at it. If the target function is complex. The final hypothesis g is only an approximation of f. J as the 'cost' of using h when you should use f.

If we define a pointwise error measure e h x. One may view E h. This cost depends on what h is used for. What are the criteria for choosing one error measure over another?

We address this question here. Example 1. An error measure quantifies how well each hypothesis h in the model approximates the target function f. While E h. The choice of an error measure affects the outcome of the learning process.

Here is a case in point. In an ideal world. J should be user-specified. The same learning task in different contexts may warrant the use of different error measures. So far. Different error measures may lead to different choices of the final hypothesis. Consider the problem of verifying that a fingerprint belongs to a particular person. For the supermarket.

False rejects. All future revenue from this annoyed customer is lost. The inconvenience of retrying when rejected is just part of the job.

In the supermarket and CIA scenarios. You just gave away a discount to someone who didn't deserve it. For our examples. The right values depend on the application. An unauthorized person will gain access to a highly sensitive facility.

The costs of the different types of errors can be tabulated in a matrix. D The moral of this example is that the choice of the error measure depends on how the system is going to be used. For the CIA. This should be reflected in a much higher cost for the false accept. The other is the CIA who will use it at the entrance to a secure facility to verify that you are authorized to enter that facility.

We need to specify the error values for a false accept and for a false reject. On the other hand. Consider two potential clients of this fingerprint system. If the right person is accepted or an intruder is rejected. The other is that the weighted cost may be a difficult objective function for optimizers to work with. One is that the user may not provide an error specification. We have already seen an example of this with the simple binary error used in this chapter.

The general supervised learning problem that we can independently determine during the learning process. This view suggests that a deterministic target function can be considered a special case of a noisy target. A data point x. The noisy target will look completely random. This realization of P y I x i s effectively a target function. If we use the same h to a pproximate a noisy version of f given by y f x. Our entire analysis of the feasibility of learning applies to noisy target functions as well.

This situation can be readily modeled within the same framework that we have. If y is real-valued for example. Eout may be as close to Ein in the noisy case as it is in the While both distributions model probabilistic aspects of x and y. One can think of a noisy target as a deterministic target plus added noise. This does not mean that learning a noisy target is as easy as learning a deterministic one. Assume we randomly picked all the y's according to the distribution P y I x over the entire input space X.

Remember the two questions of learning? With the same learning model. In Chapter 2. N Yn wnxn. Technical ly.

I n more tha n two d i mensions. For simplicity. The fol lowing steps wil l guide you through the proof. You pick a bag at ra ndom a nd then pick one of the ba lls in that bag at random.

You now pick the second ba l l from that same bag. What is the pro bability that this ba l l is also black? Use Bayes ' Theorem: One bag has 2 black ba l ls and the other has a black and a white ba l l. Problem 1. When you look at the ba l l it is black. Use induction. Compare you r results with b.

Plot a histogra m for the n u m ber of u pdates that the a lgorith m takes to converge. Com ment on whether f is close to g. This problem leads you to explore the algorith m fu rther with data sets of d ifferent sizes a n d dimensions. Com pare you r resu lts with b.

Be sure to mark the exa m ples from different classes d ifferently. PLA converges more q uickly tha n the bound p suggests. In practice. I n t h e iterations of each experiment.

How many u pdates does the a lgorithm ta ke to converge? Compare you r resu lts with b. Report the n u m ber of u pdates that the a lgorith m ta kes before converging. To get g. In this problem. T h e algorithm a bove i s a variant of the so ca l led Adaline Adaptive Linear Neuron a lgorithm for perceptron learn ing.

In each it eration t. That is. Plot the training data set. Report the error on the test set. I n each iteration. Generate a test data set of size UN are iid random varia bles. For a given coin. Remember that for a single coin. Assume we have a n u mber of coins that generate different sa m ples independently. The proba bility of obtaining k heads in N tosses of this coin is given by the binomial distribution: On the same plot show the bound that wou ld be obtained usi ng the Hoeffding I neq u a lity.

One of the sim plest forms of that law is the Chebyshev Inequality. In P roblem 1. We focus on the simple case of flipping a fair coin. Eval u ate U s as a fun ction of s. For a fixed V of size N. Tf3 N. Argue that for a ny two deterministic a lgorithms Ai a nd A2. You have now proved that i n a noiseless setting. This in-sa mple error should weight the different types of errors based on the risk matrix.

For the two risk matrices in Exa mple 1. You have N data points y YN and wish to estimate a ' representative' val ue.

Similar results can be proved for more genera l settings. What happens to you r two estimators hmean and hmed? We began the analysis of in-sample error in Chapter 1. The goal is for you to learn the course material. The in sample error Ein. Eout is based on the performance over the entire input space X. We will also discuss the conceptual and practical implications of the contrast between training and testing.

If the exam problems are known ahead of time. It expressly measures training performance. Doing well in the exam is not the goal in and of itself.

Chapter 2 Training versus Testing Before the final exam. They are the 'training set' in your learning. The same distinction between training and testing happens in learning from data. If the professor's goal is to help you do better in the exam. Although these problems are not the exact ones that will appear on the exam.

Such performance has the benefit of looking at the solutions and adjusting accordingly. The exam is merely a way to gauge how well you have learned the material. If 1-l is an infinite set. This can be rephrased as follows. A word of warning: We will also make the contrast between a training set and a test set more precise.

Not only do we want to know that the hypothesis g that we choose say the one with the best training error will continue to do well out of sample i. We have already discussed how the value of Ein does not always generalize to a similar value of Eout. This is important for learning.

E for all h E 1-l. Figure 1. Whether or not this hope is justified remains to be seen. Exercise 1. Later on, we will consider a number of refinements and variations to this basic setup as needed. However, the essence of the problem will remain the same. There is a target to be learned. It is unknown to us. We have a set of examples generated by the target. The learning algorithm uses these examples to look for a hypothesis that approximates the target.

Given a specific learn ing problem, the target function and training examples are dictated by the problem. However, the learning algorithm and hypothesis set are not. These are solution tools that we get to choose. The hypothesis set and learning algorithm are referred to informally as the learning model. Here is a simple model. In our credit example, different coor dinates of the input vector x E JRd correspond to salary, years in residence, outstanding debt, and the other data fields in a credit application.

The bi nary output y corresponds to approving or denying credit. The functional form h x that we choose here gives different weights to the different coordinates of x, reflecting their relative importance in the credit decision.

The weighted coordinates are then combined to form a 'credit score' and the result is compared to a threshold value. This formula can be written more compactly as 1.

Some o f the weights w1, , Wd may end up being negative, corresponding to an adverse effect on credit approval. For instance, the weight of the 'outstanding debt' field should come out negative since more debt is not good for credit.

The bias value b may end up being large or small, reflecting how lenient or stringent the bank should be in extending credit.

The optimal choices of weights and bias define the final hypothesis g E 1-l that the algorithm produces. Let's say that each email message is represented by the frequency of occurrence of keywords, a nd the output is if the message is considered spa m.

If the data set is linearly separable, there will be a choice for these parameters that classifies all the training examples correctly. The algorithm will determine what w should be, based on the data. Let us assume that the data set is linearly separable, which means that there is a vector w that makes 1. Our learning algorithm will find this w using a simple iterative method.

Here is how it works. The algorithm picks an example from x1 , Y1 xN , YN that is currently misclassified, call it x t , y t , and uses it to update w t. Since the example is misclassified, we have y t sign wT t x t.

The algorithm continues with further iterations until there are no longer misclassified examples in the data set. Although the update rule in 1. The proof is the subject of Prob lem 1. The result holds regardless of which example we choose from among the misclassified examples in x1, Y1 xN, YN at each iteration, and re gardless of how we initialize the weight vector to start the algorithm. For simplicity, we can pick one of the misclassified examples at random or cycle through the examples and always choose the first misclassified one , and we can initialize w O to the zero vector.

Within the infinite space of all weight vectors, the perceptron algorithm manages to find a weight vector that works, using a simple iterative process. This illustrates how a learning algorithm can effectively search an infinite hypothesis set using a finite number of simple steps. This feature is character istic of many techniques that are used in learning, some of which are far more sophisticated than the perceptron learning algorithm.

Choose the i n puts Xn of the data set as random points in the pla ne, a n d eval u ate the target function on each Xn to get the corresponding output Yn Now, generate a data set of size Try the perceptron learning a lgorithm on you r data set a n d see how long it takes to converge a n d how wel l the fin a l hypothesis g matches you r target f. You can find other ways to play with this experiment in Problem 1. Does this mean that this hypothesis will also be successful in classi fying new data points that are not in V?

This turns out to be the key question in the theory of learning, a question that will be thoroughly examined in this book. A new coin will be classified according to the region in the size mass plane that it falls into. Now, we discuss what it is not.

The goal is to distinguish between learning and a related approach that is used for similar problems. While learning is based on data, this other approach does not use data.

It is a 'design' approach based on specifications, and is often discussed alongside the learning approach in pattern recognition literature. Consider the problem of recognizing coins of different denominations, which is relevant to vending machines , for example.

We want the machine to recog nize quarters, dimes, nickels and pennies. We will contrast the 'learning from data' approach and the 'design from specifications' approach for this prob lem.

We assume that each coin will be represented by its size and mass, a two-dimensional input. In the learning approach, we are given a sample of coins from each of the four denominations and we use these coins as our data set. We treat the size and mass as the input vector, and the denomination as the output. There is some variation of size and mass within each class, but by and large coins of the same denomination cluster together.

The learning algorithm searches for a hypothesis that classifies the data set well.