
- Published on
Exploring Machine and Human Learning: Principles and Processes
What allows us to learn as human beings? In general, we acquire a large part of our knowledge through our experience with objects and the world around us. This means that we learn from the information and data we gather about them, rather than relying on abstract mathematical definitions.
This ability to learn through observation and data analysis has been very useful throughout history, as there are many problems that cannot be approached analytically or theoretically. In these cases, data allows us to find empirical solutions, although they may not necessarily provide us with a deep understanding of why things work the way they do. However, these data-driven solutions can be very practical. For this reason, the ability to learn from data is fundamental in many professions and scientific disciplines.
In this opportunity, we want to briefly address the main aspects that make up the problem of learning from data. Then, we will delve into how machines can also learn using this approach.
The problem of learning
The ability to learn from data is a process that can be automated through the use of algorithms designed specifically for this purpose. These algorithms aim to find the most accurate solution for predicting outcomes, but they don't necessarily seek to understand the underlying reasons. Instead, they rely on the data to construct a formula that offers the best practical applications. It's important to note that these data learning algorithms only seek to improve their accuracy as they gather more information, but they don't always provide a deep understanding of the underlying phenomenon.
From a more mathematical perspective, the problem of learning can be formulated using three measurable spaces , , and . The set is a subset of and represents a relationship between the data from and . In this context, the learning task consists of trying to describe the relationship from a data sample with and a loss function defined on the cartesian product between the set of all measurable functions from to and the set , with values in real numbers. The function is mainly used to evaluate the performance of the learning algorithm.
To address this task, it is necessary to select a hypothesis set and develop a learning algorithm, that is, to find a mapping:
The goal of a learning algorithm is, given a data sample of a certain size , to find a model that performs well on the data sample and also has the ability to generalize that performance to unknown data in . The performance of the model is evaluated using the loss function , and it is measured through the loss . The ability to generalize implies that the model will exhibit a similar behavior on the unknown data set as it does on the known data set .
To address this task, it is necessary to select a hypothesis set and develop a learning algorithm, which means finding a mapping:
The goal of a learning algorithm is, given a data sample of a certain size , to find a model that performs well on the data sample and also has the ability to generalize that performance to unknown data in . The performance of the model is evaluated using the loss function , and it is measured through the loss . The ability to generalize implies that the model will exhibit a similar behavior on the unknown data set as it does on the known data set .
At this point, we can agree that the terms "good performance" and "ability to generalize" are quite ambiguous. However, we can try to clarify these concepts by examining the concepts of real risk and empirical risk, which we will see below:
The real risk of a hypothesis with respect to a probability distribution over is defined as:
In this definition, the expectation of the loss function of is calculated over data randomly sampled according to the distribution . It is worth noting that, in practice, the distribution is essentially unknown.
On the other hand, the empirical risk is the expected loss over a data sample , that is:
It is desirable to find a model that has a real risk of zero, as it would mean that the model would make no errors in its prediction task. However, it is rare to find a model with these characteristics in practice. Therefore, instead, the focus is on finding a model that satisfies the following condition:
Although the mentioned condition guarantees good performance of the model on the training data set , this approach carries the risk of overfitting. In practice, it is possible to find models that have an empirical risk of zero on the training data set , but have a significant loss on previously unseen data. This means that the model lacks the ability to generalize well to new data sets and, therefore, lacks practical utility. To avoid overfitting, it is common to split the training data set into two subsets: one for training the model and another for evaluating its performance. The subset used for training is called and the subset used for evaluation is called . The goal is to find a model that has similar performance on both subsets, , indicating good generalization ability. If the model's performance is significantly worse on the test set than on the training set, it is likely that the model has overfit the training set.
How can we ensure that a model has good generalization ability? This is a complex problem that involves first choosing the appropriate hypothesis set . In this way, for any value of , we must find a training data set that guarantees:
Once we have found the hypothesis set that satisfies equation (4), we can proceed to find the hypothesis in that satisfies equation (3). If we manage to find a hypothesis set and a model with these characteristics, we can say that our model has good generalization ability and, therefore, performs well.
Prediction and Classification Tasks
Here are some examples of data-driven learning problems:
Multiclass classification. Imagine you want to design a program to classify documents into different categories, such as news, sports, biology, and medicine. A learning algorithm for this task would have access to a set of correctly classified documents, denoted as , and would use these examples to train a model that can classify new documents that are presented to it. In this example, the domain is the set of all possible documents. It is important to note that the documents should be represented using a set of features, such as the number of different words, document size, author, and origin. The labels are the set of all possible topics (in this case, it would be a finite set). Once we have defined the domain and the labels, we need to determine a suitable loss function to measure the performance of our algorithm.
For the multiclass classification problem, we can use a random variable in the domain and a loss function as follows:
This function is generally used for binary or multiclass classification problems.
In the regression task, we aim to find a simple functional relationship between the components of the data and . For example, it could be trying to predict the birth weight of a baby based on ultrasound measurements of their head circumference, abdominal circumference, and femur length. In this case, the domain is a subset of (the three ultrasound measurements) and the labels are real numbers (weight in grams). The training set is a subset . The quality of the hypothesis can be evaluated using the expected value of the squared difference between the correct labels and the prediction of , i.e.:
How do machines learn?
As mentioned before, the learning problem involves selecting a hypothesis set and finding the hypothesis that satisfies the following condition:
In other words, the learning problem reduces to optimizing the empirical risk . There are a variety of optimization algorithms available to solve such problems, but one of the most popular ones is the gradient descent algorithm. This algorithm is based on iterating the following operation:
I hope you enjoyed this post and found the information useful. See you in upcoming content. Goodbye!