We’ll start defining some important concepts and attributes of machine learning systems, things that need to be understood in order to start coding a system. As soon as you finish reading this article, you’ll have a notion of why would you use an ML solution and what do you need to build it.
Defining Machine Learning
Machine Learning is the science of programming computers so they can learn from data. An ML-based would process raw data and transform it into training instances which are part of the training set.
Raw data: pure data, unprocessed
Training instances: processed sample data. E.g: salary, purchased or not, nationality…
Training set: a set of multiple training instances used by the system to learn autonomously, algorithmic-based.
From the example above:
- Country, Age, Salary and Purchased are data types or also attributes
- A feature is usually an attribute plus its value (“Country” = “France”)
ML/AI systems vs. Mechanical Systems
Spam filters were one of the first practical and mainstream uses of machine learning, and they illustrate well such difference. A spam filter usually analyzes words within the email itself looking for red flags.
If you’re building a mechanical spam filter, you would have to hardcode all spam red flags. As it might be effective, it’s not so efficient since spam strategies are constantly changing.
What I’m saying is that a lot of human effort would be required to keep such a mechanical system up to date. In an ML scenario, the system would learn incrementally by itself by being fed training data (online learning, preferably).
Training what we can’t (or don’t want to) code
Coding a speech recognizer or a personal assistant like Siri or Alexa became possible thanks to machine learning. Well, they could be coded with no ML traces, but that’s the kind of work that becomes unnecessary when you have the powerful tools of ML.
Imagine if you had to hardcode all possible variations of each word and assign all of them to the corresponding letters… a huge chunk of work. Writing an algorithm that learns by itself is a better idea, given many examples for each word.
We can now conclude that ML and AI open uncountable possibilities for innovations, since their building time becomes shorter.
So, when to use machine learning?
- Dynamic environment (ML can adapt to new data using online/batch learning)
- Getting intel about complex problems and large amounts of data
- Complex problems that are not so easy to code or would require a lot of human hours
- Huge amounts of data and no known or developed algorithms
Machine Learning Systems
There are three ways to generally classify machine learning systems or algorithms:
- Whether they are trained or not under human supervision (supervised, unsupervised, semisupervised or Reinforcement Learning)
- Whether they can learn incrementally while running (online or batch learning)
- Whether they compare new data points to known data points, or detect patterns in the new training data and build predictive models (instance based or model-based learning)
How systems are trained
The training data fed to the algorithms includes the desired solutions, called labels. Therefore, every training instance will contain a label.
Classification and Regression are typical supervised learning tasks:
- Classification will set instances into different groups
- Regression (or prediction) will predict values or actions by learning from predictors and their labels
Most important supervised learning algorithms:
- Neural Networks (which can also be unsupervised)
- Decision Trees
- Random Forests
- Linear Regression
- Logistic Regression
The training data is unlabeled. The system learns by itself through data interpretation. Therefore, an unlabeled training set. There are three general uses for unsupervised learning:
- Association rule learning
- Visualization and dimensionality reduction
Clustering will divide instances into clusters – which are groups that share traits in common.
Dimensionality reduction has the goal of simplifying the data without losing too much information. For instance, the price of a house might be correlated with its location so the dimensionality reduction algorithm will merge them into one feature. Feature extraction is the name of this technique. This helps performance in a very considerable way.
Anomaly detection is also a task for unsupervised learning, like credit card fraud detection.
Combination of both supervised and unsupervised algorithms – a portion of the data is labeled, but the other is not. Usually, the system will identify patterns or will cluster the data and then the programmer needs to insert labels to each pattern or cluster.
Deep Belief Networks (DBNs) are based on Restricted Boltzmann Machine (RBM), which is an unsupervised learning component. RBMs are trained through unsupervised learning, and then the system is fine-tuned using supervised learning techniques (insertion of labels).
Facial recognition is a good example of semisupervised learning: the system by itself will identify that the person is there and, depending on the system, will also identify their physical attributes (hair and eye color, skin tone, shapes…), and then the instance will be fed with the person’s name and the necessary information.
The learning system is an agent in reinforcement learning. This agent will observe the environment and perform actions to receive rewards or penalties. It will learn by itself what’s the best strategy – called policy – to be rewarded more often. A policy defines what action the agent should choose in a given situation.
Reinforcement Learning is commonly used in robots with higher degrees of freedom, like walking, picking objects and opening doors!
Learning incrementally or not?
- Batch Learning: the system doesn’t learn incrementally, so it must be trained using all available data at once, typically done offline. When the system is trained, it goes into action and doesn’t learn anymore. If the system needs to learn from new data, it must be stopped and replaced with a new system trained with the new data. As it might take a long time, batch learning systems can be automated and suited for dynamic use.
- Online/Incremental Learning: the systems learn incrementally by feeding it continuous data instances (grouped in batches). Works great with data that changes a lot.
The Learning Rate is something that needs to be set when working with online learning systems. It’s a rate that defines how fast the algorithms should adapt to new data. Although a high learning rate will increase adaption to new data, the old data tends to be forgotten by the system. A learning rate with some inertia might be interesting to avoid data noise.
Instance-based or Model-based learning?
Generalization is an important task of machine learning systems. Algorithms must be able to generalize new instances, which means handling incoming data.
- Instance-based learning: generalizes new data using a similarity measure. The system will compare incoming data with already-learned data and try to correctly assign new instances
- Model-based learning: generalizes from a set of examples building a model from them. Such model will be used to predict where the incoming data will fit.
In order to build a machine learning system for your needs, the following points need to be specified:
- How is it going to be trained? Supervised, Unsupervised, Semisupervised or through Reinforcement Learning?
- How is it going to learn? Incrementally (online) or through batch learning (offline)?
- How is it going to generalize? Instance or model based?
In Part 2, we’re going to build a machine learning system from scratch! Subscribe to my newsletter to keep updated.
Questions? Comment below or email them to firstname.lastname@example.org