A Hands-On Approach to Machine Learning (part 1)

We’ll start defining some important concepts and attributes of machine learning systems, things that need to be understood in order to start coding a system. As soon as you finish reading this article, you’ll have a notion of why would you use an ML solution and what do you need to build it.

Defining Machine Learning

Machine Learning is the science of programming computers so they can learn from data. An ML-based would process raw data and transform it into training instances which are part of the training set.

Summarizing

Raw data: pure data, unprocessed

Training instances: processed sample data. E.g: salary, purchased or not, nationality…

Training set: a set of multiple training instances used by the system to learn autonomously, algorithmic-based.

training set
This is an example of a training set. Each line is a training instance. Note that the “Purchased” field is still unprocessed – your computer understands 0’s and 1’s, and not Yes’s and No’s

From the example above:

  • Country, Age, Salary and Purchased are data types or also attributes
  • A feature is usually an attribute plus its value (“Country” = “France”)

ML/AI systems vs. Mechanical Systems

Spam filters were one of the first practical and mainstream uses of machine learning, and they illustrate well such difference. A spam filter usually analyzes words within the email itself looking for red flags.

If you’re building a mechanical spam filter, you would have to hardcode all spam red flags. As it might be effective, it’s not so efficient since spam strategies are constantly changing.

What I’m saying is that a lot of human effort would be required to keep such a mechanical system up to date. In an ML scenario, the system would learn incrementally by itself by being fed training data (online learning, preferably).

Training what we can’t (or don’t want to) code

siri

Coding a speech recognizer or a personal assistant like Siri or Alexa became possible thanks to machine learning. Well, they could be coded with no ML traces, but that’s the kind of work that becomes unnecessary when you have the powerful tools of ML.

Imagine if you had to hardcode all possible variations of each word and assign all of them to the corresponding letters… a huge chunk of work. Writing an algorithm that learns by itself is a better idea, given many examples for each word.

We can now conclude that ML and AI open uncountable possibilities for innovations, since their building time becomes shorter.

So, when to use machine learning?

  • Dynamic environment (ML can adapt to new data using online/batch learning)
  • Getting intel about complex problems and large amounts of data
  • Complex problems that are not so easy to code or would require a lot of human hours
  • Huge amounts of data and no known or developed algorithms

 

Machine Learning Systems

There are three ways to generally classify machine learning systems or algorithms:

  • Whether they are trained or not under human supervision (supervised, unsupervised, semisupervised or Reinforcement Learning)
  • Whether they can learn incrementally while running (online or batch learning)
  • Whether they compare new data points to known data points, or detect patterns in the new training data and build predictive models (instance based or model-based learning)

How systems are trained

Supervised Learning

The training data fed to the algorithms includes the desired solutions, called labels. Therefore, every training instance will contain a label.

Classification and Regression are typical supervised learning tasks:

  • Classification will set instances into different groups
  • Regression (or prediction) will predict values or actions by learning from predictors and their labels

Most important supervised learning algorithms:

  • Neural Networks (which can also be unsupervised)
  • Decision Trees
  • Random Forests
  • Linear Regression
  • Logistic Regression

Unsupervised Learning

The training data is unlabeled. The system learns by itself through data interpretation. Therefore, an unlabeled training set. There are three general uses for unsupervised learning:

  • Clustering
  • Association rule learning
  • Visualization and dimensionality reduction

Clustering will divide instances into clusters – which are groups that share traits in common.

Dimensionality reduction has the goal of simplifying the data without losing too much information. For instance, the price of a house might be correlated with its location so the dimensionality reduction algorithm will merge them into one feature. Feature extraction is the name of this technique. This helps performance in a very considerable way.

Anomaly detection is also a task for unsupervised learning, like credit card fraud detection.

Semisupervised Learning

Combination of both supervised and unsupervised algorithms – a portion of the data is labeled, but the other is not. Usually, the system will identify patterns or will cluster the data and then the programmer needs to insert labels to each pattern or cluster.

Deep Belief Networks (DBNs) are based on Restricted Boltzmann Machine (RBM), which is an unsupervised learning component. RBMs are trained through unsupervised learning, and then the system is fine-tuned using supervised learning techniques (insertion of labels).

Facial recognition is a good example of semisupervised learning: the system by itself will identify that the person is there and, depending on the system, will also identify their physical attributes (hair and eye color, skin tone, shapes…), and then the instance will be fed with the person’s name and the necessary information.

Reinforcement Learning

The learning system is an agent in reinforcement learning. This agent will observe the environment and perform actions to receive rewards or penalties. It will learn by itself what’s the best strategy – called policy – to be rewarded more often. A policy defines what action the agent should choose in a given situation.

Reinforcement Learning is commonly used in robots with higher degrees of freedom, like walking, picking objects and opening doors!

Learning incrementally or not?

  • Batch Learning: the system doesn’t learn incrementally, so it must be trained using all available data at once, typically done offline. When the system is trained, it goes into action and doesn’t learn anymore. If the system needs to learn from new data, it must be stopped and replaced with a new system trained with the new data. As it might take a long time, batch learning systems can be automated and suited for dynamic use.
  • Online/Incremental Learning: the systems learn incrementally by feeding it continuous data instances (grouped in batches). Works great with data that changes a lot.

The Learning Rate is something that needs to be set when working with online learning systems. It’s a rate that defines how fast the algorithms should adapt to new data. Although a high learning rate will increase adaption to new data, the old data tends to be forgotten by the system. A learning rate with some inertia might be interesting to avoid data noise.

Instance-based or Model-based learning?

Generalization is an important task of machine learning systems. Algorithms must be able to generalize new instances, which means handling incoming data.

  • Instance-based learning: generalizes new data using a similarity measure. The system will compare incoming data with already-learned data and try to correctly assign new instances
  • Model-based learning: generalizes from a set of examples building a model from them. Such model will be used to predict where the incoming data will fit.

Summarizing

In order to build a machine learning system for  your needs, the following points need to be specified:

  • How is it going to be trained? Supervised, Unsupervised,  Semisupervised or through Reinforcement Learning?
  • How is it going to learn? Incrementally (online) or through batch learning (offline)?
  • How is it going to generalize? Instance or model based?

In Part 2, we’re going to build a machine learning system from scratch! Subscribe to my newsletter to keep updated.

Questions? Comment below or email them to brunocamposdev@gmail.com

Advertisements

Essential Python Machine Learning Libraries

Essential Python libraries which will save you a lot of time when dealing with data analysis and machine learning. I’ve listed the most used libraries and their main uses.

Enjoy!

Numpy

  • Numerical Python, used for numerical computing.
  • Fast multidimensional array object ndarray
  • Operations between arrays
  • Reading and writing array-based datasets to disk
  • Linear algebra, fourier transform, random numbers
  • C API to enable extensions and C or C++ code to access data structures and computational facilities

 

 

Pandas

  • High level data structures and functions. Work with structured or tabular data fast and easy.
  • DataFrame – tabular, column-oriented data structured with both row and column label, and the Series, a one-dimensional labeled array object
  • NumPy + relational databases
  • Reshape, slice and dice, aggregations, subsets of data
  • Data structures with labeled axes supporting automatic or explicit data alignment
  • Integrated time series functionality
  • Same data structured to handle both time series and non-time series data
  • Arithmetic operations and reductions that preserve metadata
  • SQL functions
  • Flexible handling of missing data

Matplotlib

  • Plots and other two-dimensional data visualizations.

Scipy

  • Collection of packages addressing a number of different standard problem domains
  • scipy.integrate: numerical integration routines and differential equation solvers
  • scipy.linalg: Linear algebra routines and matrix decompositions
  • scipy.optimize: Function optimizers(minimizers) and root finding problems
  • scipy.signals: signal processing tools
  • scipy.sparse: sparse matrices and sparse linear system solvers
  • scipy.special: SPECFUN, gamma function
  • scipy.stats: continuous and discrete probability distributions (density functions, samplers, continuous distribution functions), various statistical tests and more descriptive statistics

Scikit-learn

  • Classification: nearest neighbors, random forest, logistic regressions, SVM…
  • Regression: Lasso, ridge regression…
  • Clustering: k-means, spectral clustering…
  • Dimensionality reduction: PCA, feature selection, matrix factorization…
  • Model selection: Grid search, cross validation…
  • Preprocessing: feature extraction and normalization

Statsmodels

  • Statistical analysis and econometrics
  • Regression models: Linear regression, generalized linear models, robust linear models, linear mixed effect models…
  • Analysis of variance
  • Time series analysis
  • Nonparametric methods: Kernel density estimation and regression
  • Visualization
  • Statistical inference, uncertainty and p-values