Machine Learning???

If you read the first blog post So You Want to Start Coding… you know that I mentioned machine learning and artificial intelligence. But what exactly is machine learning? Before we dive into the inner workings of the field, let’s talk about some ways machine learning has changed the world we live in!

In this post, you will learn:

  • What Machine Learning/Artificial Intelligence is and Where it is Used
  • The Different Types of Machine Learning
  • The Machine Learning Workflow
  1. What Machine Learning/Artificial Intelligence is and Where it is Used

Some of the most profound ways in which machine learning has made its impact include how the algorithms have beaten humans at games such as Go and Chess. The thriving interest and work done with machine learning have given rise to better medical diagnosis and subsequent treatments as well as being able to predict and monitor stocks, predict revenue, detect objects, and much more.

What you may not know, is that almost every industry is implementing machine learning into their ways. Google Maps uses machine learning to predict things such as bus traffic delays and projected arrivals/departures in real time, Doordash and Uber Eats use machine learning to estimate when your delivery will arrive, and most of the time on a website you may be chatting with an artificial intelligence chat bot! There has even been instances where companies have used machine learning and artificial intelligence to write marketing copy because their bots are able to write such convincing ads.

Probably the most well known ML/AI project to date is the self driving car. These cars use this technology to recognize billions of objects in front and around the vehicle, allowing the car to drive itself without disaster.

While machine learning is making a huge impact in society for the better, there are instances where these algorithms introduce bias that can harm groups of people. This includes using facial recognition software in judiciary and parliament systems for prosecution and sentencing. This bias comes from the fact that machine learning models work by using data to train themselves and then predicting their output on new data, a concept we will dive deeper into in the following sections. It is important to note, however, that when using these algorithms to prosecute and sentence people, an inherent bias may be introduced which can cause an unfair prediction about the defendant. Just like with everything in life, there are pros and cons to all technologies.

Now that we have discussed some of the ways in which machine learning and artificial intelligence can be used, let’s actually figure out what the heck these terms mean…

Let’s first break down what Artificial Intelligence (or simply AI) is. Many times when people are speaking about AI today they are referring to machine learning, but AI simply means a set of tools that are used to make computers behave intelligently. AI has many subfields, including robotics and… yes you guessed it… machine learning. Machine learning has become the most prevalent subset, giving rise to its own field entirely.

So, what is Machine Learning (or simply ML)??? The term does not come bottled with an easy definition, so let’s talk about how we can define it. The coolest part about ML is that is has many applications and spans across a core set of other fields. The most simple definition of ML is an array of tools for making predictions and inferences from data.

Let’s pause… we just introduced two new terms that we should discuss to better understand ML and what it does…

  • Predictions
  • Inferences
  1. When discussing ML, we want to predict future events. For example, will the train leave on time? Our prediction would be yes, with a 75% probability that it will leave on time and 25% probability that it will be delayed.
  2. We then use ML to infer the causes of these behaviors and events. Why may the train be delayed? We use the location, temperature, weather, time of year, traffic pattern, number of passengers, distance from previous location, etc. to determine the outcome that the train will leave at its designated time and or possibly be delayed.
  3. We also infer patterns, such as the different train schedule patterns. This could include if other trains have left on time or if they have been delayed, weather conditions, traveling peaks, traffic conditions, etc

Inferences help the models make predictions based on the input data, however, they require different types of ML techniques.

How does this all work…

As mentioned above, ML spans across a multitude of fields including a mix of computer science and statistics. The point of these ML models is to learn about the different inferences without being explicitly programmed. Algorithms do so by learning patterns from existing data that the engineer inputs to the algorithm, and then applies what it learns on new data to predict an outcome. An algorithm is simply a set of instructions for a computer to follow.

Note: For any ML/AI algorithm and project to be successful, it needs high-quality, thorough data to learn from.

What is high-quality data?

It is important to have high quality data in order to predict an outcome accurately. Data comes in all shapes and sizes and is used to make informed decisions. There are typically two types of data: structured and unstructured. Structured data is in tabular form, or more commonly how data looks in a spreadsheet such as excel or in a CSV. Structured data has rows that represent each different example and columns that represent each feature. Unstructured data is data that cannot be stored in a traditional tabular form. Some examples of unstructured data are sentences or images.

Data preparation is a required step for all machine learning algorithms and projects. This topic will be discussed further in its own following blog post (stay tuned!).

Data this, data that… What about Data Science?!

As it turns out, ML is a very important tool for data scientists and data science work. Data science is all about making discoveries from data to see valuable insights.

You may be wondering what all of this looks like, and in a practical sense this comes together in machine learning models. A ML model is a statistical portrayal of a given real-world problem or process based on data. This is done by inserting new input data into the model, which then returns an outcome based on the data.

Let’s look at a practical example. We can predict whether an email is spam or not by passing the email to the model, which then returns a probability that the email is spam or a value that represents yes or no.

Note: AI and Data Science do not share the same goals. Data science is the ability to use data for insights while AI is about intelligence in computers. They do overlap, and this overlap is usually machine learning.

So far you have learned that:

  • The applications for machine learning are not limited to a specific problem, and can span many fields and domains
  • In order to make predictions from different ML models, you need high-quality data
  • ML discovers patterns in existing data and then applies those patterns to new data

2. The Different Types of Machine Learning

There are typically three different types of machine learning:

  • Supervised learning
  • Unsupervised learning
  • Reinforcement learning

Reinforcement learning is used to decide sequential actions, for example, deciding the next chess move in a game, or a robot deciding its next action. This type of learning includes concepts such as game theory and other complex mathematics that will be covered entirely in a series of blogs themselves.

Supervised learning and unsupervised learning are the two most common types of ML. The main difference between these two types is their training, or input, data. The training data is the existing data that the ML model learns from. When you pass data to a model in order for it to learn from that information, this is called training the model.

The training data for supervised models is labeled. This means that the features – or different pieces of information; columns of a dataframe or excel sheet, the values of our target, or simply the training data – are known. In other words, we knew before training what the correct answer for the prediction output should be. For supervised models, we input the known labeled features to the model for training, and try to predict as accurately what the output will be so we can compare the model output to the known correct labels.

An example of this is determining if a patient has heart disease. Some features could be the patients age, sex, whether they smoke, how many cigarettes they smoke daily, their type of chest pain, blood sugar, blood pressure, etc. The training data we use consists of previous patients features, and whether they actually had heart disease or not. The fact that we know whether the patients in our input data had heart disease or not before training is what makes this a supervised learning model. The data is labeled and known before training and predicting.

Once the training is done, we can pass new data to the model, and when it has finished evaluating the data it returns a prediction.

In unsupervised learning, the training data is not labeled; it is unknown. To use our previous example of determining spam emails, we can use input training data such as the email text, the sender’s email, the email domain, the subject, etc. What we do not know is whether the training data is spam or not spam before we input the data.

Unsupervised learning data have features that are useful when detecting anomalies, dividing data into groups (known as clustering), analyzing text, etc.

Unsupervised learning is an important technique because more often than not, data does not come with labels. A majority of the time, if the labels are unknown, it takes too much manual time and labor to go through and label all of your data, so you can use unsupervised methods instead. The unsupervised models will find their own patterns within the data.

3. The Machine Learning Workflow

You have learned that historical data is used to train a machine learning model to learn and infer patterns and ultimately return a prediction. The workflow are the steps that include determining the type of problem and data, design, implementation, and everything in between.

The ML workflow consists of four main steps. Let’s use the same example for heart disease to go through each step.

Step 1: Extract/Determine Features

It is usually the case that datasets will not come clean and pristine. Most of the time you will need to write the code (mostly SQL and various pre-processing techniques) to find the data yourself, or you will be working with a messy dataset and most likely not achieve accurate or usable results. Step 1 of the ML workflow is to extract features from the raw data.

For our example, some features we may want to extract are age, sex, cholesterol levels, blood pressure, blood sugar, whether they smoke, how many cigarettes they smoke daily, their type of chest pain, etc.

Step 2: Split Dataset Into Training/Testing

After you extract your features and cleaned your dataset, it is time to move onto the second step: splitting the dataset into the training and testing sets. This will become clear in the following steps, but it is important to keep in mind that splitting into a test set and a train set is a vital part of the ML workflow.

Step 3: Train the Model

The theory behind training the model is fairly simple. The data is passed into the ML model that you chose (or programmed from scratch, something else we will discuss in later posts!!); the model then outputs its findings.

Step 4: Evaluating the Model

Evaluating the model is incredibly important because we want to make sure that it is accurate and predicting well. We can evaluate this by passing in new data and checking the output prediction. The most important thing to remember is that you do not want to pass any data that you used to train the model to evaluate because the model has already seen this data. Viola! This is what the testing dataset is for! To evaluate, you pass in the test set to the model and evaluate the predictions. The test set is typically called “unseen” data.

It’s important to evaluate the predictions from the test set to see how accurate the model is. This can be done by finding the average error of the predictions – determining how many data points the model predicted that were correct and incorrect – or the percent of the accurate predictions within a certain margin.

It is important to establish the performance threshold. This means determining whether the accuracy of the model is good enough for your use case. For example, for a model that is predicting 85% accuracy, is that good enough for your project? If the accuracy is good enough, then your model is ready for use in production! If the accuracy is not good enough, then you need to go back to training the model, but this time, you can fine tune the model using different techniques! Often, tuning means a couple different things. You could change the input features, or input data, to the model, or you could add more features. Tuning is time consuming, but a necessary and important step when building accurate machine learning models.

In many cases, if you perform a great deal of different tuning techniques and the performance of the model does not improve, it could mean you do not have enough data.

Summary:

Let’s go through what you have learned thus far!

  1. Extracting features: the first step is to explore the dataset and manipulate it in such a way to find useful features for the model.
  2. Train, Test, Split: it is incredibly important to split the dataset into the train and test sets so that you can validate the accuracy of the model and evaluate the performance.
  3. Training: after splitting the dataset, you will train the model on the training set and choose the algorithm that best suits your project use case.
  4. Evaluate: after training and testing, it is time to evaluate the performance of the model. If the performance is desirable, the model is ready to go! If the desired performance has not yet been reached, you can use different techniques to tune the model and improve accuracy. Once tuning is completed, you go back to step 3 and train the model once more before evaluating performance.

Leave a comment