Investigating Machine learning

Introduction

I have been reading articles, one or two books, watching presentations and demonstrations and following online courses on machine learning for the last two years. I have been thinking about how to apply this knowledge to data analysis, search and match for even longer.

The lofty goal of machine learning is artificial intelligence - the mimicking of human intelligence. The ability to make a machine appear to be human.

When you look at what has been achieved using machine learning - recognizing hand written digits, identifying faces (and kittens) in images, voice recognition - it can seem daunting. However after reading about neural nets, boost algorithms, gradient descent and other algorithms for Machine learning it becomes obvious that what is being done is fitting data to a pattern recognition or decision tree engine. The power comes from building the decision tree or pattern recognition (for a programmer it is similar to a massive set of "if, then, else" statements). As a programmer I would see the task of writing a program to recognize digits as daunting - a massive set of conditions (if this pixel is this intensity then check this pixel , and so on to infinity).

Using Machine learning has many barriers - the first of which is resources. Creating leaning data takes a lot of time and people. It is not a coincidence that music recognition was an early success - there is a massive collection of music with artist, title information available for companies like Apple, Spotify or anyone with a CD collection and a connection to an online CD database to use to train a system with. Applications such as handwriting and voice recognition needed two things before becoming truly possible. Computer resources - GPUs and parallel computing resources - and a large body of people prepared to categorize data. Data centers, university projects and the internet have provided all of that.

A lot of Machine Learning is categorization - we have some data, we have a category - for example financial data with a fact associated, loan data and whether the account has a failed payment associated - can we find an algorithm (often to as a model) that can predict from new data whether there is a chance that the account holder will default on the loan?

Challenges

Even with an understanding of what machine learning is and what it is capable of it is still difficult to see how to use it. Many examples are trivial. For example predict how many bed rooms a house has based on its location, number of floors and selling price, predict whether a cancerous growth is malign based on size? These examples can be solved with a program with two or three if statements and all involve numeric data. Machine learning problems can have millions of variables (it may be easier to think of fields or columns instead of variables) and the task of the learning is to find the relationship between all these variables. The mathematics used to describe the different learning algorithms can be confusing.

How do we apply any of these techniques to real world problems?

Ask some questions

It can help to take a step back and consider the problems that you are trying to solve and ask a few questions.

What is the problem? Is the problem defined in an understandable way or is it very vague? If you don't understand the problem then start by defining it in a more formal way.

If you have a well defined problem then do you have any solutions? How is it handled now? For example to pick out possible loan accounts that are in danger of defaulting would require an analyst to be experienced and to know what are the danger signs. This analyst would then design procedures to check for these danger signs. Can these procedures be learnt by a computher - either by machine learning or an application be designed. The example in the DataRobot presentation involves training using a data set that contains data about loans with a category of "is_bad". The learning involves using this data to create a model that can predict the value of "is_bad".

Do you have (or can get) the expertise and resources to build a machine learning system?

Have you broken down the problem into small enough pieces? Can machine learning be applied to any of these pieces?

Is it AI or Machine Learning?

This point gets argued incessantly. You think you have a great application for machine learning but it doesn't get met with enthusiasm when you discuss it. It doesn't have to be AI or ML to have value so don't be discouraged by this.

Data Analysis and Matching Learning

Data analysis is one area a lot of the work is still done manually. Examining data to search for patterns, to understand what is present and is not present, what it represents as an entity or domain and what the quality of the data is. Much, if not all, of this work can be done by the computer - by learning from previous data analysis, comparing data to previously categorized data and by examining the data for patterns. Creating rules is an expensive task and one which still requires a data analyst. Building a computer application is possible (and has been done) but the quality of the application depends on the quality of the rules developed by data analysts, business users and domain area experts., Machine learning can help here. The machine learning system will not be a traditional (neural net, decision tree) model but rather an expert system. It will be able to suggest rules based on decisions being made by users.

The engine for data analysis may be building on work done earlier. If we have a system where some matches are identified - either as being missed or as being over matched - then an expert system can analyze these records to find anomalies and suggest rules to address them. The adjustments can be used to train (including using a machine leaning algorithm) the rules used. For example users can be be asked to confirm whether changes are improving the system.

Conclusions

One expression that is popular is "Don't boil the ocean" - as in "if you want to make a cup of tea you don't boil the ocean, you boil just enough water for a cup or pot". If you want to solve a problem you have to reduce its size to a pot of water not an ocean! Break it down until you find a piece or step that can be solved.

One problem with machine learning is over fitting. You can take pairs of records, break them down into n-grams (or pairs of letters), convert this into a form that can be used for a neural net and then train the neural net using this information. Even after that what you get is a model that will give poor results. It may work for the data used to train it but you will need the ocean before it works well. In other words it may only be able to predict what it has already seen because the use case and data do not suit a neural net.

Programming By Numbers

Search This Blog