Thursday, April 30, 2015

The Effect of Randomising the Order of Your Inputs

This post is going to focus on the impacts on a single example of randomising the order of your inputs (training examples) in one particular problem. This will look at the effect of adding input shuffling to a poorly-performing network, and also the effect of removing it from a well-performing network.

Let's just straight to the punch: Always randomise the order. The network will start effective training much sooner. Failure to do so may jeopardise the final result. There appears to be no downside. (if anyone has an example where it actually makes things worse, please let me know)


Here's a screenshot from my Ipython notebook of the initial, poorly-performing network. Sorry for using an image format, but I think it adds a certain amount of flavour to show the results as they appear during development. This table shows that after 13 iterations (picked because it fits in my screenshot easily), the training loss is 1.44, and the % accuracy is 11.82%. This network eventually trains to hit a loss of about 0.6, but only after a large (800-odd) number of iterations.

It takes 100 iterations or so to feel like the network is really progressing anywhere, and it's slow, steady progress the whole way through. I haven't run this network out to >1k iterations to see where it eventually tops out, but that's just a curiosity for me. Alternative techniques provide more satisfying results much faster.


Here's the improved result. I added a single line to the code to achieve this effect:

train = shuffle(train)
Adding the shuffling step wasn't a big deal -- it's fast, and conceptually easy as well. It's so effective, I honestly think it would be a good thing for NN libraries to simply to by default rather than leave to the user. We see here that by iteration 13, the valid loss is 0.72 as opposed to 2.53 in the first example. That's pretty dramatic.

If anyone knows of any examples where is it better not to randomise the input examples, please let me know!!!

For the time being, I think the benefits of this result are so clear, that a deeper investigation is not really called for in fact. I'm just going to add it to my 'standard technique' for building NNs going forward, and consider changing this step only if I am struggling with a particular problem-at-hand. What's more likely is that more sophisticated approaches to input modification will be important, rather than avoiding the step entirely. I'm aware that many high-performing results have been achieved by transforming input variables to provide additional synthetic learning data into the training set. Examples of this include image modifications such as skew, colour filtering and other similar techniques. I would be very interested to learn more about other kinds of image modification preprocessing, like edge-finding algorithms, blurring and other standard algorithms.

Interestingly, the final results for these two networks are not very different. Both of them seem to train up to around the maximum capability of the architecture of the network. Neither result arising from this tweak only approach the performance provided by the best known alternative approach that I have copied from.

I wondered whether shuffling the inputs was a truly necessary step, or just a convenient / beneficial one. If you don't care about letting the machine rip through more iterations, then is this step really relevant? It turns out the answer is a resounding "Yes", at least sometimes.

To address this I took the best-performing code (as discussed in the last post) and removed the shuffle step. The result was a network which trained far more slowly, and moreover did not reach the optimal solution nor even approach it. The well-performing network achieved a valid loss of 0.5054, which looks pretty good compared to random forest.


Here is the well-performing network with input shuffling removed. You can see here that the training starts off badly, and gets worse. Note the "valid loss" is the key characteristic to monitor. The simple "loss" improves. This shows the network is remembering prior examples well, but extrapolating badly.


After 19 rounds (the same number of iterations taking by the most successful network design), avoiding the shuffling step results a network that just wanders further and further off base. We end up with a valid loss of 13+, which is just way off.

As an interesting aside, there is a halting problem here. How do we know, for sure, that after sufficient training this network isn't suddenly going to 'figure it out' and train up to the required standard? After all, we know for a fact that the network nodes are capable of storing the information of a well-performing predictive system. It's clearly not ridiculous to suggest that it's possible. How do we know that the network is permanently stuck? Obviously indications aren't good, and just as obviously we should just use the most promising approach. However, this heuristic is not actually knowledge.

Question for the audience -- does anyone know if a paper has already been publishes analysing the impacts of input randomisation across many examples (image and non-image) and many network architectures? What about alternative input handling techniques and synthetic input approaches?

Also, I haven't bothered uploading the code I used for these examples to the github site, since they are really quite simple and I think not of great community value. Let me know if you'd actually like to see them.

Wednesday, April 29, 2015

Neural Net Tuning: The Saga Begins

So, it has become abundantly clear that a totally naive approach to neural net building doesn't cut the mustard. A totally naive approach to random forests does cut a certain amount of mustard -- which is good to know!

First lesson: benchmark everything against the performance of a random forest approach.

Let's talk about levelling up, both in terms of your own skill, and in terms of NN performance. And when I say your skill, that's a thinly-veiled reference for my skill but put in the the third person -- yet maybe you'll benefit from watching me go through the pain of getting there :).

In this particular case, I discovered two things yesterday: a new favourite NN library, and a solution to the Otto challenge which performs better. I recommend them both!

http://keras.io/ is a machine learning library which is well-documented (!), utilises the GPU when available (!) and includes tutorials and worked examples (!). I'm very excited. It also matched my own programming and abstraction style very well, so I'll even call it Pythonic. (note, that is a joke about subjective responses, the True Scotsman fallacy and whether there is even such a thing as Pythonic style) (subnote -- I still think it's the best thing out there.)

https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/13632/achieve-0-48-in-5-min-with-a-deep-net-feat-batchnorm-prelu is a Kaggle forum post on that author's NN implementation, based on the Keras library.

https://github.com/fchollet/keras/blob/master/examples/kaggle_otto_nn.py is the Python source code for their worked solution.

Their neural net is substantially different to the simple, naive three-layer solution I started with and talked about in the last blog post. I plan to look at the differences between our two networks, and comment on the impact and value of each change to the initial network design as I make it, including discussion of whether that change is generally applicable, or only offers specific advantages to the problem at hand. The overall goal is to come up with general best design practises, as well as an understanding on how much work is required to hand-tool an optimal network for a new problem.

My next post will be on the impacts of the first change: randomising the order of your input examples (I may add some other tweaks to the post). Spoiler alert! The result I got was faster network training, but not a significant improvement in the final performance. Down the track, I'll also experiment on the effects of removing this from a well-performing solution and discussing things from the other side. It may be that randomising order is useful, but not necessary, or it may be integral to the whole approach. We shall find out!


Tuesday, April 28, 2015

Neural Networks -- You Failed Me!

Kaggle have a competition open at the moment called the "Otto Group Product Classification Challenge". It's a good place to get started if for no other reason than the data files are quite small. I was finding that even on moderate sized data sets, I was struggling to make a lot of progress just because of the time taken to run a simple learning experiment.

I copied a script called "Benchmark" from the Kaggle site, ran it myself, and achieved a score of 0.57, which is low on the leaderboard (but not at the bottom). It took about 90 seconds. Great! Started!

I then tried to deploy a neural network to solve the problem (the benchmark script uses a random forest). The data consist of about 62 thousand examples, each example composed of 93 input features. The training goal is to divide these examples into 9 different classes. It's a basic categorisation problem.

I tried a simple network configuration -- 93 input nodes (by definition), 9 output nodes (by definition) and a hidden layer of 45 nodes. I gave the hidden layer a sigmoid activation function and the output layer a softmax activation function. I don't really know if that was the right thing to do -- I'm just going to confess that I picked them as much by hunch as by prior knowledge.

It was a bit late at night at this stage, and I actually spend probably an hour just trying to make my technology stake work, despite the simple nature of the problem. The error messages coming back were all about the failure to broadcast array shapes rather than something that immediately pointed out my silly mistake, so it took me a while. C'est la vie.

Eventually, I got it running. I got much, much worse results. What? Aren't neural networks the latest thing? Aren't they supposed to be better than everything else, and better yet don't they just work? Isn't the point of machine learning to remove the need to develop a careful solution to the problem? Apparently not...



I googled around, and found someone had put together a similar approach with just three nodes in the hidden layer. So I tried that.

I got similar results. Wait, what? That means that 42 of my hidden nodes were just dead weight! There must be three main abstractions, I guess, which explain a good chunk of the results.

I tried more nodes. More hidden layers. More training iterations. More iterations helped a bit, but not quite enough. I tried from 15 to 800 iterations. I found I needed at least 100 or so to start approaching diminishing returns, but I kept getting benefits all the way up to 800. But I never even came close to the basic random forest approach.

I have little doubt that the eventual winning result will use a neural network. The question is -- what else do I need to try? I will post follow-ups to this post as I wrangle with this data set. However, I really wanted to share this interesting intermediate result, because it speaks to the process of problem-solving. Most success comes only after repeated failure, if at all.


Monday, April 27, 2015

Interlude: Kaggle Scripting

I don't like to support specific companies, in general I'm more interested in how people can be productive as independently and autonomously as possible.

However, Kaggle just did something pretty amazing.

They've launched a free online scripting environment for machine learning, with access to their competition data sets. That means that everything I've been writing about so far -- loading data, setting up your environment, using notebooks, can be easily sidestepped for instant results. The best thing -- it supports Python!

I haven't yet determined where the limitations are, I've simply forked one of the sample scripts that I found, but it worked wonderfully. I found that the execution time of the script was similar to running it locally on my 2011 Macbook Air, but that's not bad for something totally free that runs on someone else's infrastructure. I was able to sidestep data downloading entirely, and "leap into the middle" my forking an existing script.

Here's a direct link to one:

http://www.kaggle.com/users/320654/cs822-am/otto-group-product-classification-challenge/benchmark

Note, I'm logged into Kaggle, so I'm not sure how this will go for people who aren't signed in.

My expectation is that for solutions which demand a lot of computational resources (or rely on the GPU for reasonable performance) or use a unusual libraries, a self-service approach to computing will still win out, but this is an amazing way to get started.

Thursday, April 9, 2015

STML: Loading and working with data

Loading Data Into Memory

I've had some Real Life Interruptions preventing me from taking the MNIST data further. Here's a generic post on working with data that I prepared earlier.

It may sound a little ridiculous, but sometimes even loading data into memory can be a challenge. I've come across a few challenges when loading data (and subsequently working with it).
  1. Picking an appropriate in-memory structure
  2. Dealing with very large data sets
  3. Dealing with a very large number of files
  4. Picking data structures that are appropriate for your software libraries
Some of these questions have obvious answers; some do not. Most of them will change depending on the technology stack you choose. Many of them will be different between Python 2.7 and Python 3.x. Let's take it step-by-step, in detail.

Update -- after beginning this post intending to address all four points, I found that I exceeded my length limit with even a brief statement on the various categories of input data with respect to choosing an appropriate in-memory data structure. I will deal with all these topics in detail later, when we actually focus on a specific machine learning problem. For now, this blog post is now just a brief statement on picking appropriate in-memory data structures. 

Picking an appropriate in-memory structure

Obviously, this will depend very much on your data set. Very common data formats are:
  1. Images, either grayscale or RGB, varying in size from icons to multi-megapixel photos
  2. Plain text (with many kinds of internal structure)
  3. Semantic text (xml, json, markdown, html...)
  4. Graphs (social graphs, networks, object-oriented databases)
  5. Relational databases
  6. Spatial data / geographically-aware data
Two additional complicating factors are dealing with time-series of data, and dealing with streaming data.

This isn't the place to exhaustively cover this topic. Simply covering how to load these files into memory could comfortably take a couple of posts per data type, and this is without going into cleaning data, scraping data or otherwise getting data in fit shape. Nonetheless, this is a topic often skipped over my machine learning tutorials and has been the cause of many a stubbed toe on my walk through a walkthrough.

Loading Images into Memory

Python has quite a few options when it comes to loading image data. When it comes to machine learning, the first step is typically to deal with with image intensity (e.g. grayscale) rather than utilising the colour values. Further, the data is typically treated as normalised values between zero and one. Image coordinates are typically based at the top left, array coordinates vary, graphs are bottom-left, screen coordinates vary, blah blah blah. Within a single data set, this probably doesn't matter too much, as the algorithms don't tend to care about whether images are flipped upside down, just so long as things are all internally consistent. Where it matters is when you bring together disparate data sources.

The super-super short version is:
# open random image small enough for memory
img = Image.open(open('filename.jpg'))
img = numpy.asarray(img, dtype='float32') / 256.
Note the use of float32, the division by 256, the decimal point on the divisor to force floating-point division. You'll need to install numpy and something called "Pillow". Pillow is the new PIL library and is compatible with any tutorials you're likely to find.

Loading Plain Text Into Memory

Plain text is usually pretty easy to read. There are some minor platform differences between Windows and the rest of the known universe, but by and large getting text into memory isn't a problem. The main issue is dealing with it after that point.

A simple
thestring = open('filename.txt').read()
will typically do this just fine for now. The challenge will come when you need to understand its contents. Breaking up the raw text is called tokenization, and most language processing will use something called 'parts of speech tagging' to start building up a representation of the underlying language concepts. A library called NLTK is the current best place to start with this, although Word2Vec is also a very interesting starting point for language processing. Don't mention unicode -- that's for later.

Loading Semantic Text into Memory

Let's start with HTML data, because I have a gripe about this. There's a library out there called Beautiful Soup, which has the explicit goal of being able to process pretty much any html page you can through at it. However, for some incomprehensible reason, I have found it is actually less tolerant that the standard Python library lxml, which is better able to handle nonconformant HTML input (when placed into compatability mode). First up, I really don't work with HTML much. As a result, you're going to be mostly on your own when it comes to parsing the underlying document, but you'll be using one of these two libraries depending on your needs and your source data.

Loading Graphs Into Memory

I also haven't worked all that much with graph structures. Figuring out how to load and work with this kind of data is a challenge. Firstly, there appear to be few commonly-accepted graph serialisation formats. Secondly, the term "graph" is massively overloaded, so you'll find a lot of search results for libraries that are actually pretty different. There are three broad categories of graph library: graphical tools for plotting charts (e.g. X vs Y plots a-la matplotlib and graphite); tools for plotting graph structures (e.g. dot, graphviz) and tools for actually working with graph structures in memory (like networkX). There is in fact some overlap between all these tools, plus there are sure to be more out there. It is also possible to represent graphs through Python dictionaries and custom data structures.

Working with Relational Data

Relational database queries are absolutely fantastic tools. Don't underestimate the value of Ye Olde Database structure. They are also a wonderful ways to work with data locally. The two magic libraries here are sqlite3 and pandas. Here's the three-line recipe for reading out from a local sqlite3 database
conn = sqlite3.connect('../database/demo.db')
query = 'select * from TABLE'
df = pandas.read_sql(query, conn)
Connected to a remote database simply involves using the appropriate library to create the connection object. Pandas is the best way to deal with the data once in memory, bar none. It is also the best way to read CSV files, and is readily usable with other scientific libraries due to the ability to retrieve data as numpy arrays. Many, many interesting machine learning problems can and should be approached with these tools.

More recently, I have found it more effective to use HDF5 as the local data storage format, even for relational data. Go figure.

Spatial Data / GIS Data

Abandon all hope, ye who enter here.

Spatial / GIS data is a minefield of complexity seemingly designed to entrap, confuse and beguile even the experts. I'm not going to even try to talk about it in a single paragraph. Working with spatial data is incredibly powerful, but it will require an enormous rant-length blog post all of its own to actually treat properly.

Dealing with Time Series

Machine learning problems frequently ignore the time dimension of our reality. This can be a very useful simplification, but feels unsatisfying. I haven't seen any good libraries which really nail this. Let's talk about some common approaches:

Markov Modelling. The time dimension is treated as a sequence of causally-linked events. Frequency counting is used to infer the probability of an outcome given a small number of antecedent conditions. The key concept here is the n-gram, which is an n-length sequence which is counted.

Four-Dimensional Arrays. Time can be treated either as a sequence of events with an equal or irregular interval. Arrays can be sparse or dense.

Date-Time Indexing. This is where, for example, a 2d matrix or SQL table has date and time references recorded with the entries.

Animation (including video). Video and sound formats both include time as a very significant concept. While most operating systems don't provide true realtime guarantees around operation, their components (sound cards and video cards) provide the ability to buffer data and then write it out in pretty-true realtime. These will in general be based on an implicit time interval between data values.

Dealing with Streaming Data

Algorithms which deal with streaming data are typically referred to as "online" algorithms. In development mode, your data will mostly be stored on a disk or something where you can re-process the data easily and at will. However, not all data sources can be so readily handled. Examples where you might use an online algorithm include video processing, data firehose scenarios (satellite data,  twitter, sound processing). Such data sources are often only of ephemeral value, not worth permanently storing; you may be working on a lower-capability compute platform; or you just might not have that much disk free.

You can think of these data streams as something where you store the output of your algorithm, but throw out the source data. An easy example might include feature detection on a network router. You might monitor the network stream for suspicious packets, but not store the full network data of every packet going across your network. Another might be small robots or drones which have limited on-board capacity for storage, but can run relevant algorithms on the data in realtime.

Next Steps

I am not sure exactly what to make of this blog post. The commentary is only slightly useful for people with a specific problem to solve, but perhaps it serves to explain just how complex an area machine learning can be. The only practical way forward that I can think of is to get into actual problem-solving on non-trivial problems.

I think the main thing to take away from this is: Don't Feel Stupid If You Find It Hard. There really really are a lot of factors which go into solving a machine learning problem, and you're going to have to contend with them all. These problems haven't all been solved, and they don't have universal solutions. Your libraries will do much of the heavy lifting, but then leave you without a safety net when you suddenly find yourself leaving the beaten path.

Wednesday, April 1, 2015

Quick Look: MNIST Data

Project Theme: STML-MNIST

I like to do a lot of different things at once. I like exploring things from both a theoretical perspective and a practical perspective in an interleaved way. I'm thinking that I will present the blog in similar fashion: multiple interleaved themes. I'll have practical projects which readers can choose to repeat themselves, and which will push me to exploring new territory firsthand. At the same time, I will share general thoughts and observations. If I can manage it, I will release one of each per week, and I will expect each project post should not take a reader more than a week to comfortably work through. By that I mean there should be somewhere between 30 minutes and 3 hours work required to repeat a project activity.

Readers, please let me know if this works for you.

Right, so carrying on... a couple of posts ago, prior to being derailed into a discussion on analytics, I ran through how to take a pretty robust algorithm called a Random Forest and apply it to a Kaggle competition data set. The training time was about 5 minutes, which is fast enough to walk through, but a bit slow for really rapid investigation. I'm going to turn now to another data set called the MNIST database. It contains images of handwritten digits between 0 and 9. It has been done to death "in the literature", but it's still pretty interesting if you haven't come across it before.

This dataset will let us apply the same algorithm we used before (Random Forest) and see how it goes on another dataset. Consider this a bit of a practise or warmup exercise for improving our ability to apply random forests and other techniques to images. I'm going to point you out at another blog post for the actual walkthrough (why re-invent the wheel), but I'll still supply my working notebook for those who want a leg up or a shortcut. But be warned: there are no real shortcuts to learning! (why do you think I'm writing this blog, for example)

Theme: Image Categorisation

The MNIST database is an incredibly common "first dataset", but that doesn't mean it's not significant. The great thing about this database is that it is actually very meaningful. The task definition is to categorise handwritten digits correctly. The problem is essentially solved, and the technique for best solving it is a neural network. 

It is not, however, what is known as a 'toy problem'. Categorising digits is how automation of mail distribution based on postcodes works. Categorising handwriting in general is used for optical character recognition (OCR, digitisation of records) and security (captchas). Categorisation of images in general has very wide application and has many unsolved aspects.

Working on a problem with a "known good" solution allows us to focus on the software engineering and problem-solving aspects without the concern that we might be just wandering down a part that goes absolutely nowhere.

The scale of this task is also appropriate for the level of computing resources available to large numbers of people. 

The STML-MNIST series of posts will involve the application of multiple machine learning strategies to this particular task. This will allow us to cross-compare the various strategies in terms of performance, complexity and challenge. While I have worked through tutorials on the neural network solution before, I have not done a deep dive. This series of posts will be a deep dive.

A fantastic tutorial to work through, solving this very problem, is available at http://neuralnetworksanddeeplearning.com/chap1.html. The linked tutorial uses a neural network approach, rather than a random forest approach. You could go down that path if you liked, or follow along with my random forest approach.

Whether you choose to replicate the result using a neural net or a random forest, I would recommend completing that on your own before proceeding much further in this post. For the moment, just try to get your own code working, rather than on understanding each step in depth. If you have replicated my setup, you can use my Ipython Notebook as short-cut to a working version. The notebook can be read online at http://nbviewer.ipython.org/github/tleeuwenburg/stml/blob/master/mnist/MNIST%20Random%20Forest.ipynb or accessed via the git repository https://github.com/tleeuwenburg/stml

... time passes ...

Okay, welcome back! I assume that it's been a couple of days since you diverted into the linked tutorial, and that you managed to successfully train a neural network to solve the problem to a high standard of quality. Great work! The neural net really solves the problem amazingly well, whereas my random forest approach reached about 90% success (still not bad, given there was no attempt to adapt the algorithm to the problem). I deliberately chose a set of RF parameters similar to that chosen for the Kaggle ocean dataset.

The mnist dataset was faster to process -- it took around a minute to train on, rather than 5 minutes. This makes it slightly more amenable to a tighter experimentation loop, which is great. Later on we can get into the mind-set of batch-mode experiments, setting off dozens at once to run overnight or while we do something else for a while. 

What we've found so far shows two things: that a Random Forest approach is mind-bogglingly flexible. It's probably not going to get beaten as a general-purpose, proof-of-concept algorithm. However, it also appears that it only goes so far, and that the next level is probably a neural not.

Each new problem will require its own investigative process to arrive at a good solution, and it is that investigative process that is of interest here.

I propose the following process for exploring this dataset further:
  • Comparisons of alternative neural network designs
  • Visualisations of intermediate results
  • Comparison against alternative machine learning methodologies
A list of techniques which could be applied here include:
  • Random forest
  • Decision tree optimisation
  • Comparison against reference images (error/energy minimisation)
Now, I would like each of my posts to be pretty easy to consume. For that reason, I'm not going to do all of that right now. Also, I don't think I'm a fast enough worker to get all the way through that process in just one week. Rather, following the referenced tutorial is the first week's activity for this project theme. I will undertake to apply one new technique to the dataset, on w roughly weekly schedule, and report back...

In the meantime, happy coding!

Cheers,
-Tennessee