Thursday, May 28, 2015

An environment to share...

I'm a terrible blogger. I just ground to a halt and got overwhelmed by real life. I ran out of good ideas. I hadn't finished things I was in the middle of and had no results. AAAAAGH. Here's something vague about repeatable data science environments for tutorials...

I am working on the background material to support tutorial sessions. If there's one hard thing about giving a tutorial, it's getting everyone on the same page without anyone being left behind. All of a sudden, you need to think about *everyone's* environment. Not just yours -- not even just 'most people's, but everyone's.

There are a few technologies for setting up environments, plus some entirely different approaches completely. My goal is to present people with multiple paths to success, without having to think of everything.

I'll be looking at:
  -- Virtualenv and Python packages
  -- Virtual machine images
  -- Vagrant automatic VM provisioning
  -- Alternative Python distributions
  -- Using web-based environments rather than your own installation

Why is this all so complicated? Well, without pointing any fingers, Python package alone won't get the job done for scientific packages. There's no getting around the fact that you will need to install some packages into the base operating system, and there is no good, well-supported path to make that easy. Particularly if you would prefer to do this without modifying the base system. Then, there's the layer of being able to help a room full of people all get the job done in about twenty minutes. Even with a room full of developers, it's going to be a challenge.

Let's take a tour of the options.

One -- Virtualenv and Python Packages

This option is the most 'pythonic' but also, by far, the least likely to get the job done. The reason is going to be the dependency on complex scientific libraries, which the user is then going to have to hand install by following through a list of instructions. It's doable, but I won't know up front what the differences between say yum and apt are going to be, let alone the potential differences between operating system versions might be. Then, there will be some users on OSX (hopefully using either macports or brew) and potentially some on Windows. In my experience, there are naming differences between package names across those systems, and at times there may be critical gaps or versioning differences. Furthermore, the relevant Python 3 vs 2.7 packages may differ. It is basically just too hard to use Python's inbuilt packaging mechanism to handle a whole room full of individual development platform differences.

Two -- Virtual Machines Images

This approach is fairly reliable, but feels kind of clumsy to me, and isn't necessarily very repeatable. While not everyone is going to have Virtualbox (or any other major virtualiser) installed, most people will be able to use this technology on their systems. There may be a few who will need to install Virtualbox, but from there it really should 'just work'.

VM files can be shared with USB keys or over the network. So long as you bring a long a good number of keys it should be mostly okay. A good tip though -- bring along keys of a couple of different brands. I have had firsthand experience of specific brands of USB key and computer just not getting along.

The downside is that while this will work in a tutorial setting, virtual machines can be slow, and don't necessarily set up the attendees with the technology they should be using going forward. They may find themselves left short of being able to work effectively in their own environments later.

Three -- Vagrant Automatic VM Provisioning

The level up from supplying a base VM is using Vagrant (www.vagrantup.com). It allows you to specify the configuration of the base machine and its packages through a configuration file, so the only thing you need to share with people is a simple file. Rather than having to share large virtual machine files, which are also hard to version, you can share a simple configuration file only. That's something than can be easily versioned, and is lightweight to send around. The only downside is that each attendee will need to download the base VM image through the Vagrant system, which will hit the local network. Running a tutorial is an exercise in digital survivalism. It's best not to rely on any aspect of supporting technology.

I have also had some a lot of trouble trying to install Vagrant boxes. I'm not really sure what the source of the issues was. I'm not really sure why it started working either. I just know I'm not going to trust it in a live tutorial environment. Crossing it off for now.

Four -- Alternative Python Distributions

This could be a really good option. The two main distributions that I'm aware of are Python(x,y) and Anaconda. Both seem solid, but Anaconda probably has more mindshare, particularly for scientific packages. For the purposes of machine learning and data science, that is going to be very relevant. Many people support using the Anaconda distribution by default, but that's not my first option.

I would recommend Anaconda in corporate environments, where it's useful to have a singular, multi-capable installation which is going to be installable onto enterprise linux distributions but still have all the relevant scientific libraries. I would recommend against it on your own machine, because despite its general excellence, I have found occasional issues when trying to install niche packages.

Anaconda would probably also be good in the data centre or in a cloud environment, where things can also get a little wild when installing software. It's probably a good choice for system administrators as well. Firstly, installed packages won't interfere with the system Python. Secondly, it allows users to create user-space deployments with their own libraries that are isolated from the central libraries. This helps with managing robustness. Standard Python will do this with the 'virtualenv' package, so there are multiple ways to achieve this goal.

Using web-based environments rather than your own installation

This is really about side-stepping the issue, rather than fixing it as such. It's not free from pitfalls, because there are still browser incompatibilities to consider. However, if your audience can't manage to have an up-to-date version of either Firefox or Chrome installed, then things are likely to be tricky. Also, up-to-date versions of Internet Explorer are also likely to work, however I haven't tested it to any degree. You'll also need to understand the local networking environment to make sure you can run a server and that attendees are going to be able to access it. You could host an ad-hoc network on your own hardware, but I'm a bit nervous about that approach.

Perhaps, if I have some spare hardware, I'll expose something through a web server. Another alternative is to demonstrate (for example) the Kaggle scripting environment.

Conclusions

I think I have talked myself around to providing a virtual machine image via USB keys. I can build the environment on my own machine, verify exactly how it is set up, then provide something controlled to participants. 

In addition, I'll supply the list of packages that are in use, so that people can install them directly onto their own system if desired. This will be particularly relevant to those looking to exploit their GPUs to the maximum. 

Finally, I'll include a demo of the Kaggle scripting environment for those who don't really have an appropriate platform themselves.

I'd appreciate any comments from anyone who has run or attended tutorials who has any opinion about how best to get everyone up and running...

Thursday, May 7, 2015

On Reading and Understanding Academic Papers: Batch Normalisation

So, keras.io has a link in its documentation. In the section on the "Batch Normalisation" layer, there is a hyperlink to a PDF of an academic paper on the use and effectiveness of this approach. Tim Berners-Lee would be proud.

I followed that link. I am soooo ignorant when it comes to understanding maths properly. Don't get me wrong, I'm not completely foreign to it -- I work with numerical data all the time, and have a printout of the unit circle and associated trig functions on my desk. I use it multiple times per year. I remember nothing of it between those times.

Reading a paper requires me to have a degree of easy familiarity with mathematical concepts which I just don't have. Let me quote from introduction of the paper. I needed to take an image snapshot to deal with the mathematical notation (sorry Tim!)


Now, as far as I can tell, this paper is a top piece of research and a wonderful presentation of a highly useful and relevant concept. Parts of this I can understand, parts I don't. SGD was already defined as stochastic gradient descent, by the way.

The first symbol is a theta. I know from prior experience plus the description in the text it refers not to a number, but to some kind of collection of numbers being the parameters of the network. I'm not sure if it's a matrix, or several numbers, or what parameters it means exactly. Arg min, I think, means a magical function which returns "The value of theta that minimises the result of the following expression". I'm reading this like a programmer, see.

Okay, then the 1/n multiplied by the sum of a function between 1 and N. This is otherwise known as "the average". I think N refers to the number of training examples, and i is an index parameter into the current example.

I have no clue what the 'l' function is. I'm going to guess from the text it means 'loss function'.

So, unpacked, this means "The network parameters which minimise the mean average loss across the training data".

What's unclear to me is how the mathematical notation actually helps here. Surely the statement "stochastic gradient descent minimises the mean average loss over the training data" is actually more instructive to both the mathematical and casual reader than this function notation?

Now, I can eventually unpack most parts of the paper, slowly, and one-by-one. Writing this post genuinely helped me grok what was going on better. I haven't actually gotten to the section on batch normalisation yet, of course. I'll read on ... casual readers can tune out now as the rest of this post is going to be an extended exposition of my confusion, rather than anything of general interest.

The next paragraph refers to the principle behind mini-batching. There is something slightly non-obvious being said here. They state that "The mini-batch is used to approximate the gradient of the loss function with respect to the parameters...". What they are saying here is that the mini-batch, if it's a representative sample, approximates the overall network. Calculating the loss of the mini-batch is an estimator of the loss of the whole training set. It's smaller and easier to work with than the entire training set, but more representative that looking just at each example. The larger the mini-batch, the more representative it is of the whole, but at the same time it is more unwieldy. Picking the optimal batch size isn't covered, although they do mention that its efficiency also relates to being able to parellelise the calculation of the loss for each example.

The reason I think it's mentioned is that the purpose of mini-batching is similar to the purpose of batch normalisation. They are saying that mini-batching improves generalisation, because the learning is related to average loss across the mini-batch, rather than learning each specific example. That is to say -- it makes the network concentrate less sensitive to spurious details.

As I understand it, batch normalisation also achieves that end, as well as reducing reducing the sensitivity of the network to the tuning of its meta-parameters (the latter being the prime purpose of batch normalisation).

The make the point that in a deep network, the effect of tuning parameters is magnified. For example, if picking a learning rate of 0.1 has some effect on a single-layer network, it could have double that effect on a two-layer network. Potentially. I think this point is a little shaky myself, because having multiple layers could allow each layer to cancel out overshoots at previous layers. However, this might be a strong case for a more intricate layer design based on capturing effects at different scales. For example, having an architecture with a 'fine detail' and a 'course detail' layer might be better than two fine-scale layers. Another approach (for example on images) might be to actually train off smoothed data plus the detail data. Food for thought.

They then move on to what I think is the main game: reducing competition between the layers. As I interpret what they are saying, the learning step affects all layers. As a result, in a sense, layers towards the top of the network are experiencing a 'moving goalposts' situation while the layers underneath them learn and respond differently to the input data. This is basically where I no longer understand things completely. They are referring to the shifting nature of layer outputs as "Internal covariate shift". I interpret this as meaning that higher layers need to re-adjust to the new outputs of lower-layers. I think of it as being like input scaling, except at the layer level, updated through mini-batch training.

They then point to their big result: reduction in training time and improvements to accuracy. They took a well-performing network, and matched its results in 7% of the iterations. That's a big reduction. The also state they can improve final accuracy, but do not state how much.

Now for the details. The paper now talks about their detailed methodology and approach. I'm literally just 1.5 pages into an 8-page document, and my mind is experiencing that pressure you get when you just don't think you can grok any more. I don't think I can reasonably burden my readers any more with my thoughts, nor can I process much more today.

I'm going to have to break up understanding this paper into more sessions, I think, coming back after internalising every page or two. It's probably worth my getting there, because the end of the paper does mention alternative means to the same ends and talks about the limits of the technique. Perhaps I will make further posts on the topic, perhaps I won't. We shall see.

If any readers here are more advanced in their understanding than I am, I would very much appreciate if you could point out anything I've gotten wrong!!!

Tuesday, May 5, 2015

Network Configuration Exploration

Let's take stock. We have a primitive model (A), and a best-performing model (B). We are undertaking a process of breakdown down the differences to understand how each difference between the two models contributes to the observed performance gains. The hope is to learn a standard best practise, and thereby start from a higher base in future. However we are also hoping to learn whether there is anything we could do to take model (B) and extend its performance further.

We have deal with the input-side differences -- shuffling and scaling. We now move onto the network's internals -- the network configuration. This means its layers, activation functions and connections. Let's call the 'upgraded' version Model A2.

Model (A) has a single hidden layer. Model (B) is far deeper, with three sets of three layers, plus the input layer, plus the output layer. Model A uses a non-linear activation functions for the hidden and the output layer. Model B uses a non-linear activation function for the output layer, but uses mostly linear processes to get its work done.

I got a bit less scientific at this point ... not knowing where to start, I took Model B and started fiddling with its internals, based on nothing more than curiosity and interest. I removed one of the layer-sets entirely, and replaced the activation function of another with 'softmax'. That network began its training more quickly, but finished with the identical performance in due course.

So I removed that softmax layer set, now having a simpler configuration with just the input layer, a linear rectifier layer, and the final softmax activation layer. This was worse, but also interesting. For the first time, the training loss was worse than the validity loss. To my mind, this is the network 'memorising' rather than 'generalising'. The final loss was 0.53, which was better than Model A, better than a Random Forest, but much worse than Model B. This maybe gives us some more to go on. Let's call this new model B1.

There are still some key differences between Model A2 and Model B1. B1 actually uses simpler activations, but includes both Batch Normalisation and Dropout stages in processing, which we haven't talked about before. Which of those differences are important to the improvement in functionality?

This gives us the direction for the next steps. My feeling is that both batch normalisation and dropout are worth examining independently. The next posts will focus on what those things do, and what their impacts are when added to a more basic model or removed from a more sophisticated one.

Saturday, May 2, 2015

What is the impact of scaling your inputs?

Last post we looked at the "input shuffling" technique. This time we're looking at input scaling. Input scaling is just one of several input modification strategies that can be put into place. Unlike input shuffling which makes intuitive sense to me, input scaling does not. Absolute value can in fact be important, and it feels like input scaling is actually removing information.

Let's take the same approach as last time: add it to a basic network, and remove it from a well-performing one. This time round we also have an extra question -- are the benefits of input scaling independent of the benefits from input shuffling?


The first thing I did was add input scaling to the network design as we had it at the end of the last post. I ran it for 20 iterations. This is the a shuffled, scaled, three-layer architecture. The performance here is much, much better. After 20 iterations, we reach a valid loss of 0.59, and a train loss of 0.56. That's not as good as our best network, but it's a big step up. If I run it out to 150 iterations, I get a valid loss of 0.558. As a reminder, our current best network hits 0.5054 after 20 iterations.

Let's review:
      Design Zero achieved valid loss of about 0.88 -- much worse than Random Forest.
      Design Zero eventually cranked its way down to 0.6 or so after 800 cycles.
      Design Zero + shuffling hits 0.72 after thirteen iterations
      Design Zero + shuffling + scaling hits 0.59 after 20 iterations (slightly worse than RF)
      Design Zero + shuffling + scaling wanders off base and degrades to 0.94 after 800 iterations

Interesting. Performance has improved greatly, but we have introduced an early-stopping problem, where the optimum result appears part-way through the training cycle. Early stopping is basically the inverse of the halting problem, which is not not knowing if you can stop. For the time being, I don't want to go down that rabbit hole, so I'll just stop at 20 iterations and concentrate on comparing performance there.

Our current design is performing at 0.59, vs 0.50 for the best network. The performance of Random Forest was 0.57, so we're getting close to reaching the benchmark level now using a pretty basic network design and the most basic input techniques.

Let's see what happens when we pull scaling out of our best design. After 20 iterations, without scaling, the performance is at 0.5058. That's a tiny bit worse than its best performance, but you'd be hard-pressed to call it significant. Whatever benefit is provided by scaling has, presumably, largely been accounted for in the network design, perhaps implicitly being reflected into the weights and biases of the first layer of the network. In fact, the performance at each epoch is virtually identical as well -- there is no change to early training results either.

The big relief is that it didn't cost us anything. The verdict on scaling is that it can be very beneficial in some circumstances, and doesn't seem to count against us. I still have a nagging philosophical feeling that I'm 'hiding' something from the NN by pre-scaling, but until I can think of a better idea, it looks like scaling is another one of those "every time" techniques.