Thursday, June 25, 2015

Setup for PyCon AU Tutorial

This is an attempt to provide attendees of PyCon AU 2015 with a guide to getting set up ahead of the tutorial. Getting set up in advance will assist greatly in getting the most of the tutorial. It will let attendees focus on the slides and the problem examples rather than on hurdling through an installation process.

What it's like installing software during a tutorial session
There will be USB keys on the day with the data sets and some of the software libraries included, in case the network breaks. However, things will go more smoothly on everyone if some of these hurdles can be cleared out the way in advance.

The Software You Will Need

  1. Python 3.4, with Numpy, Scipy, Scikit-Learn, Pandas, Xray, pillow -- install via anaconda
  2. Ipython Notebook, Matplotlib, Seaborn -- install via anaconda
  3. Theano, Keras -- install via pip
  4. Word2Vec (https://github.com/danielfrg/word2vec) -- avoid pip, install from source
  5. https://github.com/danieldiekmeier/memegenerator -- just drop in the notebook folder
  6. https://github.com/tweepy/tweepy -- install via pip
I recommend using Anaconda as it ships with prebuilt binaries for O/S dependencies, for a variety of platforms. It's possible to get this all working with pip and your O/S package manager. It should be fine to use Windows, but OSX or Linux are likely to be easier to use. Due to the use of Ipython Notebook as the primary environment, the choice of operating system is not likely to be a major limiting factor in this case.

I have only had success installing word2vec by cloning the repository and installing locally. I went with the old-school 'python setup.py install'. For whatever reason, what's in PyPI doesn't work for me.

I've noted the easiest path for installing each package in the list above.

The Data You Will Need 

  1. MNIST: https://github.com/tleeuwenburg/stml/blob/master/mnist/mnist.pkl.gz 
  2. Kaggle Otto competition data: https://www.kaggle.com/c/otto-group-product-classification-challenge
  3. "Text8": http://mattmahoney.net/dc/text8.zip
  4. For a stretch, try the larger data sets from http://mattmahoney.net/dc/textdata

An Overview of the Tutorial

The tutorial will include an introduction, a mini-installfest, and then three problem walkthroughs. There will be some general tips, plus time for discussion. 

Entree: Problem Walkthrough One: MNIST Digit Recognition

Compute Time: Around 3 to 5 minutes for a random forest approach

Digit recognition is most obviously used when decoding postcode numbers on envelopes. It's also relevant to general handwriting recognition, and also non-handwritten recognition such as OCR of scanned documents or license plate recognition.

Attendees will be able to run the supplied, worked solution on the spot. We'll step through the implementation stages to talk about how to apply similar solutions to other problems. If time is available, we will include alternative machine learning techniques and other data sets.

Data for this problem will be available on USB.

Main: Otto Shopping Category Challenge

Compute time: 1 minute for random forest
Compute time: 7 minutes for deep learning
Data for this problem can be downloaded only through the Kaggle site due to the terms of use.

This is a real-world, commercial problem. The "Otto Group" sell stuff, and they put that stuff into eight classes for problem. Each thing they sell has 93 features. They sample data set has 200k individual products which have each been somehow scored against these 93 features. The problem definition is to go from 93 input numbers to a category id between 1 and 9.

{ 93 features } --> some kind of machine learning --> { number between 1 and 9 }

Dessert: A Twitter Memebot in Word2Vec

Compute Time: Word2Vec training of 4m + 2 mins meme generation

This is something fun based on Word2Vec. We'll scrape twitter for some text to process, then use Word2Vec to look at some of the word relationships in the timelines.

Visualisation, Plotting and Results Analysis

No data science tutorial would be complete without data visualisation and plotting of results. Rather than have a separate problem for this, we will include them in each problem. We will also be considering how to determine whether your model is 'good', and how to convince both yourself and your customers / managers of that fact!

Bring Your Own Data

If you have a data problem of your own, you can bring it along to the tutorial and work on that instead. As time allows, I'll endeavour to assist with any questions you might have about working with your own data. Alternatively, you can just come up to me during the conference and we can take a look! There's nothing more interesting that looking at data that inherently matters to you.

I hope to see you at the conference!!

 

Monday, June 15, 2015

A Twitter Memebot in Word2Vec

I wanted to explore some ideas with Word2Vec to see how it could potentially be applied in practise. I thought that I would take a run at a Twitter bot that would try to do something semantically interesting and create new content. New ideas, from code.


Here's an example. Word2Vec is all about finding some kind of underlying representation of the semantics of words, and allowing some kind of traversal of that semantic space in a reliable fashion. It's about other things too, but what gets really excited is the fact that it's an approach which seems to actually approach the way that we humans tend to form word relationships.

Let's just say I was partially successful. The meme I've chosen above is one of the better results from the work, but there were many somewhat-interesting outputs. I refrained from making the twitter bot autonomous, as it had an unfortunate tendency to lock in on the most controversial tweets in my timeline, then make some hilarious but often unfortunate inferences from them, then meme them. Yeah, I'm not flicking that particular switch, thanks very much!

The libraries in use for this tutorial can be found at:

  • https://github.com/danieldiekmeier/memegenerator
  • https://github.com/danielfrg/word2vec
  • https://github.com/tweepy/tweepy
I recommend tweepy over other twitter API libraries, at least for Python 3.4, as it was the only one which worked for me first try. I didn't get round to the others again for a second try, because working solution.

You'll need to go get some twitter API keys. I don't remember all the steps for this, I just kind of did it on instinct. There's a Stack Overflow question on the topic if that helps. http://stackoverflow.com/questions/1808855/getting-new-twitter-api-consumer-and-secret-keys but that's not what I used. Good luck :)

This particular Twitter bot will select a random tweet from your timeline, then comment on it in the form of a meme. The relevance of those tweets is a bit hit-and-miss to be honest. This could probably be solved by using topic-modelling rather than random selection to find the most relevant keywords from the tweet.
public_tweets = api.home_timeline() 
will fetch the most recent tweets from the current user's timeline. The code then chooses a random tweet, and focuses on words that are longer than 3 characters (a proxy for 'interesting' or 'relevant). From this, we extract four words (if available). The goal is to produce a meme of the form "A is to B as C is to D". A, B and C are random words chosen from the tweet. D is a word found using word2vec. The fourth word is used to choose the image background by doing a flickr search.
indexes, metrics = model.analogy(pos=[ws[0], ws[1]], neg=[ws[2]], n=10)
ws.append(model.vocab[indexes][0])
The first line there is getting a list of candidate words for the end of our analogy. The second line is just picking the first one.

For example, a human might approach this as follows. Suppose the tweet is:

"I really love unit testing. It makes my job so much easier when doing deployments."

The word selection might result in "Testing, Easier, Deployments, Job". The goal would be to come up with a word for "Testing is to easier as Deployments is to X" (over an image of a job). I might come up with the word "automatic". Who knows -- it's kind of hard to relate all of those things. 

Here's an image showing another dubious set of relationships.


There' some sense in it -- it's certainly seems that breaking things can be trivial, and that flicking a switch is easy, and that questioning is a bit like testing. The background evokes both randomness and goal-seeking. However, getting any more precise than that is drawing a long bow, and a lot of those relationships came up pretty randomly. 

I could imagine this approach being used to create suggested memes to a human reviewer, using a supervised system approach. However, it's not really ready for autonomous use, being inadequate in both semantic meaning and sensitivity to content. However, I do think it shows that it's pretty easy to take and use the technologies for basic use.

There are a bunch of potential improvements I can think of which should result in better output. Focusing the word search towards the topic of the tweet is one. Selecting words for the analogy which are reasonably closely related would be another, and quite doable using the word2vec approach.

Understanding every step of this system requires a more involved explanation of what's going on, so I think the next few posts might be targetted at the intermediate steps and how they were arrived at, plus a walkthrough of each part of the code (which will be made available at that point).

Until next time, happy coding!


Tuesday, June 2, 2015

Getting deep learning going with Python 3.4 and Anaconda

I wanted to test out how hard (or easy) it would be to re-create prior results using two technologies I've been itching to try -- Python 3.4 and Anaconda. Python 3.4 is, obviously, where things are headed. Up to date, I have never succeeded in getting all the relevant packages installed that I would like to use.

Anaconda is an alternative Python distribution produced by Continuum Analytics. They provide various commercial products, but that's okay. They make something free and super-useful for developers, and their commercial products solve enterprise-relevant problems.

The 'big sell' of Anaconda as opposed to using the standard distribution is the ease of installation of scientific packages on a variety of platforms. Spending a day trudging through getting the relevant base OS packages installed and the Python libraries effectively using them all is pretty dull work.

I set out to install Keras and Ipython notebook. That is pretty much the end-game, so if that works, there's a valid path. Short answer: it worked out well, with only a few stumbles.

There are two operating system packages to install. Anaconda itself, obviously. OpenBLAS was the one remaining (or, I think, any other BLAS installation). There were still some imperfections, but everything went far, far better than the same process went for the standard Python approach.

Achieving success depending, somewhat strangely, on the order of installation of the packages. My end game was to have the Keras library up and running. That's not in the Anaconda world, so you need to use pip to get the job done. A simple 'pip install keras' didn't work for me -- there were various complaints, I think it said there was no cython.  Let's Redo From Start:


Take One
conda create -p ./new numpy
source activate ./new 
cd
python setup.py install (yes I know I should use pip but :P )
... much compiling ...
warning: no previously-included files matching '*.pyo' found anywhere in distributionCould not locate executable gfortranCould not locate executable f95Could not locate executable f90Could not locate executable f77Could not locate executable xlf90Could not locate executable xlfCould not locate executable ifortCould not locate executable ifcCould not locate executable g77Could not locate executable g95Could not locate executable pgfortrandon't know how to compile Fortran code on platform 'posix'error: Setup script exited with error: library dfftpack has Fortran sources but no Fortran compiler found 
 

Take Two

pip install theano
... much compiling ...
"object of type 'type' has no len()" in evaluating 'len(list)' (available names: [])
    error: library dfftpack has Fortran sources but no Fortran compiler found 
 
Take Three 

conda create -p ./new scipy <-- block="" blockquote="" finally="" i="" realised="" scipy="" stumbling="" the="" was="">
source activate ./new 
cd
python setup.py install 
... much compiling ... 
SUCCESS!!! W00t!

The same route on pure Python, last time I tried it, was much more involved. When I tried it, I found that scipy didn't necessarily install with the relevant Fortran support, which a lot of science packages depend on. Getting the base libraries to get up and running, and finding out what they even were, was a real mission.

Now, I'm not 100% sure anyone is to blame here. There will be reasons for each of the various packaging decisions along the way, and also I haven't necessarily taken the time to understand my own environment properly. I'm just doing what every practically-minded person does: try to just do an install of the thing and see what pops.

Fewer things pop with Anaconda. I now have a functional Python 3.4 environment with all the latest and greatest machine learning tech that Python has to offer, and that is awesome.

Also, I haven't included the bit where I discovered I had to install OpenBLAS through macports rather than through Anaconda. I've saved the reader from that.

Happy hacking!