What it's like installing software during a tutorial session |

### The Software You Will Need

- Python 3.4, with Numpy, Scipy, Scikit-Learn, Pandas, Xray, pillow -- install via anaconda
- Ipython Notebook, Matplotlib, Seaborn -- install via anaconda
- Theano, Keras -- install via pip
- Word2Vec (https://github.com/danielfrg/word2vec) -- avoid pip, install from source
- https://github.com/danieldiekmeier/memegenerator -- just drop in the notebook folder
- https://github.com/tweepy/tweepy -- install via pip

I have

**only**had success installing word2vec by cloning the repository and installing locally. I went with the old-school 'python setup.py install'. For whatever reason, what's in PyPI doesn't work for me.

I've noted the easiest path for installing each package in the list above.

### The Data You Will Need

- MNIST: https://github.com/tleeuwenburg/stml/blob/master/mnist/mnist.pkl.gz
- Kaggle Otto competition data: https://www.kaggle.com/c/otto-group-product-classification-challenge
- "Text8": http://mattmahoney.net/dc/text8.zip
- For a stretch, try the larger data sets from http://mattmahoney.net/dc/textdata

### An Overview of the Tutorial

The tutorial will include an introduction, a mini-installfest, and then three problem walkthroughs. There will be some general tips, plus time for discussion.#### Entree: Problem Walkthrough One: MNIST Digit Recognition

Compute Time: Around 3 to 5 minutes for a random forest approachDigit recognition is most obviously used when decoding postcode numbers on envelopes. It's also relevant to general handwriting recognition, and also non-handwritten recognition such as OCR of scanned documents or license plate recognition.

Attendees will be able to run the supplied, worked solution on the spot. We'll step through the implementation stages to talk about how to apply similar solutions to other problems. If time is available, we will include alternative machine learning techniques and other data sets.

Data for this problem will be available on USB.

#### Main: Otto Shopping Category Challenge

Compute time: 1 minute for random forestCompute time: 7 minutes for deep learning

Data for this problem can be downloaded only through the Kaggle site due to the terms of use.

This is a real-world, commercial problem. The "Otto Group" sell stuff, and they put that stuff into eight classes for problem. Each thing they sell has 93 features. They sample data set has 200k individual products which have each been somehow scored against these 93 features. The problem definition is to go from 93 input numbers to a category id between 1 and 9.

{ 93 features } --> some kind of machine learning --> { number between 1 and 9 }

### Dessert: A Twitter Memebot in Word2Vec

Compute Time: Word2Vec training of 4m + 2 mins meme generationThis is something fun based on Word2Vec. We'll scrape twitter for some text to process, then use Word2Vec to look at some of the word relationships in the timelines.

#### Visualisation, Plotting and Results Analysis

No data science tutorial would be complete without data visualisation and plotting of results. Rather than have a separate problem for this, we will include them in each problem. We will also be considering how to determine whether your model is 'good', and how to convince both yourself and your customers / managers of that fact!#### Bring Your Own Data

If you have a data problem of your own, you can bring it along to the tutorial and work on that instead. As time allows, I'll endeavour to assist with any questions you might have about working with your own data. Alternatively, you can just come up to me during the conference and we can take a look! There's nothing more interesting that looking at data that inherently matters to you.I hope to see you at the conference!!