Friday, July 17, 2015

Shareable Datasets -- A Functional Design

So I was inspired by reading a blog post on "Truly Open Data" by Bill Mills:

We engaged in a bit of back-and-forth, and he encouraged me to set my ideas out a bit more clearly. I got somewhat inspired, and created an imaginary tool called "odit" -- the Open Data Integration Tool. I mocked out a functional design, and wrote a user guide outlining the intended functionalty.

Here's the result:

I'd appreciate comments and feedback. Is this something you'd consider valuable enough to use? Is it worth building?

Here's a teaser of what you'll find over at readthedocs....

Command Summary

odit fetch Create a new project. odit share Shares the dataset online
odit append Append to a local dataset. odit update Revise the content of a dataset
odit set-licence
Specify the license for a data set

Wednesday, July 8, 2015

Can you make a BitTorrent 'channel' for just some files?

I have a problem I'd like to solve with BitTorrent -- I think. BT is great for two things: moving large files around quickly, and distributing storage capacity. Those are two things which data scientists badly, badly need. The only real alternative is for large data storage to be bankrolled by a large company acting in the public good. That happens sometimes, but there's something satisfying about the concept of a truly public infrastructure.

The downside, for me, with BT, is twofold. One is the 'negative branding' -- the association (merited or otherwise) with various kind of dubious content including piracy. The second is just providing a high quality set of data which isn't swamped by irrelevant content. You don't want to get a whole pile of television shows when you're actually trying to get some engineering-quality data.

There's a third niggle, which is how to handle realtime or streaming data.

So here's my question for the internet: how do I create a channel for scientific data using BitTorrent? I'm happy to get my hands dirty -- I'm a software dev after all.

Sunday, July 5, 2015

Quick example: A heat map of pedestrian counts

It might not look like it, but I have been super-busy lately working on data science and machine learning tech. I've been going on a bit of a vision quest trying to wrap my head around the whole thing. You know what -- I'm pretty lost. I've learned a lot of things, but I can also see how much deeper the rabbit-hole goes.

While that will bear fruit in time, I've decided to add a series of 'shorts' to the blog. Things which I can genuinely do more easily, and never mind if that risks being too simple to be of wider interest. The point here isn't to blaze a trail, but rather to keep up my exercise.

The plot above was generated by this code (link goes to a notebook).

The City of Melbourne provides quite fine-grained pedestrian count information for major locations in my home town -- see I really applaud this effort. I'm very excited about anything which reflects the physical world into the digital. This data updates in near-real-time as well, which is just wonderful.

Down the road I hope to use this to do some interesting prediction software, but for now I just want to explore the data. I'm also learning how to plot things.

Python has a number of libraries for this. My favourite in terms of API design is without doubt Seaborn, but it's slowwww. For speed, I recommend Bokeh, but I find it much clumsier to use. I'm also not a fan of its interactive javascript tools, because I think it's too easy to accidentally scroll away from the data entirely or otherwise misnavigate the chart. Please share your views on plotting, I'd really like to build up some more knowledge about the range of opinions on this tool.