Category Archives: computers

Booting out the Warmest 100

(Beware – this article includes a link to some probable spoilers for tomorrow’s Hottest 100 count. You can read this article without reading those spoilers.)
 

You’re probably familiar with Triple J’s Hottest 100. It’s the world’s largest write-in music poll. Last year, Triple J made an easy, shareable link for people to post their votes out on Twitter and Facebook. Alas, these links were easy scraped from the web, and the Warmest 100 (link to 2012 count) was born. The top 10 (but not its order) was revealed, and the top three was guessed perfectly.

This year, voters weren’t given a shareable link, but a few thousand people took photos of their confirmation e-mails and posted them to Instagram. With a tiny bit of OCR work, the Warmest 100 guys posted their predictions for this year. They found about half the number of votes that they did last year through the scraping method, which is no mean feat, given the lack of indexing.

So the question is — how useful are these votes in predicting the Hottest 100? What songs can we be sure will be in the Hottest 100? How certain is that top 10?

Both years, Justin Warren independently replicated their countdown (spoilers), and has written up his methodology for collecting the votes this year. I asked him for his data to do some analysis, happily, he obliged.

He’s since updated his method, and his counts, and written those up, too (spoilers).

Update: he’s updated his method *again* based on some feedback I offered, and has also written that up (spoilers). This is the data my final visualisation runs off.

So, what have I done with the data?

Bootstrap Analysis

When you have a sample — a small number of votes — from the entire count, you can’t really be certain where each song will actually appear in the count.

In this case, Justin’s data collected 17,000 votes from an estimated 1.5 million votes. That’s a sample of 0.1% of the total estimated vote. It’s a sample, but we have no idea how that compares to the actual count.

If we think that the votes that we have is a representative sample of all of the votes, then what we’d like to know is what would happen if we scale this sample up to the entire count. Where will songs end up if there’s a slight inaccuracy in our sample?

The good news is that computers give us a way to figure that out!

Bootstrap analysis (due to Effron) is a statistical technique that looks at a sample of votes from the whole set of votes, and randomly generates a new set of votes, with about as many votes as the original sample. The trick is that you weight each song by the amount of votes it received in the sample. This means that songs are picked in roughly the same proportion as they appear in the sample. The random sampling based on this weighted data adds noise.

You can think of this sample as a “noisy” version of the original sample. That is, it will be a version of the original sample, but with slight variation.

If you repeat this sampling process several thousand times, and rank the songs each time, you can get a feel for where each song could appear in the rankings.

How do you do that? Well, you can look at all of the rankings a given song gets for each randomised set. Sort this list, and pick the middle 98% of them. Based on that middle 98% of rankings, you can be 98% certain that the song will be at one of those positions. In statistics, this middle 98% is called the 98% confidence interval by bootstrap.

You can repeat this for different confidence levels, by picking a different amount of rankings around the middle.

I’ve used Google Spreadsheets to visualise these confidence intervals. The lightest blues are the 99% confidence intervals. The darkest blue intervals are the 70% confidence interval. The darkest single cell is the median — i.e. the middle of all of the rankings that we collected for that song in the bootstrap process.

The visualisation is up on Google Docs. (spoilers, etc).

I’ve run the same visualisation on Justin’s 2012 data, it’s less of a spoiler than the 2013 version if you care about that. It can inform the rest of the article for you.

Notes

First up, a bit on my methodology: Justin’s data didn’t separate votes into their original ballots. So, I had to pick songs individually. To improve accuracy, I selected songs in blocks of 10, where each song in the block of 10 is different — this vaguely resembles the actual voting process.

In my experiments, I ran the sampling and ranking process 10,000 times.

You’ll notice some interesting trends in this visualisation. The first one is that the higher the song is in the countdown, the narrower its blue interval is. Why is this so?

Well, as songs get more popular, the distance between each song in the number of votes received grows. In Justin’s sample of the votes, #100 and #73 were separated by 15 votes. So if one or two votes changed between #73 and #100, that ordering could change spectacularly. Given Justin’s sample is of 17,000 votes, 15 votes represents an 0.1% change in the vote.

So at those low rankings, a tiny change in votes can make for a massive difference in ranking.

At the other end of the count, #1 and #2 are separated by 16 votes. #3 and #4 are separated by 22 votes. #4 and #5 are separated by 51 votes. Down the bottom of the list, where 16 votes could move a song 33 places in our count, you’d need 16 votes to change just to swap positions 1 and 2.

What this means from a statistical perspective is that the closer to the top you are, the more work you need to do to change your position in the count.

You’ll also see this phenomenon in the right-hand side of the intervals. The interval of a given colour on the right-hand side of the interval will generally be longer than the same colour on the left. Once again, this is because lower ranks swap around more easily than higher ranks.

Update: Since writing this article, I ran one more test – how many of the songs in the top 100 of Justin’s raw  sample of votes will make it into the actual Hottest 100? Well, bootstrapping helps us here too. For each bootstrap trial, I take the top 100 songs, and see how many of those are in the raw top 100. I reckon, with 98% confidence, that we’ll get 91 songs in the actual Hottest 100. Thanks to David Quach for the suggestion.

In summary: the Warmest 100 approach is statistically a very good indicator of the top 4 songs. The top 4 is almost certainly correct (except that 1&2 and 3&4 might swap around between themselves). Everything up to #7 will probably be in the top 10.

The sampling approach is less accurate at the bottom, but I’m pretty confident everything in the top 70 will be in the actual top 100.

I’m also pretty confident that 91 of the songs in the raw top 100 will appear in the actual top 100.

End

I’ll be making some notes on how these confidence intervals got borne out in the actual count on Monday. I’m very interested to see how this analysis gives us a better idea of how accurate the Warmest 100 actually is.

Talk: Making Mobile Web Services That Don’t Suck

The second of my DroidCon India talks introduces developers of mobile apps with the difficulties of designing for mobile networks. It also contains a series of design ideas that developers can take back to their back-end development team, so that the APIs that they produce for accessing their services are less difficult to use in a mobile context.

Announcing the LCA2014 Open Programming Miniconf

Very pleased to say that I’ll, once again, be running an Open Programming Miniconf at Linux.conf.au in January. This time around, the conference will be at the University of Western Australia in Perth.

I’m especially pleased, because after initially being rejected by the conference team, with limited time to assemble a line-up, I’ve put together what I think is the best Programming miniconf lineup in the five years I’ve been running it.

One of the goals of the Open Programming Miniconf is to be a forum for developers to share their craft: ideas for improving the way people code, and topics that are of benefit to people who develop using many open source programming languages. This year, for the first time, I think we’ve filled that remit.

This year’s talks cover everything from low-level mobile programming and driver development, to deployment of web applications, as well as talks about packaging, deployment, and development tools.

We also don’t have a single state-of-the-language talk. Everything’s about topics that can be transferred to any number of languages.

I’m excited! If you’re interested in the miniconf, check out our schedule and all of our abstracts at the conference wiki. See you in Perth!

Announcing the LCA2013 Open Programming Miniconf!

TL;DR — submit a proposal at http://tinyurl.com/opm2013-cfp before the first round closes on Monday 29 October 2012.

***

I’m pleased to announce that The Open Programming Miniconf — a fixture for application developers attending Linux.conf.au since 2010 — is returning as part of Linux.conf.au 2013, to be held in January at the Australian National University in Canberra. The Miniconf is an opportunity for presenters of all experience levels to share their experiences in in application development using free and open source development tools.

The 2013 Open Programming Miniconf invites proposals for 25-minute presentations on topics relating to the development of excellent Free and Open Source Software applications. In particular, the Miniconf invites presentations that focus on sharing techniques, best practices and values which are applicable to developers of all Open Source programming languages.

In the past, topics have included:

  • Recent developments in Open Source programming languages (“State of the language”-type talks)
  • Tools that support application development
  • Coding applications with cool new libraries, languages, and frameworks
  • Demonstrating the use of novel programming

If you want an idea of what sort of presentations we have included in the past, take a look at our past programmes:

To submit a proposal, visit http://tinyurl.com/opm2013-cfp and fill out the form as required. The CFP will remain open indefinitely, but the first round of acceptances will not be sent until Monday 29 October 2012.

OPM2013 is part of Linux.conf.au 2013, being held at the Australian National University, Canberra in January 2013. Further enquiries can be directed to Christopher Neugebauer via e-mail ( chris@neugebauer.id.au ) or via twitter ( @chrisjrn ).

Vale John Hunter, author of Matplotlib

In my BSc(Hons) thesis, which I submitted in 2010, I commenced the acknowledgements as follows:

First, a hearty thanks to people whom I do not know: The developers of Python, Numpy, Scipy, the Python Imaging Library, Matplotlib, Weka, and OpenCV; you have collectively saved me much boring work throughout this past year, for which I am truly grateful.”

So to hear of the sudden death of John Hunter, creator and maintainer of Matplotlib was truly saddening. Matplotlib is one of those pieces of software absolutely instrumental in Python’s takeup as a language in the fields of maths, the sciences and engineering. When I was a student, I’d find myself using Matplotlib very often — it was the best there is.

Tragically, John Hunter was in his mid-forties, and left behind a wife, and three young daughters. Numfocus has created a memorial fund to care for and educate his daughters. I’ll be contributing to this fund as a way of thanking the Hunter family for John’s contribution to my own work.

Fernando Perez of IPython fame has written up a substantial post about John’s contribution to the community. PSF member, and PyCon US chair, Jesse Noller has also written a tribute to John.

It’s a somewhat strange feeling — coming to realise the contribution of one person only after he died. Such is the way of Open Source — the impact of the tools we use and develop become more important than the people who develop them. And sometimes, developers are just happy to let things be that way.

Memoirs of a PyCon Australia organiser: Part 1 (of no idea how many)

This past weekend saw the staging of the third PyCon Australia conference. It’s been a very long time coming, and the subject of countless hours of hard work by myself (chasing sponsors, arranging to fill a programme, and ensuring delegates attended the conference), not to mention my amazing co-organisers, Joshua Hesketh, Matthew D’Orazio, and Josh Deprez.

PyCon Australia 2012

We held the conference in Hobart, my home city, and the capital city of Tasmania – this follows two successful conferences in Sydney. Despite a lot of scepticism about Hobart as a venue for a conference, we managed to attract 240 signups (placing us somewhere in the middle of the first two Sydney conferences in terms of attendance (woo!)).

CodeWars at PyCon Australia 2012

The first conference activity, the CodeWars programming tournament, started on Friday evening, with teams of up to 4 competing to solve programming problems against each other on projectors. This was a great event, which let delegates meet and greet each other before the conference started, and we’re very thankful to our event sponsor, Kogan, for helping us to make it happen.

This year, we were graced by the presence of two overseas keynote speakers –– Mark Ramm, the current engineering manager on Canonical’s Juju project, and Kenneth Reitz, the chief Python guy at Heroku.

PyCon Australia 2012 - Opening

Mark’s passionate and entertaining keynote delved into the murky waters of product management, and showed that applying the tools of testing and scientific process to product development and evaluation was something well in the reach of everyday engineers, even those with small projects. A smattering of war stories from his days leading product management at SourceForge rounded the talk off. It was a great way to start the conference, and it really helped set the informal, enthusiastic tone of the event.

Kenneth Reitz at PyCon Australia 2012

Kenneth’s talk dwelled on his philosophies of designing libraries in Python. He’s the developer of the python-requests HTTP library –– a library that has taken its rightful place as the obvious way to do HTTP in Python. His keynote gave us some strong insights into places where Python can make itself more accessible to newcomers, as well as being easier to remain involved for developers who use Python in their day-to-day lives. Kenneth’s presence was a great asset to the conference –– through his keynote, and also by making himself readily available to chat with delegates in the hallway track. Hopefully we’ll be seeing him back at PyCon Australia in future years, with more of his Heroku colleagues.

PyCon Australia 2012

Our conference dinner was held at the beautiful Peppermint Bay restaurant near Woodbridge (some 30km South of Hobart); delegates were delivered there by the fast catamaran, the MV Marana. We saw some excellent views of Hobart at twilight – the silhouettes of Mt Wellington and the Hobart Hills were quite spectacular. Unfortunately, the river got a bit choppy near the entrance to the D’Entrecasteaux channel, which left a few of our delegates feeling a bit worse for wear. Luckily for us, the dinner itself was a fantastic evening of socialising, and finding out about other delegates’ interest in Python. It was a great event, with great food, and we’re going to have a lot of difficulty topping it.

PyCon Australia 2012 Sprints

There are countless people who made an amazing effort to help improve our conference, including our volunteers, our speakers (some of whom stepped in at the very last minute to help improve our conference), Ritual Coffee (who produced their own custom blend for the conference, named “African Swallow“, no less!), the venue staff at Wrest Point (especially Kelly Glass, who’s put up with my worrying about conference rooms for several months now), our sponsors (who helped to keep the conference itself affordable), and many many more. It’s helped make my life as an organiser so much more tolerable.

Anyway, that’s it for now. I expect that I’ll have a follow-up to this post, dwelling on what we did right as an organising team, and how we can improve for next year. Incidentally, the conference will be run in Hobart again next year – if you’re in a position to help out with sponsorship, shoot me an e-mail at sponsorship@pycon-au.org, and I’ll get a prospectus to you as soon as possible!

PyCon Australia early bird registrations now open!

For fear of spamming EVERYWHERE with the news, I include just the tl;dr:

tl;dr: PyCon Australia early bird registrations are now open! Find out more at http://2012.pycon-au.org/register/prices, including details of our accommodation programme.

The full media release on the opening of registration can be found at http://2012.pycon-au.org/media/news/15

Hope we see you all registered soon!

Talk — Android: The year of Linux on the palmtop?

Here’s my talk from the Hobart TasLUG meeting yesterday (18 April 2012) on the features of Android from the point of view of a Linux user — both from a technical perspective, and issues arising from Android’s unique status as an Open Source OS for cellphones. If you want to download the video, you can download it, or watch it in the embedded format later in this post… Enjoy!