Tag Archives: honours

Booting out the Warmest 100

(Beware – this article includes a link to some probable spoilers for tomorrow’s Hottest 100 count. You can read this article without reading those spoilers.)
 

You’re probably familiar with Triple J’s Hottest 100. It’s the world’s largest write-in music poll. Last year, Triple J made an easy, shareable link for people to post their votes out on Twitter and Facebook. Alas, these links were easy scraped from the web, and the Warmest 100 (link to 2012 count) was born. The top 10 (but not its order) was revealed, and the top three was guessed perfectly.

This year, voters weren’t given a shareable link, but a few thousand people took photos of their confirmation e-mails and posted them to Instagram. With a tiny bit of OCR work, the Warmest 100 guys posted their predictions for this year. They found about half the number of votes that they did last year through the scraping method, which is no mean feat, given the lack of indexing.

So the question is — how useful are these votes in predicting the Hottest 100? What songs can we be sure will be in the Hottest 100? How certain is that top 10?

Both years, Justin Warren independently replicated their countdown (spoilers), and has written up his methodology for collecting the votes this year. I asked him for his data to do some analysis, happily, he obliged.

He’s since updated his method, and his counts, and written those up, too (spoilers).

Update: he’s updated his method *again* based on some feedback I offered, and has also written that up (spoilers). This is the data my final visualisation runs off.

So, what have I done with the data?

Bootstrap Analysis

When you have a sample — a small number of votes — from the entire count, you can’t really be certain where each song will actually appear in the count.

In this case, Justin’s data collected 17,000 votes from an estimated 1.5 million votes. That’s a sample of 0.1% of the total estimated vote. It’s a sample, but we have no idea how that compares to the actual count.

If we think that the votes that we have is a representative sample of all of the votes, then what we’d like to know is what would happen if we scale this sample up to the entire count. Where will songs end up if there’s a slight inaccuracy in our sample?

The good news is that computers give us a way to figure that out!

Bootstrap analysis (due to Effron) is a statistical technique that looks at a sample of votes from the whole set of votes, and randomly generates a new set of votes, with about as many votes as the original sample. The trick is that you weight each song by the amount of votes it received in the sample. This means that songs are picked in roughly the same proportion as they appear in the sample. The random sampling based on this weighted data adds noise.

You can think of this sample as a “noisy” version of the original sample. That is, it will be a version of the original sample, but with slight variation.

If you repeat this sampling process several thousand times, and rank the songs each time, you can get a feel for where each song could appear in the rankings.

How do you do that? Well, you can look at all of the rankings a given song gets for each randomised set. Sort this list, and pick the middle 98% of them. Based on that middle 98% of rankings, you can be 98% certain that the song will be at one of those positions. In statistics, this middle 98% is called the 98% confidence interval by bootstrap.

You can repeat this for different confidence levels, by picking a different amount of rankings around the middle.

I’ve used Google Spreadsheets to visualise these confidence intervals. The lightest blues are the 99% confidence intervals. The darkest blue intervals are the 70% confidence interval. The darkest single cell is the median — i.e. the middle of all of the rankings that we collected for that song in the bootstrap process.

The visualisation is up on Google Docs. (spoilers, etc).

I’ve run the same visualisation on Justin’s 2012 data, it’s less of a spoiler than the 2013 version if you care about that. It can inform the rest of the article for you.

Notes

First up, a bit on my methodology: Justin’s data didn’t separate votes into their original ballots. So, I had to pick songs individually. To improve accuracy, I selected songs in blocks of 10, where each song in the block of 10 is different — this vaguely resembles the actual voting process.

In my experiments, I ran the sampling and ranking process 10,000 times.

You’ll notice some interesting trends in this visualisation. The first one is that the higher the song is in the countdown, the narrower its blue interval is. Why is this so?

Well, as songs get more popular, the distance between each song in the number of votes received grows. In Justin’s sample of the votes, #100 and #73 were separated by 15 votes. So if one or two votes changed between #73 and #100, that ordering could change spectacularly. Given Justin’s sample is of 17,000 votes, 15 votes represents an 0.1% change in the vote.

So at those low rankings, a tiny change in votes can make for a massive difference in ranking.

At the other end of the count, #1 and #2 are separated by 16 votes. #3 and #4 are separated by 22 votes. #4 and #5 are separated by 51 votes. Down the bottom of the list, where 16 votes could move a song 33 places in our count, you’d need 16 votes to change just to swap positions 1 and 2.

What this means from a statistical perspective is that the closer to the top you are, the more work you need to do to change your position in the count.

You’ll also see this phenomenon in the right-hand side of the intervals. The interval of a given colour on the right-hand side of the interval will generally be longer than the same colour on the left. Once again, this is because lower ranks swap around more easily than higher ranks.

Update: Since writing this article, I ran one more test – how many of the songs in the top 100 of Justin’s raw  sample of votes will make it into the actual Hottest 100? Well, bootstrapping helps us here too. For each bootstrap trial, I take the top 100 songs, and see how many of those are in the raw top 100. I reckon, with 98% confidence, that we’ll get 91 songs in the actual Hottest 100. Thanks to David Quach for the suggestion.

In summary: the Warmest 100 approach is statistically a very good indicator of the top 4 songs. The top 4 is almost certainly correct (except that 1&2 and 3&4 might swap around between themselves). Everything up to #7 will probably be in the top 10.

The sampling approach is less accurate at the bottom, but I’m pretty confident everything in the top 70 will be in the actual top 100.

I’m also pretty confident that 91 of the songs in the raw top 100 will appear in the actual top 100.

End

I’ll be making some notes on how these confidence intervals got borne out in the actual count on Monday. I’m very interested to see how this analysis gives us a better idea of how accurate the Warmest 100 actually is.

Vale John Hunter, author of Matplotlib

In my BSc(Hons) thesis, which I submitted in 2010, I commenced the acknowledgements as follows:

First, a hearty thanks to people whom I do not know: The developers of Python, Numpy, Scipy, the Python Imaging Library, Matplotlib, Weka, and OpenCV; you have collectively saved me much boring work throughout this past year, for which I am truly grateful.”

So to hear of the sudden death of John Hunter, creator and maintainer of Matplotlib was truly saddening. Matplotlib is one of those pieces of software absolutely instrumental in Python’s takeup as a language in the fields of maths, the sciences and engineering. When I was a student, I’d find myself using Matplotlib very often — it was the best there is.

Tragically, John Hunter was in his mid-forties, and left behind a wife, and three young daughters. Numfocus has created a memorial fund to care for and educate his daughters. I’ll be contributing to this fund as a way of thanking the Hunter family for John’s contribution to my own work.

Fernando Perez of IPython fame has written up a substantial post about John’s contribution to the community. PSF member, and PyCon US chair, Jesse Noller has also written a tribute to John.

It’s a somewhat strange feeling — coming to realise the contribution of one person only after he died. Such is the way of Open Source — the impact of the tools we use and develop become more important than the people who develop them. And sometimes, developers are just happy to let things be that way.

Academia, gogo!

In today’s exciting post I describe a rather amusing series of events and the end result of it:

The events:

  • In August I submitted a paper to a Computer Vision conference being held in New Zealand in November. This is entirely sensible because my honours research received a first-class grade and was in the field of computer vision.
  • In September, a large earthquake occurred in the Christchurch region, causing much pandemonium amongst organisers of said conference.
  • On Tuesday this week, my paper got accepted. Naturally, the conference was organised by people in Christchurch, and they were disrupted by several weeks due to the earthquake.

So the conference is on November 8 and 9 in Queenstown, New Zealand; this leaves me just over two weeks to:

  1. Arrange travel
  2. Revise the paper based upon reviewers’ comments
  3. Prepare a poster to present at the conference
  4. Get there

Hrrrnght.

Honours Etc

Ooops, I appear to have forgotten to update my blog (as usual), and forgot to mention anything at all about my Thesis or my Honours work otherwise for the past four months. I truly can’t be bothered writing about it at the moment, so I’ll just mention that I submitted it a couple of weeks ago, and that I received a mark of First Class for it. I’m pretty happy about that.

More news at 11!

Summer of Etc!

Once again, I’ve left this site for faaaaar too long without letting you all know what I’ve been up to of late (oops). Needless to say, a fair bit has happened in the past few weeks, and it’s probably worth telling you all about this.

Honours, Semester 1 (during semester 2)

Uni study’s been going quite swimmingly of late: both my units of study went pretty well (insofar as I got HDs in them); thesis on the other hand, has only really just started to take off. My research is into the computer vision task fo object detection (for example, finding faces in images), in particular, I’m working on improving the scheme built into the Intel OpenCV Library (Haar Classifier Cascades, if you’re at all interested) by having them consider colour.

One of the deficiencies I’ve discovered during my research is lack of sufficient real-world colour face datasets to perform detection upon: whilst I need in the order of 2000 faces (1000 to train upon, 1000 to test upon), the largest useful academic set is an order of magnitude smaller. For this reason I’m developing my own set. My current intention is to assemble the data set entirely from Creative Commons-licensed data (e.g. from Flickr and Wikipedia) and to release the resultant set under CC licenses too. I expect I’ll give a lightning talk at LCA on this, I’ll also dump a blog post here somewhere about what sort of data I’d like donated.

Summer of Google

One thing that’s looking like it will derail my Honours work slightly happened not too long ago. I applied for a Software Engineering Internship at Google Sydney back in July, and didn’t hear much about it. In late October, however, I very suddenly got contacted about it, and interviewed for the position, and quite happily, I was offered a job. This, amongst other things, involved dropping (almost*) everything for the summer, and moving to Sydney within two weeks, which I guess I’ve done somewhat successfully.

So I’m now working at Google until sometime during the first two weeks of semester (!). My current project involves working on [redacted], to make [redacted] do [redacted]; in related news, the new Sydney offices are pretty damn cool, the food is excellent, and the work is fun. I’m really looking forward to the rest of my time here.

*I guess the most important thing to mention here is that I’m still spending my week-and-a-bit in Wellington for Linux.conf.au 2010, and that I’ll still be running the Open Programming Languages Miniconf there. I can hardly wait!

3 birds…

Let’s kill two birds with one picture, as it were…

Point 1? I got an Honours scholarship. Yay me! Secondly? I got my final mark today (for Functional Analysis), a very satisfying 95 (better than my previous marks for the semester by a long way). This means that I now officially have sufficient credit to graduate to a Bachelor of Science (though this is mostly a formality, I’ve been doing Honours study for two weeks now), and will do so in two weeks time. Awesome!

The third bird? My Honours thesis topic has been allocated. Put as vaguely as possible, it’s about augmenting a machine learning-based object detection system (for images) to use colour images instead of black and white. My supervisor is Mike (my ACM-ICPC coach, as it were). I’ll _try_ and explain it better once I’ve done a bit more reading than I have so far…

Normal service to be resumed later,

–Chris

More Thingies!

Time for another status report on things that have happened recently!

More Uni!

First up, I’ve started on my Honours year! Isn’t that exciting? As I’ve learnt this week, the next 12 months for me will consist of 4 coursework units, and a research thesis. This semester, it looks like I’ll be studying Embedded Systems (yay! I get to program some microprocessors! Whoo!), Computing in Context (a research-intensive unit in HCI), and possibly one other, depensive on what the unit outline for it looks like. My thesis I’m not so sure about, given that the process by which we get assigned supervisors hasn’t occurred yet. Currently, I have a pile of 12 project areas for my perusal, from which I must rank 6 proposals by order of how much I want to study them. At the moment, there are some interesting-looking proposals relating to Machine Learning, and some interesting ones relating to web monitoring; I find out what I’ve been assigned by Friday (very exciting, no?).

Linux.conf.au 2010

Linux.conf.au 2010 is being held in Wellington, New Zealand. One of the things that makes LCA a truly wonderful conference is the first two days, devoted to single-day “miniconfs” on topic areas of interest to the Free and Open Source Software communities. I’m currently involved with two proposals; I’m primary proposer of a developers’ miniconf (called “Open Languages”) aimed towards uniting the developer communities of open source programming languages, and I’m secondary proposer of an education-flavoured miniconf. I’d be equally happy if either of these proposals get up, but with 30 other awesome proposals competing for 12 openings for miniconfs, there’s going to be some very stiff competition.

Blackjack?

Hey, turns out I turned 21 on Wednesday. How did I manage that?