Evented Github Adventure - Who writes the sweariest commit messages

Published on 2013-5-8

Okay, so now I've ran all my projections over all that crazy data and have some results to show!

We now have a stream for the commits inside Github, and we have information about the repos associated with those commits, now how about asking a question about those commits.

For reference, there are about 20 million commit messages in my event store, so I have more than enough data for this to be statistically relevant!

"Oh Github Github, in the cloud, who is the sweariest developer out loud?"

Well, this is the kind of thing we might do outside the store (after re-partitioning per-language inside the store), but I haven't got a secondary store so I'm just going to build up a view model for my charting library directly inside the event store (using the commit events I made)

var swearwords = [ "poop", "arse", "sugarlumps" ] // Changed to protect the innocent

fromStream('github-commits')
  .when({
    "$init": function(state, ev) {
      return { }
    },
    "Commit": function(state, ev) {
      var language = ev.body.repo.language

      if(!state[language])
        state[language] = { count: 0, total: 0 }

      var languageState = state[language]
      languageState.total += 1

      for(var i = 0 ; i < swearwords.length; i++) {
        var curse = swearwords[i]
        if(ev.body.commit.message.indexOf(curse) >= 0)
          languageState.count += 1
      }
      return state
    }
  })

And the results?

Well, I can go to

/projection/curses/state

And get a big pile of JSON, which looks a bit like this

{
  "ASP": { total: 1, curses: 200 },
  "OpenEdge ABL": { total: 2, curses: 0 },
  "Julia": { total: 11, curses: 0 }
}

Plugging this into d3, and filtering out the items without enough entries (5000 events), we get

Actually, let's normalise this for the lols and see who is actually the sweariest, normalised from about 0% to 7% (the majority of developers are quite clean about things ;) )

% of commit messages containing curse words

I'll leave you to draw your own conclusions about this chart, but I can't say that it comes as a huge surprise judging from the various developers on my Twitter feed ;-)

Scala developers are ducking filthy, but the lisp programmers probably save their curse words for Emacs rather than the language they're using. Seems legit.

Projections are a great way to analyse streams to generate knowledge about what is going on, of course simply doing aggregations over data over time is something we can achieve in most systems, in the next entry we'll look at something more interesting.

2015 © Rob Ashton. ALL Rights Reserved.