Impatiently learning Cascalog - Part 2

Published on 2013-9-10

I'm skimming through Cascalog for the Impatient and documenting my questions/learnings as I go for my future benefit.

Part 2

Apparently in this part we're going to update our first code so as to count the words in our document and that's the first step towards implementing a tf-idf implementation - cool story bro, I have no idea what one of those is but moving on.

I quite like the explanation given at the beginning of the article for why it's important that we be able to copy data from one place to another, and why we'd use Cascalog for this. Basically we're talking about being able to make guarantees about this operation and that's going to be important when we're trying to write logic on top of this process.

So anyway, we're given the following code

(defmapcatop split [line]
  "reads in a line of string and splits it by regex"
    (s/split line #"[\[\]\\(\),.)\s]+"))

I guess that 'defmapcatop' is a macro for defining map operations, and this appears to be splitting a line by whitespace.

Googling this doesn't appear to yield in the documentation, which is a bit unfortunate - but a bit of rummaging around finds a handy page for "which def should I use" which suggests that I'm on the right track with that line of thought.

Okay, so we can move on from this pretty swiftly and see how we're going to use this

(defn -main [in out & args]
  (?<- (hfs-delimited out)
       [?word ?count]
       ((hfs-delimited in :skip-header? true) _ ?line)
       (split ?line :> ?word)
       (c/count ?count)))

Well, I have to say I can barely read this - my Clojure-fu is not strong when mixed with the Cascalog.

But, we can see

I am mega-confused reading this because I can't actually tell how it maps to what I know about Clojure/Lisp.

The way I understand the documentation is that Cascalog looks at the dependencies of each predicate and only runs them when they have been fulfilled. I guess the 'sink' relies on ?word and ?count being available and isn't run until they are or something like that.

This is neatly explained by the Cascalog for the Impatient guide in terms of the "logic programming" paradigm so I'll accept that for now.

I suspect that the (?<- thingy is actually a macro of some sort that re-writes this into something more sane, but who knows right?

Running this with

lein uberjar
hadoop jar ./target/impatient.jar data/rain.txt output/wc

Gives me a wonderful "out of memory exception", so I post a dump on the mailing list and have a look at my environment.

Things I tried while I waited for a response

Side note: The project pages for Hadoop are awful, I had to go through a dozen links before I got to download anything - it felt like it was trying to make me feel stupid, but oh well - carrying on.

The real output?

A    3
Australia    1
Broken    1
California's    1
DVD    1
Death    1
Land    1
Secrets    1
This    2
Two    1
Valley    1

Etc - so I'm happy enough with that.

I'm still not that happy with the crazy syntax of the Clojure, I'm grabbing at it and going with the rolling assumption that the logic-like-system is just a bunch of macros on top of vanilla Clojure and "just works", so "shut up and carry on Rob".

Onto part 3 then...

2020 © Rob Ashton. ALL Rights Reserved.