How to process (small) dataset with Cascalog

My learning at university involved processing Twitter dataset using Hadoop cluster (as part of a very useful module called High Performance Computing). All done using pure Java. These are not good memories. Jobs failing, jobs queuing, cluster experiencing downtime, students panicking, staff complaining. Not to mention the need to write my own Mapper, Reducer, Combiner and a job configuration. And the job was really tiny! Oh, and the need to create a new jar and copy it to hdfs each time I changed the code. If only I could just write a simple query and run it the way SQL is run..

One word: Cascalog.

If the job does not take longer than 20 minutes to run locally, then you really don’t need a Hadoop cluster for that. With Cascalog you can write your job in a form of a query, and you can write Clojure functions to do, well, whatever you want with your data. Read below to find out how to set up your environment to work with Clojure Cascalog (there is a pure Java Cascalog too, JCascalog, but NO.)

Hadoop installation

OS X: For the most useful instructions on how to get Hadoop on your Mac, using homebrew, head to Denny Lee’s article.
Other operating systems: Apache’s Wiki seems like a good source.

Clojure installation

OS X: You can do it quickly and pain-free using homebrew (and if you don’t have homebrew, shame on you!). Instructions here.

Pay attention to Step 3, where you learn how to set up a Clojure project. Your project.clj is a one-stop shop for downloading all dependencies, so this is where you’ll add your hadoop and cascalog jars.

(defproject project-name "0.1.1-SNAPSHOT"
  :description "Description"
  :url "https://www.url.com"

  :dependencies [[org.clojure/clojure "1.5.1"]
                 [cascalog "1.10.1"]
                 [org.clojure/math.numeric-tower "0.0.2"]]
  :profiles {:dev {:dependencies [[midje "1.5.1"
                                   :exclusions [org.apache.httpcomponents/httpcore
                                                commons-io]]
                                  [org.apache.hadoop/hadoop-core "1.0.4"
                                   :exclusions [org.slf4j/slf4j-api
                                                commons-logging
                                                commons-codec
                                                org.slf4j/slf4j-log4j12
                                                log4j]]]}
             :provided {:dependencies [[org.apache.hadoop/hadoop-core "0.20.2-dev"]]}}
  :main your-main
  :uberjar-name "your-jar.jar"
  :exclusions [org.apache.hadoop/hadoop-core
               org.clojure/clojure
               midje])

 

Emacs

I’m new to Emacs, it’s been a steep learning curve, but an enjoyable one. Emacs is not as memory hungry as your typical IDE. And using nrepl in Emacs proved to be easier than using it in IntelliJ (haven’t tried other IDEs though).

To write and run your code, get yourself Emacs-live. It’s a complete setup so everything you need is already there. If you haven’t used Emacs before it’s best if you go through the built-in tutorial. You can open it from inside Emacs Live using M-h t (should be Alt+h followed by t). In Emacs the two commonly-used modifier keys are <Control> (usually labeled <Ctrl> and referred to as C), and <Meta> (usually labeled <Alt> and referred to as M).

Open your project.clj (or any other clojure file) in Emacs, and connect to it nrepl by running M-x nrepl-jack-in. This will load all your dependencies and set up your classpath.
For more commands, use Emacs cheatsheet.

Write queries

Let’s say we have a series of CSV files, each looking like this: store_id,store_name,city,sales_total
and containing sales from a single month. We want to go through all the files and get the total sales (12 months worth) from all stores per each city.

First, let’s create our own namespace, and import Cascalog and some other namespaces:
– cascalog.ops to use sum function
– cascalog.more-taps to parse delimited files, e.g. CSV

(ns my.project.sales
  (:require [cascalog.api :refer :all]
            [cascalog.ops :as ops]
            [cascalog.more-taps :refer [hfs-delimited]]))

Next, let’s create some helper functions. We’ll be dealing with numbers so it’s important to check if the value in a given column is actually a number, and then parse it:

(defn numbers-as-strings? [& strings]
  (every? #(re-find #"^-?\d+(?:\.\d+)?$" %) strings))

(defn parse-double [txt]
  (Double/parseDouble txt))

Now we can construct our query (and write it as a function of course):

(defn total-sales-per-city [input]
  (<- [?city ?total]
      (input :> ?store-id ?store-name ?city ?sales-string)

      (numbers-as-strings? ?sales-string)
      (parse-double ?sales-string :> ?sales)

      (ops/sum ?sales :> ?total)))

Query takes a single line from the file as an input, bounds each coma-separated value to a variable (preceded with ?), checks and parses numbers, and then uses sum function to add up all sales. Since we declared ?city as a result variable of the query, Cascalog will partition the records by ?city and apply the ops/sum aggregator within each partition. Quite nifty, isn’t it?

Last step is to use the above function:

#_ (let [data-in "./input/sales/"
         data-out "./output/total-sales-per-city/"]
     (?- (hfs-delimited data-out :sinkmode :replace :delimiter ",")
         (total-sales-per-city
          (hfs-delimited data-in :delimiter ","))))

?- executes the query and emits the results to the specified tap. In the example above, I assigned input path to a variable data-in, and output path to data-out. Taps I’m using are both hfs-delimited, as I’m reading from a CSV, but I also want to write to a CSV. In order to overwrite the resulting file (if we don’t want to delete the file each time manually) we can specify :sinkmode as :replace.

To run:
1. Set the namespace to the current buffer C-c M-n
2. Compile: C-c C-k
3. Move the cursor over to the query and execute: C-M-x

That’s it. It’s really that simple. For more examples, head to Nathan’s article.

Enjoy writing queries, because Cascalog is fun!

3 thoughts on “How to process (small) dataset with Cascalog

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>