Jun 14

Code snippets for Keynote

I’ve been looking for an easy way to format code snippets for my Keynote presentation and everything seemed quite awkward to use (especially taking screenshots!). I could just add a link to the example and make it open a local webpage, but I’d rather not do that for very short snippets. After some quick research I’ve settled with highlight.

You can install it with homebrew:

brew install highlight

I put my snippets into a file (snippets.clj in the example below), and then run this to copy the formatted snippet to clipboard:

pbpaste | highlight --syntax=clojure -O rtf snippets.clj | pbcopy

-O rtf specifies format of the output file (which is rtf for Keynote). Highlight supports lots of languages and customising possibilities are endless. The end effect looks like this:

Sample slide with code snippet

Sample slide with code snippet formatted using highlight

Aug 13

The end of summer. The end of internship. The end of awesome.

This has been a whirlwind of a summer, and I can’t believe it’s come to an end.  Working at Mastodon C has been exactly what I wanted it to be: an awesome experience. I can honestly say I enjoyed every single day. Whether I was working my way out of JavaScript’s scoping labyrinths or learning the power of Cascalog, there was a sense of pride for me knowing that I had directly participated as a part of the team. With a very very rare coffee requests ;-)


With the recent rise of big data and open data I feel very lucky to have been in the middle of it. During my first week of the internship I was introduced to the CDEC Open Health Data Platform idea, given a bunch of files, some examples of maps and Cascalog queries, and off I went.

It took me a couple of hours to work through Paul Lam’s “Cascalog for the Impatient” - enough to start writing my own queries. Cascalog, and Clojure for that matter, are amazing. You get to do stuff fast, clean and in a functional way. I used it to scrub the data and calculate diabetes prevalence stats.  If a recent graduate whose only experience is in Java, is able to do that in a few days, then everyone who is dismissing the power of Clojure should go and ponder over their technology choices ;-) Feel free to have a look at the GitHub repo, all code is open source.

I have been a frequent user of online maps, but I have not created any maps myself. This was about to change. I was going to take the data processed with Cascalog  and visualise it on interactive maps and charts. Brilliant JavaScript libraries made this process much easier: leaflet and dimple. And although my JavaScript was rusty, it all soon started to come together. The (still ongoing) results can be seen on the alpha site. The feedback it received on the day of the launch was the best reward I could imagine.

Since then I have crated more maps, more charts and more queries. And during that time I have always been myself: I came to code, Mastodon C allowed me to do just that. It has been like working in this unreal world where I wake up in the morning wanting to go to work, and where every day helps me to live my life in a more rewarding way. Being able to use my skills to help someone and see their positive reaction – priceless.

Thank you to all Mastodonians for guiding the way for me throughout the entire summer.

Fran (@fhr), for being a friendly, approachable boss, for introducing me to Women in Data, for inspiring me to overcome my fears of public speaking and balancing out the sci-fi references.

Bruce (@otfrom), for always having a positive attitude and constantly helping to point me in the right direction, for introducing me to the world of Clojure and for the long list of all sci-fi films/books I’m yet to watch/read.

Neale (@sw1nn), for the Iron Maiden talks, for always willing to help out, for inspiring me through your passion for technology and keeping lunch breaks interesting thanks to your amazing skill of remembering facts.

Steve (@sf105) and Merici (@merici), it’s been a pleasure to work with you both, even if only briefly. Thank you for advice on professional career and for the froyo!

It has been a wonderful experience to be part of a company that does meaningful things. If I were asked again what I wanted to be when I grow up, my answer would be: a Giant Robot Mastodonian: professional, expert and geeky. You guys rock!

Jul 13

How to process (small) dataset with Cascalog

My learning at university involved processing Twitter dataset using Hadoop cluster (as part of a very useful module called High Performance Computing). All done using pure Java. These are not good memories. Jobs failing, jobs queuing, cluster experiencing downtime, students panicking, staff complaining. Not to mention the need to write my own Mapper, Reducer, Combiner and a job configuration. And the job was really tiny! Oh, and the need to create a new jar and copy it to hdfs each time I changed the code. If only I could just write a simple query and run it the way SQL is run..

One word: Cascalog.

If the job does not take longer than 20 minutes to run locally, then you really don’t need a Hadoop cluster for that. With Cascalog you can write your job in a form of a query, and you can write Clojure functions to do, well, whatever you want with your data. Read below to find out how to set up your environment to work with Clojure Cascalog (there is a pure Java Cascalog too, JCascalog, but NO.)

Hadoop installation

OS X: For the most useful instructions on how to get Hadoop on your Mac, using homebrew, head to Denny Lee’s article.
Other operating systems: Apache’s Wiki seems like a good source.

Clojure installation

OS X: You can do it quickly and pain-free using homebrew (and if you don’t have homebrew, shame on you!). Instructions here.

Pay attention to Step 3, where you learn how to set up a Clojure project. Your project.clj is a one-stop shop for downloading all dependencies, so this is where you’ll add your hadoop and cascalog jars.

(defproject project-name "0.1.1-SNAPSHOT"
  :description "Description"
  :url "https://www.url.com"

  :dependencies [[org.clojure/clojure "1.5.1"]
                 [cascalog "1.10.1"]
                 [org.clojure/math.numeric-tower "0.0.2"]]
  :profiles {:dev {:dependencies [[midje "1.5.1"
                                   :exclusions [org.apache.httpcomponents/httpcore
                                  [org.apache.hadoop/hadoop-core "1.0.4"
                                   :exclusions [org.slf4j/slf4j-api
             :provided {:dependencies [[org.apache.hadoop/hadoop-core "0.20.2-dev"]]}}
  :main your-main
  :uberjar-name "your-jar.jar"
  :exclusions [org.apache.hadoop/hadoop-core



I’m new to Emacs, it’s been a steep learning curve, but an enjoyable one. Emacs is not as memory hungry as your typical IDE. And using nrepl in Emacs proved to be easier than using it in IntelliJ (haven’t tried other IDEs though).

To write and run your code, get yourself Emacs-live. It’s a complete setup so everything you need is already there. If you haven’t used Emacs before it’s best if you go through the built-in tutorial. You can open it from inside Emacs Live using M-h t (should be Alt+h followed by t). In Emacs the two commonly-used modifier keys are <Control> (usually labeled <Ctrl> and referred to as C), and <Meta> (usually labeled <Alt> and referred to as M).

Open your project.clj (or any other clojure file) in Emacs, and connect to it nrepl by running M-x nrepl-jack-in. This will load all your dependencies and set up your classpath.
For more commands, use Emacs cheatsheet.

Write queries

Let’s say we have a series of CSV files, each looking like this: store_id,store_name,city,sales_total
and containing sales from a single month. We want to go through all the files and get the total sales (12 months worth) from all stores per each city.

First, let’s create our own namespace, and import Cascalog and some other namespaces:
- cascalog.ops to use sum function
- cascalog.more-taps to parse delimited files, e.g. CSV

(ns my.project.sales
  (:require [cascalog.api :refer :all]
            [cascalog.ops :as ops]
            [cascalog.more-taps :refer [hfs-delimited]]))

Next, let’s create some helper functions. We’ll be dealing with numbers so it’s important to check if the value in a given column is actually a number, and then parse it:

(defn numbers-as-strings? [& strings]
  (every? #(re-find #"^-?\d+(?:\.\d+)?$" %) strings))

(defn parse-double [txt]
  (Double/parseDouble txt))

Now we can construct our query (and write it as a function of course):

(defn total-sales-per-city [input]
  (<- [?city ?total]
      (input :> ?store-id ?store-name ?city ?sales-string)

      (numbers-as-strings? ?sales-string)
      (parse-double ?sales-string :> ?sales)

      (ops/sum ?sales :> ?total)))

Query takes a single line from the file as an input, bounds each coma-separated value to a variable (preceded with ?), checks and parses numbers, and then uses sum function to add up all sales. Since we declared ?city as a result variable of the query, Cascalog will partition the records by ?city and apply the ops/sum aggregator within each partition. Quite nifty, isn’t it?

Last step is to use the above function:

#_ (let [data-in "./input/sales/"
         data-out "./output/total-sales-per-city/"]
     (?- (hfs-delimited data-out :sinkmode :replace :delimiter ",")
          (hfs-delimited data-in :delimiter ","))))

?- executes the query and emits the results to the specified tap. In the example above, I assigned input path to a variable data-in, and output path to data-out. Taps I’m using are both hfs-delimited, as I’m reading from a CSV, but I also want to write to a CSV. In order to overwrite the resulting file (if we don’t want to delete the file each time manually) we can specify :sinkmode as :replace.

To run:
1. Set the namespace to the current buffer C-c M-n
2. Compile: C-c C-k
3. Move the cursor over to the query and execute: C-M-x

That’s it. It’s really that simple. For more examples, head to Nathan’s article.

Enjoy writing queries, because Cascalog is fun!