This post was inspired by great article on Bostata.com called “Client-side instrumentation for under $1 per month. No servers necessary.

Sounds pretty good, right?

In brief the author, Jake, describe how you can build data warehouse collecting events from your website using Snowplow event tracker on Amazon Web Services stack:

  • Cloudfront to record events
  • S3 to store row and processed data
  • Lambda to enrich row data
  • Athena to process it and provide access to the data in SQL format

The end result is data stored in S3, which is essentially just text storage and processed on fly with AWS Athena. So you pretty much own your business data, which is very in line with the philosophy of this blog.

Plus it should be cheap – no always running servers used, S3, Lambda and Athena are inexpensive, especially on comparatively small data that small & medium businesses generate.

Hence I can’t miss the opportunity to give this approach a try with help from my colleague from Magenable.

I will not repeat Jake, you need to read an original post, just stress several things that were not so obvious.

Snowplow

Snowplow is a platform to collect events data. Think about your own Google Analytics, more precisely tracking part.

Snowplow architecture, source https://github.com/snowplow/snowplow

We didn’t have much experience with Snowplow, so had to learn few things. Probably the most important is tracker itself, Javascript code that your call from your website page.

It can be both self-hosted and hosted by Snowplow for public. We used public one for the experiment, but setting up own isn’t that hard, you can have it with S3, so no need to have a real server.

Data enrichment (Lambda)

Without enrichment you get very basic data from your tracker and you need to clean it to make it easier for further processing. Hence we need to do data enrichment. In original Snowplow architecture enrichment is done using one of 3 applications. They are open-source and can be hosted on AWS EC2. However we try to do it cheaper and use serverless Lambda function. Unfortunately the details of this process was omitted in Jake’s post. So that was the hardest part in our experiment.

Lambda supports number of programming languages including Python, which we decided to use for our simple enrichment script. It took some time, but eventually Lambda function was ready, deployed and started to work. It takes raw data from one S3 folder, processes it and puts to another S3 folder.

You may find the function on Gitgub – https://github.com/ownyourbusinessdata/snowplow-s3-enrich

It contains number of files mainly because we need to work with Maxmind Lite database of geoIP data.

Another source of enrichment is Cloudfront logs.

At the moment the function covers around 60 attributes, but we have just started.

Plugging in analytic tools

Once the data collected and Athena is configured you can get access to it in SQL form and plug almost any decent analytical tool. We’ve tried AWS own Quick Sight, Mode Analytics and R, but probably cover how it was done in another blog post.

Just add there couple charts quickly produced from the data collected.

Hourly sessions
Browser statistic, R/ggplot2

Update: Terraform script added for quick and easy deployment

We’ve added Terraform script that automates creation of AWS infrastructure, so you can deploy tracking described above quickly! Check it in our Github repository

Published by Alex Levashov

Alex lives in Melbourne, Australia, where he runs eCommerce consultancy Magenable.

Join the Conversation

3 Comments

Leave a comment

Your email address will not be published. Required fields are marked *