Prophecy: Teamwork’s Data Lake

Joe Minichino
Teamwork Engine Room
10 min readJul 17, 2020

--

You need a Data Lake.

The Context

Teamwork has been around for more than 10 years. Starting out as a project management and work collaboration platform and later expanding into other areas, such as help-desk, chat, document management and CRM software. As the company has grown and evolved, data has grown, changed, expanded, diversified, fragmented, then changed again. Analytics in this landscape are not for the faint of heart.

The Problem

The issue at the base of the decision to start re-evaluating our analytics approach, at Teamwork, was because we have way too many data sources to make sense of all the data we collect. Much of the data ends up being “dark”, underutilized, unleveraged. We have multiple databases shards, for each product, along with platform databases containing global customer information; and then we have 3rd parties such as Stripe, Marketo, Chart Mogul and Google Analytics to name the most important ones. We realized that making a connection between sales leads in Marketo, their usage of our products (GA and our DBs) and their revenue (our DBs, Stripe, Chart Mogul) was impossible. Well, it was possible, but incredibly painful and poorly automated, making it slow and error prone.

And by addressing this problem we are getting a large number of benefits back, practically for free.

You need Analytics

Let’s cut to the chase: you need analytics. No matter how big or small your business or enterprise is, you need analytics to take more informed business decisions. I do not doubt there are individuals with great gut-driven decision-making skills, but by and large, it’s better if you look at data to make decisions. That’s why you need a Data Lake. It could be a really small lake, it could be a pond or a puddle. But you need it for better decision making.

Data Lakes, Big Data and other buzzwords

Let’s clear the air on a couple of misconceptions. The terms “Big Data” and “Data Lake” absolutely scream of corporate, of enterprise, of governance, compliance, regulations and everything else that any passionate engineer out there probably wants to stay well away from. At least I do. Big Data does not mean “so much data your puny mind cannot possibly fathom” and Data Lake does not mean “so much unstructured data your unfathomable-Big-Data puny mind cannot possibly comprehend”. Yes, volume is one aspect of it, and variety is another (for the lovers of the 3/5/7 Vs of Big Data) but to me, Big Data Engineering is the enabler of Analytics, the size of the data in my mind has very little to do with it. The variety and velocity much more. But really if you find yourself with 1 database table, a bunch of CSV, XML and JSON files and some server logs you’re already knee deep into a big data situation not matter how little your data set is. And the most effective way to analyze all of the above is through classic Big Data Engineering tools / methodologies. To explain this concept further I did a demo for our folks at Teamwork in which I analysed a spreadsheet that only contained about 22k rows of information utilising a (Big) Data processing pipeline.

Enter Prophecy

Our vision for our Data Lake is that it will give us a clear picture of the past (BI) and hopefully predict the future (predictive analytics). Hence why we codenamed the project “Prophecy”.

How we treat Prophecy data

From the get-go, the mantra was absolutely clear. Automate-all-the-things. When you deal with the amount of data that we can potentially generate at Teamwork, there is no choice but for Prophecy to become a self-maintained entity that takes care of ingesting data, cleaning it, extracting the interesting parts and running analytics on those. At Teamwork we are AWS-based, so our tools are AWS tools, but don’t stop reading at this point. AWS tools very often wrap other software (eg. Presto being the engine under the hood of Athena), or you could replace one part of our architecture with an equivalent software that performs the same function (eg. Kinesis and MSK can be replaced with Kafka, EMR with Hadoop).

We haven’t gone as far as automating the provisioning of the entire stack (yet), but Terraform-ing everything is on the roadmap.

Data Storage

S3. Plain as that. Yes, data might end up in Redshift or Elasticsearch for some specific analytics jobs, but first all the raw data is parked and stored into s3. For an AWS-based solution there is really no alternative and by the way, as simple as S3 is, it is a great piece of software, and we’ll see what not-so-well-known features can be really helpful in future posts.

Data Sources

So what are the tributaries of Prophecy? there are 4 at the moment:

  1. DMS (Database Migration Service): we replicate our DBs into S3 turning them into CSV files. A DMS task operates an initial load and subsequently captures all the changes and stores those in s3 as well. Now the history of every record in your db is stored and kept forever. Free benefit: AUDIT!
    On a technical level, DMS captures changes by “pretending” to be a replication node by listening to the binlog of the RDS MariaDB instance we have in place, and translates those changes into CSV, pre-pending a column to the row that indicates whether the new record was an Insert (I), Update (U) or a Delete (D). Really nice to have a snapshot of when a record was deleted.
  2. Kinesis Firehose: we have multiple services publishing to Firehose, which in turn deliver to S3. Note that if you want to use Firehose to deliver to Redshift or Elasticsearch the data will be parked in S3, as an intermediary step, so you might as well just deliver to S3 and only worry about other analytics engines later. Kinesis Firehose delivery streams actually use Kinesis Data Streams as source, upon which data is published by a few services that are capturing Teamwork events coming from RabbitMQ. We could have opted for the RabbitMQ consumers, to publish straight on Firehose, but we’d have given up the opportunity to have Kinesis Analytics streams in place, which generate reports/stats that go to feed the lake. Our usage of Kinesis itself deserves a post of its own.
  3. AppFlow: this is a very recent addition to the AWS toolset, it allows scheduled import from 3rd parties such as Marketo or Google Analytics or Datadog to be imported into s3. Very handy.
  4. Prophecy itself: by running Athena queries from Lambda, or by running ETL jobs, more data is generated and stored into S3.

Since storing every single bit of data in our DB would be overkill, we are using a consumer service that listens to RabbitMQ events, and publishes these on Kinesis Data Streams. Firehose streams, utilizing these Data Streams as sources , pick up the events and deliver them to S3. This way we have control over what we really need for analytics purposes and we can do some manipulation before storing the data, reducing the need for ETL / cleanup jobs.

Cataloging

Cataloging the data is important to automate the tasks of creating metadata which is late used by query engines to run analytics. This might seem like a secondary aspect but — on the contrary — it is a vital, crucial aspect of maintaining a big data pipeline.

On the security and permissions side of things you can use Lake Formation, which allows you to restrict access and privileges to users.

The main tool for cataloging is AWS Glue and Glue Crawlers in particular are the backbone of automated cataloging of our data. In short Crawlers scan your s3 buckets, detect structure in your data and infer data types. The output is a table definition for Athena, our query engine of choice. Once cataloged, the schema can be edited so that you can rename columns/fields/partitions or change data types.

Crawlers can be run on demand, or on a schedule. I personally have a nightly run of the crawlers on the s3 buckets, hosting our internal DBs data, because schema changes are infrequent. The crawler for Stripe data is purely on demand, as the Stripe API practically never changes (Should Stripe notify developers of such a change, I would run the crawler again).

Lastly the most impressive feature of crawlers is the automatic detection of partitions. Since s3 uses prefixes (which represent folders but they are not actually folders), Athena tables can be configured to be partitioned so that only the files with the relevant prefix will be queried, and not the entire bucket indiscriminately. The default way of partitioning data of AWS tools follows the standard of prefixing your files with a YYYY/MM/DD/HH prefix, and Glue Crawlers will automatically detect that. If you have an “orders” table in Athena that has structure id bigint | sku string | quantity int | created timestamp the fastest way to query orders place on July 13th 2020 would be to do something like SELECT * FROM orders WHERE YEAR = '2020' AND MONTH = '07' AND DAY = '13' rather than doing a date comparison on the created field. It’s weird because YEAR, MONTH, DAY, and HOUR are not specified fields of the orders table, they are partitions fields, but this is great part of the power of Athena*.

*thanks Aodh O’Mahony for revealing that fact to me.

In short, data arrives into S3 and Crawlers wake up every night to perform their routine scans, keeping the data lake up to date with changes in schemas. They are our vampiric minions, at midnight they awake, at dawn they sleep.

Glue can also run ETL jobs, but that’s such an extensive topic I will write a separate post about that.

Querying pt I — JOINinig varied data

We query our data with Athena. As mentioned, we catered for scenarios in which we might use ingest nodes to load data into Elasticsearch from S3, or the COPY command to load it into Redshift for specific types of analysis, but other than that, Athena is how we query our data. Athena utilizes Presto under the hood, and SQL for queries. There are a few diversion from standard SQL especially for specific functions, but by and large if you are versed in SQL you can use Athena right away. The real difference is in how you create/define tables. In Athena you need to specify the location of your data, namely the S3 bucket where your raw data resides, and you can provide parsing options to correctly interpret the underlying data, so that Athena knows whether to parse JSON or CSV in a particular table. It is highly recommended that you use Crawlers for your table creations as they will specify a “classification” attribute for your table, something that is of vital importance if you want to run AWS Glue ETL jobs, to convert from one format to another for example, or to extract and manipulate some columns into a cleaner and smaller file.

Once the tables are defined you can join them as normal. This might not appear to be a big deal at this point, but it is. For example, Stripe data flows into Prophecy as 6–7 levels deep nested JSON, and global DB information as plain CSV files. Once the crawlers have finished their job, I can — in Athena — create a query that will join the Stripe data table with our global customers table, which under the hood means JSON data is being joined to CSV data, and that is pretty impressive.

Querying pt II — Views

If the above wasn’t impressive enough, with Athena you can create views which will produce subsets of data more suited to your use-cases. For example, I have many *_current views which represent the current state of a record, theoretically mirroring what is currently stored in our live DBs. If you think about it, every record transferred with DMS contains its history as well, so querying for a customer ID won’t return one record but as many records as there were changes to that customer record in the lake. However, more often than not I will want the current state of affairs rather than the entire history. So I create a view which includes a RANK() or a ROW_NUMBER() clause that will only return the most recently updated version of the record. These views are especially useful for visualization in Quicksight.

Visualizing — Quicksight

Visualization is done with Quicksight. This is an excellent product by AWS which allows to create graphs in seconds. If you understand your data and the insights you want to generate then you can create graphs and charts quicker than it takes to describe them. Quicksight can do all sorts of things, its strength is really in a good amount of ML/AI applied under the hood that simplifies a lot of the work for you and gives you predictive analytics out of the box.

In the pipeline

I’m currently working on EMR / Zeppelin notebooks as the next step in terms of processing, and Sagemaker functions (callable from Athena) as the next step in the realm of analytics. When these are ready I will possibly post an overview of how we employed them.

This is just scratching the surface

Each of the components of Teamwork’s Big Data pipeline deserves a post of its own. Hopefully this post illustrates what technologies can be used to set up a Data Lake and Analytics pipeline that’s entirely AWS based. In the future I will dive more deeply into the single technologies we currently use.

--

--