The “Go Serverless” Data Engineering Revolution at Teamwork (golang + AWS = ❤)

Joe Minichino
Teamwork Engine Room
6 min readOct 18, 2021

--

A real data lake.

Traditional Data Engineering relies on products such as Airflow, Hadoop, Spark and Spark-based architectures, or similar technologies.

These are still viable solutions for a number of reason, not least the fact that Data Engineers are few and far between, and the vast majority of them will be familiar in the above technologies or similar products/frameworks.

Go Serverless

I wrote an article about our tech stack, which includes S3, Athena, Glue, Quicksight, Kinesis etc.

What is not immediately apparent is the serverless nature of our stack, which was a deliberate choice taken in the context of a Data Analytics department which was started as an experiment and had to pick its battles very wisely. Sysops, cluster / server management, CI/CD were not top of our list. Creating dashboards was.

Also before Teamwork I had a nearly 2-year run working with a company that was entirely serverless in their set up AND mentality, and it was a career-changing experience. We worked prevalently with AWS Lambda when Lambda was a new toy, and we loved it.

Finally, despite the “buzz” over Functional Programming being seemingly over, I am an arduous fanatic of it and of what it represents philosophically, a way to represent each problem in terms of an input, some transformation, an output and when possible, no side effects.

NOTE: some of the tools I mention are not strictly serverless as much as they are fully managed (eg. Kinesis or Github Actions), but they still involve little or no sysops / devops.

Serverless Processing: AWS Lambda

Once upon a time you would have to settle for Node.js or Python to write Lambda code, but nowadays not only you can use a whole lot of runtimes, you can simply use your own docker image et voila’, you’re ready to go. In our case we are ready to Go, since at Teamwork Go(lang) is our programming language of choice. It’s fast, easy to use, very safe and reliable, and given that it can complete the same operation faster than most other programming languages it also translates to a cost save, given Lambda pricing is a function of the duration of the execution of each Lambda invocation.

Serverless Deployment: Github Actions + Terraform

Terraform keeps us aligned with the internal practice for infrastructure maintainance and Github Actions allow us to perform a Terraform apply on PR merge, so effectively we have cut out the need for older setups such as Travis or Jenkins. It does not cut out the reliance on a third party (Github) but we’re ok with the eggs being in that basket because if Github goes down then we can’t even push our code so there would not be releases anyway. Note that we build Docker images and store them in ECR (you have to for Lambdas), but we are happier with this than with having our images on Docker Hub since — again — without AWS our system is down so we’d rather reduce the number of service providers we depend upon.

Serverless Events: AWS EventBridge

One particularly elegant aspect of building serverless pipelines in AWS is EventBridge, which allows several triggers to invoke pipelines, making them event driven. Your Lambdas can be invoked by SNS topics, S3 events and so on. A frequent way we use in Teamwork to invoke Lambdas is to set the s3:putObject as a trigger, so any time an object is added to a particular bucket, a Lambda is invoked that will process that particular object.

Serverless Analytics: Athena

Athena is a great analytics engine whose pricing model is based on the amount of S3 space scanned and — if you use federated queries — the underlying cost of Lambda invocations. Since partitions are of vital importance when querying large datasets, structuring your queries so only subsets of your data are scanned, and making sure you use compressed and optimized columnar formats (such as parquet) means saving cost on queries.

Fully Managed Components: Kinesis and Quicksight

Kinesis covers a vital role in that it holds the streams of data traffic that are meant for processing and storage. Kinesis Firehose in particular does a great job of automatically partitioning your data by time units, down to the hour, into your desired destination. We mostly store in S3 but you can also target Redshift, Elasticsearch and Splunk.
Quicksight is a visualization tool that is extremely powerful. It probably lacks a bit behind competitors such as PowerBI or Tableau in terms of UX, but its seamless integration with AWS makes very interesting features such as augmenting your data with ML predictions with Sagemaker an incredibly trivial task. You can also expose your dashboards to the internet if you are on the correct pricing plan, we use this feature to democratize our data internally by granting access to all our dashboard to all our employees.

Example Use Case and Architecture

The problem at hand was exploring the usage of Teamwork integrations, with a particular focus on the internally maintained ones to prioritize feature development.

The raw data source is HAProxy logs in common log format, and these logs are funnelled into Cloudwatch, and this was our starting point. We designed the architecture from there, here is the diagram:

Integrations Analytics architecture.

Step 1: Subscription Filter (Production Account)

The way we extract logs from Cloudwatch is through a little known feature called Subscription Filters. You can have two filters per log group, and all they do is take log entries (you can specify one of several formats available), apply a filter that you can write (regex / parsing kind of thing) and extract the matching entries.

Step 2: Kinesis Firehose (Production)

Subscription Filters can be configured to publish their data on a Kinesis Stream, or a Firehose stream, which is the option we chose to “park” the gzipped entries into a S3 bucket.

Step 3: S3 replication (Production -> DA)

We are replication data out of our production account into our Data Analytics (DA) account. This is good for a multitude of reasons, but mainly:
1. Don’t do Analytics in Production
2. Avoid having to set a lot of policies allowing cross-account interaction between components, which is annoying.

Step 4: Processing (DA)

Processing happens with a Lambda, which is triggered by the s3:putObject event in EventBridge. Whenever a new object lands in the bucket (in this case by means of replication), the Lambda picks it up and processes it.
This step takes the entris that have been transformed into gzipped JSON by Firehose and for each entry, it looks up details of the request so that we can augment the record with account and integration information. We use a Federated Query in Athena to join several data sources, and we store the augmented records in Prophecy, our data lake.

Step 5: Cataloging

Every hour, an AWS Glue Crawler scans the data lake bucket for new data, and it updates schemas in our AwsDataCatalog which is the catalog of metadata information for Prophecy. There the schemas are updated and new partitions detected.

Step 6: Analytics (DA)

At this point the data is ready to be analyzed. Our normal process is to write SQL queries and create an Athena view for them, and so we wrote a single SQL view that retrieves all the data we need to visualize which illustrates integration usage records.

Step 7: Visualization

Finally we create a dataset in Quicksight based on the aforementioned view and display the integration usage, delegating to Quicksight details such as computing distinct accounts / users / number of requests per integration etc.

Conclusion

Data Engineering is evolving just as fast as other fields in the Data realm (Data Analytics and Data Science, for example) and while traditional methods won’t go away any time soon, Serverless Data Engineering means data can be collected, transformed, manipulated and made available to data analysts and data scientists cutting costs on overheads and increase elegance and stability of your pipelines.

--

--