Simple AWS Analytics architecture with Glue Catalog, Athena, S3 (automated with Terraform)

Simple AWS Analytics architecture with Glue Catalog, Athena, S3 (automated with Terraform)

High-level overview

This post will show an example of simple analytics architecture that allows for querying JSON data stored in S3 bucket using SQL & AWS Athena. Process of creating all required resources will be automated with Terraform scripts.

Terraform

Before this article, Terraform was terra incognita to me - I'd usually "click" all the required resources in the console. Keeping infrastructure as a code is way more maintainable so I hope it will become my new habit.

If it's first time that you heart about Terraform, then, citing their official website:

Terraform is an open-source infrastructure as code software tool that provides a consistent CLI workflow to manage hundreds of cloud services. Terraform codifies cloud APIs into declarative configuration files.

Architecture

Let's take a look at architecture diagram (click for full resolution):

Architecture diagram

Solution consists of 3 distinct layers:

  • Metadata layer (AWS Glue Data Catalog with database & table definitions)
  • Storage layer (S3 buckets - one for storing actual data that business users will query; one for storing AWS Athena results)
  • Analytics layer (AWS Athena with Athena workgroup).

Repository structure

I'm going to keep solution's code in simple git repository with below structure:

repo_root
  data
    |__ 1.json
    |__ 2.json
      ...
  src
    |__ __init__.py
    |__ generate_data.py
  terraform
    |__ athena.tf
    |__ glue.tf
    |__ main.tf
    |__ s3.tf
  .flake8
  .gitignore
  .pre-commit-config.yaml
  requirements.txt

Dotfiles

Let's quickly go over dotfiles I keep at the repository root level.

# file /path/to/file convention at the top of each of file means file location in the repo

  • .gitignore is generated with gitignore.io by combining presets for Pycharm+all, Python, venv, Terraform.
  • .flake8 is used only for aligning max-line length to 88 (which is a default value for black formatting package)
# file .flake8
[flake8]
max-line-length=88
  • .pre-commit-config.yaml - I'm using pre-commit for setting a pipeline triggered on every commit. It includes isort for sorting imports; black for formatting code; flake8 for PEP-8 compliance; checkov for AWS configuration security checks.
# file .pre-commit-config.yaml
repos:
  - repo: https://github.com/pycqa/isort
    rev: 5.5.2
    hooks:
      - id: isort
        files: './*'
      - id: isort
        files: 'tests/.*'
  - repo: https://github.com/ambv/black
    rev: stable
    hooks:
      - id: black
        language_version: python3.8
  -   repo: https://gitlab.com/pycqa/flake8
      rev: 3.8.3
      hooks:
      -   id: flake8
  - repo: https://github.com/bridgecrewio/checkov.git
    rev: 'master'
    hooks:
      - id: checkov
  • requirements.txt are generated by installing 2 external libraries: pre-commit (pre-commit pipeline) and faker (used for generating fake data)
# file requirements.txt
appdirs==1.4.4
cfgv==3.2.0
distlib==0.3.1
Faker==8.1.0
filelock==3.0.12
identify==2.2.3
nodeenv==1.6.0
pre-commit==2.12.0
python-dateutil==2.8.1
PyYAML==5.4.1
six==1.15.0
text-unidecode==1.3
toml==0.10.2
virtualenv==20.4.3

src

This folder keeps Python code that generates fake data.

Fake data structure

I decided to make fake data very simple so don't expect anything fancy here. Each row is kept in a separate JSON file and has following schema:

id (type: int)
date (type: string following ISO date format - YYYY-MM-DD)
product (type: string; possible values: "A", "B", "C")

Generating fake data

I wrote simple Python script that generates n amount of JSON files that (specified by user as a command-line argument). It saves fake data to data directory where each JSON is named after its id value (data/{id}.json).

Script's code:

# file src/generate_data.py
"""
Generate Data

Script allows the user to generate fake data that could be used
for populating Athena tables (JSON files living inside S3 bucket).
"""

import argparse
import datetime
import itertools
import json
import pathlib
import random
from typing import Dict, NoReturn

import faker

id_sequence = itertools.count(0)
prod_choices = "ABC"


def _generate_row_dict() -> Dict:
    """
    Generate a JSON row with random data.
    JSON schema:
    - id (int)
    - date (date in str format YYYY-MM-DD)
    - product (string)
    :return: Dict
    """
    return {
        "id": next(id_sequence),
        "date": faker.Faker()
        .date_between_dates(
            datetime.date(2020, 1, 1),
            datetime.date(2020, 12, 31),
        )
        .strftime("%Y-%m-%d"),
        "product": random.choice(prod_choices),
    }


def main(n: int, p: pathlib.Path) -> NoReturn:
    for _ in range(n):
        row = _generate_row_dict()
        row_path = p / f"{row['id']}.json"
        with open(row_path, "wt") as f:
            json.dump(row, f)


if __name__ == "__main__":
    data_path = pathlib.Path("../data")

    if not data_path.exists():
        data_path.mkdir()

    parser = argparse.ArgumentParser(description="Generate fake data for Athena.")
    parser.add_argument("n", type=int, help="Amount of rows that will be generated.")

    args = parser.parse_args()
    main(n=args.n, p=data_path)

2 things worth mentioning:

  • I needed to simulate sequence database object for getting consecutive numbers (for id field) and decided to go with itertools.count object. itertools is a very interesting module and I definitely need to get more familiar with it. If you'd like to simulate it with your own code, then Python's generators are the way to go (but this is exactly what itertools.count wraps around so there's no need to reinvent the wheel).
  • faker is used for generating fake date - which might seem a little exaggerated for generating random date between 2020, January 1st and 2020, December 31st. When starting the article, I had more complicated row structure in mind and hence included faker. If you'd like to avoid adding external dependency to the project, you can achieve same result with just the standard library. Snippet for that:
"""
Generating random date from 2020 year using only built-in
datetime and random modules.
"""

import datetime
import random


date_values = [
    datetime.date(2020, 1, 1) + datetime.timedelta(days=n) for n in range(366)
]

# pick a random date
print(random.choice(date_values))

Terraform

It's time to take care of creating actual AWS infrastructure that will power whole solution.

Provider & region

Let's start with adding an AWS provider in terraform and setting the default region to eu-north-1.

# file terraform/main.tf
provider "aws" {
  region = "eu-north-1"
}

Glue Data Catalog

We'll need to create a database and table inside Glue Data Catalog.

Instead of clicking them by hand in AWS console, we can use terraform script for spinning resources according to our specification. This script creates example_db database containing products table. products is an external table that points to S3 location where our previously generated JSON files with product data live.

Note that inside storage_descriptor -> location we're referring to a resource (S3 bucket) that comes from another terraform file (terraform/s3.tf).

Beside serialization/deserialization info, storage_descriptor contains definition of products table columns (columns {} blocks).

# file terraform/glue.tf
resource "aws_glue_catalog_database" "example_db" {
  name = "example_db"
}

resource "aws_glue_catalog_table" "products" {
  database_name = "example_db"
  name          = "products"
  description   = "Table for keeping data about products."

  table_type = "EXTERNAL"

  parameters = {
    EXTERNAL = "TRUE"
  }

  storage_descriptor {
    location      = "s3://${aws_s3_bucket.input_data_bucket.bucket}/input"
    input_format  = "org.apache.hadoop.mapred.TextInputFormat"
    output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"

    ser_de_info {
      name                  = "s3-stream"
      serialization_library = "org.openx.data.jsonserde.JsonSerDe"

      parameters = {
        "serialization.format" = 1
      }
    }

    columns {
      name = "id"
      type = "bigint"
    }

    columns {
      name = "date"
      type = "date"
    }

    columns {
      name = "product"
      type = "string"
    }
  }
}

S3

We'll need 2 buckets: jsz-tf-input-data-bucket for keeping our JSON data that will feed the products table and jsz-tf-athena-results-bucket for storing results generated by running SQL queries through Athena. Very basic configuration for such resources could look like this:

# file terraform/s3.tf
resource "aws_s3_bucket" "input_data_bucket" {
  bucket = "jsz-tf-input-data-bucket"
  acl    = "private"

  tags = {
    Project = "AWS Athena Glue Terraform example"
  }

}

resource "aws_s3_bucket" "athena_results_bucket" {
  bucket = "jsz-tf-athena-results-bucket"
  acl    = "private"

  tags = {
    Project = "AWS Athena Glue Terraform example"
  }
}

Athena

Last part of our solutions is Athena - we'll need to create an Athena workgroup that defines SQL queries results S3 bucket.

# file terraform/athena.tf
resource "aws_athena_workgroup" "athena_example_workgroup" {
  name = "athena_example_workgroup"

  configuration {
    enforce_workgroup_configuration    = true
    publish_cloudwatch_metrics_enabled = true

    result_configuration {
      output_location = "s3://${aws_s3_bucket.athena_results_bucket.bucket}/output/"
    }
  }
}

Closing remarks

  1. Terraform is awesome! It's so much better than manually clicking stuff in the console.
  2. To make this solution fully operational, I plan to add terraform script for creating a read-only IAM role that could be assumed by any business users that want to query the data.
  3. S3 security could be improved. I purposely ignored some of the checkov warnings since it's not a real solution. When doing production stuff, go carefully through all warnings and either remediate them or make sure you fully understand consequences that follow ignoring them.
  4. Resource naming could follow some convention and input variables could be used for dynamic resource name construction.
  5. You can check the code yourself here: github link.
  6. I used diagrams.net tool for creating architecture diagram.

Thanks for reading and see you in the next post,
Kuba