Terraform and AWS Glue Catalog Tables gotcha

In this short article I'd like to share a pitfall I've came across recently while working with AWS Glue tables automation using Terraform.

TL;DR?

Beware of spaces when defining a struct data type for AWS Glue Table inside Terraform scripts!

This one works:

columns {
  name = "reviews"
  type = "array<struct<name:string,rating:tinyint,created_at:timestamp,text:string>>"
}

This one does not (mind the space separating struct fields):

columns {
  name = "reviews"
  type = "array<struct<name:string, rating:tinyint, created_at:timestamp, text:string>>"
}

More details

Spaces inside struct data type causing issues seemed pretty weird to me so for the purpose of this article I decided to perform a little test - I'll create 4 tables containing same columns; the only difference will be in struct definition (space or no space between struct fields) and the way of creating the table (Hive DDL query or Terraform script). Combining this alltogether gives us 4 different combinations:

  • Hive DDL query with spaces (table books_hive_ddl_without_spaces)
  • Hive DDL query no spaces (table books_hive_ddl_with_spaces)
  • Terraform script with spaces (table books_tf_with_spaces)
  • Terraform script no spaces (table books_tf_without_spaces)

This way we'll find out whether this is only Terraform issue or maybe we should avoid them in Hive DDL queries as well.

Exemplary data

I created a simple JSON file and uploaded it to S3 bucket - it will serve as data for all tables we'll create. It contains some information about Lord of the Rings book (isbn, title, author, edition's published date). Since 2 reviews of book are included as well, we'll use array of struct to model this nested data structure in Glue Data Catalog Table.

JSON's content:

{
  "isbn": "9780544003415",
  "title": "The Lord of the Rings",
  "author": "J.R.R. Tolkien",
  "published": "2012-08-14",
  "reviews": [
    {
      "name": "John Doe",
      "rating": 5,
      "created_at": "2014-09-15 03:02:05.324",
      "text": "Masterpiece! I recommend this book to everyone!"
    },
    {
      "name": "Mark Smith",
      "rating": 4,
      "created_at": "2014-10-11 14:01:02.123",
      "text": "Great book!"
    }
  ]
}

Case 1: Hive DDL query without spaces

Let's create first table using Hive DDL command and make sure there are no spaces separating struct fields.

Hive DDL command:

CREATE EXTERNAL TABLE IF NOT EXISTS books_hive_ddl_without_spaces (
         isbn STRING,
         title STRING,
         author STRING,
         published DATE,
         reviews ARRAY<STRUCT<name:STRING,rating:TINYINT,created_at:TIMESTAMP,text:STRING>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 's3://athena-jszafran-test-data/books/'

Executing above in Athena's console results in Query successful. message.

Let's try to run simple query to make sure that table's row count is what we're expecting (1 row).

SELECT COUNT(*) row_cnt
FROM books_hive_ddl_without_spaces; 

We're getting rows count equal to 1. So far so good, no issues encountered.

Case 2: Hive DDL query with spaces

Time to add spaces to separate struct fields (for the reviews column).

Hive DDL command:

CREATE EXTERNAL TABLE IF NOT EXISTS books_hive_ddl_with_spaces (
         isbn STRING,
         title STRING,
         author STRING,
         published DATE,
         reviews ARRAY<STRUCT<name:STRING, rating:TINYINT, created_at:TIMESTAMP, text:STRING>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 's3://athena-jszafran-test-data/books/'

Query execution succeeds and new table is created. Let's try to test it with similar query as previously:

SELECT COUNT(*) row_cnt
FROM books_hive_ddl_with_spaces; 

Row count == 1 and no errors - looks like spaces do not cause any issues to Athena's/Glue parser and everything works properly.

Case 3: Terraform script with spaces

It's Terraform's turn!

We'll create AWS Glue Catalog Table resource with below script (I'm assuming that example_db already exists and do not include its definition in the script):

resource "aws_glue_catalog_table" "books_tf_with_spaces" {
  database_name = "example_db"
  name          = "books_tf_with_spaces"
  description   = "Table for keeping books info & reviews data."

  table_type = "EXTERNAL"

  parameters = {
    EXTERNAL = "TRUE"
  }

  storage_descriptor {
    location      = "s3://athena-jszafran-test-data/books/"
    input_format  = "org.apache.hadoop.mapred.TextInputFormat"
    output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"

    ser_de_info {
      serialization_library = "org.openx.data.jsonserde.JsonSerDe"

      parameters = {
        "serialization.format" = 1
      }
    }

    columns {
      name = "isbn"
      type = "string"
    }

    columns {
      name = "title"
      type = "string"
    }

    columns {
      name = "author"
      type = "string"
    }

    columns {
      name = "published"
      type = "date"
    }

    columns {
      name = "reviews"
      type = "array<struct<name:string, rating:tinyint, created_at:timestamp, text:string>>"
    }
  }
}

terraform apply output:

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # aws_glue_catalog_table.books_tf_with_spaces will be created
  + resource "aws_glue_catalog_table" "books_tf_with_spaces" {
      + arn           = (known after apply)
      + catalog_id    = (known after apply)
      + database_name = "example_db"
      + description   = "Table for keeping books info & reviews data."
      + id            = (known after apply)
      + name          = "books_tf_with_spaces"
      + parameters    = {
          + "EXTERNAL" = "TRUE"
        }
      + table_type    = "EXTERNAL"

      + storage_descriptor {
          + input_format  = "org.apache.hadoop.mapred.TextInputFormat"
          + location      = "s3://athena-jszafran-test-data/books/"
          + output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"

          + columns {
              + name = "isbn"
              + type = "string"
            }
          + columns {
              + name = "title"
              + type = "string"
            }
          + columns {
              + name = "author"
              + type = "string"
            }
          + columns {
              + name = "published"
              + type = "date"
            }
          + columns {
              + name = "reviews"
              + type = "array<struct<name:string, rating:tinyint, created_at:timestamp, text:string>>"
            }

          + ser_de_info {
              + parameters            = {
                  + "serialization.format" = "1"
                }
              + serialization_library = "org.openx.data.jsonserde.JsonSerDe"
            }
        }
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

aws_glue_catalog_table.books_tf_with_spaces: Creating...
aws_glue_catalog_table.books_tf_with_spaces: Creation complete after 1s [id=584229380385:example_db:books_tf_with_spaces]

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

Table was created successfully. Let's try to query it with:

SELECT COUNT(*) row_cnt
FROM books_tf_with_spaces; 

No luck this time, after 10 seconds Athena complains with below error:

Your query has the following error(s):

HIVE_METASTORE_ERROR: com.facebook.presto.spi.PrestoException: Error: name expected at the position 25 of 'array<struct<name:string, rating:tinyint, created_at:timestamp, text:string>>' but ' ' is found. (Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)

Taking a closer look at the error - it looks like spaces inside struct fields are treated as column names, hence the exception raised.

Interestingly, when checking table's definition in AWS Glue Data Catalog console, it seems that the column's fine and struct fields were parsed correctly:

organizational tree visualization

Just to make sure that spaces are the offenders, let's try out the last case.

Case 4: Terraform script without spaces

We'll create another table with Terraform - the only difference will be that we'll not add any spaces inside reviews column definition.

Terraform script for the resource:

resource "aws_glue_catalog_table" "books_tf_without_spaces" {
  database_name = "example_db"
  name          = "books_tf_without_spaces"
  description   = "Table for keeping books info & reviews data."

  table_type = "EXTERNAL"

  parameters = {
    EXTERNAL = "TRUE"
  }

  storage_descriptor {
    location      = "s3://athena-jszafran-test-data/books/"
    input_format  = "org.apache.hadoop.mapred.TextInputFormat"
    output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"

    ser_de_info {
      serialization_library = "org.openx.data.jsonserde.JsonSerDe"

      parameters = {
        "serialization.format" = 1
      }
    }

    columns {
      name = "isbn"
      type = "string"
    }

    columns {
      name = "title"
      type = "string"
    }

    columns {
      name = "author"
      type = "string"
    }

    columns {
      name = "published"
      type = "date"
    }

    columns {
      name = "reviews"
      type = "array<struct<name:string,rating:tinyint,created_at:timestamp,text:string>>"
    }
  }
}

terraform apply output:

  # aws_glue_catalog_table.books_tf_without_spaces will be created
  + resource "aws_glue_catalog_table" "books_tf_without_spaces" {
      + arn           = (known after apply)
      + catalog_id    = (known after apply)
      + database_name = "example_db"
      + description   = "Table for keeping books info & reviews data."
      + id            = (known after apply)
      + name          = "books_tf_without_spaces"
      + parameters    = {
          + "EXTERNAL" = "TRUE"
        }
      + table_type    = "EXTERNAL"

      + storage_descriptor {
          + input_format  = "org.apache.hadoop.mapred.TextInputFormat"
          + location      = "s3://athena-jszafran-test-data/books/"
          + output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"

          + columns {
              + name = "isbn"
              + type = "string"
            }
          + columns {
              + name = "title"
              + type = "string"
            }
          + columns {
              + name = "author"
              + type = "string"
            }
          + columns {
              + name = "published"
              + type = "date"
            }
          + columns {
              + name = "reviews"
              + type = "array<struct<name:string,rating:tinyint,created_at:timestamp,text:string>>"
            }

          + ser_de_info {
              + parameters            = {
                  + "serialization.format" = "1"
                }
              + serialization_library = "org.openx.data.jsonserde.JsonSerDe"
            }
        }
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

aws_glue_catalog_table.books_tf_without_spaces: Creating...
aws_glue_catalog_table.books_tf_without_spaces: Creation complete after 1s [id=584229380385:example_db:books_tf_without_spaces]

books_tf_without_spaces table was created. Let's try querying it.

SELECT COUNT(*) row_cnt
FROM books_tf_without_spaces; 

Succeess! We got expected results and no exceptions from Presto engine.

Summary

To sum up our little test:

  • Hive DDL query with spaces - OK
  • Hive DDL query no spaces - OK
  • Terraform script with spaces - NOT OK - spaces inside struct definition will be treated as columns names and Athena will raise HIVE_METASTORE_ERROR: com.facebook.presto.spi.PrestoException exception
  • Terraform script no spaces - OK

And that's it for today. Hope this will save you some time when working with Terraform and Glue/Athena.

Best Regards,
Kuba