I've been getting more and more into analytics and ETL tools at work and have spent some time getting my head around how AWS S3, Glue and Athena all integrate to provide a serverless ETL and analytics process. I think it's really cool and so wanted to write about it in the hope that it helps someone else get started.
Disclaimer: If you follow the steps precisely in this tutorial you'll be within the free-tier usage and so shouldn't incur costs. If you're unsure please look over the pricing for Glue and Athena and be careful how often you crawl and query data!
First things first, it would be good to be clear what AWS S3, Glue and Athena are!
|AWS S3||AWS Simple Storage Service (S3) is an object storage service that provides features such as lifecycle management, storage tiers, versioning, etc.|
|AWS Glue||AWS Glue is a managed and serverless (pay-as-you-go) ETL (Extract, Transform, Load) tool that crawls data sources and enables us to transform data in preparation for analytics.|
|AWS Athena||AWS Athena is an interactive query service to analyse a data source and generate insights on it using standard SQL.|
Now to discuss what we're trying to achieve. I'm using a open dataset which lists New York taxi data (it's from the AWS Open Data registry). This data is available in S3 in an account that I don't own but that has been made public.
To access the data in S3 log into the AWS console and go to this location:
In here I'm going to be working with specific datasets, the first contains a list of zone locations and the second contains all the trip data for yellow, green and FHV (for-hire-vehicles) taxis:
- https://s3.amazonaws.com/nyc-tlc/trip data
Now, for the objective...
I want to see the top 3 boroughs for number of FHV taxi pick ups in 2015 with a breakdown per month of the year.
I'm going to achieve this in three steps:
Analyse the dataset and determine how best to use the data to achieve the objective.
Transform the data to a more structured format to enhance query performance and costs.
Query the data to generate our required insight.
Here's what you need to do to get this done.
Step 1: Review the Dataset
If you look at the data source in S3 you'll see that there are loads of different trip data files for each month of the year and in some cases back to 2009! All the data is in CSV format and to understand the structure of the data I could download one of the smaller files and have a look at it. But where's the fun in that? Let's set-up AWS Glue to crawl the dataset for us and we can have a look at the dataset that way.
Step 2: Security Set-Up
First things first... We need to create an IAM Role to give Glue the permissions is requires to crawl and transform the data.
Step 2.1: Go to IAM in the AWS Console and click on Roles.
Step 2.2: Click on Create Role.
Step 2.3: On the create role screen ensure AWS Service is selected and chose Glue from the list of services (see below):
Step 2.4: Hit Next: Permissions to continue.
Step 2.5: From the Filter Policies list enter "glue" and select the AWSGlueServiceRole policy.
Step 2.6: Next enter "s3" into the Filter Policies list and select the AmazonS3FullAccess policy. Please Note: This is pretty bad practice and not suitable for production scenarios. Ideally we would just use a policy that provides read access to the bucket or even folder/key required.
Step 2.7: Click the Next: Tags button.
Step 2.8 Add any tags you wish (or not) and click the Next:Review button.
Step 2.9: Give your role a name (I called mine "TaxiGlueServiceRole") and click the Create Role button.
Now we're all set to get cracking with Glue!
Step 3: Create the Glue Data Catalog
Now we're going to create a Glue Crawler, which we'll use to target the taxi dataset. The crawler will head off and scan the dataset for us and populate the Glue Data Catalog. In effect, this will create a database and tables in the Data Catalog that will show us the structure of the data. Let's get started.
Step 3.1: Head to AWS Glue in the AWS Management Console.
Step 3.2: When in the Glue dashboard click on Crawlers on the left menu.
Step 3.3: Now click Add Crawler.
Step 3.4: In the add crawler screen enter a name for your crawler (I called mine "TaxiCrawler"). Also, open the other options on this screen and have a look at what else we could set. We're not going to use any of these other options here, but it's still useful to know that they're there!
Step 3.5: Scroll to the bottom of the page and click Next.
Step 3.6: Now we're on the Add a Data Store page. In here choose S3 as your data store and under the Crawl data in entry select Specified path in another account. In here enter the following S3 path.
- s3://nyc-tlc/trip data/
Step 3.7: Next we need to enter the Exclude patterns to make sure Glue only crawls the CSV files we're interested in. Add the exclusions shown below and click the Next button.
Step: 3.8: When prompted with the Add another data store question select Yes and click Next.
Step: 3.9: Here we're going to add another dataset to include the taxi zone data. Perform the same steps as above, but using the following information. Once you're finished click the Next button. It's also worth reviewing the source dataset to understand the exclude patterns I've created.
Step: 3.10: When prompted to add another datastore select No and click the Next button.
Step: 3.11: Now we're on the Choose an IAM Role screen. Here we select the Choose an existing IAM role selection and choose our TaxiGlueServiceRole IAM role that we created in the previous step. Then click the Next button.
Step: 3.12: When asked to select the schedule to run the crawler select Run on demand and click the Next button.
Step: 3.13: Now we should be on the Configure the crawler's output screen. On here click Add database and enter a database name of your choice (I used "TaxiData"). No need to enter a location, but you can enter a description if you wish. Finally click the Create button.
Step: 3.14: In the Prefix added to tables (optional) input enter a prefix of your choice (I used "raw_"). This is added to the start of all your table names in Glue.
Step: 3.15: Leave the configuration options as they are but it's worth reviewing them and understanding the options we have here. Then click the Next button.
Step: 3.16: Review the settings you've added and click the Finish button.
Step: 3.17: Now on the crawlers list, select the TaxiCrawler and click the Run Crawler button. This should take less than a minute and it will crawl the CSV files you've not excluded in the taxi S3 bucket. Once it's completed click on the Database -> Tables option in the left menu to review the tables.
Step 4: Verify the Raw Data
At this point we're going to have a look at the data that's been crawled by Glue. We'll view this data using AWS Athena.
Step: 4.1: In Glue click on Database -> Tables to view the tables created by the crawler, which should be as follows:
Step: 4.2: To view the data in athena select one of the tables and click the Action button, then select View Data. This will open Athena where you can then click the Run Query button to view the data (see below).
Now at this point we can runs some queries on the data to get more of an idea of how large the dataset is. I'd advise you don't do this to save your money (Athena charges $5 for each TB of data scanned so we shouldn't exceed this, but just in case). I've scanned the data and present the info below:
|Trip Data||Record Count||2,061,783,285|
|Trip Data||Null Location Ids||1,923,234,549|
|Trip Data||Null Pickup Date||149|
|Zone Data||Record Cound||107,003,212|
Step 5: Transforming the Data
At this point we can see we're dealing with a reasonable amount of records (yes it's small by most big data processing scenarios but it's more than large enough for our example). But we can also see that we've got almost 2 billion records with a null locationid in the trip data records. We also know that the trip data contains a data field (pickup_date) and that at the moment this has a data type of string. To answer the question we set in the objective at the top of this page we'll need to convert that value to a date so that we can generate insights around months of the year. So we're going to transform our data to Parquet and as part of that transformation process we'll perform the following steps on the transformation of the trip data:
Remove the dispatching_base_num field - we don't need this.
Convert the pickup_date to a date type field - needed to query on date parts (e.g. month).
Remove the records with a null locationid - we do this because this record doesn't add any value (as we're joining the two trip and zone data on the location id) and so will just increase the size of the dataset and therefore the size of queries in Athena.
Transforming our raw CSV data into Parquet also compresses the data (around 3 to 1 compression) so it makes our dataset even smaller and (again) keeps down the cost of queries in Athena.
Before we can transform our data we need somewhere to put it! So let's create an S3 bucket to store the transformed data. Then we'll set-up the transformation job in Glue.
Step 5.1: Create an S3 bucket in the same region you're using for Glue.
Step 5.2: In the bucket create the following folders:
|data/trip-data||Holds the transformed trip data.|
|data/zones||Holds the transformed zone lookup data.|
|scripts||Stores the transformation scripts generated by Glue.|
|temp||Stores temporary intermediate results during the transformation process. Won't be used in this situation, but to maintain control of your data it's worth setting to your bucket.|
Step 5.3: In the Glue page, under the ETL menu option on the left click on Jobs.
Step 5.4: In the jobs list click on Add Job. We're going to set-up separate jobs for both the trip data and zone lookup data.
Step 5.5: We'll start with the trip-data. In the job creation window add a name and select the IAM role we created earlier (as below).
Step 5.6: Then further down the page set the S3 path where the script is stored input to the S3 bucket you created in step 1 and in the scripts folder.
Step 5.7: Next under on the Temporary directory input enter the S3 bucket you created in step 1 and the temp folder.
Step 5.8: Now click the Next button.
Step 5.9: Now on the Choose a datasource screen select the taxidata/raw_trip_data table and click the Next button.
Step 5.10: On the Choose a data target screen, select S3 from the Datastore dropdown. Set the format to Parquet and set the Target path to the S3 bucket you created in step on and the data/trip-data folder. Then click the Next button.
Step 5.11: On the next screen review the mappings and delete the dispatching_base_num field in the Target table (we're not using this). Then update the pickup_date to have a Data type of date. So your mappings should look as follows:
Step 5.12: Now click the Save job and edit script button.
Step 5.13: On this next page we can see the Python script generated for us by Glue. Clicking on the blocks on the left of the screen highlights the relevant area of code and is useful to understand what's going on. For example, if you click Transform Name ApplyMapping on the left it highlights the code that converts the pickup_date and sets the type of the locationid field to long.
Here we want to remove all the records that have a null/empty locationid. To do this we have a number of options but I'm going to use some Spark commands in PySpark.
In the generated script find the following line (should be around line 35):
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
Then replace the line with the following:
result = resolvechoice2.toDF().filter("locationid IS NOT NULL") dropnullfields3 = DynamicFrame.fromDF(result, glueContext, "dropnullfields3")
The above code coverts the previous dynamicframe to an Apache Spark dataframe and then we filter the records to only include those where locationid is not null. After which we convert back to a DynamicFrame to continue the glue process.
Step 5.14: Now you can review the script and then click the Save button and the Run Job button (clicking it again in the pop-up).
Step 5.15: Finally, we now need to repeat the steps for the zone lookup data (excluding step 13 - we don't need to customise the script). Call the job something meaningful (I called mine TaxiData-ZoneData-CSV-To-Parquet) and remember to set the timeout as before. Point the scripts and temp folders to your S3 bucket again and target the taxidata/raw_misc data set to the S3 bucket in the data/zones folder while remembering to set the format to Parquet. Once all of that is done run the job.
Step 6: Crawl the Transformed Data
At this point if you have a look in the data/trip-data and data/zones folders in your S3 bucket you should see all the lovely parquet files. Now we need to add the newly transformed data into the Glue Data Catalog so that we can query it using Athena.
Please Note: I'm not going to go through all the steps again to create a crawler for your transformed data as that's all above in the Create the Glue Data Catalog section). I'll just highlight the different steps we need to perform this time.
Step 6.1: First we need a different name for the crawler (I've called mine "TaxiCrawler_Parquet").
Step 6.2: I've added a data store for the data/trip-data files and for the data/zones files in our S3 bucket (both without exclude patterns this time because we've isolated our dataset).
Step 6.3: When asked about an IAM role I've used the TaxiGlueServiceRole role again and I set it to run on demand.
Step 6.4: When asked to configure the cralwers output we could configure a new database to store the tranformed data. However in this case I'm going to reuse our existing taxidata database and just change the prefix of the tables to "str_" to show that this is structured data (see below).
Step 6.5: Once you've created the crawler follow the steps outlined before to run it.
Step 7: Query the Data
Up until now we've been completing the ETL steps of this process. We've crawled the raw dataset and reviewed the data, then we've transformed the data into a format that's more efficient to query. Now we're going to query the data to generate our insight.
At this point we should have the following tables in our taxidata database in Glue.
Now we need to query the data in Athena!
Step 7.1: Open the Athena dashboard in the AWS Console and ensure the taxidata database is selected on the left (under database).
Step 7.2: To view the newly transformed data click the three dots to the right of the str_trip_data table and select Preview Table. This runs a query (limited to 10 records). You should see something similar to the following:
Step 7.3: To generate the insight as per our objective, we need to perform 2 steps. First we need to get a list of the top 3 boroughs for pickups. Then we use the list of top 3 boroughs for pickups to get a breakdown per month. So first, we need to use the following query to get the list of top 3 boroughs:
SELECT zones.borough, COUNT(trips.pickup_date) as trips FROM "taxidata"."str_zones" zones LEFT JOIN "taxidata"."str_trip_data" trips ON zones.locationid = trips.locationid GROUP BY zones.borough ORDER BY trips DESC limit 3;
Step 7.4: Now we can save the above query as a view so we can easily use it in another query. To do this click the Create button and select Create view from query. Give the view a name (I've called mine "taxidata-top-boroughs").
Step 7.5: At this point we need to use the view to get the list of top 3 borough names and then get a list of locations within each borough. After that we can join on the trip data and group the results by month of the year. This is what I'm doing in the following query:
SELECT top.borough, MONTH(trips.pickup_date) as months, COUNT(trips.pickup_date) as trips FROM "taxidata"."taxidata-top-boroughs" top LEFT JOIN "taxidata"."str_zones" zones on top.borough = zones.borough LEFT JOIN "taxidata"."str_trip_data" trips on zones.locationid = trips.locationid GROUP BY top.borough, MONTH(trips.pickup_date) ORDER BY top.borough, months ASC
Step 7.6: If you click the History tab in the Athena window you can download a CSV of the results. Then you can reformat the data to make it more presentable (as follows):
Query performance in Athena is a topic deserving of it's own post and I'm sure if I'd have spent more time I could have come up with more efficient queries (the joins in the second query would be the first to tackle). But for now it's worth pointing out that because of the transformation step we performed (where we transformed the data into Parquet and removed records we wouldn't use) we can see an increase in performance of a simple
select * from [table] where locationid = 1 query as follows:
|Measure||Raw Data||Structured Data||Decrease (%)|
|Run Time||79 secs||2.25 secs||97.15%|
So the lesson here is to always analyse, target and transform your data. Even just changing the format of the data helps massively!
Thanks to those of you who made it this far into what was a long post! I've really enjoyed digging into AWS Glue and Athena and I hope others have found some use for what I've written here.