Data Pipeline into QuickBase

Overview

This project is a consolidated Extract, Transform, Load (ETL) data pipeline that supports a program that builds maps for the aviation industry. It extracts data from a specific source on schedule, does extensive cleaning, feature development and transformation, then imports this data into a QuickBase environment. This process ensures that the data is ready and usable for operations, relieving users of having to manually work with the data.

Features

Operates in a secure, isolated account within Amazon Web Services. Identity and access management are customized, and the account is configured with permissions, policies, and safeguards in place.
Multi-step, modularized processing with integrity checks and stop-gap measures at each stage.
Processing only occurs when data updates, delivering significant cost savings.
Clear logs support both engineering and management visibility.
Serverless operations with AWS Step Functions and AWS Lambda ensure scalability and flexibility.

Technical Overview

There are three stages to the data pipeline.

We use a combination of AWS Event Bridge, Step Functions, and Lambda to process the cleaning and transformation modules. It starts with an EventBridge Scheduler call that triggers a Lambda, which downloads data from source databases and places it into a versioned S3 bucket. We then query the new dataset to identify whether changes exist. If so, we run a sequence of Python Lambda scripts coordinated with Step Functions. If there are no changes, we log this and skip downstream processing to save compute costs.

In the third module, we format the dataset specifically for transfer using the QuickBase API. The process breaks data into fundamental components and updates records in QuickBase, tracking appended, updated, and deleted rows.

All processes are tracked with AWS CloudWatch, with customized logging for both technical and operational stakeholders.

Cost Efficiency

Cost efficiency is a core pillar of AWS’ Well-Architected Framework, and this pipeline follows those practices. During design, we tested a custom VM, AWS Managed Airflow (MWAA), and EventBridge/Lambda. VMs were difficult to maintain. MWAA worked but had costly scaling characteristics and lacked simple on/off controls. We moved to EventBridge, Step Functions, and Lambda, reducing monthly costs from nearly $2,000 with MWAA to under $100 with Lambda.

Future Planning

The data pipeline is complete with no immediate enhancements planned. The architecture and lessons learned are now being applied to additional pipeline projects across other domains.