Overview

The objective of this project is to autonomously retrieve data from an open-source database daily for subsequent utilization in data visualization and analysis. This repository is dedicated to the data engineering architecture and setup, using Apache Airflow to orchestrate a batch data processing pipeline. All data from each run is stored in Google Cloud Storage, serving as a data lake, and subsequently utilized BigQuery as a data warehouse for data visualization.

Data Extraction: Download a batch of .xml.gz files and extract vehicle speed, traffic volume, and vehicle occupancy data from the .xml files contained within.
Data Cleaning: Apply a variety of rules to filter out unqualified data and employ the moving average method to impute those data points. Additionally, calculated a bad data ratio to assess data quality. Data failing to meet the threshold will undergo further review.
Data Transform: Aggregate the data to adjust the time interval from 1 minute to 5 minutes, as reducing the interval decreases data fluctuations.
Data Loading: Load the data into Google Cloud Storage, branching it into two directories. One directory is designated to store transformed and cleaned data, while the other is allocated for storing unqualified data.

To summarize here is a snapshot from Airflow web UI to show how these job are arranged.

Architecture

Files

/dags : Contains the following Python scripts to implement the batch data ETL workflow.
- /tasks : All the task functions are include in this folder.
  - variables.py : Store global variables will be imported in other scripts.
  - extract.py : Functions to extract target data from open source database.
  - clean_transform.py : Functions to clean and transform data for loading into a Google Cloud Storage bucket.
  - read_load.py : Functions to load data to Google Cloud Storage and BigQuery.
  - task_functions.py : Functions used in the Airflow DAG.
- batch_data_etl.py : Airflow DAG setup for orchestrating the ETL process.

Run the Project

To run the project, follow these steps:

Build the Docker image using the provided dockerfile which copies the project files into the appropriate directory within the container.
```
docker-compose build
```
Initialize the database
```
docker-compose up airflow-init
```
Run Airflow
```
docker-compose up -d
```

For more details about how to quickly deploy Airflow on docker, please refer to https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html.

Requirements

Have Docker installed on your system.
Create these folders: logs, plugins, config, data, google/credentials.
Have all the required packages written in requirements.txt .
Have your own Google Cloud Platform account, and create a new project and storage bucket.
Get credentials to store or access files in Google Cloud Storage, and save the credentials.json under ./google/credentials .
Then set the necessary Airflow Variables in .env .

Dependencies

Python
Pandas
Google Cloud Storage
Google Cloud BigQuery

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
assets		assets
dags		dags
.env		.env
.gitattributes		.gitattributes
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Overview

Architecture

Files

Run the Project

Requirements

Dependencies

About

Uh oh!

Packages

Uh oh!

Languages

harry1321/batch-data-etl

Folders and files

Latest commit

History

Repository files navigation

Overview

Architecture

Files

Run the Project

Requirements

Dependencies

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Languages

Packages