A workflow is an orchestrated sequence of steps which conform a business process. Workflows help define, implement and automate these business processes, improving the efficiency and synchronization among their components. An ETL workflow involves extracting data from several sources, processing it and extracting value from it, storing the results in a data warehouse, so they can be later read by others. ETL processes offer a competitive advantage to the companies which use them, since it facilitates data collection, storage, analysis and exploitation, in order to improve business intelligence.
Apache Airflow is an open-source tool to programmatically author, schedule and monitor workflows. Developed back in 2014 by Airbnb, and later released as open source, Airflow has become a very popular solution, with more than 16 000 stars in GitHub. It’s a scalable, flexible, extensible and elegant workflow orchestrator, where workflows are designed in Python, and monitored, scheduled and managed with a web UI. Airflow can easily integrate with data sources like HTTP APIs, databases (MySQL, SQLite, Postgres…) and more. If you want to learn more about this tool and everything you can accomplish with it, check out this great tutorial in Towards Data Science.
Despite being such a great tool, there are some things about Airflow we don’t like too much:
• By default, Airflow uses a SQLite database as a backend, offering a low performance.
• Sensible data, such as credentials, is stored in the database as plain text, without encryption.
• XCom, the Airflow message queue, has an object size limit of 65KB. This can be a problem when working with big amounts of data.
In this post, we’ll learn how to easily create our own Airflow Docker image, and use Docker Compose to automate the deployment of this tool together with a MySQL backend in order to improve performance. We’ll also implement a cryptographic system for securely store credentials and increase the XCom size limit to 4GB.
As a spoiler, if you just want to go straight without following this extensive tutorial, you have a link to a GitHub repo at the end of the post with all the needed files.
Hands-on!
First of all, we’ll start by creating a Docker image for Airflow. We could use the official one in DockerHub, but by creating it ourselves we’ll learn how to install Airflow in any environment. From the official Python 3.7 image (3.8 seems to produce some compatibility issues with Airflow), we’ll install this tool with the pip package manager and set it up. Our Dockerfile would look like this:
FROM python:3.7 RUN pip3 install 'apache-airflow' RUN airflow initdb CMD (airflow scheduler &) && airflow webserver
For creating a container, we would type the following two lines in the terminal, in order to create the image from the Dockerfile, and then run a container with that image, mapping port 8080 and creating a volume for persisting Airflow data:
docker build . -t airflow docker run -it -p 8080:8080 -v :/root/airflow airflow
However, as we saw before, here Airflow uses a SQLite database as a backend, whose performance is quite lower than if we used a MySQL server, for instance. Again, using Docker, we can pretty straightforward setup a MySQL container with a database called “airflowdb” and a user with full permissions on that database.
docker run -d -p 3306:3306 -v :/var/lib/mysql --env-filemysql.env mysql:latest
With the corresponding mysql.env file, where database name, user and password are defined (feel free to change them to what you want):
MYSQL_ROOT_PASSWORD=sOmErAnDoMsTuFF MYSQL_DATABASE=airflowdb MYSQL_USER=airflower MYSQL_PASSWORD=eirfloub!*
At this point, it makes sense to use Docker Compose to orchestrate the deployment of these two containers. The following docker-compose.yml file will deploy both and interconnect them with a bridge network called airflow-backend. It also maps the needed volumes for both containers, opens the 8080 port for Airflow, and in MySQL sets “mysql_native_password” as the authentication plugin, since the default “caching_sha2_password” doesn’t get along well with Apache Airflow:
version: "3.7" services: airflow-backend: image: mysql:latest command: --default-authentication-plugin=mysql_native_password networks: - airflow-backend volumes: - "airflow-backend:/var/lib/mysql" env_file: - mysql.env airflow-engine: build: ./airflow-engine depends_on: - airflow-backend init: true networks: - airflow-backend volumes: - "airflow-engine:/root/airflow/" ports: - "8080:8080" env_file: - mysql.env networks: airflow-backend: volumes: airflow-engine: airflow-backend:
Together with this file, we have to create a new folder, airflow-engine, with the Dockerfile for Airflow. Still, some work left to do… Wouldn’t it be nice that the connection to the MySQL container was automatically set up? Don’t worry, it’s very easy. Airflow has a file called airflow.cfg where it stores key-value configurations, including the URL of the backend. In this case, the MySQL container name is airflow-backend, and the complete URL of the database is mysql://airflower:eirfloub!*@airflow-backend/airflowdb (if you used the same names than here). These key-value files can be easily modified with the Python library configparser. The following Python script, mysqlconnect.py, in the airflow-engine folder, will do the job for us:
from os import environ as env import configparser # Open the Airflow config file config = configparser.ConfigParser() config.read('/root/airflow/airflow.cfg') # Store the URL of the MySQL database config['core']['sql_alchemy_conn'] = 'mysql://{user}:{password}@airflow-backend/{db}'.format(user=env.get('MYSQL_USER'), password=env.get('MYSQL_PASSWORD'), db=env.get('MYSQL_DATABASE')) config['core']['executor'] = 'LocalExecutor' with open('/root/airflow/airflow.cfg', 'w') as f: config.write(f)
With this change, the MySQL container would be the backend, but the credentials would be still stored in this database as plain text. For fixing it, we’ll modify again the airflow.cfg file with another Python script, called fernet.py, which will generate a random Fernet key, which will be used by Airflow for encrypting all the sensible data:
from cryptography.fernet import Fernet import configparser # Generate a random Fernet key fernet_key = Fernet.generate_key().decode() # Store the key config = configparser.ConfigParser() config.read(‘/root/airflow/airflow.cfg’) config[‘core’][‘fernet_key’] = fernet_key with open(‘/root/airflow/airflow.cfg’, ‘w’) as f: config.write(f)
The question now is, how do we execute these scripts? Bash to the rescue! We’ll make a Bash script which will be executed every time the airflow-engine container is run. It will be called airflow.sh and saved it in the airflow-engine folder. This Bash script will check if it’s the first time the container is run; if yes, it will do the initial Airflow setup and call the two Python scripts above. Then, it will automatically run the Airflow scheduler and webserver.
INIT_FILE=.airflowinitialized if [ ! -f «$INIT_FILE» ]; then # Create all Airflow configuration files airflow initdb rm /root/airflow/airflow.db # Secure the storage of connections’ passwords python fernet.py # Wait until the DB is ready apt update && apt install -y netcat while ! nc -z airflow-backend 3306; do sleep 1 done apt remove -y netcat # Setup the DB python mysqlconnect.py airflow initdb # This configuration is done only the first time touch «$INIT_FILE» fi # Run the Airflow webserver and scheduler airflow scheduler & airflow webserver & wait
Almost done! There’s still something we could improve… By default, XCom, the Airflow message queue system, has a limit of 65KB for objects, which can be insufficient when working with big amounts of data. This happens because XCom works with the MySQL database, where the value of each object is modeled by a BLOB column. By changing the column type to LONGBLOB, the limit will be increased to 4GB. This tweak can be automated inside the Bash script above, including these lines before the “touch “$INIT_FILE””:
# Allow XComs to store objects bigger than 65KB apt update && apt install -y default-mysql-client mysql –host airflow-backend –user=root –password=MYSQL_DATABASE –execute=»ALTER TABLE xcom MODIFY value LONGBLOB;» apt remove -y default-mysql-client
If you followed all the steps, you should have the following file structure:
airflow-engine: - airflow.sh - Dockerfile - fernet.py - mysqlconnect.py docker-compose.yml mysql.env
In order to run all the scripts above, we need to do one last thing. Some changes must be made in airflow-engine/Dockerfile:
# Airflow seems to crash with Python 3.8, it's important to use version 3.7 instead FROM python:3.7 # Install and setup Airflow RUN pip3 install ‘apache-airflow[mysql,crypto]’ mysql-connector-python # Configure Airflow: connect to backend WORKDIR /root/airflow/ COPY airflow.sh airflow.sh RUN chmod +x airflow.sh COPY fernet.py fernet.py COPY mysqlconnect.py mysqlconnect.py CMD ./airflow.sh
Now everything is ready for deploying your Apache Airflow on steroids! In the root folder, just type:
docker-compose build docker-compose up
Now you’re ready to start using this powerful tool and orchestrating workflows like a boss. If you couldn’t follow all the steps in the tutorial, don’t worry, everything is uploaded in this GitHub repo.
Authors: Rafael P. Martínez Álvarez, technical manager of Big Data Analytics & Guillermo Barreiro Fernández, intern at Intelligent Systems Departmet