Configuration Guide
To keep in mind: a few useful commands.
What can be configured
The following options can be configured:
Quick Options (see Quick Start and Selecting base configuration):
To install Conda or not
Use Existing PostgreSQL or install a new container (existing PostgreSQL requires custom configuration)
Custom configuration
How Airflow connects to PostgreSQL
What prerequisites and requirements are installed into the runtime workflow execution environment
What user projects are installed into the runtime workflow execution environment
Username and password used by Airflow administrator
Selecting base configuration.
You may or may not need Conda for your workflows. Your host system might also run other applications that use PostgreSQL and thus already have PostgreSQL running directly on your host or in an existing Docker container. Combination of these options bring us to four possible base configurations.
Configurations |
Existing PostgreSQL |
New Container with PostgreSQL |
---|---|---|
With Conda |
Need to install Conda and configure PostgreSQL connections |
Need to install Conda and PostgreSQL (default). Connections are automatically configured |
Without Conda |
Need to configure PostgreSQL connections |
Need to install PostgreSQL. Connections are automatically configured |
Configuration is mostly defined by setting environment variables
that can be set manually in the shell or, for simplicity and
repeatability, listed in a special file named .env
. This package
includes four template environment files, corresponding to the
configurations above:
Configurations |
Existing PostgreSQL |
New Container with PostgreSQL |
---|---|---|
With Conda |
||
Without Conda |
The first step will always be to select the appropriate configuration
and copying corresponding environment file into .env
, e.g.,
cp .env_example_postgres_conda .env
The configuration is controlled by the two lines at the top of each file:
###
COMPOSE_PROFILES=[/postgres]
AIRFLOW_CONDA_ENV=[/conda_default]
###
Then users can edit the setting in the .env
, which they most probably
would want to do in a production environment.
Controlling Conda environments
Setting Conda environment used during workflow executions
For more details about managing Conda environments, please look here.
First,
export the environment
you need into a YAML file. Put your Conda environment file
into project
folder under the source tree. Then edit
variable AIRFLOW_CONDA_ENV
in the .env
file:
# AIRFLOW_CONDA_ENV="conda_default"
AIRFLOW_CONDA_ENV="mycondaenv"
Alternatively, but less preferably, you can replace conda_default.yml.
Managing multiple Conda environments
If you have more than one YAML file in the project
folder,
your containers will be built with all of these environments.
This will give you an option to switch between Conda environments
without rebuilding the containers. The default environment will
be the one, specified by AIRFLOW_CONDA_ENV
environment variable.
You will be also able to select
Conda environment inside a container when running command line
tools (e.g., using cwl-runner) or batch executions.
However, only one Conda Environment, specified by
AIRFLOW_CONDA_ENV
environment variable will be active inside
Airflow. To change the environment you will need to shut down
the webserver container, set the new value of AIRFLOW_CONDA_ENV
and restart webserver without rebuilding it.
Configuring installation of third-party requirements
Python requirements
Python requirements should be placed in the requirements.txt file.
R libraries
We are using Conda as an execution environment for R scripts, therefore R requirements should be part of your Conda environment.
If any R packages have to be installed from GitHub, they should be listed in r-github-packages.txt These packages are installed directly from GitHub by install_conda script. Make sure, that there is an end-of-line at the end of the file.
Configuring user projects
Beside installing third-party requirements, in many cases, you will want to install your own code inside the workflow execution environment. This deployment supports user code written in Python and R.
Python Projects
Python projects can be installed inside CWL-Airflow and hence can
be used by workflows. The automatic configuration assumes that all
Python projects must be placed in project folder under the source tree.
It can be done either by using Git submodules
utility, or, simply by copying the project content under projects folder.
Each Python project must contain setup.py
file.
An included example, project/python_sample_project
shows how it can be done.
Please make sure that the argument
install_requires = [
...
]
of your setup.py
file includes all required dependencies.
Enforcing order for installation of user Python Projects
If the projects depend on each other, then it is important to
install the projects in the specific order. To enforce the order,
create a file called projects.lst
and place it in project
folder.
List a single subfolder of a python project on each line of this file.
If there is no file projects.lst
, then the projects will be installed
in an arbitrary order. See install_projects.sh
for details.
R Projects
R scripts can be placed under project folder in the source tree.
See included example, project/r_sample_project
.
Configure Git submodules
This step is especially important if you are working inside environment with limited Internet capabilities. It works around the problem that docker containers running CWL and Airflow might have no access to the Internet.
Most probably, you need to install your projects inside the CWL-Airflow environment. These projects can be installed using Git submodules functionality.
Clone this repo and go to repo directory
If you need to install additional projects with custom code add them as submodules to the project as subprojects inside
project
subdirectory. You can also just copy the content into a subdirectory ofproject
. Please note, that one submodule (CWL-Airflow) is already included.Execute command:
git submodule update --init --recursive
Overriding BASE_URL
In most of the cases, you will use a
proxy server to connect to Airflow
in production environment. The connection can go through
nginx or
apache HTTP server.
Airflow itself uses redirection, therefore you will need to
tell Airflow that it is behind a proxy. This is done
by enabling a proxy fix (enable_proxy_fix = True
) and setting the value
of BASE_URL
in your .env
file.
export BASE_URL=http://my_host/myorg/airflow
Airflow admin username and password
Most probably, for security reasons, you would want to change username and password for the Airflow and for the database authentication, used by Airflow.
export _AIRFLOW_WWW_USER_USERNAME=airflow
export _AIRFLOW_WWW_USER_PASSWORD=airflow
Overriding default parameters
If you want to override some params, see the section environment in docker-compose.yaml.
Full list of available environment variables
The following variables can be exported in the shell or updated in .env file to override their default values
### Available options and default values
## Postgres
# POSTGRE_USER=airflow
# POSTGRE_PASS=airflow
# POSTGRE_DB=airflow
# POSTGRES_PORT=5432
#
## Airflow parameters
# POSTGRE_SERVER=postgres
# WEB_SERVER_PORT=8080
# AIRFLOW__CORE__LOAD_EXAMPLES="False"
# AIRFLOW__WEBSERVER__EXPOSE_CONFIG: "True"
## DAGS_FOLDER -- Environment variable inside container. Do not override if you set DAGS_DIR variable
# DAGS_FOLDER="/opt/airflow/dags"
# _AIRFLOW_WWW_USER_USERNAME="airflow"
# _AIRFLOW_WWW_USER_PASSWORD="airflow"
# BASE_URL="http://localhost:8080"
#
### Mapped volumes
# PROJECT_DIR="./project"
## DAGS_DIR -- Environment variable on host! Do not override if you set DAGS_FOLDER variable
# DAGS_DIR="./dags"
# LOGS_DIR="./airflow-logs"
# CWL_TMP_FOLDER="./cwl_tmp_folder"
# CWL_INPUTS_FOLDER="./cwl_inputs_folder"
# CWL_OUTPUTS_FOLDER="./cwl_outputs_folder"
# CWL_PICKLE_FOLDER="./cwl_pickle_folder"
Example of .env file. Ready to run containers
NB: Values might be different for your environment
COMPOSE_PROFILES=
AIRFLOW_CONDA_ENV="conda_default"
POSTGRE_SERVER="172.16.238.1"
POSTGRE_DB=airflow
POSTGRE_USER=airflow
POSTGRE_PASS=airflow
PROJECT_DIR=./project
DAGS_DIR=./dags
LOGS_DIR=./airflow-logs
CWL_TMP_FOLDER=./cwl_tmp_folder
CWL_INPUTS_FOLDER=./cwl_inputs_folder
CWL_OUTPUTS_FOLDER=./cwl_outputs_folder
CWL_PICKLE_FOLDER=./cwl_pickle_folder