NSAPH Data Platform Deployment

Prerequisites
Installation
Quick Start Deployment
Testing
Updating project packages

Deployment of NSAPH Data Platform is based on
CWL-Airflow Docker Deployment developed by Harvard FAS RC in collaboration with Forome Association.

Essentially, this is a fork of: Apache Airflow + CWL in Docker with Optional Conda and R

It follows Infrastructure as Code (IaC) approach.

Prerequisites 

NB: The docker-compose.yaml in this project uses profiles and therefore requires docker-compose utility version 1.29+

Installation 

Deployment Guide provides detailed information about deployment options and custom configurations.

Howto provides a list of required and optional steps that should be performed during the deployment.

Installation of CWL-Airflow on a dedicated host is relatively simple and is by and large covered by the Quick Start Deployment section below.

Advanced options are described in the Configuration Guide

If the host where you are installing CWL-Airflow is shared with other applications, especially those, using PostgreSQL, you should carefully read Howto and Configuration Guide

After you have deployed CWL-Airflow, test it with the included examples.

You should be aware of some useful commands.

Quick Start Deployment 

This quick start is specific to NSAPH project. For testing general platform capabilities please refer to original CWL-Airflow deployment README

Full sequence of commands to copy and paste for a clean VM:

git clone https://github.com/NSAPH-Data-Platform/nsaph-platform-deployment.git
cd nsaph-platform-deployment
git submodule update --init --recursive
export log=build-`date +%Y-%m-%d-%H-%M`.log && date > $log && cat .env >> $log && DOCKER_BUILDKIT=1 BUILDKIT_PROGRESS=plain docker-compose --env-file ./.env  build --no-cache 2>&1 | tee -a $log && date >> $log
mkdir -p ./dags && cp -rf ./project/examples/* ./dags
docker-compose --env-file ./.env up -d

The whole process, when using a stable Internet connection should take from 20 minutes to a few hours depending on your Internet speed.

You can test the installation as described in Testing the installation section. The first two examples should run in both command-line mode and in Airflow UI. The third example requires Conda.

Testing 

Basic testing is described in the Test Guide. The guide describes how to test both command line commands and Airflow UI.

The code that performs actual data processing lives in project subdirectory. From there it is installed in all Docker containers that are used by the platform. In this documentation we also refer to it as ‘user code’, meaning that it is not part of infrastructure but code developed by researchers and engineers for their specific projects.

Obviously, from time to time the runtime environment requires to be updated with the latest version of this user code. In this section we describe how it can be done by system administrator.

There are three options to update user code in the runtime environment:

Rebuild all docker containers
Install updates inside docker containers
Map packages from container to the host

Option 1: Rebuild all docker containers 

This is the most straightforward and proper option. It follows the best practices guidelines but have a few caveats.

Executing this option is equivalent to following the instructions in Quick Start. There is also a helper script hardreset.sh

There are, however, a few caveats associated with this option:

The process might take several hours, depending on the Internet speed and hardware.
If build fails for some reason (e.g. some third-party packages have been updated and some dependencies are broken), it will take time and effort even to get back to the working version

Option 2: Install updates inside docker containers 

This is quick and easy option which is also relatively safe. It can be performed by running refresh.sh script or by executing similar commands.

The main downside of this option is that changes affect containers only while they are running. If any of the containers are restarted, all changes will be lost. However, this is not as awful as it sounds, just:

Do not forget to rerun refresh.sh script every time you restart the containers!

Option 3: Map packages from host 

We can map packages on the host machine to the library path in the containers. File docker-compose.mapped-packages.yaml illustrates how to do it. Look at lines 64-70.

If this option is used, just refreshing packages on the host (e.g. by executing git pull) will automatically update packages inside the container.

However, keep in mind that you are bypassing normal installation with unpredictable consequences.