# NSAPH Data Platform Deployment

[Documentation Home](https://nsaph-data-platform.github.io/nsaph-platform-docs/home.html)

```{toctree}
---
maxdepth: 2
hidden:
---
Introduction
Guide
Howto
Configuration
Testing
Glossary
UsefulCommands
```

```{contents}
---
local:
---
```

Deployment of NSAPH Data Platform is based on  
CWL-Airflow Docker Deployment developed
by Harvard FAS RC in collaboration with Forome Association.

Essentially, this is a fork of: 
[Apache Airflow + CWL in Docker with Optional Conda and R](https://github.com/ForomePlatform/airflow-cwl-docker)
                                                            
It follows 
[Infrastructure as Code (IaC)](https://en.wikipedia.org/wiki/Infrastructure_as_code) 
approach.

## Prerequisites 

>**NB**: The docker-compose.yaml in this project uses profiles and therefore
> requires **docker-compose utility version 1.29+**
                    
## Installation

[Deployment Guide](Guide) provides detailed information about
deployment options and custom configurations.

[Howto](Howto.md) provides a list of required and optional steps
that should be performed during the deployment.  

Installation of CWL-Airflow on a dedicated host is relatively simple and 
is by and large covered by the [](#quick-start-deployment) section below.

Advanced options are described in the 
[Configuration Guide](Configuration.md)

> If the host where you are installing CWL-Airflow is shared with other 
> applications, especially those, using PostgreSQL, you should carefully read 
> [Howto](Howto.md) and [Configuration Guide](Configuration.md)
 
After you have deployed CWL-Airflow, 
[test it](Testing.md) 
with the included examples.
                         
You should be aware of some [useful commands](UsefulCommands).

## Quick Start Deployment
 
This quick start is specific to NSAPH project. For testing general 
platform capabilities please refer to original 
[CWL-Airflow deployment README](https://github.com/ForomePlatform/airflow-cwl-docker#quick-start)

Full sequence of commands to copy and paste for  a clean VM:

    git clone https://github.com/NSAPH-Data-Platform/nsaph-platform-deployment.git
    cd nsaph-platform-deployment
    git submodule update --init --recursive
    export log=build-`date +%Y-%m-%d-%H-%M`.log && date > $log && cat .env >> $log && DOCKER_BUILDKIT=1 BUILDKIT_PROGRESS=plain docker-compose --env-file ./.env  build --no-cache 2>&1 | tee -a $log && date >> $log
    mkdir -p ./dags && cp -rf ./project/examples/* ./dags
    docker-compose --env-file ./.env up -d
    
                                                  
The whole process, when using a stable Internet
connection should take from 20 minutes to a few hours depending on your 
Internet speed.
                 
You can test the installation as described in 
[Testing the installation](Testing.md) section. The first two 
examples should run in both command-line mode and in Airflow UI. 
The third example requires Conda.


## Testing 

Basic testing is described in the [Test Guide](Testing.md). The
guide describes how to test both command line commands and Airflow UI.

            
## Updating project packages
                                                                 
### What are project packages?

The code that performs actual data processing lives in `project`
subdirectory. From there it is installed in all Docker containers
that are used by the platform. In this documentation we also refer 
to it as 'user code', meaning that it is not part of infrastructure
but code developed by researchers and engineers for their specific 
projects. 

Obviously, from time to time the runtime environment requires to be 
updated with the latest version of this user code. In this section we describe
how it can be done by system administrator.

There are three options to update user code in the runtime environment:

* Rebuild all docker containers
* Install updates inside docker containers
* Map packages from container to the host

### Option 1: Rebuild all docker containers

This is the most straightforward and proper option. It follows the 
best practices guidelines but have a few caveats. 

Executing this option is equivalent to following the instructions
in [Quick Start](#quick-start-deployment). There is also a helper script
[hardreset.sh](members/rebuild.md)

There are, however, a few caveats associated with this option:

1. The process might take several hours, depending on the Internet speed
    and hardware.
2. If build fails for some reason (e.g. some third-party packages have been
    updated and some dependencies are broken), it will take time and effort
    even to get back to the working version

### Option 2: Install updates inside docker containers

This is quick and easy option which is also relatively safe. It can
be performed by running [refresh.sh](members/refresh.md) script or
by executing similar commands.

The main downside of this option is that changes affect containers only
while they are running. If any of the containers are restarted, all changes
will be lost. However, this is not as awful as it sounds, just:

> Do not forget to rerun [refresh.sh](members/refresh.md) script every 
> time you restart the containers!
 
### Option 3: Map packages from host

We can map packages on the host machine to the library path in the containers.
File 
[docker-compose.mapped-packages.yaml](members/docker-compose-mapped-packages.yaml.md)
illustrates how to do it. Look at lines 64-70.

If this option is used, just refreshing packages on the host
(e.g. by executing `git pull`) will automatically update packages
inside the container.

However, keep in mind that you are bypassing normal installation with 
unpredictable consequences.