# Generator of pipelines executing containerized apps

```{contents}
---
local:
---
```

## Introduction

[National Studies on Air Pollution and Health](https://www.hsph.harvard.edu/nsaph/)
organization (NSAPH) publishes containerized applications to produce
certain types of data. These applications are published on the
[NSAPH Data Production GitHub](https://github.com/NSAPH-Data-Processing).

The Pipeline Generator generates a 
[CWL](https://www.commonwl.org/) pipeline to execute the app and ingest
the data it produces into Dorieh Data warehouse.

The process of data ingestion consists of two steps:

1. Generation of the piepline for data ingestion
2. Execution of the pipeline

             
## Prerequisites
        
### Docker or Python virtual environment

> You need either Option 1 or Option2, not both!

#### Option 1: Docker

The first step, generation of the pipeline, has minimal requirements. 
The easiest way to generate the pipeline is to use a docker container,
which will only requires Docker to be installed
on the host system, where the step is executed. See
[Docker installation instructions](https://docs.docker.com/engine/install/)
for details.
                     
#### Option 2: Python virtual environment

Alternatively, *instead* of Docker one can set up a
[Python virtual environment](https://docs.python.org/3/library/venv.html).
Once virtual environment is set up, you should install Dorieh packages
in it with the following command:

    pip install git+https://github.com/NSAPH-Data-Platform/nsaph-core-platform.git@develop


### Setup DBMS Server

Dorieh uses PostgreSQL DBMS to manage its data warehouse. The data warehouse
is assumed to be set up and operational to ingest data. Generating 
the pipeline does not require the data warehouse.

### Define connection
          
Dorieh uses database.ini type file to manage connections
to data warehouse. The format described in
[documentation](SampleQuery.md#create-connection-definition-file).

If the file with database connections does not exist, you 
need to create one. For example, named database.ini somewhere
on your local file system.

## Using pipeline generator

### Generate pipeline and metadata

Generator takes 3 command line parameters:
                                       
1. GitHub URL or a local path for the containerized app. In the 
    root directory of the path, generator will look for a file named
   `app.config.yaml`.

If you use a local Python virtual environment, then run:

    python -m dorieh.platform.apprunner.app_run_generator $GitHubURL $outputfile $branch

Example:

    python -m dorieh.platform.apprunner.app_run_generator https://github.com/NSAPH-Data-Processing/zip2zcta_master_xwalk.git pipeline.cwl master

to generate a pipeline executing
[ZIP to ZCTA Crosswalk Producer app](https://github.com/NSAPH-Data-Processing/zip2zcta_master_xwalk)
using `master` branch and output the result into current directory in
a file named `pipeline.cwl`

Alternatively, to do the same using Docker container, execute:

    docker run -v $(pwd):/tmp/work forome/dorieh python -m dorieh.platform.apprunner.app_run_generator https://github.com/NSAPH-Data-Processing/zip2zcta_master_xwalk.git /tmp/work/pipeline.cwl master

In both cases, the generator will produce 3 files:

* pipeline.cwl: main workflow file
* ingest.cwl: subworkflow used for data ingestion
* common.yaml: metadata required for ingestion. Name `common` is
    derived from `domain` key in the 
    [app.config.yaml](https://github.com/NSAPH-Data-Processing/zip2zcta_master_xwalk/blob/app-config-1/app.config.yaml)
    file in the app repository.
 
## Execute generated pipeline

If you installed Dorieh packages in your local Python virtual environment, 
you can execute the pipeline with the following command in 
the working directory, for example using CWL reference implementation
built into Dorieh (cwl-runner).

    cwl-runner pipeline.cwl --database $path_to_your_connection_def_file --connection_name $connection_name

for example:

    cwl-runner pipeline.cwl --database ../../database.ini --connection_name dorieh
 
A better way would be to use a production grade CWL implementation
such as [Toil](https://toil.readthedocs.io/en/latest/running/cwl.html).
To do this you need to 
[install Toil](https://toil.readthedocs.io/en/latest/gettingStarted/install.html) 
on your local system. 

> You do not need to install Dorieh packages to execute the pipeline.
> The runtime engine will use Dorieh container where all requirements
> are preinstalled.

For Toil, a good advice would be to first create working 
directory, e.g. named `work`. Otherwise, Toil will create a default
directory somewhere in yoru temporary space. 

The command to execute the pipeline with Toil would be:

    toil-cwl-runner --retryCount 0 --cleanWorkDir never  --jobStore j1 --outdir results --workDir work  pipeline.cwl --database ../../database.ini --connection_name nsaph-docker

Specifying `jobStore` will let you restart the pipeline from a 
point of failure if pipeline execution fails for any reason.

## Appendix 1: Metadata description

### File app.config.yaml

Keys:

* `metadata`: a relative path to the app metadata file
* `dorieh-metadata`: a relative path to the metadata required to create
    a database table
* `docker`: information about the docker container, including:
  * `image`: the tag for the container image that executes the app
  * `run`: the command to be run within the container. This is an optional
    field
  * `outputdir`: directory, where to look for the results of the execution
    of the app

### File metadata.yml

This is the file referenced from  app.config.yaml by `metadata`
key.

It should contain the following keys:

* dataset_name
* description
* fields:
  * table
    * columns

Each column should have `name`, `type` and `description` keys.

### File dorieh-metadata.yaml

This is a header for a knowledge domain that will be created.
Detailed description is provided in the 
[Data modeling section](Datamodels.md#domain). It is important
to define correct values for `quoting`, `schema` and 
`primary_key` for each table.