NSAPH Data Platform: Documentation Home

User and Development Documentation

Index

Introduction to Data Platform

This data platform is intended for development and deployment of ETL/ELT pipelines that includes complex data processing and data cleansing workflows. Complex workflows require a workflow language, and we have chosen Common Workflow Language (CWL). For deployment, we have selected CWL-Airflow to take advantage of the excellent user interface allowing for the control of the actual execution process. The data is eventually stored in a PostgreSQL DBMS; many processing steps in the included data processing pipelines are being run inside the database itself.

The data platform is based on a combination of an Infrastructure as Code (IaC) approach and CWL. Beside tools written in widely used languages such as Python, C/C++ and Java, the platform also supports tools written in R and PL/pgSQL. Data platform consists of several python packages, a package to deploy the platform using CWL-Airflow and a number of data ingestion pipelines. Data ingestion pipelines process data from external sources and load it into the database.

A discussion on what are the aims of this data platform and how reproducible research can benefit from such product is provided in the What is Data Platform section.

The data platform is deployed as a set of Docker containers orchestrated by Docker-Compose. Conda (package manager) environment files and Python requirements are used to build Docker containers satisfying the dependencies. Specific parameters can be customized via environment files and shell script callbacks.

Building Blocks

The building blocks of the data platform are packaged in several repositories:

  • The NSAPH utilities repository https://github.com/NSAPH-Data-Platform/nsaph-utils

  • The core platform repository https://github.com/NSAPH-Data-Platform/nsaph-core-platform

  • The GIS utilities repository https://github.com/NSAPH-Data-Platform/nsaph-gis

  • The pipeline repositories. Five pipelines, each focused in a different data domain have been implemented:

    • The cms repository https://github.com/NSAPH-Data-Platform/nsaph-cms

    • The EPA repository https://github.com/NSAPH-Data-Platform/nsaph-epa

    • The gridmet repository https://github.com/NSAPH-Data-Platform/nsaph-gridmet

    • The census repository https://github.com/NSAPH-Data-Platform/nsaph-census

  • The deployment repository https://github.com/NSAPH-Data-Platform/nsaph-platform-deployment

General details of the building blocks are provided next.

NSAPH Utilities

The nsaph_utils package is intended to hold python code that will be useful across multiple portions of the NSAPH pipelines.

The included utilities are developed to be as independent of specific infrastructure and execution environment as possible.

Included utilities:

  • Interpolation code

  • Reading FST files from Python

  • various I/O wrappers

  • An API and CLI framework

  • QC Framework

  • Documentation utilities to simplify creation of consistent documentation for NSAPH platform

Core Platform

The data platform provides generic infrastructure for NSAPH Data Platform It depends on nsaph_util package, but it augments it with APIs and command line utilities dependent on the infrastructure and the environment. For instance, its components assume presence of PostgreSQL DBMS (version 13 or later) and CWL runtime environment.

Some mapping (or crosswalk) tables are also included in the Core Platform module. These tables include between different territorial codes, such as USPS ZIP codes, Census ZCTA codes, FIPS codes for US states and counties, SSA codes for codes for US states and counties. See more information in the Mapping between different territorial codes

GIS Utilities

This library contains several packages, aimed to work with census shape files.

Data Processing and Loading Pipelines

See dedicated Pipelines page for additional details.

Fully tested and supported pipelines are listed in the Pipelines page. At this moment, we have published processing pipelines for all Data Domains except Demographics. However, it is not possible to test health data processing pipelines without access to the same health data that was used for their development.

To include additional data in a deployed data-platform instance go to Adding more data section.

Pipelines can be tested with DBT Pipeline Testing Framework

Deployment

The deployment repository is based on
CWL-Airflow Docker Deployment developed by Harvard FAS RC in collaboration with Forome Association. Essentially, this is a fork of: Apache Airflow + CWL in Docker with Optional Conda and R It follows Infrastructure as Code (IaC) approach.

[Harvard FAS RC Superset] repository is a fork of Apache Superset customized for Harvard FAS RC environment.

Detailed description of this deployment is covered in the NSAPH Data Platform Deployment subsection of the Data Platform Internals section.

Using the Database

For a sample to query the database, please look at Sample Query

A discussion of querying of health data can be found in this document

Terms and Acronyms

Included Glossary provides some information about acronyms and other terms used throughout this documentation.

Additionaly, General Index and Python Module Index englobe all the pieces of the Data Platform together.

Building Platform documentation

The documentation contains general documentation pages in MarkDown format and a build script that goes over all other platform repositories in the platform and creates a combined GitHub Pages site. The script supports links between repositories. The README.md file contains Build instructions.

To integrate Markdown with Sphinx processing we use MyST Parser.

Documentation utilities are contained in nsaph-utils in docutils package.