NSAPH Data Platform: Documentation Home
User and Development Documentation
Introduction to Data Platform
This data platform is intended for development and deployment of ETL/ELT pipelines that includes complex data processing and data cleansing workflows. Complex workflows require a workflow language, and we have chosen Common Workflow Language (CWL). For deployment, we have selected CWL-Airflow to take advantage of the excellent user interface allowing for the control of the actual execution process. The data is eventually stored in a PostgreSQL DBMS; many processing steps in the included data processing pipelines are being run inside the database itself.
The data platform is based on a combination of an Infrastructure as Code (IaC) approach and CWL. Beside tools written in widely used languages such as Python, C/C++ and Java, the platform also supports tools written in R and PL/pgSQL. Data platform consists of several python packages, a package to deploy the platform using CWL-Airflow and a number of data ingestion pipelines. Data ingestion pipelines process data from external sources and load it into the database.
A discussion on what are the aims of this data platform and how reproducible research can benefit from such product is provided in the What is Data Platform section.
The data platform is deployed as a set of Docker containers orchestrated by Docker-Compose. Conda (package manager) environment files and Python requirements are used to build Docker containers satisfying the dependencies. Specific parameters can be customized via environment files and shell script callbacks.
Building Blocks
The building blocks of the data platform are packaged in several repositories:
The NSAPH utilities repository https://github.com/NSAPH-Data-Platform/nsaph-utils
The core platform repository https://github.com/NSAPH-Data-Platform/nsaph-core-platform
The GIS utilities repository https://github.com/NSAPH-Data-Platform/nsaph-gis
The pipeline repositories. Five pipelines, each focused in a different data domain have been implemented:
The cms repository https://github.com/NSAPH-Data-Platform/nsaph-cms
The EPA repository https://github.com/NSAPH-Data-Platform/nsaph-epa
The gridmet repository https://github.com/NSAPH-Data-Platform/nsaph-gridmet
The census repository https://github.com/NSAPH-Data-Platform/nsaph-census
The deployment repository https://github.com/NSAPH-Data-Platform/nsaph-platform-deployment
General details of the building blocks are provided next.
NSAPH Utilities
The nsaph_utils package is intended to hold python code that will be useful across multiple portions of the NSAPH pipelines.
The included utilities are developed to be as independent of specific infrastructure and execution environment as possible.
Included utilities:
Interpolation code
Reading FST files from Python
various I/O wrappers
An API and CLI framework
QC Framework
Documentation utilities to simplify creation of consistent documentation for NSAPH platform
Core Platform
The data platform provides generic infrastructure for NSAPH Data Platform It depends on nsaph_util package, but it augments it with APIs and command line utilities dependent on the infrastructure and the environment. For instance, its components assume presence of PostgreSQL DBMS (version 13 or later) and CWL runtime environment.
Some mapping (or crosswalk) tables are also included in the Core Platform module. These tables include between different territorial codes, such as USPS ZIP codes, Census ZCTA codes, FIPS codes for US states and counties, SSA codes for codes for US states and counties. See more information in the Mapping between different territorial codes
GIS Utilities
This library contains several packages, aimed to work with census shape files.
Data Processing and Loading Pipelines
See dedicated Pipelines page for additional details.
Fully tested and supported pipelines are listed in the Pipelines page. At this moment, we have published processing pipelines for all Data Domains except Demographics. However, it is not possible to test health data processing pipelines without access to the same health data that was used for their development.
To include additional data in a deployed data-platform instance go to Adding more data section.
Pipelines can be tested with DBT Pipeline Testing Framework
Deployment
The deployment repository is based on
CWL-Airflow Docker Deployment developed
by Harvard FAS RC in collaboration with Forome Association. Essentially, this is a fork of:
Apache Airflow + CWL in Docker with Optional Conda and R
It follows
Infrastructure as Code (IaC)
approach.
[Harvard FAS RC Superset] repository is a fork of Apache Superset customized for Harvard FAS RC environment.
Detailed description of this deployment is covered in the NSAPH Data Platform Deployment subsection of the Data Platform Internals section.
Using the Database
For a sample to query the database, please look at Sample Query
A discussion of querying of health data can be found in this document
Terms and Acronyms
Included Glossary provides some information about acronyms and other terms used throughout this documentation.
Additionaly, General Index and Python Module Index englobe all the pieces of the Data Platform together.
Building Platform documentation
The documentation contains general documentation pages in MarkDown format and a build script that goes over all other platform repositories in the platform and creates a combined GitHub Pages site. The script supports links between repositories. The README.md file contains Build instructions.
To integrate Markdown with Sphinx processing we use MyST Parser.
Documentation utilities are contained in nsaph-utils in docutils package.