NSAPH Core Data Platform
Tool Examples
Examples of tools included in this package are:
A utility to monitor progress of long-running database processes like indexing.
A utility to infer database schema and generate DDL from a CSV file
A utility to link a table to GIS from a CSV file
A utility to import/export JSONLines files into/from PostgreSQL
Project Structure
The package is under intensive development, the project structure is in flux
Top level directories are:
- doc
- resources
- src
Doc directory contains documentation.
Resource directory contains resources that must be loaded in the data platform for its normal functioning. For example, they contain mappings between US states, counties, fips and zip codes. See details in Resources section.
Src directory contains software source code. See details in Software Sources section.
Software Sources
The directories under sources are:
- airflow
- commonwl
- html
- plpgsql
- python
- r
- superset
- yml
They are described in more details in the corresponding sections. Here is a brief overview:
airflow contains code and configuration for Airflow. Most of the content is deprecated as it is now transferred to the deployment package or the specific pipelines. However, this directory is intended to contain Airflow plugins that are generic for all NSAPH pipelines
commonwl contains reusable workflows, packaged as tools that can and should be used by all NSAPH pipelines. Examples of such tools are: introspection of CSV files, indexing tables, linking tables with GIS information for easy mapping, creation of a Superset datasource.
html is a deprecated directory for HTML documents
plpgsql contains PostgreSQL procedures and functions implemented in PL/pgSQL language
python contains Python code. See more details.
r contains R utilities. Probably will be deprecated
superset contains definitions of reusable Superset datasets and dashboards
yml contains various YAML files used by the platform.
Python packages
NSAPH Package
This is the main package containing the majority of the code.
Modules and subpackages included in nsaph
package are described below.
Subpackage for Data Modelling
nsaph.data_model
Implements version 2 of the data modelling toolkit.
Version 1 was focused on loading into the database already processed data saved as flat files. It inferred data model from the data files structure and accompanied README files. The inferred data model is converted to database schema by generating appropriate DDL.
Version 2 focuses on generating code required to do the actual processing. The main concept is a knowledge domain, or just a domain. Domain model is define in a YAML file as described in the documentation. The main module that processes the YAML definition of the domain is domain.py. Another module, inserter handles parallel insertion of the data into domain tables.
Auxiliary modules perform various maintenance tasks. Module index_builder builds indices for a given tables or for all tables within a domain. Module utils provides convenience function wrappers and defines class DataReader that abstracts reading CSV and FST files. In other words, DataReader provides uniform interface to reading columnar files in two (potentially more) different formats.
Module Database Connection Wrapper
nsaph.db
Module db is a PostgreSQL
connection wrapper. It reads connection parameters from
an ini
file and connects to the database. It can
transparently connect over ssh tunnel when required.
Loader Subpackage
nsaph.loader
A set of utilities to manipulate data.
Module data_loader Implements parallel loading data into a PostgreSQL database. It is also responsible for loading DDL and creation of view, both virtual and materialized.
Module index_builder is a utility to build indices and monitor the build progress.
Subpackage to describe and implement user requests [Incomplete]
nsaph.requests
Package nsaph.requests
contains some code that is
intended to be used for fulfilling user requests. Its
development is currently put on hold.
Module hdf5_export exports result of SQL query as an HDF5 file. The structure of the HDF5 is described by a YAML request definition.
Module query generates SQL query from a YAML request definition.
Subpackage with miscellaneous utilities
nsaph.util
Package nsaph.util
contains:
Support for packaging resources in two modules resources and pg_json_dump. The latter module imports and exports PostgreSQL (pg) tables as JSONLines format.
Module net contains one method resolving host to
localhost
. This method is required by Airflow.Module executors implements a ThreadPoolExecutor with a bounded queue. It is used to prevent out of memory (OOM) errors when processing huge files (to prevent loading the whole file into memory before dispatching it for processing).
YAML files
The majority of files are data model definitions. For now, they are included in nsaph package because they are used by different utilities and thus, expected to be stored in a specific location.
Beside data model files, there are YAML files for:
Conda environments, required for NSAPH pipelines. Unless we will be able to create a single environment that accomodate all pipelines we will probably deprecate them and move into corresponding pipeline repositories.
Sample user requests for future downstream pipelines that create user workspaces from the database. File example_request.yml is used by sample request handler
Resources
Resources are organized in the following way:
- ${database schema}/
- ddl file for ${resource1}
- content of ${resource1} in JSON Lines format (*.json.gz)
- ddl file for ${resource2}
- content of ${resource2} in JSON Lines format (*.json.gz)
Resources can be packaged when a wheel is built. Support for packaging resources during development and after a package is deployed is provided by resources module.
Another module, pg_json_dump, provides support for packaging tables as resources in JSONLines format. This format is used natively by some DBMSs.
SQL Utilities
Utilities, implementing the following:
-
Counting rows in tables
Finding a name of the column that contains year from most tables used in data platform
Creating a hash for HLL aggregations
Procedure:
A procedure granting
SELECT
privileges to a user on all NSAPH tables
Set of SQL statements: to map tables from another database This can be used to map public tables available to anybody to a more secure database, containing health data
Tables and functions to map between different territorial codes, including USPS ZIP codes, Census ZCTA codes, FIPS codes for US states and counties, SSA codes for codes for US states and counties.
Territorial Codes Mappings
An important part of the data platform is the mappings between different territorial codes, such as USPS ZIP codes, Census ZCTA codes, FIPS codes for US states and counties, SSA codes for codes for US states and counties. See more information in the Mapping between different territorial codes page.