# NSAPH Core Data Platform

[Documentation Home](https://nsaph-data-platform.github.io/nsaph-platform-docs/home.html)

```{toctree}
---
maxdepth: 4
hidden: 
---
Datamodels
DataLoader
ProjectLoader
TerritorialCodes
SampleQuery
UserRequests
SQLDocumentation
```

```{contents}
---
local:
---
```

## Tool Examples

Examples of tools included in this package are:

* [Universal Data Loader](members/data_loader)
* A [utility to monitor progress of long-running database](members/monitor) processes like indexing.
* A [utility to infer database schema and generate DDL](members/introspector) from a CSV file
* A [utility to link a table to GIS](members/link_gis) from a CSV file
* A [wrapper around database connection to PostgreSQL](#module-database-connection-wrapper)
* A [utility to import/export JSONLines](members/pg_json_dump) files into/from PostgreSQL
* An [Executor with a bounded queue](members/executors)

(core-prj-struct)=
## Project Structure 

**The package is under intensive development, the project structure is in flux**

Top level directories are:

    - doc
    - resources
    - src

Doc directory contains documentation.

Resource directory contains resources that must be loaded in
the data platform for its normal functioning. For example,
they contain mappings between US states, counties, fips and zip codes.
See details in [Resources](#resources) section.

Src directory contains software source code.
See details in {ref}`core-software-sources` section.

(core-software-sources)=
### Software Sources 

The directories under sources are:

    - airflow
    - commonwl
    - html
    - plpgsql
    - python
    - r
    - superset
    - yml

They are described in more details in the corresponding sections.
Here is a brief overview:

* **airflow** contains code and configuration for Airflow.
  Most of the content is deprecated as it is now transferred
  to the deployment package or the specific pipelines. However,
  this directory is intended to contain Airflow plugins that are
  generic for all NSAPH pipelines
* **commonwl** contains reusable workflows, packaged as tools
  that can and should be used by
  all NSAPH pipelines. Examples of such tools
  are: introspection of CSV files, indexing tables, linking
  tables with GIS information for easy mapping, creation of a
  Superset datasource.
* **html** is a deprecated directory for HTML documents
* **plpgsql** contains PostgreSQL procedures and functions
  implemented in PL/pgSQL language
* **python** contains Python code. See [more details](#python-packages).
* **r** contains R utilities. Probably will be deprecated
* **superset** contains definitions of reusable Superset datasets and dashboards
* **yml** contains various YAML files used by the platform.

### Python packages

#### NSAPH Package

This is the main package containing the majority of the code.
Modules and subpackages included in `nsaph` package are described below.

##### Subpackage for Data Modelling

* `nsaph.data_model`

Implements version 2 of the data modelling toolkit.

Version 1 was focused on loading into the database already
processed data saved as flat files. It inferred data
model from the data files structure and accompanied README
files. The inferred data model is converted to database schema
by generating appropriate DDL.

Version 2 focuses on generating code required to do the
actual processing. The main concept is a knowledge domain, or
just a domain. Domain model is define in a YAML file as
described in the [documentation](Datamodels). The main
module that processes the YAML definition of the domain
is [domain.py](members/domain). Another
module, [inserter](members/inserter)
handles parallel insertion of the data into domain tables.

Auxiliary modules perform various maintenance tasks.
Module [index_builder](members/index_builder)
builds indices for a given tables or for all
tables within a domain.
Module [utils](members/utils)
provides convenience function wrappers and defines
class DataReader that abstracts reading CSV and FST files.
In other words, DataReader provides uniform interface
to reading columnar files in two (potentially more)
different formats.

##### Module Database Connection Wrapper

* `nsaph.db`

Module [db](members/db) is a PostgreSQL
connection wrapper. It reads connection parameters from
an `ini` file and connects to the database. It can
transparently connect over **ssh tunnel** when required.

##### Loader Subpackage

* `nsaph.loader`

A set of utilities to manipulate data.

Module [data_loader](members/data_loader)
Implements parallel loading data into a PostgreSQL database.
It is also responsible for loading DDL and creation of view,
both virtual and materialized.

Module [index_builder](members/index_builder)
is a utility to build indices and monitor the build progress.

##### Subpackage to describe and implement user requests [Incomplete]

* `nsaph.requests`

Package `nsaph.requests` contains some code that is
intended to be used for fulfilling user requests. Its
development is currently put on hold.

Module [hdf5_export](members/hdf5_export) exports
result of SQL query as an HDF5 file. The structure of the HDF5 is
described by a YAML request definition.

Module [query](members/query) generates SQL query
from a YAML request definition.

##### Subpackage with miscellaneous utilities

* `nsaph.util`

Package `nsaph.util` contains:

* Support for packaging [resources](#resources)
  in two modules [resources](members/resources)
  and [pg_json_dump](members/pg_json_dump). The
  latter module imports and exports PostgreSQL (pg) tables
  as JSONLines format.
* Module [net](members/net) contains
  one method resolving host to `localhost`. This method is
  required by Airflow.
* Module [executors](members/executors)
  implements a
  [ThreadPoolExecutor](https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor)
  with a bounded queue. It is used to prevent out of memory (OOM)
  errors when processing huge files (to prevent loading
  the whole file into memory before dispatching it for processing).

### YAML files

The majority of files are data model definitions. For now, they
are included in **nsaph** package because they are used by
different utilities and thus, expected to be stored in
a specific location.

Beside data model files, there are YAML files for:

* Conda environments, required for NSAPH pipelines. Unless we
  will be able to create a single environment that accomodate
  all pipelines we will probably deprecate them and move
  into corresponding pipeline repositories.
* Sample user requests for future downstream pipelines
  that create user workspaces from the database. File
  [example_request.yml](members/example_request.yaml) is used by
  [sample request handler](members/hdf5_export)

### Resources

Resources are organized in the following way:

```
- ${database schema}/
    - ddl file for ${resource1}
    - content of ${resource1} in JSON Lines format (*.json.gz)
    - ddl file for ${resource2}
    - content of ${resource2} in JSON Lines format (*.json.gz)
```

Resources can be packaged when a
[wheel](https://pythonwheels.com/)
is built. Support for packaging resources during development and
after a package is deployed is provided by
[resources](members/resources) module.

Another module, [pg_json_dump](members/pg_json_dump),
provides support for packaging tables as resources in JSONLines
format. This format is used natively by some DBMSs.

### SQL Utilities

Utilities, implementing the following:

* [Functions](members/utils.sql):
    * Counting rows in tables
    * Finding a name of the column that contains year from most tables used in data platform
    * Creating a hash for [HLL aggregations](https://en.wikipedia.org/wiki/HyperLogLog)
* Procedure:
    * [A procedure](members/utils.sql) granting `SELECT` privileges
      to a user on all NSAPH tables
    * [A procedure to rename indices](members/rename_indices.sql)
* Set of SQL statements:
    [to map tables from another database](members/map_to_foreign_database.ddl)
    This can be used to map public tables available to anybody
    to a more secure database, containing health data
* [Tables and functions](members/zip2fips.sql) to 
    [map between different territorial codes](#territorial-codes-mappings), 
    including USPS ZIP codes, Census ZCTA codes, 
    FIPS codes for US states
    and counties, SSA codes for codes for US states
    and counties. 

## Territorial Codes Mappings

An important part of the data platform is the mappings between different
territorial codes, such as USPS ZIP codes, Census ZCTA codes,
FIPS codes for US states and counties, SSA codes for codes for US states
and counties. See more information in the
[Mapping between different territorial codes](TerritorialCodes)
page.

(core-soft-idx)=
## Documentation Indices 

* [genindex](genindex)
* [modindex](modindex)