How to add data to the database
What data are you adding?
There are many ways to add data to the database. We review the following options:
Creating a new data domain with its own pipelines and, optionally, software tools written in a programming language like Python, Java, R, Pl/PgSQL, etc.,
Adding a new table:
from a file on file system
from remote data source
Adding data to existing table
Bulk ingesting multiple CSV-like files (we support many formats) from local file system to create a lightweight data domain
For creating new tables in the database, there is a choice between manually creating a data model and required data conversions and transformations or automatically inferring data structure based on data sampling.
Data modelling vs data introspection
Tools for data modelling are discussed in Data Modelling for NSAPH Data Platform.
Examples of manually created data models are data models for Medicare and Medicaid domains. Actual models are defined respectively in Medicare.yaml and Medicaid.yaml
To automatically infer data structure by analyzing sample data
and generating data model corresponding to the existing structure
one can use
Introspector tool.
It can be run as a standalone command-line tool or used via Python API.
Examples of using introspector via API can be found in
EPA pipeline.
Project Loader Tool also uses Introspector.
Adding new data domain
To add a new data domain one create a new repository on GitHub or other source control system
Adding data to existing table
The process of adding data to an existing table is described in NSAPH Data Loader
Creating new single table
In many cases, creating a new single table will mean running a
pipeline that first
introspects the data
in a file (CSV, JSON, FST and some other
formats) and then running the
Data Loader.
However, for simple cases one can use
Project Loader Tool
to either ingest or just to introspect the data
(introspection can be done by using --dryrun
argument).