Dorieh Core Data Platform
Core platform overview
The data platform provides generic functionality for Dorieh Data Platform with APIs and command line utilities dependent on the infrastructure and the environment. For instance, its components assume presence of PostgreSQL DBMS (version 13 or later) and CWL runtime environment.
Some mapping (or crosswalk) tables are also included in the Core Platform module. These tables include between different territorial codes, such as USPS ZIP codes, Census ZCTA codes, FIPS codes for US states and counties, SSA codes for US states and counties. See more information in the Mapping between different territorial codes
See also: Managing database connections.
Tool Examples
Examples of tools included in this package are:
A utility to monitor progress of long-running database processes like indexing.
A utility to infer database schema and generate DDL from a CSV file
A utility to link a table to GIS from a CSV file
A utility to import/export JSONLines files into/from PostgreSQL
A utility to export Parquet files files from PostgreSQL
Project Structure
The package is under intensive development, the project structure is in flux
Top level directories are:
- doc
- resources
- src
- examples
- docker
Doc directory contains documentation.
Resource directory contains resources that must be loaded in the data platform for its normal functioning. For example, they contain mappings between US states, counties, fips and zip codes. See details in Resources section.
Src directory contains software source code. See details in Software Sources section.
Software Sources
The directories under sources are:
- cwl
- python
- sql
They are described in more details in the corresponding sections. Here is a brief overview:
cwl contains reusable workflows, packaged as tools that can and should be used by Dorieh pipelines. Examples of such tools are: introspection of CSV files, indexing tables, linking tables with GIS information for easy mapping, creation of a Superset datasource.
sql contains PostgreSQL procedures and functions implemented in the PostgreSQL dialect of SQL/DDL and PL/pgSQL language
python contains Python code. See more details.
Python packages
Modules and subpackages included in dorieh.platform
package are
described here.
Resources
Resources are organized in the following way:
- ${database schema}/
- ddl file for ${resource1}
- content of ${resource1} in JSON Lines format (*.json.gz)
- ddl file for ${resource2}
- content of ${resource2} in JSON Lines format (*.json.gz)
Resources can be packaged when a wheel is built. Support for packaging resources during development and after a package is deployed is provided by resources module.
Another module, pg_json_dump, provides support for packaging tables as resources in JSONLines format. This format is used natively by some DBMSs.
SQL Utilities
Utilities, implementing the following:
-
Counting rows in tables
Finding a name of the column that contains year from most tables used in data platform
Creating a hash for HLL aggregations
Procedure:
A procedure granting
SELECT
privileges to a user on all tables created or managed by Dorieh platform
Set of SQL statements: to map tables from another database This can be used to map public tables available to anybody to a more secure database, containing health data
Tables and functions to map between different territorial codes, including USPS ZIP codes, Census ZCTA codes, FIPS codes for US states and counties, SSA codes for codes for US states and counties.
Territorial Codes Mappings
An important part of the data platform is the mappings between different territorial codes, such as USPS ZIP codes, Census ZCTA codes, FIPS codes for US states and counties, SSA codes for codes for US states and counties. See more information in the Mapping between different territorial codes page.