The data_loader Module

Implements parallel loading data into a PostgreSQL database. It is also responsible for loading DDL and creation of view, both virtual and materialized.

API

Domain Data Loader

Provides Command line interface for loading data from a single or a set of column-formatted files into NSAPH PostgreSQL Database.

Input (aka source) files can be either in FST or in CSV format.

class DataLoader(context: Optional[LoaderConfig] = None)[source]

Class for data loader

set_table(table: Optional[str] = None)[source]
print_ddl()[source]
print_table_ddl(table: str)[source]
static execute_sql(sql: str, connxn)[source]
insert_from_select()[source]
is_parallel() bool[source]
get_connections() List[connection][source]
get_connection()[source]
get_files() List[Tuple[Any, Callable]][source]
has_been_ingested(file: str, table)[source]
reset()[source]
drop()[source]
run()[source]
commit()[source]
rollback()[source]
close()[source]
load()[source]
import_data_from_file(data_file)[source]

Configuration

Common options for data manipulation

class DBConnectionConfig(subclass, doc)[source]

Configuration class for connection to a database

Creates a new object

Parameters
  • subclass – A concrete class containing configuration information Configuration options must be defined as class memebers with names, starting with one ‘_’ characters and values be instances of :class Argument:

  • description – Optional text to use as description. If not specified, then it is extracted from subclass documentation

autocommit

Use autocommit

db

Path to a database connection parameters file

connection

Section in the database connection parameters file

verbose

Generate verbose output

dryrun

Dry run: do no database modifications

class DBTableConfig(subclass, doc)[source]

Creates a new object

Parameters
  • subclass – A concrete class containing configuration information Configuration options must be defined as class memebers with names, starting with one ‘_’ characters and values be instances of :class Argument:

  • description – Optional text to use as description. If not specified, then it is extracted from subclass documentation

table

Name of the table to manipulate

class CommonConfig(subclass, doc)[source]

Abstract base class for configurators used for data loading

Creates a new object

Parameters
  • subclass – A concrete class containing configuration information Configuration options must be defined as class memebers with names, starting with one ‘_’ characters and values be instances of :class Argument:

  • description – Optional text to use as description. If not specified, then it is extracted from subclass documentation

domain

Name of the domain

registry

Path to domain registry. Registry is a directory or an archive containing YAML files with domain definition. Default is to use the built-in registry

Domain Loader Configurator

Intended to configure loading of a single or a set of column-formatted files into NSAPH PostgreSQL Database. Input (aka source) files can be either in FST or in CSV format

Configurator assumes that the database schema is defined as a YAML or JSON file. A separate tool is available to introspect source files and infer possible database schema.

class Parallelization(value)[source]

An enumeration.

class DataLoaderAction(value)[source]

An enumeration.

class LoaderConfig(doc)[source]

Configurator class for data loader

Creates a new object

Parameters
  • subclass – A concrete class containing configuration information Configuration options must be defined as class memebers with names, starting with one ‘_’ characters and values be instances of :class Argument:

  • description – Optional text to use as description. If not specified, then it is extracted from subclass documentation

action: Optional[DataLoaderAction]

If this option is given, then the whole domain schema will be dropped

data

Path to a data file or directory. Can be a single CSV, gzipped CSV or FST file or a directory recursively containing CSV files. Can also be a tar, tar.gz (or tgz) or zip archive containing CSV files

reset

Force recreating table(s) if it/they already exist

page

Explicit page size for the database

log

Explicit interval for logging

limit

Load at most specified number of records

buffer

Buffer size for converting fst files

threads

Number of threads writing into the database

parallelization

Type of parallelization, if any

pattern

pattern for files in a directory or an archive, e.g., “**/maxdata_*_ps_*.csv”

incremental

Commit every file and skip over files that have already been ingested

sloppy

Do not update existing tables and views

validate(attr, value)[source]

Subclasses can override this method to implement custom handling of command line arguments

Parameters
  • attr – Command line argument name

  • value – Value returned by argparse

Returns

value to use