# Project (Directory) Loading Utility ```{contents} --- local: --- ``` (project-loader-overview)= ## Overview [Project Loader](members/project_loader) is a command line tool to introspect and ingest into a database a directory, containing CSV (or CSV-like, e.g. FST, JSON, SAS, etc.) files. The directory can be structured, e.g. have nested subdirectories. All files matching a certain name pattern at any nested subdirectory level are included in the data set. It can also load a single file if a file rather than a directory is given as `--data` argument. In the database, a schema is crated based on the given project name. For each file in the data set a table is created. The name of the table is constructed from the relative path of the incoming data file with OS path separators (e.g. '/') being replaced with underscores ('_'). It might be a good idea, before actually ingesting data into the database to do a [dry run](#dry-runs-introspect-only) and visually examine the database schema created by Introspection utility. Loading into the database is performed using [Data Loader](members/data_loader) functionality. ## Configuration options Configuration options are provided by [LoaderConfig](members/loader_config) object. Usually, they are provided as command line arguments but can also be provided via an API call. Some configuration options can be provided in the registry YAML file. By default, if registry does not exist, a new YAML file will be created with the following parameters: * header: True ## i.e. CSV files are expected to have header line * quoting: QUOTE_MINIMAL, ## i.e. only strings with whitespaces are expected to be quoted * index: "unless excluded" ## We will build indices for every column unless it is explicitly excluded See [Domain options](Datamodels.md#domain) for the descriptions of these parameters. When a registry file is created it can be manually edited by user. The manual modifications will be preserved for subsequent runs. ## Usage from command line ``` python -u -m nsaph.loader.project_loader [-h] [--drop] [--data DATA [DATA ...]] [--pattern PATTERN [PATTERN ...]] [--reset] [--incremental] [--sloppy] [--page PAGE] [--log LOG] [--limit LIMIT] [--buffer BUFFER] [--threads THREADS] [--parallelization {lines,files,none}] [--dryrun] [--autocommit] [--db DB] [--connection CONNECTION] [--verbose] [--table TABLE] --domain DOMAIN [--registry REGISTRY] optional arguments: -h, --help show this help message and exit --drop Drops domain schema, default: False --data DATA [DATA ...] Path to a data file or directory. Can be a single CSV, gzipped CSV or FST file or a directory recursively containing CSV files. Can also be a tar, tar.gz (or tgz) or zip archive containing CSV files, default: None --pattern PATTERN [PATTERN ...] pattern for files in a directory or an archive, e.g. `**/maxdata_*_ps_*.csv`, default: None --reset Force recreating table(s) if it/they already exist, default: False --incremental Commit every file and skip over files that have already been ingested, default: False --sloppy Do not update existing tables, default: False --page PAGE Explicit page size for the database, default: None --log LOG Explicit interval for logging, default: None --limit LIMIT Load at most specified number of records, default: None --buffer BUFFER Buffer size for converting fst files, default: None --threads THREADS Number of threads writing into the database, default: 1 --parallelization {lines,files,none} Type of parallelization, if any, default: lines --dryrun Dry run: do not load any data, default: False --autocommit Use autocommit, default: False --db DB Path to a database connection parameters file, default: database.ini --connection CONNECTION Section in the database connection parameters file, default: nsaph2 --verbose Verbose output, default: False --table TABLE, -t TABLE Name of the table to manipulate, default: None --domain DOMAIN Name of the domain --registry REGISTRY Path to domain registry. Registry is a directory or an archive containing YAML files with domain definition. Default is to use the built-in registry, default: None ``` ## Sample command The following command creates a schema named `my_schema` and loads tables from all files with extension `.csv` found recursively under the directory `/data/incoming/valuable/data/`: python -u -m nsaph.loader.project_loader --domain my_schema --data /data/incoming/valuable/data/ --registry my_temp_schema.yaml --reset --pattern *.csv --db database.ini --connection postgres It uses `database.ini` file in the current directory (where the program is started) and a section named `postgres` inside it. It creates temporary file `my_temp_schema.yaml` also in the current directory. If such a file already exists, it will be loaded and the settings found in it will override the defaults. Option `--reset` would delete all existing tables with the same names and recreate them. The following is the same command but with parallel execution using 4 threads writing into the database and with increased page size for writing into the database. It is optimized for hosts with more RAM. python -u -m nsaph.loader.project_loader --domain my_schema --data /data/incoming/valuable/data/ --reset --registry my_temp_schema.yaml --pattern *.csv --db database.ini --connection postgres --threads 4 --page 10000 To load a single file one can use a command like this: python -u -m nsaph.loader.project_loader --domain my_schema --data /data/incoming/valuable/test_file.csv --registry my_temp_schema.yaml --reset --db database.ini --connection postgres ## Dry runs (introspect only) To just introspect files in a directory and generate YAML schema for the project (see [domain schema specification](Datamodels) for the description of the format) without modifications in the database, use dry run. On the command line, just give :code:`--dryrun` option. Dry run will create "registry" file that can be manually examined and modified. The following command described [above](#sample-command) will perform dry run: python -u -m nsaph.loader.project_loader --domain my_schema --data /data/incoming/valuable/data/ --registry my_temp_schema.yaml --dryrun --pattern *.csv This command will create file named `my_temp_schema.yaml`. ## API Usage Example of API usage retrieving command line arguments: ```python loader = ProjectLoader() loader.run() ``` More advanced usage: ```python config = LoaderConfig(__doc__).instantiate() config.pattern = "**/*.csv.gz" loader = ProjectLoader(config) loader.run() ```