Health Data
Pipelines to process CMS data: Medicaid and Medicare
Overview of health data (Medicare and Medicaid)
We use health data provided by Centers for Medicare & Medicaid Services (CMS)
Data processing pipelines included in this package create a data warehouse with health data (Medicare and Medicaid). They perform ingestion of raw data into the database, data cleansing and deduplication , when possible, data quality analysis and optimization of the tables for efficient queries.
Please see the following documents for details:
Data model and processing of Medicaid data
Data model and processing of Medicare data
Tips on querying of Medicaid data
Medicare processing now includes a pipeline to automatically create QC Tables. These tables are used by Apache Superset dashboard that visualizes QC results.
Project Structure
Top level directories are:
- doc
- src
Doc directory contains documentation.
Src directory contains software source code. The directories under sources are:
- cwl
- python
CWL
CWL folder contains reusable workflows, packaged as tools that can and should be used by all NSAPH pipelines.
Each processing step of CMS data is packaged as a standalone tool that can be run individually. Each tool is individually documented. The tools are combined into a workflow represented by medicaid.cwl and medicare.cwl files.
Python
Python packages and modules are described in the Python Package Description document.
Included are utilities to:
Parse FTS format and generate database schema
Data Model for health data
The data model in YAML format is used to generate database schema and
processing code to ingest data into the database. Read more about
the modeling in the
Data Modeling.
The model for raw data is automatically generated by parsing FTS files or analyzing SAS data.
The following models are defined here:
Medicare processed data. See also Medicare Files Handling
Tables
SQL Views, used internally for data processing
medicare.ps
Combined raw data for patient summariesmedicare._ps
medicare._beneficiaries
medicare._enrollments
SQL
File procedures addresses the problem that creating Medicaid eligibility table in a single transaction requires too much time and memory. The stored procedures in this file split populating this table with data either by beneficiary or by year and state. Splitting by beneficiary (i.e. using one database transaction per beneficiary) works best.
File functions contain helper functions to parse dates in non-standard formats that are encountered in raw medicare files that we have.