# Medicare Files Handling ```{toctree} --- maxdepth: 4 hidden: --- members/mcr_sas2yaml.rst members/mcr_sas2db.rst members/fts2yaml.rst members/medicare_yaml.md Medicaid.md ``` ```{contents} --- local: --- ``` ## Processing pipeline ### Medicare Pipeline Steps Current pipeline consists of 5 steps, each represented as a sub-workflow: 1. [Ingest raw data](pipeline/load_raw_medicare) 2. [Process beneficiaries and their enrollment in Medicare](pipeline/medicare_beneficiaries) 3. [Process Admissions](pipeline/medicare_admissions) 4. [Create QC Tables](pipeline/medicare_qc) 5. [Grant `SELECT`](pipeline/grant) privileges (i.e., read access) to all newly created tables (this step is actually a command line tool, not a workflow) ### Ingestion of raw data Ingestion of raw data is incremental, i.e. tables that are already in the database will not be dropped (deleted). However, any table with the name found in input path will be replaced. Please note, that every raw record is identified by the tuple consisting of the original file name and the line number in that file. ```{note} If no raw data is given or `--input` parameters points to a non-existent or empty directory, the pipeline will skip ingestion step and will process the raw data that is already in the database. ``` Ingestion as a part of the data pipeline is only implemented for data in the format as it comes from ResDac. Metadata for ingestion is taken from [FTS](../../../fts) files that accompany ResDac deliverables. ```{important} In case of Medicare data in posession of NSAPH organization, we only have original ResDac data for years 2011-2014 and 2016-2018. Therefore, the pipeline is unable to ingest the data for other years (1999-2010 and 2015). ``` See [](#files-for-1999-to-2010) for more information. See [](#ingesting-raw-files) for processing details ### Processing Data in the Database During in-database processing all tables, views and materialized views are completely replaced. Old tables are dropped and new ones are created from scratch. See [](#combining-raw-files-into-a-single-view) for processing details. ### Medicare Pipeline References See [](pipeline/medicare) for the pipeline code See [Medicare data model definition](members/medicare_yaml.md) for formal data model definition. ## Ingesting Raw Files [Pipeline](pipeline/load_raw_medicare) ### Overview of Ingesting Raw Medicare Files There are two types of tables: * Patient summary, aka enrollment, aka denominator * Inpatient admissions Unfortunately, the structure of medicare files is different for almost every year. Summary files for some years come in pairs: * `mbsf_ab_summary` * `mbsf_d_cmpnts` For other years we have a single file: * `mbsf_abcd_summary` Inpatient admissions files always follow `medpar_all_file` pattern. Columns vary from year to year even for similarly named files, new columns are being added and column names are sometimes changed. To add insult to injury, for years prior to 2011 (1999-2010) we do not have original files, but preprocessed files with patient summary (called denominators) and admissions. They are in SAS 7BDAT format, however columns are also different for different years. ### Storing raw data in the Database Given the difference in file structures we create a separate table for every file. However, to make it easier to join these tables we: * Add a column containing original file name to every table * Add generated columns with uniform names for: * Year * State * Bene_Id * Zip code Originally, these data is stored in columns with the following possible names: "bene_id": (None, ["bene_id", "intbid", "qid", "bid_5333*"]), "state": (None, ["state", "ssa_state", "state_code", "bene_rsdnc_ssa_state_cd", "state_cd", "medpar_bene_rsdnc_ssa_state_cd"]), "zip": (None, ["zip", "zipcode", "bene_zip_cd", "bene_zip", "bene_mlg_cntct_zip_cd", "medpar_bene_mlg_cntct_zip_cd"]), "year": (None, ["year", "enrolyr", "bene_enrollmt_ref_yr", "rfrnc_yr"]) When a table has no natural primary key (admission tables) we add a record number column. This column ha sno meaning but allows to trace a record to the original data. ### Files for 1999 to 2010 These files are in SAS7BDAT format. They have been stored on RCE in two directories: * denominator * inpatient One file per year. SAS7BDAT format contains metadata with column names and types. We use this metadata to generate appropriate database schema. See the code for handling these files: * [Metadata and data model](members/mcr_sas2yaml.rst) * [Ingesting data](members/mcr_sas2db.rst) ### Files for Years 2011 and later These files are original files from ResDac. They come in Fixed Width Format (FWF). For each file the structure is described in File Transfer Summary (FTS) file. Unfortunately these files are intended for reading by a human and is difficult to parse automatically. A [partial parser](members/fts2yaml.rst) that relies on a known file type is implemented in Python. The information extarcted by the parser is used to: * Generate data model (database schema) * Generate metadata for the FWF Reader ## Combining raw files into a single view [Pipeline](pipeline/medicare_beneficiaries) ### Eventual database schema Once all raw files are ingested into the database they are combined into two views: 1. Patient summary (aka MBSF, aka Beneficiary summary) 2. Inpatient Admissions (aka hospitalizations, aka medpar) The figure below visualizes the database schema. ```{image} medicare-db.png --- width: 600 --- ``` The tables above are defined in [Medicare data model definition](members/medicare_yaml.md). This file uses [](DataModellingExtensions.md). ### CWL workflows The [full pipeline](pipeline/medicare) consists of two steps: 1. Creating [beneficiary federated summary and enrollments table](pipeline/medicare_beneficiaries) 2. Creating [inpatient admissions table](pipeline/medicare_admissions) ### Creating Federated Patient Summary The federated patient summary view is created in two steps, though the division into steps is purely technical. The reasons are given some limitations of readability in SQL. This step uses data modelling extensions described in [](DataModellingExtensions.md). These steps are part of [](pipeline/medicare_beneficiaries) #### First step: Initial in-database data conditioning The first step creates a view called `medicare.ps`. This step technically combines all `cms.mbsf_ab*` and `cms.mcr_bene_*` tables into a single view using `CREATE VIEW` SQL statement. It also cleanses and conditions data from the following columns: * `year` * If it is a string in original file, it is converted to integer * If it is two-digit, it is converted to 4 digit * `dob`: converted to SQL `DATE` type, from either character or SAS numeric form * `dod` (date of death): converted to SQL `DATE` type, from either character or SAS numeric form * `age` as recorded in the raw data. It is the beneficiary's age on the last day of the prior year * `sex` * `race` * `rti_race` Research Triangle Institute race code * `hmo_indicators` Monthly Medicare Advantage (MA) enrollment indicator * `hmo_cvg_count` Number of months the beneficiary was enrolled * `yob` year of birth calculated with the age variable (year - age) * `state`: added a column with text state id * `ssa2`: Social Security Administration (SSA) two digit code for state * `ssa3`: Social Security Administration (SSA) three digit code for county * `fips2`: added a column with two digit state FIPS code * `zip`: if original file uses 9-digit zip code, it is split into two separate columns, 5 digit `zip` and 4-digit `zip4`. The value is also converted to integer value. * `zip4`: added, when available - the last four digits of 9-digit zip code The following [CWL tool](pipeline/medicare_combine_tables) is responsible to perform it. #### Second step: Mapping to county FIPS codes At the second step, a view called `medicare._ps` is created. The only difference between `medicare.ps` and `medicare._ps` is that the latter has county FIPS code (`fips3` column) inferred either SSA county code (`ssa3` column), if it is available or from the zip code (`zip` column) if SSA county code is absent. The reason this has to happen in a separate second step is that both `ssa3` and `zip` are being cleansed in the first step. The second step is performed by a general loader utility based on the [Medicare data model definition](members/medicare_yaml.md). ### Creating Beneficiaries table This is also part of [](pipeline/medicare_beneficiaries) See also [creating Medicaid Beneficiaries table](Medicaid.md#beneficiaries) This is also a two steps operation. The first step creates an SQL view and the second step stores the data as a real table. Essentially it is a `medicare.ps` view grouped by beneficiary id (`bene_id` column). This step also takes care of documenting any discrepancies in the data related to: * dob * dod * race * race_rti * sex If there is any discrepancy for a given `bene_id`, then: * The earliest _**DOB**_ is selected as `dob` * The latest _**DOD**_ (date of death) is selected as `dod` * A comma-separated string containing all race codes is used for `race` * A comma-separated string containing all race codes is used for `race_rti` * comma-separated string containing all sex codes is used for `sex` The following columns are added: * `duplciates`: a numeric column showing the number of inconsistent values for this beneficiary. If it is greater than 1, it means there is a discrepancy in the data for this beneficiary * `dob_latest`: the latest DOB found in the records for this beneficiary. The value of this column is NULL for consistent records * `dod_earliest`: the earliest DOD found in the records for this beneficiary. The value of this column is NULL for consistent records * Beneficiary id HLL hash (`bene` column), to be used for `approximate count distinct` queries. [See more](UsingHLL.md) This topic is discussed in more details in the [Medicaid documentation](Medicaid.md#deduplication-and-data-cleansing) ### Creating Enrollments table This is also part of [](pipeline/medicare_beneficiaries) #### Enrollments overview Enrollments table contains information about yearly beneficiaries enrollments in different states and tracks changes in eligibility (i.e. beginning of the eligibility and beneficiaries death) and changes in states and addresses. See also [Medicaid Enrollments](Medicaid.md#enrollments) and [Medicaid Eligibility](Medicaid.md#eligibility) tables. Please note, that since Medicare eligibility is not as volatile as Medicaid eligibility, i.e. it does not usually change month to month, there is no direct analog to [Medicaid Eligibility](Medicaid.md#eligibility) table. As most of the other tables, **Enrollments** table is created in two steps. The first step creates an SQL view and the second step stores the data as a real table, adds primary key and builds indices to make queries more efficient. #### Enrollments Primary key (unique identifier) - bene_id - year - state In other words, a record in the table describes a given beneficiary leaving in a given state during a given year. If beneficiary has moved from one state to another during the year, more than one record for such a beneficiary will be created in the table. This is consistent with [Medicaid Enrollments](Medicaid.md#enrollments), though, arguably, makes less sense for Medicare. #### Enrollments data cleansing Beneficiaries can move during a year therefore address columns can have multiple values. These columns are: * `fips2`: state FIPS code * `fips3`: county FIPS code * `ssa2`: SSA state code * `ssa3`: SSA county code * `zip`: beneficiary address zip code The policy for all of this columns is the following: * For corresponding column in the enrollments table, an arbitrary but deterministic value is selected * An additional column is added, containing the list of all encountered values The additional columns are: * `ssa2_list` * `ssa3_list` * `residence_counties` * `zips` Additional columns reflecting data quality and cleansing are added to the **Enrollments** tables: * `state_count`: Number of states associated with the given beneficiary in a given year * `fips3_is_approximated`: A boolean column, indicating whether the value was taken from original record as is or approximated. If true, it means that there was no valid county code in the original ResDac record, hence, the county code was inferred from other data (in most cases, zip code) * `fips3_valdiated`: A boolean column indicating that the value of county code is consistent with the values of state code and zip code. #### Enrollments columns definitions The following columns are created for Enrollments: * `ssa2`: SSA state code * `ssa3`: SSA county code * `ssa2_list`: list of all SSA county codes * `state_iso`: ISO code of the state, used for mapping * `residence_county`: one of the "latest" residence counties where the beneficiary was registered, latest in alphabetical order * `residence_counties`: comma separated list of all "latest" residence counties, where a beneficiary was registered during the year * `fips5`: 5 digit FIPS code of the `residence_county` * `zip`: one of the "latest" zip codes where the beneficiary was registered, latest in numerical order * `zips`: comma separated list of all "latest" zip codes, where a beneficiary was registered during the year * `state_count`: number of states, where the beneficiary was enrolled in medicaid during the year. Note, this is also the number of records for this beneficiary and this year in the Enrollments` table. * `died`: a boolean flag indicating that the beneficiary has died during this year while being registered for medicaid in this state. * `hmo_indicators`: the maximum value of all the monthly hmo indicators * `hmo_cvg_count`: the number of months the beneficiary was enrolled in a Medicare Advantage (MA) * `fips3_is_approximated`: A boolean column, indicating whether the value was taken from original record as is or approximated. If true, it means that there was no valid county code in the original ResDac record, hence, the county code was inferred from other data (in most cases, zip code) * `fips3_valdiated`: A boolean column indicating that the value of county code is consistent with the values of state code and zip code. * Beneficiary id HLL hash (`bene` column), to be used for `approximate count distinct` queries. [See more](UsingHLL.md) ### Creating Federated Admissions view This step is part of [](pipeline/medicare_admissions) This step technically combines all `cms.medpar*` and `cms.mcr_ip_*` tables into a single view using `CREATE VIEW` SQL statement. It also cleanses and conditions data from the following columns: * `year` * If it is a string in original file, it is converted to integer * If it is two-digit, it is converted to 4 digit * `state`: added a column with text state id * `fips2`: added a column with two digit state FIPS code * `zip`: if original file uses 9-digit zip code, it is split into two separate columns, 5 digit `zip` and 4-digit `zip4`. The value is also converted to integer value. * `zip4`: added, when available - the last four digits of 9-digit zip code * `admission_date`: converted to SQL `DATE` type, from either character or SAS numeric form * `discharge_date`: converted to SQL `DATE` type, from either character or SAS numeric form * `adm_day_of_week`: converted to `integer` * Diagnoses: separate columns combined into a single `ARRAY` column (read more about [PostgreSQL Arrays](https://www.postgresql.org/docs/current/arrays.html)) ### Creating Inpatient Admissions table This step is also part of [](pipeline/medicare_admissions) Table with all inpatient admissions billed to Medicare with admission and discharge dates and ICD codes. During this step the following major operations are performed: * Added the following columns: * Admission year, extracted from admission date * Added [HLL hashes](UsingHLL.md) for: * Beneficiary id (`bene` column) * Primary diagnosis at admission (`pd_hll_hash`) * All diagnoses, used for admission (`icd_hll`) * Performed validation of admission data. Records that failed validation are excluded from the resulting `Admissions` table but are stored in a special `medicare_audit.admissions` table, together with the reason for validation failure. We distinguish three reasons for validation failure: * `Primary key`: this indicates missing data, for example: * Missing beneficiary id * Missing admission or discharge date * Missing US State, where the admission happened * `Foreign key`: means that the beneficiary referred in the admission record was not eligible for Medicare in the given year * `Duplicate`: a duplicate record was found. Only one record out of several duplicates is stored in the `admissions` table, others are copied to `medicare_audit.admissions` table. See more information about handling records that have failed validation in: [Data Modeling](../../core-platform/doc/Datamodels.md#invalid-record) ## Creating QC Tables [Pipeline](pipeline/medicare_qc) ### Medicare QC approach QC tables (materialized views to be precise) are created by [Medicare QC Pipeline](pipeline/medicare_qc) Two tables are created: * Enrollments QC * Admissions QC These are aggregate tables, defined in [Medicare data model definition](members/medicare_yaml.md) (qc_enrollments and qc_admissions). In these tables we define dimensions and measures, including count measures and percent measures. ### Enrollments QC Table #### Enrollments QC Table Definition Enrollment QC is roughly defined by the following SQL: ```sql SELECT year, state, zip, fips3, CASE WHEN ( beneficiaries.dob IS NULL) THEN 'MISSING'::TEXT WHEN ( beneficiaries.dob_latest IS NOT NULL) THEN 'AMBIGUOUS':: TEXT ELSE 'CONSISTENT'::TEXT END AS consistent_dob, CASE WHEN ( beneficiaries.dod IS NULL) THEN 'NONE'::TEXT WHEN ( beneficiaries.dod_earliest IS NOT NULL) THEN 'AMBIGUOUS':: TEXT ELSE 'CONSISTENT'::TEXT END AS consistent_dod, CASE WHEN ( beneficiaries.sex ~~ '%,%'::TEXT) THEN 'AMBIGUOUS':: TEXT ELSE 'CONSISTENT'::TEXT END AS consistent_sex, CASE WHEN ( beneficiaries.race ~~ '%,%'::TEXT) THEN 'AMBIGUOUS':: TEXT ELSE 'CONSISTENT'::TEXT END AS consistent_race fips3_is_approximated, fips3_valdiated, state_iso, COUNT(*) AS numrecords, ((# hll_add_agg(bene)))::bigint AS numdistinctbeneficaries, hll_add_agg(bene) AS bene_hll FROM medicare.enrollments natural join medicare.beneficiaries GROUP BY year, state, zip, fips3, consistent_dob, consistent_dod, consistent_sex, consistent_race, fips3_is_approximated, fips3_valdiated; ``` #### Enrollments QC Table Dimensions Therefore, the following QC dimensions are defined: * year, * state, * zip, * fips3, * consistent_dob, * consistent_dod, * consistent_sex, * consistent_race, * fips3_is_approximated, * fips3_valdiated #### Enrollments QC Table Measures In Apache Superset, the following metrics are defined for this table: * Number of consistent beneficiaries: ```sql (#(hll_union_agg(bene_hll) FILTER ( WHERE consistent_dob = 'CONSISTENT' AND consistent_dod <> 'AMBIGUOUS' AND consistent_race = 'CONSISTENT' AND consistent_sex = 'CONSISTENT' )) ) * 100.0 / (#(hll_union_agg(bene_hll))) ``` * Number of distinct beneficiaries ```sql (#(hll_union_agg(bene_hll)))::INT ``` * Number of enrollment records ```sql SUM(numrecords) ``` ### Admissions QC Table #### Admissions QC Table Definition Enrollment QC is roughly defined by the following SQL: ```sql SELECT year, state, zip, reason, state_iso, COUNT(*) AS numrecords, ((# hll_add_agg(bene)))::bigint AS numdistinctbeneficaries, hll_add_agg(bene) AS bene_hll FROM medicare.admissions UNION ALL medicare_audit.admissions GROUP BY year, state, zip, reason; ``` #### Admissions QC Table Dimensions Therefore, the following QC dimensions are defined: * year, * state, * zip, * reason (reason is either literal 'OK' or a reason why a record failed validation) #### Admissions QC Table Measures In Apache Superset, the following metrics are defined for this table: Count metrics: * Number of admission records ```sql SUM(numrecords) ``` * Number of distinct beneficiaries ```sql (#(hll_union_agg(bene_hll)))::INT ``` Percent metrics: * Percent of valid records (passed validation) ```sql (SUM(numrecords) FILTER (WHERE reason = 'OK'))*100.0/SUM(numrecords) ``` * Percent of admission records, for which corresponding enrollment data was not found (failed validation) ```sql (SUM(numrecords) FILTER (WHERE reason = 'FOREIGN KEY'))*100.0/SUM(numrecords) ``` * Percent of duplicate records (failed validation, one apparent admission recorded more than once) ```sql (SUM(numrecords) FILTER (WHERE reason = 'DUPLICATE'))*100.0/SUM(numrecords) ``` * Percent of valid records with missing data (failed validation) ```sql (SUM(numrecords) FILTER (WHERE reason = 'PRIMARY KEY'))*100.0/SUM(numrecords) ```