Medicare Files Handling

Processing pipeline

Medicare Pipeline Steps

Current pipeline consists of 5 steps, each represented as a sub-workflow:

  1. Ingest raw data

  2. Process beneficiaries and their enrollment in Medicare

  3. Process Admissions

  4. Create QC Tables

  5. Grant SELECT privileges (i.e., read access) to all newly created tables (this step is actually a command line tool, not a workflow)

Ingestion of raw data

Ingestion of raw data is incremental, i.e. tables that are already in the database will not be dropped (deleted). However, any table with the name found in input path will be replaced. Please note, that every raw record is identified by the tuple consisting of the original file name and the line number in that file.

Note

If no raw data is given or --input parameters points to a non-existent or empty directory, the pipeline will skip ingestion step and will process the raw data that is already in the database.

Ingestion as a part of the data pipeline is only implemented for data in the format as it comes from ResDac. Metadata for ingestion is taken from FTS files that accompany ResDac deliverables.

Important

In case of Medicare data in posession of NSAPH organization, we only have original ResDac data for years 2011-2014 and 2016-2018. Therefore, the pipeline is unable to ingest the data for other years (1999-2010 and 2015).

See Files for 1999 to 2010 for more information.

See Ingesting Raw Files for processing details

Processing Data in the Database

During in-database processing all tables, views and materialized views are completely replaced. Old tables are dropped and new ones are created from scratch.

See Combining raw files into a single view for processing details.

Ingesting Raw Files

Pipeline

Overview of Ingesting Raw Medicare Files

There are two types of tables:

  • Patient summary, aka enrollment, aka denominator

  • Inpatient admissions

Unfortunately, the structure of medicare files is different for almost every year.

Summary files for some years come in pairs:

  • mbsf_ab_summary

  • mbsf_d_cmpnts

For other years we have a single file:

  • mbsf_abcd_summary

Inpatient admissions files always follow medpar_all_file pattern.

Columns vary from year to year even for similarly named files, new columns are being added and column names are sometimes changed.

To add insult to injury, for years prior to 2011 (1999-2010) we do not have original files, but preprocessed files with patient summary (called denominators) and admissions. They are in SAS 7BDAT format, however columns are also different for different years.

Storing raw data in the Database

Given the difference in file structures we create a separate table for every file. However, to make it easier to join these tables we:

  • Add a column containing original file name to every table

  • Add generated columns with uniform names for:

    • Year

    • State

    • Bene_Id

    • Zip code

Originally, these data is stored in columns with the following possible names:

"bene_id":  (None, ["bene_id", "intbid", "qid", "bid_5333*"]),
"state":  (None, ["state", "ssa_state", "state_code",
                  "bene_rsdnc_ssa_state_cd", "state_cd",
                  "medpar_bene_rsdnc_ssa_state_cd"]),
"zip": (None, ["zip", "zipcode", "bene_zip_cd", "bene_zip",
               "bene_mlg_cntct_zip_cd",
               "medpar_bene_mlg_cntct_zip_cd"]),
"year": (None, ["year", "enrolyr", "bene_enrollmt_ref_yr",
                "rfrnc_yr"])

When a table has no natural primary key (admission tables) we add a record number column. This column ha sno meaning but allows to trace a record to the original data.

Files for 1999 to 2010

These files are in SAS7BDAT format. They have been stored on RCE in two directories:

  • denominator

  • inpatient

One file per year. SAS7BDAT format contains metadata with column names and types. We use this metadata to generate appropriate database schema.

See the code for handling these files:

Files for Years 2011 and later

These files are original files from ResDac. They come in Fixed Width Format (FWF). For each file the structure is described in File Transfer Summary (FTS) file. Unfortunately these files are intended for reading by a human and is difficult to parse automatically. A partial parser that relies on a known file type is implemented in Python. The information extarcted by the parser is used to:

  • Generate data model (database schema)

  • Generate metadata for the FWF Reader

Combining raw files into a single view

Pipeline

Eventual database schema

Once all raw files are ingested into the database they are combined into two views:

  1. Patient summary (aka MBSF, aka Beneficiary summary)

  2. Inpatient Admissions (aka hospitalizations, aka medpar)

The figure below visualizes the database schema.

../../../_images/medicare-db.png

The tables above are defined in Medicare data model definition. This file uses Extensions used for creating federated view of different years.

Creating Federated Patient Summary

The federated patient summary view is created in two steps, though the division into steps is purely technical. The reasons are given some limitations of readability in SQL.

This step uses data modelling extensions described in Extensions used for creating federated view of different years.

These steps are part of Medicare Beneficiaries data in-database processing pipeline

First step: Initial in-database data conditioning

The first step creates a view called medicare.ps.

This step technically combines all cms.mbsf_ab* and cms.mcr_bene_* tables into a single view using CREATE VIEW SQL statement.

It also cleanses and conditions data from the following columns:

  • year

    • If it is a string in original file, it is converted to integer

    • If it is two-digit, it is converted to 4 digit

  • dob: converted to SQL DATE type, from either character or SAS numeric form

  • dod (date of death): converted to SQL DATE type, from either character or SAS numeric form

  • age as recorded in the raw data. It is the beneficiary’s age on the last day of the prior year

  • sex

  • race

  • rti_race Research Triangle Institute race code

  • hmo_indicators Monthly Medicare Advantage (MA) enrollment indicator

  • hmo_cvg_count Number of months the beneficiary was enrolled

  • yob year of birth calculated with the age variable (year - age)

  • state: added a column with text state id

  • ssa2: Social Security Administration (SSA) two digit code for state

  • ssa3: Social Security Administration (SSA) three digit code for county

  • fips2: added a column with two digit state FIPS code

  • zip: if original file uses 9-digit zip code, it is split into two separate columns, 5 digit zip and 4-digit zip4. The value is also converted to integer value.

  • zip4: added, when available - the last four digits of 9-digit zip code

The following CWL tool is responsible to perform it.

Second step: Mapping to county FIPS codes

At the second step, a view called medicare._ps is created.
The only difference between medicare.ps and medicare._ps is that the latter has county FIPS code (fips3 column) inferred either SSA county code (ssa3 column), if it is available or from the zip code (zip column) if SSA county code is absent. The reason this has to happen in a separate second step is that both ssa3 and zip are being cleansed in the first step.

The second step is performed by a general loader utility based on the Medicare data model definition.

Creating Beneficiaries table

This is also part of Medicare Beneficiaries data in-database processing pipeline

See also creating Medicaid Beneficiaries table

This is also a two steps operation. The first step creates an SQL view and the second step stores the data as a real table.

Essentially it is a medicare.ps view grouped by beneficiary id (bene_id column). This step also takes care of documenting any discrepancies in the data related to:

  • dob

  • dod

  • race

  • race_rti

  • sex

If there is any discrepancy for a given bene_id, then:

  • The earliest DOB is selected as dob

  • The latest DOD (date of death) is selected as dod

  • A comma-separated string containing all race codes is used for race

  • A comma-separated string containing all race codes is used for race_rti

  • comma-separated string containing all sex codes is used for sex

The following columns are added:

  • duplciates: a numeric column showing the number of inconsistent values for this beneficiary. If it is greater than 1, it means there is a discrepancy in the data for this beneficiary

  • dob_latest: the latest DOB found in the records for this beneficiary. The value of this column is NULL for consistent records

  • dod_earliest: the earliest DOD found in the records for this beneficiary. The value of this column is NULL for consistent records

  • Beneficiary id HLL hash (bene column), to be used for approximate count distinct queries. See more

This topic is discussed in more details in the Medicaid documentation

Creating Enrollments table

This is also part of Medicare Beneficiaries data in-database processing pipeline

Enrollments overview

Enrollments table contains information about yearly beneficiaries enrollments in different states and tracks changes in eligibility (i.e. beginning of the eligibility and beneficiaries death) and changes in states and addresses.

See also Medicaid Enrollments and Medicaid Eligibility tables. Please note, that since Medicare eligibility is not as volatile as Medicaid eligibility, i.e. it does not usually change month to month, there is no direct analog to Medicaid Eligibility table.

As most of the other tables, Enrollments table is created in two steps. The first step creates an SQL view and the second step stores the data as a real table, adds primary key and builds indices to make queries more efficient.

Enrollments Primary key (unique identifier)

  • bene_id

  • year

  • state

In other words, a record in the table describes a given beneficiary leaving in a given state during a given year. If beneficiary has moved from one state to another during the year, more than one record for such a beneficiary will be created in the table. This is consistent with Medicaid Enrollments, though, arguably, makes less sense for Medicare.

Enrollments data cleansing

Beneficiaries can move during a year therefore address columns can have multiple values. These columns are:

  • fips2: state FIPS code

  • fips3: county FIPS code

  • ssa2: SSA state code

  • ssa3: SSA county code

  • zip: beneficiary address zip code

The policy for all of this columns is the following:

  • For corresponding column in the enrollments table, an arbitrary but deterministic value is selected

  • An additional column is added, containing the list of all encountered values

The additional columns are:

  • ssa2_list

  • ssa3_list

  • residence_counties

  • zips

Additional columns reflecting data quality and cleansing are added to the Enrollments tables:

  • state_count: Number of states associated with the given beneficiary in a given year

  • fips3_is_approximated: A boolean column, indicating whether the value was taken from original record as is or approximated. If true, it means that there was no valid county code in the original ResDac record, hence, the county code was inferred from other data (in most cases, zip code)

  • fips3_valdiated: A boolean column indicating that the value of county code is consistent with the values of state code and zip code.

Enrollments columns definitions

The following columns are created for Enrollments:

  • ssa2: SSA state code

  • ssa3: SSA county code

  • ssa2_list: list of all SSA county codes

  • state_iso: ISO code of the state, used for mapping

  • residence_county: one of the “latest” residence counties where the beneficiary was registered, latest in alphabetical order

  • residence_counties: comma separated list of all “latest” residence counties, where a beneficiary was registered during the year

  • fips5: 5 digit FIPS code of the residence_county

  • zip: one of the “latest” zip codes where the beneficiary was registered, latest in numerical order

  • zips: comma separated list of all “latest” zip codes, where a beneficiary was registered during the year

  • state_count: number of states, where the beneficiary was enrolled in medicaid during the year. Note, this is also the number of records for this beneficiary and this year in the Enrollments` table.

  • died: a boolean flag indicating that the beneficiary has died during this year while being registered for medicaid in this state.

  • hmo_indicators: the maximum value of all the monthly hmo indicators

  • hmo_cvg_count: the number of months the beneficiary was enrolled in a Medicare Advantage (MA)

  • fips3_is_approximated: A boolean column, indicating whether the value was taken from original record as is or approximated. If true, it means that there was no valid county code in the original ResDac record, hence, the county code was inferred from other data (in most cases, zip code)

  • fips3_valdiated: A boolean column indicating that the value of county code is consistent with the values of state code and zip code.

  • Beneficiary id HLL hash (bene column), to be used for approximate count distinct queries. See more

Creating Federated Admissions view

This step is part of Process Medicare inpatient admissions data inside the database

This step technically combines all cms.medpar* and cms.mcr_ip_* tables into a single view using CREATE VIEW SQL statement.

It also cleanses and conditions data from the following columns:

  • year

    • If it is a string in original file, it is converted to integer

    • If it is two-digit, it is converted to 4 digit

  • state: added a column with text state id

  • fips2: added a column with two digit state FIPS code

  • zip: if original file uses 9-digit zip code, it is split into two separate columns, 5 digit zip and 4-digit zip4. The value is also converted to integer value.

  • zip4: added, when available - the last four digits of 9-digit zip code

  • admission_date: converted to SQL DATE type, from either character or SAS numeric form

  • discharge_date: converted to SQL DATE type, from either character or SAS numeric form

  • adm_day_of_week: converted to integer

  • Diagnoses: separate columns combined into a single ARRAY column (read more about PostgreSQL Arrays)

Creating Inpatient Admissions table

This step is also part of Process Medicare inpatient admissions data inside the database

Table with all inpatient admissions billed to Medicare with admission and discharge dates and ICD codes.

During this step the following major operations are performed:

  • Added the following columns:

    • Admission year, extracted from admission date

    • Added HLL hashes for:

      • Beneficiary id (bene column)

      • Primary diagnosis at admission (pd_hll_hash)

      • All diagnoses, used for admission (icd_hll)

  • Performed validation of admission data. Records that failed validation are excluded from the resulting Admissions table but are stored in a special medicare_audit.admissions table, together with the reason for validation failure. We distinguish three reasons for validation failure:

    • Primary key: this indicates missing data, for example:

      • Missing beneficiary id

      • Missing admission or discharge date

      • Missing US State, where the admission happened

    • Foreign key: means that the beneficiary referred in the admission record was not eligible for Medicare in the given year

    • Duplicate: a duplicate record was found. Only one record out of several duplicates is stored in the admissions table, others are copied to medicare_audit.admissions table.

See more information about handling records that have failed validation in: Data Modeling

Creating QC Tables

Pipeline

Medicare QC approach

QC tables (materialized views to be precise) are created by Medicare QC Pipeline

Two tables are created:

  • Enrollments QC

  • Admissions QC

These are aggregate tables, defined in Medicare data model definition (qc_enrollments and qc_admissions).

In these tables we define dimensions and measures, including count measures and percent measures.

Enrollments QC Table

Enrollments QC Table Definition

Enrollment QC is roughly defined by the following SQL:

SELECT 
    year,
    state,
    zip,
    fips3,
    CASE
        WHEN ( 
                beneficiaries.dob IS NULL) 
        THEN 'MISSING'::TEXT
        WHEN ( 
                beneficiaries.dob_latest IS NOT NULL) 
        THEN 'AMBIGUOUS':: TEXT
        ELSE 'CONSISTENT'::TEXT
    END AS consistent_dob,
    CASE
        WHEN ( 
                beneficiaries.dod IS NULL) 
        THEN 'NONE'::TEXT
        WHEN ( 
                beneficiaries.dod_earliest IS NOT NULL) 
        THEN 'AMBIGUOUS':: TEXT
        ELSE 'CONSISTENT'::TEXT
    END AS consistent_dod,
    CASE
        WHEN ( 
                beneficiaries.sex ~~ '%,%'::TEXT) 
        THEN 'AMBIGUOUS':: TEXT
        ELSE 'CONSISTENT'::TEXT
    END AS consistent_sex,
    CASE
        WHEN ( 
                beneficiaries.race ~~ '%,%'::TEXT) 
        THEN 'AMBIGUOUS':: TEXT
        ELSE 'CONSISTENT'::TEXT
    END AS consistent_race
    fips3_is_approximated,
    fips3_valdiated,
    state_iso,
    COUNT(*)                        AS numrecords,
    ((# hll_add_agg(bene)))::bigint AS numdistinctbeneficaries,
    hll_add_agg(bene)               AS bene_hll
FROM 
    medicare.enrollments natural join medicare.beneficiaries
GROUP BY 
    year, 
    state, 
    zip, 
    fips3, 
    consistent_dob, 
    consistent_dod, 
    consistent_sex, 
    consistent_race, 
    fips3_is_approximated, 
    fips3_valdiated;

Enrollments QC Table Dimensions

Therefore, the following QC dimensions are defined:

  • year,

  • state,

  • zip,

  • fips3,

  • consistent_dob,

  • consistent_dod,

  • consistent_sex,

  • consistent_race,

  • fips3_is_approximated,

  • fips3_valdiated

Enrollments QC Table Measures

In Apache Superset, the following metrics are defined for this table:

  • Number of consistent beneficiaries:

    (#(hll_union_agg(bene_hll) FILTER (
            WHERE consistent_dob = 'CONSISTENT'
            AND consistent_dod <> 'AMBIGUOUS'
            AND consistent_race = 'CONSISTENT'
            AND consistent_sex = 'CONSISTENT'
            ))
            ) * 100.0 / (#(hll_union_agg(bene_hll)))
    
  • Number of distinct beneficiaries

    (#(hll_union_agg(bene_hll)))::INT
    
  • Number of enrollment records

    SUM(numrecords)
    

Admissions QC Table

Admissions QC Table Definition

Enrollment QC is roughly defined by the following SQL:

SELECT 
    year,
    state,
    zip,
    reason,
    state_iso,
    COUNT(*)                        AS numrecords,
    ((# hll_add_agg(bene)))::bigint AS numdistinctbeneficaries,
    hll_add_agg(bene)               AS bene_hll
FROM 
    medicare.admissions UNION ALL medicare_audit.admissions
GROUP BY 
    year, 
    state, 
    zip, 
    reason;

Admissions QC Table Dimensions

Therefore, the following QC dimensions are defined:

  • year,

  • state,

  • zip,

  • reason (reason is either literal ‘OK’ or a reason why a record failed validation)

Admissions QC Table Measures

In Apache Superset, the following metrics are defined for this table:

Count metrics:

  • Number of admission records

    SUM(numrecords)
    
  • Number of distinct beneficiaries

    (#(hll_union_agg(bene_hll)))::INT
    

Percent metrics:

  • Percent of valid records (passed validation)

    (SUM(numrecords) FILTER (WHERE reason = 'OK'))*100.0/SUM(numrecords)
    
  • Percent of admission records, for which corresponding enrollment data was not found (failed validation)

    (SUM(numrecords) FILTER (WHERE reason = 'FOREIGN KEY'))*100.0/SUM(numrecords)
    
  • Percent of duplicate records (failed validation, one apparent admission recorded more than once)

    (SUM(numrecords) FILTER (WHERE reason = 'DUPLICATE'))*100.0/SUM(numrecords)
    
  • Percent of valid records with missing data (failed validation)

    (SUM(numrecords) FILTER (WHERE reason = 'PRIMARY KEY'))*100.0/SUM(numrecords)