The duplicates Module
In this module we attempt to analyze duplicate records in CMS medicaid data
Original R function to remove duplicates: [de_duplicate](https://github.com/NSAPH/NSAPHplatform/blob/master/R/intake.R#L29-L45)
selects a random record from a set of duplicate records
Original R code that has been used to create Demographics files: https://github.com/NSAPH/data_model/blob/master/scripts/medicaid_scripts/processed_data/1_create_demographics_data.R
calls de_duplicate function:
de_duplicate(demographics, “BENE_ID”, seed = 987987)
If I understand this code correctly, it:
- Removes all beneficiaries that have multiple records with inconsistent death dates (el_dod)
(https://github.com/NSAPH/data_model/blob/master/scripts/medicaid_scripts/processed_data/1_create_demographics_data.R#L50) See also https://github.com/NSAPH/data_requests/tree/master/request_projects/dec2019_medicaid_platform_cvd#varying-death-dates
- For remaining beneficiaries with multiple records, randomly selects a single record
(https://github.com/NSAPH/data_model/blob/master/scripts/medicaid_scripts/processed_data/1_create_demographics_data.R#L44) as stated here: https://github.com/NSAPH/data_requests/tree/master/request_projects/dec2019_medicaid_platform_cvd#variation-in-demographic-information-across-years
More details about duplicates:
Within states: https://github.com/NSAPH/data_requests/tree/master/request_projects/dec2019_medicaid_platform_cvd#duplicates-within-states
Across states: https://github.com/NSAPH/data_requests/tree/master/request_projects/dec2019_medicaid_platform_cvd#duplicates-across-states
Official documentation https://www2.ccwdata.org/documents/10280/19002246/ccw-max-user-guide.pdf
Section Assignment of a Beneficiary Identifier
To construct the CCW BENE_ID, the CMS CCW team developed an internal cross-reference file consisting of historical Medicaid and Medicare enrollment information using CMS data sources such as the Enterprise Cross Reference (ECR) file. When a new MAX PS file is received, the MSIS_ID, STATE_CD, SSN, DOB, Gender and other beneficiary identifying information is compared against the historical enrollment file. If there is a single record in the historical enrollment file that “best matches” the information in the MAX PS record, then the BENE_ID on that historical record is assigned to the MAX PS record. If there is no match or no “best match” after CCW has exhausted a stringent matching process, a null (or missing) BENE_ID is assigned to the MAX PS record. For any given year, approximately 7% to 8% of MAX PS records have a BENE_ID that is null. Once a BENE_ID is assigned to a MAX PS record for a particular year (with the exception of those assigned to a null value), it will not change. When a new MAX PS file is received, CCW attempts to reassign those with missing BENE_IDs.
Also, see: https://resdac.org/cms-data/variables/encrypted-723-ccw-beneficiary-id