The duplicates Module

In this module we attempt to analyze duplicate records in CMS medicaid data

Original R function to remove duplicates: [de_duplicate](https://github.com/NSAPH/NSAPHplatform/blob/master/R/intake.R#L29-L45)

selects a random record from a set of duplicate records

Original R code that has been used to create Demographics files: https://github.com/NSAPH/data_model/blob/master/scripts/medicaid_scripts/processed_data/1_create_demographics_data.R

calls de_duplicate function:

de_duplicate(demographics, “BENE_ID”, seed = 987987)

If I understand this code correctly, it:

More details about duplicates:

Official documentation https://www2.ccwdata.org/documents/10280/19002246/ccw-max-user-guide.pdf

Section Assignment of a Beneficiary Identifier

To construct the CCW BENE_ID, the CMS CCW team developed an internal cross-reference file consisting of historical Medicaid and Medicare enrollment information using CMS data sources such as the Enterprise Cross Reference (ECR) file. When a new MAX PS file is received, the MSIS_ID, STATE_CD, SSN, DOB, Gender and other beneficiary identifying information is compared against the historical enrollment file. If there is a single record in the historical enrollment file that “best matches” the information in the MAX PS record, then the BENE_ID on that historical record is assigned to the MAX PS record. If there is no match or no “best match” after CCW has exhausted a stringent matching process, a null (or missing) BENE_ID is assigned to the MAX PS record. For any given year, approximately 7% to 8% of MAX PS records have a BENE_ID that is null. Once a BENE_ID is assigned to a MAX PS record for a particular year (with the exception of those assigned to a null value), it will not change. When a new MAX PS file is received, CCW attempts to reassign those with missing BENE_IDs.

Also, see: https://resdac.org/cms-data/variables/encrypted-723-ccw-beneficiary-id

class DuplicatesExplorer(arguments)[source]
init()[source]
explore_one(id, cursor)[source]
explore_all()[source]
is_loaded()[source]
load()[source]
save()[source]
report()[source]
find_duplicate_dates(date_type) Dict[source]
analyze_inconsistent_age()[source]
run()[source]
args()[source]