Linking IPUMS-DHS Data to DHS Files
Users may want to link additional variables from the original Demographic and Health Surveys (DHS) files (that are not yet in the IPUMS-Demographic and Health Surveys system) to an IPUMS-DHS data extract. To ensure correct linkage, a unique identifier that is identical in name and in character length, sometimes known as a linking key, must be used. This user note describes how to create unique linking keys and merge an IPUMS-DHS individual file to original DHS individual-level data and how to merge individual-level IPUMS-DHS data to original DHS household-level data.
Single sample
To merge data for a single sample, CASEID can be used to link DHS files (except for the Household Recode File), to IPUMS-DHS Individual (Women's) data. When merging DHS Household files to the IPUMS-DHS Individual data, most samples can be merged using HHID, except some Special Cases noted below. This is a many-to-one merge (multiple woman may be from the same household and thus share the same household record).
Multiple samples
When using multiple samples (e.g., pooled data analysis), CASEID is not sufficient. Rather, users must use a linking key that incorporates both the unique sample and case identifier (SAMPLE and CASEID).
Users who are using multiple samples are strongly encouraged to read the section on creating Unique STRATA and PSU variables. Unique STRATA and PSU identifiers must be created in order to have the correct standard errors for each sample.
Creating Unique Identifiers in DHS Public Use Files
To make DHS data files ready for merging with IPUMS-DHS extracts, users must first create the IPUMD-DHS equivalent variable of SAMPLE within each DHS data file. SAMPLE is a unique id number for each country and year of survey. Below is a list of the sample id numbers. Users should create the variable SAMPLE in each original data file that is to be merged.
SAMPLE was created by combining the country International Organization for Standardization code (ISO), and the number of the survey in the country. In the example below, the ISO code for Ethiopia is 231 and 2000 was the year of the first survey, so the SAMPLE is 2311. The SAMPLE IDs are listed on the codes tab for the variable SAMPLE
Unique Identifiers in IDHS Files
There are two linking keys in IPUMS-DHS. They are generated by concatenating SAMPLE and either CASEID or HHID. (except for several Phase 1 surveys). The first linking key is IDHSPID, which is a unique identifier for the person records. The second linking key is IDHSHID, which is a unique identifier for household records. Each IPUMS-DHS linking key uniquely identifies households or women, respectively, across all samples. Refer to tables 1 and 2 for detailed examples of the sequence of linking variables.
Country/Year | SAMPLE ID | CASEID | IDHSPID |
---|---|---|---|
Ethiopia 2000 | 2311 | 1 12 35 | 2311 1 12 35 |
Ethiopia 2005 | 2312 | 5 15 26 | 2312 5 15 26 |
Country/Year | SAMPLE ID | HHID | IDHSHID |
---|---|---|---|
Ethiopia 2000 | 2311 | 1 11 15 | 2311 1 11 15 |
Ethiopia 2005 | 2312 | 3 20 50 | 2312 3 20 50 |
Note that HHID may not be available for Phase 1 surveys. These include: Egypt 1988, Ghana 1988, Kenya 1989, Mali 1987, and Zimbabwe 1989.
For these surveys, combine SAMPLE + CLUSTERNO + HHNUM = IDHSHID
Merging Variables from Original DHS Data Files to IPUMS-DHS Data Files
- Obtain original DHS Files
Original DHS files can be downloaded from the DHS Program website (link) -
Create IDHSPID and IDHSHID in Original DHS Before Merging to IPUMS-DHS data
When using multiple original DHS files, it is important to create the unique id before appending or pooling all data sets. This is because there maybe duplicate CASEIDS across countries and over time. The IPUMS-DHS staff has provided example syntax for creating IDHSPID and IDHSHID in Stata, SAS, and SPSS.
Note that using the R function
merge()
can merge dataframes based on multiple variables, so creating IDHSHID and IDHSPID is not necessary.
Stata
gen sample = string(4 digit code from the codes tab for the variable SAMPLE)
gen str idhspid = sample +caseid
SAS
sample_id = put(4 digit code from the codes tab for the variable SAMPLE, 4.);
idhspid =sample || caseid; run;
SPSS
STRING sample_s (A4).
COMPUTE sample_s = STRING(4 digit code from the codes tab for the variable SAMPLE, F4). EXECUTE.
STRING idhspid (A19).
COMPUTE idhspid = concat(sample_s, caseid). EXECUTE.
Once you have created the IDHSPID or IDHSHID in your DHS data, you can then append and your data sets and merge them to your extract
Unique STRATA and PSU variables
The DHS uses a complex survey design in which there are multiple stages to selecting the sample, rather than a simple random sample. Because of this complex design, users need to incorporate data on the sampling design when doing analysis. The survey design variables are STRATA and PSU (along with PERWEIGHT). SPSS, Stata and SAS have specific commands to account for stratification and sampling probabilities. The survey design variable are specific to each sample. Thus, when using multiple samples, unique, sample-specific values for each sample need to be created before the data are pooled. The code to do this is provided below:
Stata
egen newpsu = group(sample PSU)
egen newstrata = group(sample DOMAIN) or group(REGION URBAN)
SAS
newpsu =sample + PSU
newstrata =sample + DOMAIN
SPSS
Compute newspu = sample + PSU
Compute newstrata = sample + DOMAIN
Execute.
Special Cases
There are a few samples in IPUMD-DHS where CLUSTERNO and HHNUM do not uniquely identify households. These special cases are described below. In addition, users may occasionally encounter a few duplicate records, in some DHS files, particularly in the oldest files. In these rare circumstances, merging data files may be problematic, unless the duplicate records are identified and removed. For example, if there are two households with the same CLUSTERNO and HHNUM, a one-to-many merge to the women's file will fail. Users should identify these records and decide whether to remove them.
Kenya 1989:
CLUSTERNO and HHNUM alone do not uniquely identify households. Unique households in the original DHS files and IDHS women's file can be identified by using URBAN (Urban/Rural status) with HURBRUR on the original DHS household data file, in addition to CLUSTERNO and HHNUM.
Mali 1987:
The variable CCODESE is the cluster number and CNOMEN for household number in the household file, but unlike standard merges, these do not uniquely identify records. CNOCON (concession number) must also be used to link to the women's file. In the IDHS women's file, this variable is MLCONUM (SCNOCON in the original DHS data file).