A Unified Approach to Cross-Sectional and Longitudinal Imputation with Applications to the Survey of Doctorate Recipients

National Center for Science and Engineering Statistics (NCSES)

A Unified Approach to Cross-Sectional and Longitudinal Imputation with Applications to the Survey of Doctorate Recipients

Disclaimer

Working papers are intended to report exploratory results of research and analysis undertaken by the National Center for Science and Engineering Statistics (NCSES) at the U.S. National Science Foundation (NSF). Any opinions, findings, conclusions, or recommendations expressed in this working paper do not necessarily reflect the views of NSF. This working paper has been released to inform interested parties of ongoing research or activities and to encourage further discussion of the topic.

NCSES has reviewed this product for unauthorized disclosure of confidential information and approved its release (NCSES-DRN25-016).

Abstract

This working paper describes imputation research with applications to the Survey of Doctorate Recipients (SDR), which provides information about the educational and occupational achievements and career movement of U.S.-trained doctoral scientists and engineers in the United States and abroad. The paper explores a new imputation approach for the SDR that systematically utilizes and maintains reported longitudinal patterns. The paper also summarizes the evaluation of the new cross-sectional imputation for the 2019 SDR cross-sectional data and the 2015–19 longitudinal SDR file.

1. Introduction

Missing data in surveys occurs when a respondent does not answer a particular question either intentionally (the respondent does not know or refuses to answer), unintentionally (an oversight or misinterpreted skip pattern instructions), or because the respondent completes a short version of the questionnaire that includes only critical items. Imputation of missing data from surveys aims to mitigate the risk of nonresponse bias associated with item nonresponse and provide data users with a complete data set with no missing values to facilitate data analyses. In addition, the imputation procedures seek to support multivariate inferences by reflecting the relationships among survey variables.

Particularly, in imputation for longitudinal surveys, it is crucial to maintain longitudinal relationships in the data by preserving temporal patterns and accurately predicting missing data at a given time. The Survey of Doctorate Recipients (SDR) provides information about the educational and occupational achievements and career movement of U.S.-trained doctoral scientists and engineers in the United States and abroad and produces both cross-sectional and longitudinal data. The SDR currently imputes the biennial cross-sectional item missing data through a hot deck method. The first release of longitudinal SDR data in 2022 (LSDR 2015–19) added hot deck imputation of missing years of data for longitudinal cohort cases using longitudinal patterns^{For sample cases who were included in the LSDR 2015–19 and did not complete the 2017 SDR form or the 2019 SDR form, hot deck imputation was conducted to impute all variables in a selected set of 32 critical items.} while taking the cross-sectional imputations as fixed, meaning the cross-sectional data were not changed. Although each cross-sectional imputation utilizes reported data from prior cycles, the current method does not incorporate longitudinal patterns consistently.

The research described in this working paper investigates the effects of the systematic usage of information from past cycles in the cross-sectional imputation and in the longitudinal imputation, which already incorporates longitudinal information. The key goal is to improve donor matching in the cross-sectional imputation and, in turn, enhance the data quality of (1) the cross-sectional data sets by strengthening the imputation accuracy and (2) the longitudinal data sets by reducing the chance of creating unobserved and unverifiable longitudinal patterns. The following steps are involved in this investigation.

Develop a cross-sectional imputation approach that uses longitudinal patterns and test it on identified items using data from the 2019 SDR (section 2.2.2).
Evaluate the new cross-sectional imputation results relative to the original cross-sectional imputation for the 2019 SDR (section 2.3).
Develop an imputation approach for a longitudinal 2015–19 SDR data set that makes use of the newly imputed cross-sectional data set (section 3.2).
Evaluate the new longitudinal imputation results relative to the original longitudinal imputation that was done for the 2015–19 SDR (section 3.3).
Make recommendations for future implementation of the longitudinal imputation for SDR.

The results of this research can be used to evaluate both the cross-sectional and longitudinal data sets and to provide specific recommendations on extending and modifying the methodology for use in future SDR cycles.

The two methods sections (sections 2 and 3) describe the cross-sectional imputation and longitudinal imputation, respectively. Section 4 concludes with a summary and recommendations.

2. Methods: Cross-Sectional Imputation

The SDR is conducted approximately biennially and provides demographic, education, and career history information from individuals with a U.S. research doctoral degree in a science, engineering, or health field. The majority of data items have low nonresponse. For example, item nonresponse in the SDR 2019 for key employment and demographic items ranged from 0% to 3%. Nonresponse to questions about income was higher and ranged from 13% to 15%.^{Item response statistics are summarized in the SDR 2019 technical notes at https://ncses.nsf.gov/pubs/nsf21320#technical-notes.} For the SDR cross-sectional files, the cross-sectional imputation has been utilizing information available within a cycle while incorporating data from previous cycles where possible, primarily aiming to enhance the accuracy of the imputation within the cross-sectional file. The research described in this working paper investigated the effect of a new approach that uses data from previous cycles systematically. The new approach resembles the processes used for the original cross-sectional imputation but with some modifications, including (1) conducting a separate imputation step for cases grouped by the historic response patterns of a given item and (2) identifying additional class and sort variables to be used in the imputation of each of the items. These modifications were intended to preserve longitudinal patterns observed in the 2015, 2017, and 2019 SDR data reported by respondents.

The exploration focuses on items from questionnaire section A (employment situation) of the 2019 survey, which are of particular interest longitudinally. The original 2019 cross-sectional imputation for questionnaire sections B (past employment), C (other work-related experiences), D (recent educational experiences), and E (demographic) were retained.

The methods section begins by outlining the existing hot deck imputation in section 2.1. The new cross-sectional imputation approach, including describing the modifications and their implementation in more detail using the 2019 SDR data, is given in section 2.2. Section 2.3 shows comparisons between the re-imputed data using the new approach and the published 2019 SDR data using the original approach; the comparisons were made by evaluating item response status patterns and variable distributions before and after imputation (marginal and conditional on the longitudinal information).

2.1 Existing Imputation Process: Hot Deck Imputation

The existing SDR imputation process involved filling in missing values for all survey items for which a response was expected based on the survey questionnaire logic. Imputation was conducted after editing, which involved both filling in values that could be logically inferred and setting some values to missing if there was inconsistency among a sample member’s responses. Imputation was conducted sequentially to account for logic and skip patterns in the questionnaire. Most survey items were imputed using hot deck imputation. Hot deck imputation is a statistical technique that, for each respondent with a missing survey item, identifies another respondent with complete data for that item. The respondent with complete information is a donor, and the complete data are used to impute the missing data. The respondent with the missing survey item is referred to as a recipient. For each imputed item, donors were selected to match recipients based on specific class variables so that the respondent who provided the information had the same responses as the recipient for those variables. As a further step, all cases were sorted by sort variables, so that a recipient’s donor was chosen from a respondent nearby in the sort order (hence, with similar values). The set of cases with the same values in the full set of class variables and the same or similar values in the full set of sort variables is referred to as a cell.

Additional details of the original 2019 cross-sectional imputation procedures can be found in the 2019 Survey of Doctorate Recipients Methodology Report, available upon request from the National Center for Science and Engineering Statistics (NCSES). The remainder of this section focuses on the changes made to the cross-sectional imputation for the questionnaire section A items.

2.2 Methods for Questionnaire Section A Survey Item Re-Imputation

All questionnaire items in section A, regarding employment, were re-imputed with the exception of the following:

Critical complete questions (which were required to have a response for a case to be considered “complete”),
Variables that were never missing after editing, and
Verbatim texts (where the respondent provided answers to open-ended questions).

Appendix A lists each of the 108 variables from section A of the questionnaire that were re-imputed using the new approach for hot deck imputation (see section 2.2.1 Hot Deck Imputation) and two variables from section A of the questionnaire that were re-imputed using the new approach for random imputation (see section 2.2.2 Random Imputation). Random imputation was retained for the two variables related to the most and second-most important reasons for taking a postdoc position and the two variables related to the most and second-most important reasons for working outside the field of highest degree. For re-imputation, a hot deck imputation procedure was used in all but a few instances. Details of the exceptions are provided in section 2.2.2 Random Imputation.

2.2.1 Hot Deck Imputation

As described earlier, the primary statistical imputation technique used in the existing SDR imputation process was a hot deck procedure, which involves classifying individual cases into cells with similar cases using class and sort variables. The hot deck imputation was still the primary technique in the new approach. Imputation was performed within cells, guaranteeing that the donor and recipient had the same value for all class variables.

As with the original imputation, filter variables were used to determine the skip patterns for the interview and thus determine whether or not a question was even asked. For example, reasons for not working are asked only if the respondent reports to be unemployed, so employment status is the filter variable for retirement status. No changes were made to the filter variables.

Class variables were used to maintain consistency among related variables. For example, working status was used as a class variable for whether or not the respondent had retired. In general, class variables from the original 2019 cross-sectional imputation were retained, but additional variables were included for many of the items in the section A questionnaire to preserve longitudinal relationships.

Within a particular cell, all individuals were ordered using sort variables. Sort variables were chosen based on being related to the variable to be imputed. Again, the 2019 re-imputation retained the sort variables from the original 2019 cross-sectional imputation but introduced additional sort variables.

Once all of the cases were grouped into cells and sorted by the relevant sort variables, a case with a missing value was given the same response as the observation nearest to it in the ordered sort without violating the donor use limit^{Imputation donors that exceeded the use limit will be excluded from the list of available donors so that the response data from a donor is not used more than three times.} (the donor limit was three within the cells). Within cells, serpentine sorting^{Serpentine sorting for hot deck imputation is described in details in https://support.sas.com/resources/papers/proceedings-archive/SUGI95/Sugi-95-182%20Carlson%20Cox%20Bandeh.pdf.} was implemented using the sort variables. This approach to sorting helped ensure that cases near each other in the sort variable had similar values for the sort variables.

The new imputation approach deviated from the existing approach to preserve reported longitudinal patterns whenever possible in two ways:

First, by running the imputation separately by item response status history. For a given item, (1) recipients (cases with the item missing) were divided into three groups based on their item response status from each of the three SDR cycles (2015, 2017, 2019), (2) a set of donors was identified for each group of recipients, and (3) the item in each group was imputed separately using hot deck imputation with the associated set of recipients.

Table 1 specifies how the recipients were grouped and how the donor pool was determined for each recipient group.

For a given item,

Recipient group 1 included item nonrespondents in 2019 who responded to the item in 2015 and 2017. The item for these recipients was imputed using values reported by item respondents in 2019 who responded to the item in 2015 and 2017 as well (referred to as donor pool 1).
Recipient group 2 included item nonrespondents in 2019 who did not respond to the item in 2015 (due to item nonresponse or having yet to be sampled into the survey) but responded in 2017. The item for these recipients was imputed using values reported by item respondents in 2019 who responded to the item in 2017 as well, regardless of their item response status in 2015 (referred to as donor pool 2). It should be noted that donor pool 2 contains all donors in donor pool 1.
Recipient group 3 included all the other item nonrespondents in 2019. They did not respond to the item in 2017 (due to item nonresponse or having yet to be sampled into the survey), irrespective of their item response status in 2015. The item for these recipients was imputed using values reported by any item respondents in 2019, regardless of their item response status in either 2017 or 2015 (referred to as donor pool 3). It should be noted that donor pool 3 contains all donors in donor pool 2.

Table 1. Inclusion conditions for the three recipient groups and their associated donor pools

Data

(Imputation conditions)

N = did not report; Y = reported; X = included.

^a Recipient group 1 included item nonrespondents in 2019 who responded to the item in 2015 and 2017.
^b Recipient group 2 included item nonrespondents in 2019 who did not respond to the item in 2015 but responded in 2017.
^c Recipient group 3 included all the other item nonrespondents in 2019. They did not respond to the item in 2017, irrespective of their item response status in 2015.

Note(s):

Item nonresponse in 2015 and 2017 was due to item nonresponse or having yet to be sampled into the survey.

Second, by updating the class and sort variables. A variable for the time of doctorate award (prior to July 2013, July 2013 to June 2015, and July 2015 to June 2017) was introduced. Also, for most of the items for recipient groups 1 and 2, location in or outside of the United States from 2015, 2017, or both years; working for pay or profit during the reference week from 2015, 2017, or both years; and the variable to be imputed from 2015, 2017, or both years were used in the imputation process systematically. These new variables were introduced as class variables whenever possible or placed as early as possible in the list of sort variables to respect the longitudinal pattern. For recipient group 3, the class and sort variables from the original imputation were used.

2.2.2 Random Imputation

Most of the missing values in questionnaire section A were re-imputed using hot deck imputation. However, the random imputation^{For these missing items, the most and second-most important reasons were imputed by a simple random selection from the subset of reported reasons.} was retained for the most and second-most important reasons for taking a postdoc position and for working outside the field of highest degree since these items required significant manual work and are not key longitudinal survey items.

Since primary and secondary work activities are items of particular interest longitudinally, an alternative approach was implemented for their re-imputation using two steps of the hot deck method:

First, the primary and secondary work activities were imputed using hot deck with the indicators for individual work activities from the 2019 cycle and primary and secondary work activities from 2017 or 2015 as sort variables. The first step of this imputation imputed the majority of missing items appropriately. However, a fraction of the imputed values were invalid (e.g., a primary activity that is not an activity the individual performs was imputed).
Second, for those cases with invalid imputed values, the primary and secondary work activities were imputed one case at a time using hot deck from a smaller pool of donors. Given the recipient's individual work activity items, the donor pool included only cases with plausible values for primary and secondary work activities. The number of matched work activity items between each recipient and each of its potential donors was used as a sort variable to find a similar case. For example, suppose the recipient's work activities are basic research, computing programming, and managing or supervising people or projects (i.e., the response is “Yes” to those three individual work activity items). In this case, the donor pool includes only respondents whose primary work activity is one of the three activities, and the donor is selected from among cases with the largest number of matched individual work activity items.

The result was that the longitudinal relationships were able to be preserved where possible while also imputing the variable set more efficiently.

2.2.3 Creating and Assembling Variables for the 2019 SDR Re-Imputation

In order to redo the 2019 SDR imputation for questionnaire section A, several variables were accumulated to define the new class and sort variables. In general, to be used as a class or sort variable for a different survey item, a variable had to have been imputed prior to the imputation of the item considered. The complete list of SDR SAS variable names corresponding to survey questions is available in the annotated survey questionnaire.

The SDR variables from past cycles were also used as class and sort variables for the continuing cohort. A prefix of “L_” was added to the variable name to indicate 2017 past SDR variables, and a prefix of “L2_” was added to the variable name to indicate 2015 past SDR variables. As mentioned earlier, they, along with some other variables, were used as class variables whenever possible or placed as early as possible in the list of sort variables to maintain the longitudinal pattern.

Otherwise, the imputation implementation (e.g., imputation order, imputation flag creation) largely followed the original 2019 imputation implementation.

2.3 Evaluation of New Imputation Method in Cross-Sectional File

Re-imputed data using the new approach were compared to the published 2019 SDR data generated using the original approach. The comparison involved an examination of the post-imputation distributions of variables (both marginal and conditional on the longitudinal information) and estimates, focusing on a subset of the data tables. This section provides a summary of the results from these comparisons. A comprehensive set of results from the analyses will be presented in the final report.

2.3.1 Marginal Distribution

To evaluate whether the new approach changes the marginal distribution in the cross-sectional file, the marginal distribution of each of the 108 newly imputed variables were examined with two sets of tests.

Goodness-of-fit chi-square tests (unweighted and weighted), treating the distribution using the existing approach as fixed, were used to detect differences in the post-imputation marginal distribution between two versions of each variable: one using the existing approach and the other using the proposed approach, both overall and by frame source (2015, 2017, 2019). Continuous variables were grouped into five quintiles. It should be noted that these tests are more liberal than those that account for the uncertainty of the estimated distribution based on the existing approach. Any significant differences detected by these tests may not be significant if more conservative tests were performed.

Out of the 108 newly imputed variables, the newly imputed primary and secondary work activities exhibited statistically significant differences when compared with variables imputed using the current approach. Although these differences are relatively small, the updated imputation method used for the primary and secondary variables aligns their joint distribution more closely with individual work activity variables, as noted in section 2.3.2 Conditional Distribution on Reported Longitudinal Information. Specifically, given that an individual is engaged in a particular work activity (reported or imputed), the probability of the primary and secondary work activities being imputed to that same activity is now closer to the probabilities derived from reported values (e.g., proportion of respondents with primary work activities in managing or supervising people or projects given the respondent answered Yes to that individual work activity item). Table 2 presents these probabilities for each of the 14 individual work activities, demonstrating the closer alignment in the newly imputed primary work activities, specifically for applied research, managing or supervising people or projects, professional services, and other work activities.

Table 2. Probability of the primary work activity being the same as a specific work activity, given that the individual is performing the specific activity

Data

(Percent)

Note(s):

Survey respondents mark Yes or No for each work activity. Among the reported work activities, one is reported as the primary work activity. When an individual is engaged in a particular work activity (reported or imputed), the probability of the primary work activities being imputed to that same activity can be estimated and presented in the table as the conditional probability.

Source(s):

National Center for Science and Engineering Statistics, Survey of Doctorate Recipients: Longitudinal Data: 2015–19.

For the other variables, utilizing the proposed imputation approach based on data from previous cycles did not result in significant differences in marginal distribution for the 2019 data. This suggests that the proposed imputation approach produces data that are comparable to the published 2019 SDR data in terms of marginal distribution.

2.3.2 Conditional Distribution on Reported Longitudinal Information

The distributions of the newly imputed variables were further examined by conditioning on the reported longitudinal information. This comparison includes only individuals who responded to a particular item in the previous two cycles, 2015 and 2017.

Similar to the tests in section 2.3.1 Marginal Distribution, goodness-of-fit chi-square tests were performed to detect any differences in post-imputation distributions. This resulted in no significant differences, largely because not many cases responded to a particular item in past cycles and did not respond to it in 2019. In addition, any new patterns introduced by imputation were examined. Over the 108 variables, on average, the percentage of new patterns not observed in reported data but introduced by imputation decreased from 5% to less than 1%.

3. Methods: Longitudinal Imputation

For the 2015–19 LSDR data file, the longitudinal imputation was limited to imputing missing years of data and incorporated the data from previous cycles in a systematic manner (see section 3.1 Overview of Imputation Approach for details). However, the cross-sectional SDR data, both the respondent-provided values and the imputed values, were used regardless of the source of the data, and the cross-sectional imputation did not systematically utilize the longitudinal information for the 2017 and 2019 cycles.

For the longitudinal imputation research, the 2019 SDR cross-sectional file created through the new imputation approach was used while the general longitudinal imputation approach remained the same. Thus, the imputation approaches in the cross-sectional and longitudinal imputation processes were consistent for the 2019 cycle.

Section 3.1 Overview of Imputation Approach provides an overview of the longitudinal imputation approach. The newly re-imputed variables from section A of the questionnaire from SDR 2019 were compared with the existing 2015–19 LSDR data created based on the published 2019 SDR cross-sectional data.

3.1 Overview of Imputation Approach

Out of the 40,148 cases in the 2015–25 LSDR panel, 36,399 individuals who responded at least once after 2015 or were found to be ineligible in both 2017 and 2019 were included in the LSDR 2015–19 file. Table 3 shows the composition of the sample cases by the response statuses for three SDR cycles (2015, 2017, and 2019), along with the number of cases that need imputation for the last two cycles (2017 and 2019). In general, the LSDR 2015–19 data file used the 2017 cross-sectional SDR data for the 2017 respondents regardless of the source of the data (the respondent-provided values or the imputed values). Similarly, the 2019 cross-sectional SDR data was used for the 2019 respondents. However, as shown in table 3, a total of 8,438 cases required imputation for all the variables for a missing survey cycle on the LSDR files: 6,187 cases for 2019 data and 2,251 cases for 2017 data.

Table 3. Sample cases, by response status in the longitudinal sample of the Survey of Doctorate Recipients: 2015–19

Data

(Number)

IE = ineligible for inclusion; NR = did not respond; R = responded; UE = unknown eligibility.

Note(s):

For each survey year, response status of each sample case is determined and recorded.

Source(s):

National Center for Science and Engineering Statistics, Survey of Doctorate Recipients: Longitudinal Data: 2015–19.

The donor pool consisted of 45,798 individuals who responded in 2017 and 2019 after responding in 2015. Their respondent-provided or imputed values were used in the longitudinal imputation.

Some of the 32 variables in the LSDR file (see appendix B for the list of variables) were grouped to be imputed together. Variables in a block shared the same set of class or sort variables, and one donor was used to impute all the variables in a block for a cycle missing case. Table 4 lists the imputation blocks. It should be noted that the imputation is conducted in the presented order in a given cycle.

Table 4. Imputation blocks of variables to be imputed together in the Survey of Doctorate Recipients

Data

(Variable)

Source(s):

2019 Survey of Doctorate Recipients questionnaire.

Similar to the cross-sectional imputation, hot deck imputation was used. Ratio imputation was implemented for three continuous variables: salary (annualized), total earned income before deductions in previous year, and tenure year.

Each imputation block included the following variables (or their recoded variables) as class or sort variables in the presented order:

Key outcome variables: Location, labor force status, employer sector, and broad primary occupation code of the two responding cycles before the variables are imputed or of all three cycles after the variables are imputed; and
Target domain variables: Years since the PhD award in 2015, sex, broad field of degree, and race and ethnicity from the 2015 SDR data set.

Class variables included as many key outcome variables as possible, and the rest of the key outcome variables were used as sort variables. In most of the blocks, the variables in the other two cycles corresponding to the variables to be imputed (e.g. the corresponding 2015 and 2019 variables for imputing a 2017 variable) along with other related variables were placed between the key outcome variables and the target domain variables. This was done to account for the variables’ patterns across cycles. For example,

First, the class and sort variables for SDR 2017 critical key outcome variables—2017 location (inside or outside of the United States) through 2017 employer sector (detailed codes)—will be imputed with the following class and sort variables:

Class variable

Sort variable

Location in 2015

Location in 2019

Labor force status in 2015

Labor force status in 2019

Employer sector in 2015 (categorized)

Employer sector in 2019 (categorized)

Job occupation group in 2015 (major)

Job occupation group in 2019 (major)

Years since PhD in 2015

Sex in 2015

U.S. doctorate major group in 2015

Multi-race in 2015 (categorized)

Then, the 2017 occupation codes—job code for principal job (minor group and major group)—will be imputed with the following class and sort variables:

Note: Items with asterisks are the newly added class or sort variables. Items in italics are related to location.

Class variable

Sort variable

Location in 2015

Location in 2017*

Location in 2019

Labor force status in 2015

Labor force status in 2017*

Labor force status in 2019

Employer sector in 2015 (categorized)

Employer sector in 2017 (categorized)*

Employer sector in 2019 (categorized)

Job occupation group in 2015 (major)

Job occupation group in 2019 (major)

Years since PhD in 2015

Sex in 2015

U.S. doctorate major group in 2015

Multi-race in 2015 (categorized)

Afterwards, the number of weeks working in a year and full-time and part-time status for the SDR 2017 will be imputed with the following class and sort variables:

Note: Items with asterisks are the newly added class or sort variables. Items in italics are related to location, and items in bold are related to jobs.

Class variable

Sort variable

Location in 2015

Location in 2017*

Location in 2019

Labor force status in 2015

Labor force status in 2017*

Labor force status in 2019

Employer sector in 2015 (categorized)

Employer sector in 2017 (categorized)*

Employer sector in 2019 (categorized)

Job occupation group in 2015 (major)

Job occupation group in 2017 (major)*

Job occupation group in 2019 (major)

Principal job full time or part time in 2015*

Principal job full time or part time in 2019*

Weeks worked at principal job in 2015*

Weeks worked at principal job in 2019*

Years since PhD in 2015

Sex in 2015

U.S. doctorate major group in 2015

Multi-race in 2015 (categorized)

3.2 Evaluation of New Cross-Sectional Imputation Method in Longitudinal File

The longitudinal estimates calculated using the newly imputed longitudinal data were examined and compared with estimates using the existing longitudinal data. For example, for the employment sector variables, the proportions of doctorate recipients who have not changed their employment sector in the last three cycles were compared by various domains (data table 1-B^{See NCSES 2022: data table 1-B at https://ncses.nsf.gov/pubs/nsf22326/table/1-B.}), based on all cases using the existing imputation approach, cases using the new imputation approach, and only reported cases, respectively. Although the changes are subtle, the estimates under the new imputation approach were closer on average to the estimates based on only reported cases than those under the existing imputation approach. The newly imputed longitudinal data had fewer combinations that were not observed in the reported data.

4. Summary and Recommendations

Longitudinal surveys offer insights that cross-sectional surveys cannot, such as the ability to track outcomes over time and assess the potential impacts of policy changes. However, longitudinal data are more complex by design. Particularly, in imputation for longitudinal surveys, it is critical to preserve temporal patterns in addition to accurately predicting missing data at a given time. Through NCSES’s SDR, which produces both cross-sectional and longitudinal data and imputes missing data through two processes—one for cross-sectional files and the other for longitudinal files—the research summarized in this working paper studied the effects of the systematic usage of information from past cycles in both processes.

Systematically incorporating longitudinal information in imputation for the cross-sectional files did not affect the overall distributions of the variables marginally or conditionally in the 2019 cross-sectional file when comparing them to the variable distributions using the existing imputation approach. However, the research findings indicate that the proposed approach slightly reduced the chance of creating new longitudinal patterns when compared with the existing approach. In the longitudinal file, the distribution of changes using the new approach was slightly closer to the distribution using the reported values than the distribution using the existing approach.

This research implemented an alternative imputation approach for primary and secondary work activities to enhance efficiency and effectiveness of the process. Instead of a random imputation method, the alternative is a two-step approach based on the hot deck imputation. The first step incorporates each of the individual work activity variables and primary and secondary work activities reported in the past cycles if available as sort variables, which imputes most of the missing cases. The second step iterates individually through each of a small number of cases that the first step could not successfully impute and imputes the variables by forming a donor pool with valid cases given the combination of individual work activities the respondent performs. This modification shifted the distributions of these variables among imputed cases closer to the distributions among reported cases.

Although implementing the new approaches for both the cross-sectional and longitudinal imputation required some additional effort, the processes are similar to the original approaches and were not overly burdensome to execute. Given this, although differences in overall distributions of the imputed variables between the existing and new imputation approaches were small, future SDR data processing cycles should consider:

Replacing the random imputation approach with the hot deck approach for primary and secondary variables, such as reasons for taking a postdoc position, reasons for working in an area outside the field of first U.S. doctoral degree, working activities, and reasons for taking work-related training; and
Adopting the unified approach presented in this research—consistently maintaining longitudinal information when imputing both cross-sectionally and longitudinally—ensures that temporal patterns will be preserved to the extent possible among key outcomes in the SDR.

This research demonstrated potential gains in data quality and efficiency of survey data processing when applying a unified approach that fully utilizes reported longitudinal data patterns for both cross-sectional and longitudinal imputation. Future work that includes a comprehensive evaluation of results and estimating variance that account for imputation is recommended before a full implementation.

Reference

National Center for Science and Engineering Statistics (NCSES). 2022. Survey of Doctorate Recipients, Longitudinal Data: 2015–19. NSF 22-326. Alexandria, VA: U.S. National Science Foundation. Available at https://ncses.nsf.gov/pubs/nsf22326/.

Notes

1 For sample cases who were included in the LSDR 2015–19 and did not complete the 2017 SDR form or the 2019 SDR form, hot deck imputation was conducted to impute all variables in a selected set of 32 critical items.

2 Item response statistics are summarized in the SDR 2019 technical notes at https://ncses.nsf.gov/pubs/nsf21320#technical-notes.

3 Imputation donors that exceeded the use limit will be excluded from the list of available donors so that the response data from a donor is not used more than three times.

4 Serpentine sorting for hot deck imputation is described in details in https://support.sas.com/resources/papers/proceedings-archive/SUGI95/Sugi-95-182%20Carlson%20Cox%20Bandeh.pdf.

5 For these missing items, the most and second-most important reasons were imputed by a simple random selection from the subset of reported reasons.

6 See NCSES 2022: data table 1-B at https://ncses.nsf.gov/pubs/nsf22326/table/1-B.

Suggested Citation

National Center for Science and Engineering Statistics (NCSES). 2025. A Unified Approach to Cross-Sectional and Longitudinal Imputation with Applications to the Survey of Doctorate Recipients. Working Paper NCSES 25-222. Alexandria, VA: U.S. National Science Foundation. Available at https://ncses.nsf.gov/pubs/ncses25222.

Contact Us

National Center for Science and Engineering Statistics
Directorate for Social, Behavioral and Economic Sciences
U.S. National Science Foundation
Alexandria, VA 22314
Tel: (703) 292-8780
FIRS: (800) 877-8339
TDD: (800) 281-8749
E-mail: ncsesweb@nsf.gov

Appendix A: List of Re-Imputed Questionnaire Section A Items

Table Appendix-A. List of re-imputed questionnaire section A items

Data

(Variable)

Source(s):

2019 Survey of Doctorate Recipients questionnaire.

Appendix B: List of Variables on LSDR 2015–19

Table Appendix-B. List of variables on the Survey of Doctorate Recipients, Longitudinal Data: 2015–19

Data

(Variable)

Note(s):

Survey item ID is not applicable for recode variables.

Source(s):

National Center for Science and Engineering Statistics, Survey of Doctorate Recipients: Longitudinal Data: 2015–19.