Ethics statement
This study was approved by the institutional review board of the VA St. Louis Health Care System, which granted a waiver of informed consent (protocol number 1606333).
Study design and setting
This study was conducted using the electronic health databases of the US Department of Veterans Affairs. The VA operates the largest nationwide integrated healthcare system in the United States, including 1,321 healthcare facilities (172 medical centers and 1,138 outpatient sites) serving more than 9 million US veterans each year. Veterans enrolled in the VA healthcare system have access to a comprehensive array of medical services, including outpatient care, inpatient care, prescriptions, mental care, home healthcare, primary care, specialty care, geriatric and extended care, rehabilitation services, medical equipment and prosthetics.
Data sources
The electronic medical databases of the VA house comprehensive information on outpatient and inpatient encounters, laboratory test results and medications during routine healthcare encounters and are updated daily
25. Data sources also included the VA Beneficiary Identification Record Locator System, the Medicare Vital Status File, the Social Security Administration Master File and the National Cemetery Administration. We used the inpatient and outpatient domains of the VA Corporate Data Warehouse databases to obtain information on diagnoses and procedures
34,35,36. The outpatient pharmacy and barcode medication administration domains were used to obtain data on pharmacy records. The laboratory results domain was used to obtain data on laboratory measurements. Test results on SARS-CoV-2 infection were obtained from VA COVID-19 Shared Data Resource, which consisted of results from polymerase chain reaction tests, antigen tests conducted in the VA or tests reported to the VA
37. Medicare inpatient and outpatient data were from the VA Centers for Medicare and Medicaid Services (CMS). The area deprivation index (ADI) was used as a summary metric of contextual socioeconomic disadvantage (income, education, employment and housing quality)
38.
Cohort
A flowchart showing the cohort construction is presented in Extended Data Fig.
5. We first identified the exposure group with a first SARS-CoV-2 infection between 1 March and 31 December 2020 (
n = 149,459). We then selected those individuals who are VA users, defined as having at least two healthcare encounters separated by at least 180 d in the 2 years before the infection (
n = 143,034). To examine the risk of PASC, we selected individuals who were alive at 30 d after the infection, yielding an analytic cohort of 135,161 individuals in the COVID-19 group. Hospitalization within the acute phase was defined as inpatient admission date within 7 d before or within 30 d after the positive test. The COVID-19 group was then further divided by the care setting during the acute phase of infection into non-hospitalized (
n = 114,864) and hospitalized (
n = 20,297) COVID-19 groups. The date of positive SARS-CoV-2 test was defined as
T0, and the follow-up started from 30 d after
T0. Participants were followed until death, repeated SARS-CoV-2 infection, 1,080 d after the first infection or 31 December 2023.
To construct a control group without SARS-CoV-2 infection, we first identified 6,231,638 individuals who were alive on 1 March 2020 and did not have a positive SARS-CoV-2 test between 1 March 2020 and 30 March 2021. We then randomly assigned T0 for the control group based on the distribution of T0 in the overall COVID-19 group to ensure that the proportion of participants started at a specific date was the same between the COVID-19 group and the control group; 6,194,973 participants were alive at the randomly assigned T0. Similar to the COVID-19 group, we also required the control group to have encountered the VA healthcare system on at least two occasions separated by at least 180 d in the 2 years before the assigned T0, yielding a final analytical cohort of 5,206,835 participants in the control group without infection. The follow-up started from 30 d after T0, and participants were followed until death, a SARS-CoV-2 infection, 1,080 d after T0 or 31 December 2023.
Outcomes
We pre-specified a list of 80 individual outcomes that are well-characterized sequelae of SARS-CoV-2 infection based on previous evidence
1,4,5,6,8,9,11,12,13,20,25,37. These outcomes were defined using International Classification of Diseases 10th Revision (ICD-10) diagnosis codes, medical prescriptions and laboratory measurements
20,25. Incident outcomes were identified when the outcomes did not occur in the 2 years before
T0 and they were the first occurrences from 30 d after
T0 to the end of follow-up. The individual outcomes were then grouped into 10 organ systems: cardiovascular, coagulation and hematological, fatigue, gastrointestinal, kidney, mental health, metabolic, musculoskeletal, neurological and pulmonary. For overall PASC and outcomes at organ system level, we estimated the number of sequelae as the sum of the occurrence of individual outcomes included in a composite outcome. We additionally used the GBD methodologies to estimate DALYs, which represent a measure of disease burden that accounts for the number of sequelae included and their influence on overall health
20,26,39. For each individual outcome, a health burden coefficient was assigned
20,25,26. DALYs were then estimated by the weighted sum of all individual outcomes included in a composite outcome, where weight is the health burden coefficient for each individual outcome
20,25.
Covariates
A set of pre-defined covariates was selected based on prior knowledge of potential confounders that may bias the relationship between SARS-CoV-2 infection and PASC
1,4,6,7,8,12,13,20,25. The demographic covariates included age, self-reported sex, self-reported race (White, Black and Other), ADI at residential address and smoking status (never, former and current smokers). Additional covariates included estimated glomerular filtration rate (eGFR), systolic and diastolic blood pressure and body mass index measured before and closest to
T0. A set of variables defining healthcare utilization included the use of long-term care in the year before the pandemic, receipt of seasonal influenza vaccination each year for up to 5 years before
T0, the number of inpatient and outpatient Medicare visits in the year before the pandemic, the number of inpatient and outpatient unique medical prescriptions and and the number of inpatient and outpatient laboratory panels in the VA medical system separated by 180-d intervals. Comorbidities included anxiety, cardiovascular disease, cerebrovascular disease, chronic kidney disease, chronic lung disease, dementia, depression, diabetes, immunocompromised status (history of organ transplantation, end stage kidney disease, cancer, HIV or prescriptions of corticosteroids or immunosuppressants) and peripheral artery diseases. To account for spatiotemporal variations, we accounted for the calendar week of SARS-CoV-2 infection for the COVID-19 groups or the assigned
T0 for the control group as well as the geographic location of medical service. Missing values included 9.2% for eGFR, 4.9% for systolic and diastolic blood pressure and 10.4% for body mass index. We imputed the missing data using multivariate imputation by chained equations and matching method with predictive mean conditional on all covariates in COVID-19 groups and the control group separately
25. All the covariates were measured using a look-back period of 2 years before
T0 unless otherwise specified.
Statistical analyses
The COVID-19 group was separated by care setting during the acute phase into two mutually exclusive groups: non-hospitalized and hospitalized COVID-19 groups. Baseline characteristics of the COVID-19 groups and the control group without infection were reported. Continuous variables were reported as means (standard deviations), and categorical variables were reported as frequencies (percentages). Standardized mean differences were computed to evaluate covariate balance between COVID-19 groups and the control group without infection, where a value of less than 0.1 was considered evidence of good covariate balance. An analytic flowchart is presented in Extended Data Fig.
6.
Inverse probability weighting was used to balance baseline differences between the two COVID-19 groups and the control group without infection
20. Logistic regression models were constructed to estimate the probability of being assigned to the target group given the pre-specified covariates (propensity score). To provide a representative risk assessment, we selected the overall population (COVID-19 groups and the control group) as the target population. The inverse probability weights for all three groups were then computed as the propensity score divided by (1 − propensity score). We truncated the propensity score weights at 99.9% percentiles in each group (the control, non-hospitalized COVID-19 and hospitalized COVID-19 groups) to reduce the influence of excessively large weights on the analytical results. We estimated the risk of death and the risk of sequelae at the levels of overall PASC, organ systems and individual outcomes in the weighted cohorts during three time periods: 30–360 d (first year), 361–720 d (second year) and 721–1,080 d (third year) after
T0. To estimate the risk of incident outcome in each period, participants were considered at risk if the examined outcome did not occur in the previous period. We estimated the propensity score weights independently within each period and applied the weights from different periods into one outcome model to estimate the risks and cumulative burden. Participants were censored at the time of death or SARS-CoV-2 infection during follow-up for both COVID-19 groups (non-hospitalized and hospitalized) and the control group.
IRRs, absolute rates, absolute rate differences, cumulative rates, cumulative rate differences for death, the number of sequelae and DALYs were estimated from weighted generalized estimating equations using a log link and a Poisson distribution. The rate differences in death, the number of sequelae and DALYs overall and by organ system between COVID-19 groups and the control group without infection were considered as outcomes due to COVID-19. The percentage contribution of number of sequelae and DALYs in each year during the follow-up were estimated for overall PASC and by organ system. The 95% CIs were generated from the 2.5th and 97.5th percentiles of parametric bootstrapping of 1,000 times based on the point estimates and covariance matrix of the generalized estimating equations
20. The number of sequelae and DALYs are reported as rates per 1,000 persons.
In all analyses, a 95% CI of IRR that excluded unity or the number of sequelae and DALYs that excluded zero was considered evidence of statistical significance. Analyses were conducted using SAS Enterprise Guide version 8.3 (SAS Institute), and results were visualized using R version 4.3.2.
Sensitivity analyses
We performed several sensitivity analyses. (1) We estimated doubly robust adjustment models in which the covariates were used in both exposure and outcome models, instead of the primary approach where the covariates were applied only in the exposure model. (2) Instead of the Poisson models in the primary approach, we constructed zero-inflated Poisson models to evaluate how a large number of zero outcomes influences the model fit. (3) Instead of the primary approach where participants were censored at SARS-CoV-2 infection during follow-up, we did not censor participants in the COVID-19 groups upon reinfection and consider reinfection as a natural outcome of the first infection. (4) Instead of using only the pre-defined set of covariates in our primary approach, we additionally adjusted for 100 algorithmically selected high-dimensional covariates
40. (5) Instead of defining hospitalization during the acute phase as inpatient admission date within 7 d before or within 30 d after the positive test in the main analyses, we used an alternative definition of hospitalization as inpatient admission date on the day of positive test or within 30 d after the positive test. (6) We truncated propensity score weights at 99.5% percentiles rather than the 99.9% percentiles in the main analyses. (7) We estimated the results among a subsample with complete data on all covariates (
n = 4,432,414, 83.0% of the full sample) to test the consistency of the results with those obtained using multiple imputation for missing data. (8) We estimated the risks based on Fine–Gray models where death and SARS-CoV-2 infection during follow-up were considered as competing risks
41. (9) We additionally applied inverse probability of censoring weight to account for non-random censoring due to death or SARS-CoV-2 infection during follow-up across the three groups (the control group without infection and the non-hospitalized COVID-19 and hospitalized COVID-19 groups)
42. (10) We alternatively used a narrower definition of PASC that included 73 outcomes instead of the 80 outcomes included in the primary analyses.
Negative outcome control
We used the same analytic approach (outlined above) to examine the association between COVID-19 and incident neoplasms as a negative outcome control in each year during the 3 years of follow-up
43. There is no mechanistic support for or clinical evidence of a causal relationship between SARS-CoV-2 infection and the risk of incident neoplasms. Reproducing the a priori expected null association between COVID-19 and the negative outcome control may reduce concerns about possible spurious biases
43.
Reporting summary