May 18 2022
A study shows how the National COVID Cohort Collaborative used XGBoost machine learning models to better define long-term COVID and identify potential long-term COVID patients with a high degree of accuracy.
Clinical scientists used machine learning (ML) models to explore unidentified electronic health record (EHR) data in the National COVID Cohort Collaborative (N3C), a national clinical database funded by the National Institutes of Health, to help distinguish characteristics of people with long standing disabilities. COVID and factors that may help identify these patients using data from medical records.
The findings, published in The Lancet Digital Health, have the potential to improve clinical research on the long-running COVID and offer a more standardized system of care for the condition.
First author Emily R. Pfaff, MD, assistant professor in the division of endocrinology and metabolic medicine at the University of North Carolina School of Medicine.
“We needed to gain a better understanding of the complexities of long COVID, and for this, it made sense to take advantage of modern data analytics tools and a unique big data resource such as N3C, where many of the long COVID features are represented.”
Sponsored by the National Centers for the Advancement of Translational Sciences (NCATS) of the National Institutes of Health, the N3C data pocket currently includes information representing more than 13 million people from 72 locations nationwide, including nearly five million positive cases of COVID-19. The resource enables quick research on emerging questions about vaccines, treatments, risk factors, and health outcomes for COVID-19.
This new research is part of the National Institutes of Health’s COVID Research Initiative to Promote Recovery (RECOVER), which has recruited thousands of participants across the country to answer critical research questions about the syndrome to determine who has had COVID for a long time, and their risk factors. For long-term COVID and potential interventions and treatments.
Using N3C, researchers have developed XGBoost machine learning (ML) models to understand patient characteristics and better identify potential long-term COVID patients.
Researchers examined the demographics, healthcare use, diagnoses, and medications of 97,995 adult COVID-19 patients. They used these features on nearly 600 long-term COVID patients from three long-term COVID specialty clinics to train and test three ML models, which focused on identifying long-term COVID patients in three groups: Among all COVID-19 patients, Among patients hospitalized with COVID-19, and among patients who have had COVID-19 but have not been hospitalized.
Models have proven accurate in identifying potential COVID patients for a long time. Patients tagged by the models could be interpreted as “patients who required care in a specialized clinic for novel coronavirus for a long time”.
The models also demonstrated several important features that distinguish potential long-term COVID patients from non-long-term COVID patients.
They focused on patients with a positive diagnosis of COVID who were at least 90 days after acute infection. Features most commonly identified among prospective long-term COVID patients include respiratory symptoms after COVID and related treatments, non-respiratory symptoms widely reported as part of prolonged COVID (eg, sleep disturbances, anxiety, malaise, chest pain, constipation), existing risk factors To increase the severity of acute COVID (eg, chronic lung disease, diabetes, chronic kidney disease), hospitalization agents, indicating greater severity of acute COVID.
The study also suggests that it is plausible that long-term COVID ultimately does not have a single definition, and could be better described as a group of conditions related to their symptoms, pathways, and treatments.
Josh Wessel, MD, PhD, senior clinical advisor at NCATS and head of the scientific program at RECOVER added, “Once you can identify who has long had COVID in a large database of people, you can start asking questions about those people. Would you Was there something different about these people long before they had coronavirus? Do they have certain risk factors? Was there something about how they were treated during acute covid-19 that might increase or decrease the risk of long-term Covid?”
The study included how EHR data tends toward patients who benefit most from health care systems. Pfaff says it’s important to recognize data that are less likely to be represented — patients who are uninsured, patients who have limited access to care or are able to pay for care, or patients who seek care at small practices or community hospitals with limited data exchange capabilities.
“Electronic health records (EHRs) contain information only for people who go to the doctor,” said Pfaff, who is also co-director of the NC TraCS Informatics and Data Science (IDSci) program. “They also have more information about people who go to the doctor a lot. So, people who don’t have good access to care or people who don’t go to the doctor, we won’t get information about them. So that’s a warning I give with every study that I do based on health records. We need to identify who is not in the data set.”
The N3C team continues to improve its models as more real-world data emerges. Their longitudinal data for COVID-19 patients could provide a comprehensive basis for developing ML models to identify potential long-term COVID patients.
As larger groups of long-term COVID patients are established, future work will include research to identify subtypes of long-term COVID, making the condition easier to study and treat.