Introduction

On October 2nd, 2023, the Nobel Assembly at Karolinska Institute awarded the 2023 Nobel Prize in Physiology or Medicine to Professors Katalin Karikó and Drew Weissman for their discovery that modifying the uridine nucleoside to pseudouridine blocks the inflammatory response consecutive to cell delivery of messenger RNA (mRNA) molecules, thereby increasing the production of proteins encoded by the mRNA1. This discovery 15 years ago revolutionized the therapeutic potential of mRNA and allowed the rapid development of mRNA vaccines against SARS-CoV-2. RNAs have come of age, not only for vaccines, but for diagnosing and treating disease2.

On March 2020, partners of the EU-CardioRNA COST Action network3,4 gathered forces to develop a RNA-based diagnostic test based on artificial intelligence (AI) to predict clinical outcomes after COVID-195. The rationale for this endeavor was that leveraging the power of non-coding RNAs may help reduce the devastating consequences of COVID-19 pandemic6. Indeed, risk prediction models could inform about clinical management of patients. Non-coding RNAs, unable to encode proteins like the better-known mRNAs, are regulated in virtually all pathological conditions and, since they are detectable in the blood, they have emerged in recent years as a new reservoir of non-invasive candidate biomarkers and therapeutic targets. Our consortium previously characterized a panel of 2906 cardiac-enriched or heart failure-associated long non-coding RNAs (lncRNAs) (FIMICS panel)7 which, together with an in-house developed bioinformatics pipeline to maximize the benefit of targeted sequencing (Firalink pipeline8), provides a new tool to discover disease-associated lncRNAs with potential to help in diagnosis and risk stratification. Since the FIMICS panel contains many inflammation-related lncRNAs and inflammation is a hallmark of host response to infection by SARS-CoV-2, we thought that it may be of usefulness to identify predictors of COVID-19 outcome.

In the H2020-funded FastTrack COVIRNA project, we aimed to apply the FIMICS panel to identify lncRNAs predictive of COVID-19 outcome. We used blood samples and clinical data from four cohorts of COVID-19 patients totaling 1286 patients. Three cohorts with 804 patients were merged as a discovery cohort for feature selection and choice of best performing machine learning (ML) models. The fourth cohort of 482 patients was used for validation purposes. Here, we have built a model based on one lncRNA and age able to predict in-hospital mortality with an area under the receiver operating characteristic curve (AUC) of 0.83 (0.82–0.84).

Results

Study design

The study design is illustrated in Fig. 1. The study population consisted of a total of 1329 patients with COVID-19, shared between a discovery cohort (n = 818) and a validation cohort (n = 511) used for ML model selection and evaluation, respectively. Three European cohorts were included in the discovery cohort (PrediCOVID from Luxembourg, n = 141; NAPKON from Germany, n = 557; and ISARIC4C from United Kingdom, n = 120) and one cohort from Canada constituted the validation cohort (BQC19, n = 511). Whole blood samples collected in PAXgene RNA tubes at baseline in all patients were centrally stored at −80 °C in a NF S96-900 certified biobank. RNA extraction, quality check, library preparation and RNA sequencing using the FIMICS panel were performed in our core lab. Raw sequencing data were normalized and merged with clinical data of patients in our central database. Data were curated and made available for analysis using ML/AI. Patients with RNAseq datasets that did not meet the quality criteria described in the Materials and Methods section, or with blood samples not collected at the time of enrolment in the study, or for which survival data were not available, were excluded from the analysis. After curation and quality checks, combined RNAseq datasets and clinical data from 136 PrediCOVID, 556 NAPKON, 112 ISARIC4C (804 patients for the discovery cohort) and 482 BQC19 patients for the validation cohort were available for ML analysis. Overall, a total of 1286 full datasets representing each a unique patient were available for analysis. After lncRNA selection by ML, a translational study was conducted by qPCR in a subgroup of 86 patients from the NAPKON cohort for which leftover RNA was available.

Fig. 1
figure 1

Study design.

Baseline characteristics of patients in the analysis are reported in Table 1, in which the three merged European cohorts used for discovery are compared to the Canadian cohort used for validation of the selected features and ML models. Missing data are indicated and were imputed using missForest. The median number of days in hospital was 9 (Q1 = 5, Q3 = 19) and 8 (Q1 = 4, Q3 = 19) for the ISARIC4 and BQC19 cohort, respectively. In all cohorts, patients who died in hospital were older than survivors, more often had cardiovascular disease, and more often received oxygen therapy. Being a male was associated with a higher risk of death in the merged European cohorts. Diabetes and chronic lung disease were also risk factors in this cohort. Patients in the Canadian cohort were older, were more often females and were less often smokers than patients in the merged European cohorts. Supplementary Table 1 shows the characteristics of the three European cohorts individually, together with the nature of common COVID-19 symptoms across cohorts. The PrediCOVID cohort had younger patients than the two other cohorts and none of them died during the follow-up period. There were more smokers at the time of enrolment in the PrediCOVID cohort than in the NAPKON and ISARIC4C cohorts. Common baseline symptoms across the European cohorts included fever, headache, cough and dyspnea, which were less frequent in survivors (Supplementary Table 1). Ethnicity data was available in the NAPKON cohort, in which most patients were Caucasian and no apparent association between ethnicity and survival was found (Supplementary Table 1). In this cohort, vaccinated people had a lower risk of death as compared to non-vaccinated people (Supplementary Table 1). Vaccination data was unavailable in other cohorts.

Table 1 Baseline characteristics of patients in the discovery and validation cohorts

Machine learning model building and characterization

We performed feature selection on the training set and evaluated five different ML classifiers (RF, kNN, Logit, MLP, SVM, XGB) on the discovery cohort derived from the 3 combined European cohorts (n = 804) in each of the 100 iterations as described in the Materials and Methods section and in Supplementary Fig. 1. The median number of features selected in each iteration was 21 (Q1 = 16, Q3 = 25). The performance of each model to predict in-hospital mortality is shown in Table 2. The logistic regression model (Logit) with the selected features in each iteration provided the most accurate prediction of in-hospital mortality with an AUC of 0.83 (95% CI 0.81–0.84), an accuracy of 0.74 (95% CI 0.73–0.76), a sensitivity of 0.77 (95% CI 0.74–0.79), and a specificity of 0.72 (95% CI 0.69–0.75).

Table 2 Performance of different classifiers to predict in-hospital mortality in the discovery cohort (n = 804) using the features from each iteration

The analysis yielded the selection of two features, age and the lncRNA LEF1-AS1, which appeared in 82 and 63 iterations out of the 100 iterations performed, respectively (Fig. 2A). LEF1-AS1 is a lncRNA of 3,360 nucleotides transcribed from the lymphoid enhancer binding factor 1 (LEF1) locus located on chromosome 4. In the merged European cohorts (discovery cohort, n = 804), patients who survived were younger and had higher expression levels of LEF1-AS1 than patients who died (Fig. 2B, C). There was a significant albeit moderate negative correlation between age and LEF1-AS1 in this cohort (Fig. 2D), as well as in the Validation cohort (r = −0.35, p < 0.01). Also, LEF1-AS1 was differentially expressed between males and females in the Discovery (Fig. 2E) and in the Validation cohort (p < 0.01 and p = 0.02, respectively). The expression of LEF1-AS1 was associated with cancer diagnosis with an odds ratio of 0.71 [0.55–0.90] and 0.66 [0.52–0.84] in the NAPKON and BQC19 cohorts, respectively. The Shapley beeswarm plots shown in Supplementary Fig. 2 attest that higher age and lower expression of LEF1-AS1 led to positive SHAP values and thus had positive impacts on model output.

Fig. 2: Feature selection on the discovery cohort (n = 804 patients).
figure 2

A Line plot of the selected times of the 10 most selected features. X-axis: the name of the features. SEQXXXX are the codes of the probes of the FIMICS panel. SEQ0235 probe recognizes the lncRNA LEF1-AS1. Y-axis: the number of times a feature appeared in the 100 iterations of the feature selection process. B, C Box/violin plots of age and LEF1-AS1 expression, which were significantly increased and decreased in the non-survivors group (n = 62 patients) of the European cohorts, respectively. P-value is from 2 sided Student’s t test. FDR (false discovery rate) is from DESeq2 algorithm. D Correlation between age and LEF1-AS1. A Pearson Correlation coefficient and a two-sided t-test p-value are indicated. E Comparison between expression levels of LEF1-AS1 in males (n = 480 patients) and females (n = 324 patients). P-value is from a two-sided Student’s t test. In B, C and E, the box is drawn from Q1 (25th percentile) to Q3 (75th percentile) with a horizontal line inside it to denote the median. The length of the whiskers indicate 1.5 times of IQR (Interquartile range Q3–Q1).

The five different ML classifiers with the two selected features (age and LEF1-AS1) were then evaluated on the discovery cohort in 100 iterations, using the same data splits as for feature selection. The model MLP exhibited the highest AUC of 0.82 (95% CI: 0.80–0.84) (Table 3 and Supplementary Fig. 3). There was no significant difference in performance between the models with the features from each iteration and the model with age and LEF1-AS1 (Tables 2 and 3). Adding the third best predictor selected during the feature selection step did not improve the performance of the prediction model in the balanced (AUC 0.84 [0.82–0.86]. p = 0.11 for comparison with the model without oxygen therapy) and imbalanced (AUC = 0.83 [0.82–0.84], p = 0.91 for comparison with the model without oxygen therapy) discovery dataset.

Table 3 Performance of different classifiers to predict in-hospital mortality in the discovery cohort using the two selected features (age and LEF1-AS1)

When predicting in-hospital mortality for the balanced datasets from the validation cohort (i.e., same number of survivors and deceased patients, Supplementary Fig. 1), the MLP model achieved an AUC of 0.84 (95% CI 0.82–0.86), an accuracy of 0.76 (95% CI 0.74–0.78), a sensitivity of 0.77 (95% CI 0.75–0.79), and a specificity of 0.75 (95% CI 0.72–0.78) (Table 4). We extended the testing to the original imbalanced datasets using the 2 selected features, yielding the following metrics for the discovery cohort: AUC 0.83 (95% CI 0.82–0.84), balanced accuracy 0.78 (95% CI 0.77–0.79), sensitivity 0.86 (95% CI 0.84–0.88), and specificity 0.71 (95% CI 0.70–0.71); for the validation cohort, the metrics were AUC 0.83 (95% CI 0.82–0.84), balanced accuracy 0.75 (95% CI 0.74–0.77), sensitivity 0.79 (95% CI 0.76–0.82), and specificity 0.72 (95% CI 0.71–0.73). The model with age alone yielded AUC 0.78 (95% CI 0.77–0.80), 0.79 (95% CI 0.78–0.80), 0.78 (95% CI 0.76–0.79), and 0.78 (95% CI 0.76–0.79) in balanced/imbalanced discovery and balanced/imbalanced validation cohort, respectively. Adding LEF1-AS1 significantly improved the model performance (Fig. 3, Supplementary Fig. 4). Adding sex and/or the other 2 features which were selected more than 40 times in the feature selection iterations (oxygen therapy and SEQ0986, Fig. 2A) did not significantly improve the model performance (Supplementary Fig. 5). Missing data imputation did not significantly influence results since the MLP model run without prior imputation of missing age data (concerning only 4 patients of the Discovery cohort, Table 1) reached an AUC of 0.83 (95% CI 0.81–0.85) and a balanced accuracy of 0.78 (95% CI 0.76–0.80) (Supplementary Table 3).

Table 4 Performance of the MLP model to predict in-hospital mortality in the balanced/imbalanced discovery and validation cohorts
Fig. 3: Comparison of the performance of the models with age alone, LEF1-AS1 alone and the two features using the discovery (n = 804 patients) and the validation cohort (n = 482 patients), respectively.
figure 3

The evaluation was performed on the imbalanced data with 20 repeated 5-fold cross-validation. The error bars display the confidence interval. We indicated the significant (p < 0.05) difference compared to the model with 2 features using a two-sided Student’s t test.

We compared the predictive performance of the MLP model with age and LEF1-AS1 to a previously published model involving age, sex, C-reactive protein (CRP) and lactate dehydrogenase (LDH). As shown in Supplementary Fig. 6, our MLP model and the four-parameter model had similar capacity to predict mortality in the BQC19 cohort (AUC 0.83 [0.82–0.84] vs 0.85 [0.84–0.86], respectively). Our MLP model outperformed the four-parameter model in the NAPKON cohort (AUC 0.82 [0.81–0.83] vs 0.78 [0.76–0.79], respectively). Brier score analysis was used to assess the calibration of our MLP model, where a lower Brier score indicates a more calibrated model. This analysis revealed a similar (for BQC-19 data) and a lower score (for NAPKON data) for our MLP model compared to the previously published four-parameter model (Supplementary Fig. 6).

Survival analysis

We then evaluated the association between the lncRNA LEF1-AS1 and in-hospital mortality using survival analysis. Patients with high levels of LEF1-AS1 were at low risk of death (age-adjusted HR 0.59, 95% CI 0.36–0.96) in the ISARIC4C subgroup of the discovery cohort (Fig. 4A). In the validation cohort, the HR was 0.54 (95% CI 0.40–0.74) (Fig. 5A). Kaplan–Meier curves using different cut-offs for LEF1-AS1 expression demonstrate the observed association of high expression levels of LEF1-AS1 with low risk of death (Figs. 4B and 5B).

Fig. 4: Survival analysis in the ISARIC4C cohort (n = 112 patients).
figure 4

A Forest plot of the Hazard Ratio (HR) from Cox regression analysis shows a higher risk of death for older patients and a lower risk for patients with higher LEF1-AS1 expression level. The dots and the error bars display the HR and the confidence interval, respectively. The p values are from a two-sided Wald test. B Kaplan–Meier curves using the stratified LEF1-AS1 expression with the first quartile (Q1), the median and the third quartile (Q3), respectively. Patients with LEF1-AS1 expression levels below or equal to the first quartile (Q1) are at a high risk of death. The p values are from a two-sided log-rank test.

Fig. 5: Survival analysis in the BQC19 cohort (n = 438 patients).
figure 5

Note that 44 patients of the 482 patients of the BQC19 cohort did not have information on the number of hospitalized days, yet had information on in-hospital mortality. A Forest plot of the Hazard Ratio (HR) from Cox regression analysis shows a higher risk of death for older patients and a lower risk for patients with higher LEF1-AS1 expression level. The dots and the error bars display the HR and the confidence interval, respectively. The p values are from a two-sided Wald test. B Kaplan–Meier curves using the stratified LEF1-AS1 expression with the first quartile (Q1), the median and the third quartile (Q3), respectively. Patients with LEF1-AS1 expression levels below or equal to the first quartile (Q1) are at a high risk of death. The p values are from a two-sided log-rank test.

Translational perspective

To gain further insights into the feasibility of LEF1-AS1 testing in the hospital environment, e.g., for the development of a molecular diagnostic assay, we set-up a quantitative PCR protocol to measure blood levels of LEF1-AS1 in a subgroup of 84 patients of the NAPKON cohort. Patient characteristics are shown in Supplementary Table 2. 41 patients survived and 43 died in hospital. The two groups were age-matched, sex-balanced and had similar average body mass index (BMI). We first validated that expression levels of LEF1-AS1 as assessed by quantitative PCR were correlated with the levels obtained by RNAseq using the FIMICS panel (Fig. 6A). Moreover, as shown in Fig. 6B, patients who died during their hospital stay had a lower expression of LEF1-AS1 compared to survivors (p = 0.003). A patient was ~1.4 times as likely to survive at hospital discharge for every 1 unit (log2 transformed expression) increase in LEF1-AS1 (OR 1.39 95% CI 1.10–1.76). When we dichotomized the log2-transformed expression levels of LEF1-AS1 using a cut-off determined by the Youden’s index (to maximize specificity and sensitivity), patients who had LEF1-AS1 levels above 0.043 were 5 times more likely to survive after hospital discharge (OR 5.08 95% CI 2.02–12.73).

Fig. 6: Quantitative PCR assessment of LEF1-AS1 in whole blood samples collected in PAXgene tubes from 84 NAPKON patients (41 survivors and 43 non-survivors).
figure 6

A Correlation between qPCR and RNAseq data obtained with the FIMICS panel. The gray area displays the confidence interval. A Spearman’s rank correlation coefficient and a two-sided t-test p value are indicated. B Box/violin plots of LEF1-AS1 expression, which was decreased in deceased patients (n = 43 patients). The box is drawn from Q1 (25th percentile) to Q3 (75th percentile) with a horizontal line inside it to denote the median. The length of the whiskers indicate 1.5 times IQR (Interquartile range Q3–Q1). P-value is from a two-sided Student’s t-test.

Discussion

We hereby report the characterization of a machine learning model based on age and the lncRNA LEF1-AS1 able to predict in-hospital mortality of COVID-19 patients with clinically relevant accuracy.

COVID-19 pandemic has impacted peoples’ lives in many different ways. Healthcare management during the pandemic has been challenging, partly due to lack of preparedness and ability to triage the large numbers of people with infection arriving at the Emergency Department. Methods to help triage and risk stratify patients would have greatly facilitated the work of healthcare providers. Being able to identify patients at high-risk of poor outcome or death, or on the other hand patients with a high chance of survival, would have allowed a more personalized approach to the use of healthcare that could have improved outcomes overall.

Initiated in March 2020 during the first phase of the pandemic, this study aimed to cope with the above issue and design a new method to identify patients at high risk of poor outcome after being infected with SARS-CoV-2. We applied our previously developed FIMICS panel of lncRNAs7 to whole blood samples of COVID-19 patients collected from four different European cohorts and a Canadian cohort. This panel allows for targeted sequencing, which is about 70 times more sensitive than whole genome sequencing, and therefore more suitable to detect and quantify potentially weakly expressed lncRNAs. Other studies have identified biomarkers of disease severity and outcome of COVID-199,10,11. We previously reported that LEF1-AS1 expression in peripheral blood cells was negatively associated with disease severity and mortality in a modestly sized cohort of COVID-19 patients12, which is consistent with our present investigation in whole blood samples. Models to predict mortality of COVID-19 patients have been previously developed, yet they suffer from a high risk of bias13. The MLP model reported in the present study with only two features (age and LEF1-AS1) showed similar predictive performance in the BQC-19 cohort and higher performance in the NAPKON cohort compared to a model including age, sex, CRP and LDH. As compared to previous reports13, the strength of our study relies on its methodological aspects which reduce the risks of bias. We conducted a multi-center and well powered study, with patient numbers well above previous studies. We have used a machine learning pipeline including feature selection and testing of multiple machine learning models with Discovery and Validation cohorts, each split into training and testing subgroups. In each cohort, we have evaluated models on the imbalanced datasets using twenty times repeated 5-fold cross validation.

Even though we observed a consistently low expression of LEF1-AS1 in patients with high risk of death, a functional role of LEF1-AS1 in COVID-19 outcome has still to be demonstrated. LEF1-AS1 is an antisense RNA to the lymphoid enhancer binding factor 1 (LEF1) gene encoding a transcription factor expressed in pre-B and T cells which is involved in proliferation, activation of genes in the Wnt/β-catenin pathway and in regulating systemic inflammation. Consistent with our observed lower expression of LEF1-AS1 in severe patients, recent studies have illustrated that B cells undergo significant depletion following SARS-CoV-2 infection14. Additionally, pulmonary fibrosis stems from damage to alveoli and is a hallmark of SARS-CoV-2 infection. Recent work has demonstrated that alveolar damage can be suppressed through activation of LEF1, which is mediated by the transcription factor krüppel-like factor 4, thus hinting at a possible protective role of LEF1 following alveolar injury and SARS-CoV-2 infection15. These studies suggest a link between LEF1/LEF1-AS1, T or B cell proliferation, alveolar protection and COVID-19 severity which warrants further investigation.

The machine learning protocol used in the present study was inspired by the method from ref. 16, which used Boruta, a random forest-based algorithm, to select features from electronic health records and evaluate a quantitative marker of coronary artery disease. We adapted their design to suit RNAseq data by adding DESeq2 for differential expression analysis. Many conventional statistical methods, such as t-tests and ANOVA, assume normal data distributions, which is often not the case for data generated by high throughput platforms, such as sequencing. New machine learning methods are able to deal with scale, diverse data distributions, and non-linearity, such as large omics datasets17. Multiple machine learning algorithms, including deep learning algorithms, have been developed to build powerful predictive models linking omics data to prediction of clinical outcomes18,19. While benefiting from the modeling flexibility and robustness, these models often suffer from difficulty in interpreting the role of each individual feature. Identifying biomarkers functionally associated with disease progression could help establish novel hypotheses regarding prevention, diagnosis, and treatment of complex human diseases20.

Translational perspectives

The present investigation was conducted using patient’s whole blood samples collected in PAXgene RNA tubes, which are certified for in vitro diagnostics. Other matrices could also be used and we do not exclude that other biomarkers may be found with relevant predictive value. Using quantitative PCR, a technique available in most hospital labs and cost-effective, we confirmed that LEF1-AS1 was readily and reliably detected. Furthermore, we validated that low levels of LEF1-AS1 were associated with a high risk of death. These data support the potential translation of our findings to clinical application.

With the current excitement for the use of RNA as both biomarker and therapeutic targets, lncRNAs may constitute a novel generation of actionable disease-monitoring biomarkers and drugs. Our data showing that lncRNAs are associated with mortality of COVID-19 patients support their potential as theranostic drugs, usable for both risk assessment and treatment of COVID-19. Circular RNAs particularly raised interest for future drug development since these closed RNA molecules are not only able to more stably induce therapeutic protein production compared to linear RNAs, they also have potential to capture and sequester unwanted molecules and thereby function as antisense RNAs, or they can regulate RNA editing21. Whether lncRNAs find utility for COVID-19 remains to be determined, as well as whether circRNAs hold similar or superior value to reduce disease burden. It will be interesting in such endeavors to develop multimodal approaches taking into account not only baseline clinical characteristics and biomarkers but also mental health indicators, considering the importance of pre-existing health problems and especially psychological problems in the development of post-COVID condition22. It would be interesting to apply a similar approach to see whether lncRNAs are associated with the long term impact of COVID-19, such as long COVID23. Considering the prevalence and devastating consequences of this novel disease24, setting-up methods to predict the risk of developing long COVID symptoms would have a significant impact on the enormous burden of long COVID or post-COVID symptoms.

Limitations

This work has some limitations. First, since patients enrolled in this study were from the first phase of the pandemic, we assume that most if not all patients were infected by the original SARS-CoV-2 variant. Also, there was limited information on vaccination status due to the fact that there were no widely available vaccines at the time of study enrolment. However, we cannot exclude that some patients were infected by other variants. Hence, we could not test the performance of the model in patients infected by different viral variants. Second, only limited clinical descriptions of the patients enrolled in the study could be provided due to heterogeneity of cohorts and difficulty to merge the clinical data from different cohorts. Third, none of the participants of the Luxembourg PrediCOVID cohort died in hospital, most probably due to the nationwide mass screening program, which allowed an improved control of the virus and an earlier hospitalization of patients25. Since this cohort was included at project inception and despite that the main aim of this study was to predict in-hospital mortality, it was kept in analyses and we verified that its removal does not affect study findings. Fourth, survival analysis using Cox regression and Kaplan–Meier curves could be conducted only in the ISARIC4C and BQC19 subgroups for which we had data on time to death. Fifth, even though we tested five different ML classifiers, others could provide stronger predictive value. Lastly, a full functional characterization of the role of LEF1-AS1 in post COVID-19 outcome remains to be done. We identified a machine learning-supported model combining age and the lncRNA LEF1-AS1 predictive of COVID-19 in-hospital mortality. This model may find utility for the management of COVID-19 patients. Its usefulness for long COVID patients remains to be tested.

Methods

Patient cohorts

This study was performed in full compliance with the Declaration of Helsinki. Involved cohorts comprise COVID-19-positive patients aged 18 years and older from Luxembourg (PrediCOVID study), Germany (NAPKON study), United Kingdom (ISARIC4C study), and Canada (BQC19 study). The Luxembourg PrediCOVID study was approved by the National Research Ethics Committee of Luxembourg (study Number 202003/07) and was registered under ClinicalTrials.gov (NCT04380987)26. The ISARIC-4C study was approved by the Oxford C Research Ethics Committee (Reference 13/SC/0149) (details on study design, registration and approvals are available in the online supplement). For the NAPKON Cross-Sectoral Platform, a primary ethics vote was obtained at the Ethics Committee of the Department of Medicine at Goethe University Frankfurt, Germany (local ethics ID approval 20-924). All further study sites received their local ethics votes at the respective ethics committees. The NAPKON Cross-Sectoral Platform is registered at ClinicalTrials.gov (Identifier: NCT04768998)27. The Biobanque québécoise de la COVID-19 (BQC19) study has been approved by the Research Ethics Board of the Center Hospitalier de l’Université de Montréal (CHUM) (#13.389)28. Periods of patient enrolment and biological samples collection were as follows: May 2020 - Present for PrediCOVID, July 2020 - Present for NAPKON, February 2020 - September 2020 for ISARIC4C, March 2020 - Present for BQC19. Informed consent was signed by all patients enrolled in these studies. Legal agreements for material and data sharing have been signed between each cohort and COVIRNA project coordinator Luxembourg Institute of Health (LIH).

Sample storage and RNA extraction

All procedures were performed in the ISO 17025, ISO 9001, and CAP accredited facility of Firalis. Whole blood samples collected in PAXgene™ Blood RNA tubes (PreAnalytiX, Cat. #762165; BD Biosciences, Aalst, Belgium) were shipped from the different patient cohorts to our central NF S96-900 certified Biobank and were stored at −80 °C. Whole blood samples were randomized according to age and sex in batches of 64 prior to RNA extraction. Total RNA was extracted with the KingFisher Apex instrument (Cat. #5400930 P, Thermo Scientific, Waltham, MA, USA) using the MagMAX™ for Stabilized Blood Tubes RNA Isolation Kit (Cat. #4451894, Invitrogen, Thermo Scientific). Extracted RNA samples were quantified using the Qubit 3.0 fluorometer (Cat. #Q33216, Invitrogen, Thermo Fisher Scientific) with the RNA high sensitivity Assay kit. Sample quality was assessed using a TapeStation 4150 electrophoresis platform (Cat. #G2992AA, Agilent, Santa Clara, CA, USA).

Library preparation, targeted RNA sequencing and raw data analysis

An extended version of this section is available in the Supplementary Material. Briefly, a second stratified randomization by age and sex was performed in batches of 46 samples prior to library preparation. The libraries were generated by the EpMotion 5075t NGS solution (Cat. #5075000962, Eppendorf, Hamburg, Germany) using the KAPA Stranded RNAseq Kit with RiboErase (HMR; Cat. #634444, Roche diagnostics, Basel, Switzerland) for ribosomal RNA (rRNA) depletion and total RNA libraries construction. The clean-ups were performed with Celemag clean-up beads (Cat. #CMCB57.6, Celemics, Seoul, Korea) and the purified libraries were dual indexed during a 13-cycle PCR using the library preparation box #2 (Cat. # LI20D96, Celemics).

The indexed libraries were then captured using the FIMICS panel targeting 2906 lncRNAs7 (Cat. #BO5096, Celemics) and purified using Celemag streptavidin coated magnetic beads (Cat. #CMSB5.76, Celemics) and Celemics wash buffer (Cat. #TC4096, Celemics). The on-beads captured sequences were enriched by a 14 cycle PCR and purified using Celemag clean-up beads before quality assessment and quantification. The libraries were then normalized and pooled prior to being sequenced on the NextSeq 2000 platform (Cat. #20038897, Illumina Inc., San Diego, CA, USA) using the NextSeq 2000 P2 kit (Cat. #20046811, Illumina Inc.). Raw sequencing data were analysed using the Firalink pipeline8.

Data management and curation

RNA sequencing (RNAseq) datasets with a relative standard deviation <0.46 and with a number of lncRNAs detected with more than 10 reads in less than 10% of the total FIMICS lncRNAs were excluded. LncRNA data were merged with age, sex, and smoking status for the feature selection process. The missing values of these clinical data were imputed using the missForest function from the missForest R package29. Voom-transformed RNAseq data was used for ML analysis30.

Machine learning models

The three European cohorts (PrediCOVID, NAPKON, ISARIC4C) were combined and used as a discovery cohort, on which a machine learning procedure was iterated 100 times (Supplementary Fig. 1), following these steps: (1) random selection of 80% of deceased patients and a balanced set of living patients to construct a training dataset; (2) use of the remaining 20% of deceased patients along with a balanced set of the remaining living patients to form a test dataset; (3) identification of differentially expressed lncRNAs in the training dataset with a false discovery rate (FDR) < 0.00001 using the DESeq2 algorithm31; (4) feature selection in R on clinical variables (age, sex, and smoking status) and differentially expressed lncRNAs from the training dataset using the Boruta function from the Boruta package32 and the vif function from the rms package (https://CRAN.R-project.org/package=rms) with a cut-off of 5 to avoid multi-collinearity; (5) use of repeated (2x) 5-fold cross-validation to fine-tune various machine learning models, including random forest (RF), k-nearest neighbor (kNN), logistic regression (logit), multilayer perceptron (MLP), XGBoost (XGB) and support vector machine (SVM) model in the training dataset using scikit-learn package in Python; (6) evaluation of the model in the test dataset. Features that appeared more than 70 times during the 100 iterations were retained as the selected features that were used to train and evaluate ML models by repeating steps 1, 2, 5 and 6 within 100 iterations with the same seed. The algorithm yielding the model with the highest AUC with the selected features in the test cohort was retained for use in the validation cohort.

The BQC19 cohort was used as the validation cohort. We repeated steps 1 and 2 described above 100 times to split the validation cohort into training and test datasets. In each iteration, a model was trained with the algorithm selected in the discovery cohort using with the features selected there, and evaluated. We also evaluated the selected model on the original imbalanced datasets from the discovery and validation cohort respectively using repeated (20x) 5-fold cross-validation. To test the model robustness, we compared the selected model to the model after adding the top 4 ranked but not selected lncRNAs. The performance metrics, including the AUC, balanced accuracy (accuracy for balanced dataset), sensitivity, and specificity, were reported for the mean and 95% CIs across 100 iterations or the cross-validation. The sensitivity and specificity were determined using 0.5 as the threshold for the predicted class probability.

Quantitative PCR (qPCR)

RNA samples extracted from whole blood samples collected in PAXgene tubes were used to assess the expression levels of LEF1-AS1. 200 ng of each RNA sample were reverse transcribed with the High-capacity cDNA reverse transcription kit (ThermoFisher Scientific, Cat # 4368814). To avoid any batch effect, cDNA samples were then randomized in 3 different batches prior to being assessed by quantitative PCR using the CFX-OPUS-96 Dx qPCR device (Biorad, Temse, Belgium) with IQ SYBR Green Supermix (Biorad). Each sample was quantified in duplicate. The following primer sequences designed with the Beacon Designer software (Premier Biosoft) were used for LEF1-AS1: forward 5′- GTCCATGCTATGACCATCTCCA −3′, reverse 5′- ACACGAGTTAAGGCACATTCA −3′; and for SF3A1 which was used as normalizer: forward 5′- GATTGGCCCCAGCAAGCC-3′, reverse 5′- TGCGGAGACAACTGTAGTACG-3′. Splicing Factor 3a Subunit 1 (SF3A1) was chosen as a housekeeping gene for normalization. Expression levels were calculated by the relative quantification method (ΔΔCt) using the CFX Manager 2.1 software (Bio-Rad).

Statistical analysis

Continuous and categorical variables were compared with two-sided unpaired Student’s t-test and Fisher’s exact test, respectively. A Mann–Whitney test was used to compare non-normally distributed datasets, as assessed by the Shapiro–Wilk test. Correlation between qPCR and RNAseq data was evaluated using the Spearman’s rank test. Cox proportional hazards regression was used to test the association of lncRNAs with survival using the coxph function from the survival R package (https://cran.r-project.org/web/packages/survival/index.html). For survival time, the start date was the date of admission, and the end date was the date of death or the date of discharge. Association between lncRNAs and survival is reported as hazard ratio (HR), along with a measure of precision (95% confidence interval, CI). The significance level was set at 0.05. Kaplan–Meier curves stratified by lncRNA quartile were generated for survival analyses using the ggsurvplot function from the survminer R package (https://cran.r-project.org/web/packages/survminer/index.html).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.