Introduction

Circular RNAs (circRNAs) are a unique group of RNA molecules generated by back-splicing and featured by a ring structure1. The interplay of circRNAs and microRNAs or RNA-binding proteins plays a vital role in nearly all aspects of essential biological activities2,3,4. Compared with classical mechanisms, emerging coding evidences in recent years add another layer to the functionality of circRNAs whereby these newly identified translation templates might generate novel proteinic products with biological implication5,6,7. By estimation, circRNA-derived non-canonical proteome, though small in size, is huge in numbers8,9,10. According to accumulating reports of circRNA-encoded products with clinical significance and artificial modification of circRNAs into engineered products capable of stable translation, this new type of circRNAs with coding potential is becoming a rising star in RNA biology, whose massive detection, annotation and characterization present unprecedented opportunities and challenges for both experimental and computational science11,12,13.

Coding apparatus such as internal ribosomal entry site (IRES), open reading frame (ORF) and m6A modification underlies the structural basis for translatable circRNAs14,15. Generally, IRES and ORF hold common targets for preliminary bioinformatic tools like IRESite and ORF Finder16,17, and then followed by the development of software performing comprehensive assessment of RNA coding potential (e.g., Contrastive Predictive Coding or CPC, Coding-Potential Assessment Tool or CPAT)18,19. As opposed to investigation on an individual basis, recent high-throughput approaches open the possibility for global surveys of translated circRNAs. Of note, genome-wide circRNAs can be harvested efficiently via the detection of backsplicing junction (BSJ) with the aid of bioinformatics20. Essentially, the gold standard for defining protein-coding circRNAs is to uncover ribosome-protected fragments (RPFs) through alignment of ribosome sequencing (Ribo-seq) reads with circRNA (circRNA-seq) or total RNA sequencing (RNA-seq) data21,22. In light of functional complexity of circRNAs, coding potential software can further target and recognize circRNAs with translation potential from the repertoire of RPFs.

The efficiency of retrieving translatable circRNAs via data mining is heavily dependent on the performance of circRNA detection and coding prediction software23,24. As RNA-associated algorithms began to prosper in discovery-driven studies, a comprehensive, efficient and reliable de novo detection of circRNAs has been achieved by ever-increasing and updating computational methods (e.g., CIRI2, CIRCexplorer2, Findcirc). Until recently, Gaffo etc. proved that particular combinations of algorithms outperform any single method, and therefore developed CirComPara2 integrating seven algorithms and exhibiting both heightened reliability and sensitivity25. For another, CPC, a primary RNA coding prediction tool based on unsupervised learning, has been updated to newer, faster and stronger CPC226. And in addition to alignment-based methods, CPAT sets an alignment-free paradigm for an efficient and reliable prediction of RNA coding potential19. And until now, CircPro, a highly modularized, integrative and well-established tool, has provided a major bioinformatic solution for comprehensive circRNA characterization23. However, due to the incompetency exhibited by its implements (i.e., CIRI2, CPC), CircPro is no longer optimal for the most recent researches and more potent software updates is anticipated by current in silicon investigation. To this end, we improved CircPro into CircProPlus with enhanced performance in the detection of protein-coding circRNAs. Major improvements include allowing user-defined input of circRNAs calculated elsewhere, updating CPC to CPC2 and adding CPAT as an alternative option in the coding prediction module, and optimizing the overall workflow by reducing speed-wasting steps. As a result, we enabled CircProPlus with more flexibility and accessibility as well as heightened efficiency and accuracy, highlighting its potency in de novo detection of translatable circRNAs.

Results

Overview of the CircProPlus workflow and characteristics

We developed CircProPlus from CircPro by updating functional modules and optimizing the overall workflow for better data mining. Overall, CircProPlus also implements an automated computational pipeline for de novo detection of protein-coding circRNAs (Fig. 1). However, major modifications listed as follows are applied to boost software performance. (1) Module 1: De novo circRNA detection. CIRI2 is set by default for circRNA detection while other algorithms are available for users to customize their circRNA analysis. Well-established genome index files will be reused whereby parallel calculation is allowed. (2) Module 2: Coding potential prediction. Newly developed CPC2 and CPAT are implemented to crosscheck prediction results. (3) Module 3: Ribo-seq reads identification. Ribo-seq reads are allowed to align to any site of circRNA other than BSJ. As a result, CircProPlus offers users with more options and availability for harnessing the power of state-of-the-art algorithms. Meanwhile, CircProPlus improves the detection of circRNAs bound with Ribo-reads, providing a much larger repertoire for translated circRNAs. Additionally, by rendering genome indexes reusable, repeated steps are avoided for each run and parallel calculation is available. Taken together, these improvements remarkably reduce computational load while prominently boost performance of CircProPlus by offering more flexibility, accessibility and reliability.

Figure 1
figure 1

The workflow of CircProPlus. Three functional modules constitute the pipelines of CircProPlus. BSJ, back-splicing junction.

Performance of CircProPlus in processing circRNA-seq data

To minimize the impacts of linear RNAs, poly(A)-depleted RNA library (RNase R treated) is highly recommended for an efficient characterization of genome-wide circRNAs. We first tested the performance of CircProPlus using real circRNA-seq data and matched Ribo-seq reads from human breast tissue. Similarly, CIRI2 and CirComPara2 were both adopted by CircProPlus for circRNA detection. At the same computer hardware level, we found that CIRI2 implementation yielded 20,777, 9754, 9110 and 18,153 translated circRNAs in each sample, whereas CirComPara2 discovered 42,917, 31,455, 23,431 and 35,418 translated circRNAs in corresponding groups (Fig. 2a, Table S1, 2).

Figure 2
figure 2

The number of translated circRNAs detected by CircProPlus from circRNA-seq reads (RNase R treated) using CIRI2 or CirComPara2 implement. (a, b) Performance of CircProPlus was tested in human (a) and mouse (b) datasets. Paired t test.

Next, we also applied CircProPlus to analyzing mouse sequencing data. As expected, 2175, 4541 translated circRNAs were collected in two embryonic stem cells (ESCs) groups while 5055, 7670 translated circRNAs were distinguished in neural progenitor cells (NPCs) groups by CircProPlus using CIRI2 implement (Fig. 2b, Table S1, 2). CirComPara2, by contrast, retrieved much more translated circRNAs, adding up to 11,540, 19,263 translated circRNAs in ESC groups and 20,438, 28,785 in NPC groups when integrated with CircProPlus.

Performance of CircProPlus in analyzing total RNA-seq data

For deep sequencing of RNase R untreated libraries, linear RNAs are profoundly enriched, leaving only a small fraction of circRNAs attainable. We next asked whether CircProPlus could still efficiently discern circRNAs from total RNA-seq data, an alternative, potential source of circRNAs. Cumulatively, 3237, 5268, 3786 and 5730 translated circRNAs in each human sample were revealed by implementing CIRI2 within CircProPlus, which was in contrast to 12,194, 17,776, 15,581 and 20,167 translated circRNAs retrieved by replacing CIRI2 with CirComPara2 (Fig. 3a, Table S3,4).

Figure 3
figure 3

The number of translated circRNAs detected by CircProPlus from total RNA-seq reads (RNase R untreated) using CIRI2 or CirComPara2 implement. (a, b) Performance of CircProPlus was tested in human (a) and mouse (b) datasets. Paired t test.

Besides human samples, mouse RNA-seq reads of RNase R untreated libraries were also utilized for test. Consistently, CIRI2-implemented CircProPlus discovered only 661, 893 translated circRNAs in ESC groups and 1721, 2133 in NPC samples, exhibiting inferior efficacy than CirComPara2-boosted CircProPlus whose outputs add up to 3527, 4305 translated circRNAs in ESC groups and 7779, 7824 in NPC samples (Fig. 3b, Table S3, 4).

Runtime and memory consumption

CircProPlus inherited the highly modularized framework from CircPro and improved its redundant computational design. For example, genome index files are generated only once. Meanwhile, CPC implementation, the most time-demanding tool in CircPro whose running speed is decided by online environment, has been upgraded to much faster CPC2. As a result, CircProPlus using default CIRI2 implementation finished each run within hours using a maximum of 32 threads. Besides, the running time of CircProPlus almost scaled linearly with the number of input reads, suggesting a positive association of them (Fig. 4a,b). Peak memory, compared to the running time, exhibited mild variation with the growth of input reads.

Figure 4
figure 4

Runtime of CirComPara2 with CIRI2 implement compared to the amount of processed reads from circRNA-seq (a) or RNA-seq (b) reads using a maximum of 32 threads. Simple linear regression was performed.

Discussion

CircRNAs with biological significance and coding potential are becoming a rising focus in recent RNA world. Although high-throughput sequencing technologies are highly efficient in transcriptome characterization, genome-wide detection of circRNAs with coding potential poses a challenge for bioinformatics. Earlier strategies such as targeting basic coding apparatus (e.g., IRESite, ORFfinder) and adopting sequence alignment (e.g., CPC2) or feature comparison (e.g., CPAT) provide only theoretical basis for RNA coding ability. Yet ribosome profiling coupled with transcriptome analysis provides experimental support, which reinforces the identity of such noncanonical circRNAs. However, due to multi-functional roles of ribosome-bound circRNAs, circRNAs serving as coding templates should be further distinguished from those ribosome regulators27,28,29,30.

By far, CircPro still represents a popular, integrative pipeline for de novo identification of protein-coding circRNAs. With regards to three functional modules of CircPro, CIRI2 and CPC are two core algorithms implemented in the circRNA detection and coding potential prediction modules, respectively. CIRI2 has long been a benchmark method in circRNA mining31. CirComPara2, a newly developed tool, achieves even higher efficacy and accuracy by integrating 7 circRNA detection methods including CIRI2. Furthermore, Gaffo etc. showed that CirComPara2 could recall true circRNAs omitted by other single method, which was in consistent with our findings that CirComPara2 outperformed CIRI2 in circRNA retrieval when coupled with CircProPlus25. For another, we also introduced newer algorithms to revitalize the coding prediction module. CPC2 which recognizes protein-coding genes through homology and runs about 1000 times faster than CPC, and CPAT, an alignment-independent method which distinguishes coding and non-coding genes via structural features, were both integrated in CircProPlus, aiming to attach more accuracy to the results of predicted coding circRNAs19,26. Moreover, CircProPlus allows the discovery of all circRNAs aligned with Ribo-seq reads despite of the binding site, which remarkably improves the yields of translated circRNAs. Following these revisions, we presented CircProPlus as a potent computational tool for the detection of translated circRNAs, whose high efficacy has been comprehensively tested in both RNase R treated and untreated libraries from human or mouse samples.

In addition to renewed implements, we also optimized the overall workflow of CircProPlus to make it more convenient, resilient and efficient. Time-consuming steps such as creating reference genome indexes with each run are removed and reuse of index files is allowed, which dramatically decreases computational time by enabling parallel processing of multiple tasks.

CircProPlus implementations remain the major source of bias in exploring protein-coding circRNAs. Firstly, most circRNA detection tools identify back-splice junction reads through chimeric mapping to the reference genome as exemplified by TopHat2, STAR, BWA-MEM and Segemehl32,33,34,35. Less popular ones such as sequence feature identification and machine learning strategy were recently developed. However, none of them could achieve both high sensitivity and precision, which undermines the reliability of their derivative computational pipelines, such as CIRI2, CIRCexplore and Findcirc25. Combining multiple methods for cross-check provides a possible solution to this dilemma as Gaffo etc. found that CirComPara lowered the false positive rate by focusing on circRNAs commonly discovered by at least two algorithms25. Nevertheless, a few RNase R-sensitive circRNAs (e.g., circCDR1as) could still be missed by RNase R treatment based circRNA library preparation36. Thus, how to efficiently recall circRNAs with low abundance poses a further challenge for both methodology and bioinformatics. For another, CPC2 and CPAT facilitate a fast and accurate coding ability assessment of RNA transcripts especially for long non-coding RNAs. To optimize their performance in circular transcripts, additional improvements in software are required considering the unique structure of circRNA and the possibility of rolling translation. Moreover, although evidence provided by Ribo-seq data enhances the authenticity of predictions, false positive probability is unavoidable given some ribosome-bound circRNAs merely regulate ribosome biology rather than coding proteins. With constant advances in next-generation sequencing technologies, evidence from multi-omics undoubtedly gains more confidence. In addition to updating modular implements, a more potent CircProPlus which integrates transcriptome, translatome and proteome is much anticipated for circRNA biology.

Recently, circRNA-encoded products have been highlighted in almost all biological fields with versatile roles in oncogenesis, immunity, metabolism, infection and development3. Shedding light on circRNAs with coding potential holds remarkable significance with the aid of bioinformatics. In spite of an unprecedented focus in nonclassical proteome, computational pipelines are limited, inefficient and insufficient to cater the ever-increasing needs of biological researches. CircProPlus provides a renewed, promising computational solution to uncover the coding potential of circRNAs. By integrating the state-of-the-art algorithms, CircProPlus fulfills the advantages of these implements. In conclusion, with CircProPlus, we provide a highly efficient software aiming to facilitating a comprehensive exploration of protein-coding circRNAs in living organism.

Methods

Real datasets and reference genome

Overall, 4 samples of human breast tissue with matched circRNA-seq, total RNA-seq and Ribo-seq datasets were retrieved from Gene Expression Omnibus (GEO) under the accession GSE210793. Human reference genome associated files (ucsc.hg19.fasta, hg19.ensGene.gtf) were downloaded from UCSC (https://genome.ucsc.edu). For mouse data, sequencing reads from RNase R treated or untreated RNA libraries of embryonic stem cells (ESCs) and neural progenitor cells (NPCs) were obtained from GSE157788, and Ribo-seq data were collected from GSE248725 and GSE154992. The GRCm38/mm10 assembly of the mouse genome and annotation (mm10.ncbiRefSeq.gtf) were harvested from UCSC. Further details of data used in this study were included in Table 1.

Table 1 Real datasets used to test the performance of CircProPlus.

Implementation of CircProPlus

As the original algorithm, CircPro integrates three modules including: (1) Module 1: circRNA detection. CIRI2 is specified for de novo circRNA detection, (2) Module 2: protein-coding potential score. CPC is designated for coding potential prediction, (3) Module 3: junction reads from Ribo-Seq. Only BSJ-spanned reads are focused. CircProPlus is an improved version of CircPro with following revisions: (1) retaining CIRI2 as the default algorithm for circRNA detection while also allowing the input of circRNAs calculated by user-defined algorithms, (2) preprocessing genome index files and rendering them reusable, (3) introducing CPC2 and CPAT to supersede CPC in the coding prediction module, (4) allowing Ribo-seq reads mapped to all sites of circRNAs (BSJ-spanned or BSJ-free). In addition, the running script has been rearranged to lower operational load and heighten efficiency via multi-task processing in parallel.

Details for running CircProPlus

For implementing CircProPlus with default CIRI2 algorithm, circRNA input data are FASTQ files from circRNA-seq or total RNA-seq. Input of clean data is recommended or adapters can be removed by CircProPlus automatedly based on user-supplied adapter information. Reference genome in FASTA format, annotation files in GTF format and rRNA sequences in FASTA format should be prepared and specified. CircProPlus is also available for analyzing the output generated by other circRNA detection algorithms such as CirComPara2 where following supplements are required: circRNA output in GTF file format should be converted into a list format using the Perl script convert.pl and a BED format using the gtf2bed tool. CircRNA sequences are extracted from reference genome with the aid of this BED file. The list file and sequence file should be appended to the parameters “-L” and “-E”, respectively.

Statistics

Statistical analysis was performed using Prism 9.0 (GraphPad Software). Paired t-test was used to compare the performance of CircProPlus with CIRI2 or CirComPara2 implement. Simple linear regression analysis was performed to evaluate the correlation of running time and the amount of input reads. P value of < 0.05 was considered statistically significant.

Algorithm versions

The software versions of major CircProPlus implements were listed as follows: CIRI2 v2.0.5, CPC2 v0.9-r2 and CPAT v3.0.5. BWA MEM v0.7.5, Bowtie2 v2.4.1.