Talk:WikiJournal of Science/A broad introduction to RNA-Seq

From Wikiversity
Jump to navigation Jump to search

WikiJournal of Science logo.svg

WikiJournal of Science
Open access • Publication charge free • Public peer review • Wikipedia-integrated

WikiJournal of Science is an open-access, free-to-publish, Wikipedia-integrated academic journal for science, mathematics, engineering and technology topics. WJS WikiJSci Wiki.J.Sci. WikiJSci WikiSci WikiScience Wikiscience Wikijournal of Science Wikiversity Journal of Science WikiJournal Science Wikipedia Science Wikipedia science journal STEM Science Mathematics Engineering Technology Free to publish Open access Open-access Non-profit online journal Public peer review

<meta name='citation_doi' value='10.15347/WJS/2021.004'>

Article information

Author: Felix Richter[a][i]ORCID iD.svg  , et al.

Felix Richter; et al. (17 May 2021), "A broad introduction to RNA-Seq", WikiJournal of Science, 4 (2): 4, doi:10.15347/WJS/2021.004, ISSN 2470-6345, Wikidata Q100146647


Plagiarism check

Artículo bueno.svg Pass. Report from WMF copyvios tool. Abstract overlaps with multiple other internet pages, however those pages are reusing the content from Wikipedia (often outside the cc-by-sa license). No genuine plagiarism detected. T.Shafee(Evo﹠Evo)talk 01:30, 21 August 2019 (UTC)

First peer review

Review by Till Adhikary ORCID iD.svgWikidata-logo.svg , Institute for Medical Bioinformatics and Biostatistics, Centre for Tumour Biology and Immunology, Philipps University of Marburg

These assessment comments were submitted on , and refer to this previous version of the article

The article by Richter et al. gives a comprehensive and balanced overview of RNA-seq. It summarises technical principles, in silico tools, and common approaches, the advantages and disadvantages of which are discussed. In my opinion, only minor additions are needed.


Thank you, your feedback is very much appreciated.

Library generation, step 2

I suggest to mention that poly(A) selection not only ignores noncoding RNA but also several protein-coding transcripts such as those encoding for core histones. These are not polyadenylated. Furthermore, many cytokine-encoding transcripts are regulated via the length of their poly(A) tail which can lead to failure of detection when using oligo(dT) for enrichment.

Simple, streamlined approaches such as Quantseq ( omit enrichment/depletion steps and generate libraries with strong 3' bias.


These poly(A) selection caveats are now in the article and Quantseq is provided as an example method for enrichment/depletion.

Single cell RNA sequencing, Experimental procedures

It is correct that each droplet carries a different barcode, but this may be confusing for readers with a basic level of knowledge. I think it should be mentioned that this is mediated by the presence of a bead which carries the barcoded oligonucleotides on its surface. Labeling of single cells is usually achieved by supplying limiting amounts of both cells and beads so that co-occupancy of a droplet by more than one cell, more than one bead, or more than one of each is a very rare event. Since the use of beads is a standard approach, I suggest also to include this in the workflow presented in figure 5. Another option is to mechanically separate cells into single wells (e.g. Takara ICELL8 or similar devices) and to process them individually.

Possibly it is also worth to include information on unique molecular identifiers which enable the identification of artefacts (overrepresentation of sequences derived from a single RNA molecule) introduced during library preparation.


The scRNA-Seq section has been improved to clarify droplets vs microwells, bead labeling, and UMIs.

Gene expression quantification

I suggest to change the sentence ""Gene expression is often used as a proxy for protein abundance"" to ""Transcript levels are often used as a proxy for protein abundance"" since gene expression encompasses processes beyond transcription.


This has been updated.

The use of spike-ins does not only allow for absolute quantification but also enables detection of genome-wide effects, e.g. after depletion of global regulators such as MYC, chromatin remodelers, and acetyltransferase complexes. Here, standard normalisation procedures strongly distort the outcome of the analysis (see e.g. DOI:10.1128/MCB.00970-14), which can easily go unnoticed if spike-ins are omitted.


Because these concepts are so related, I expanded this section to ""Spike-ins for absolute quantification and detection of genome-wide effects"" and have incorporated these points in the text.

Minor comments

RNA-seq is sometimes spelled RNA-Seq, RNAseq, or RNASeq. If I am not mistaken, ""RNA-seq"" is the common spelling.


The article now uses RNA-Seq and I have added the other spellings to the introductory sentence. RNA-Seq and RNA-seq seem equally popular (e.g., our other reviewer used RNA-Seq), so I opted for RNA-Seq to stay consistent with the Wikipedia article title.

Several figures are not referred to in the main text, and some of them (3,4) represent specific workflows, which might confuse entry-level readers.


Figures are now referenced in the text and figures 3 and 4 (on magnetic bead selection) have been removed from the article.

Second peer review

Review by Martin Hölzer ORCID iD.svgWikidata-logo.svg , MF1 Bioinformatics, Robert Koch Institute, 13353 Berlin, Germany

These assessment comments were submitted on , and refer to this previous version of the article

The article on whole transcriptome sequencing (RNA-Seq) presented here gives a comprehensive overview of this impressive technology, which has been developed over the last ten years and is now widely used in various fields of molecular biology. First of all, I would like to thank Richter et al. for all their work and for preparing this article. While I believe that the article already covers most of the important aspects and applications of the technology, I would like to add/discuss a few more points. I do not think that all this needs to be included, especially as the article is intended for a wider audience. However, I wanted to add the latest knowledge from the areas I am most familiar with and suggest some changes to describe certain aspects of RNA-Seq in more detail.


Thank you for the thoughtful feedback. I felt that your suggestions were applicable for a wider audience and have incorporated all of them.

  • Methods: Sequencing I have the feeling that the Methods section misses a subsection about the last but crucial step of RNA-Seq: the sequencing itself. Of course, there are other articles covering NGS but a short paragraph "Sequencing" in-between "Library preparation" and "Small RNA/non-coding RNA sequencing" (maybe consider renaming "Small RNA/non-coding RNA library preparation") could emphasize the sequencing itself a bit more, referencing short-read (mainly Illumina) and long-read (mainly PacBio, ONT) sequencing

There is now a Sequencing section and paragraph listing commonly used technology platforms, workflows, and experimental design considerations. I condensed the Small RNA/non-coding RNA library preparation section into a single sentence under RNA selection/depletion: in the Library Preparation Section

  • Methods: Long-read technologies Long-read sequencing of cDNA (PacBio IsoSeq, Oxford Nanopore Technologies (ONT)) and native RNA (ONT) using SMRT long-read technologies is mentioned in some parts throughout the article. However, due to the increased emergence and use of such technologies, I think that the section on "Direct RNA sequencing" can be expanded a bit and perhaps renamed "Single-molecule real-time RNA sequencing", with a subsection on direct RNA sequencing. Then, this section could also have an "Applications" section like the scRNA-Seq part mentioning RNA modification analyses, full-length transcript characterization, ... The SMRT and dRNA-Seq part appears relatively short e.g. in comparison to the scRNA-Seq part. The authors could also mention that, because the RNA is sequenced in their native form, modifications are preserved and can be investigated directly. In addition, differences between short reads (commonly derived from Illumina sequencing) and long reads derived from PacBio or Nanopore sequencing can be emphasized, with the potential to cover transcripts in full length but also a higher error rate (in particular for dRNA-Seq). Regarding dRNA-Seq, I suggest adding: "ONT allows direct sequencing of native RNA molecules (dRNA-Seq) without any fragmentation steps and cDNA conversion errors. For example, the recent use of ONT dRNA-Seq for the differential expression of human cell populations has impressively demonstrated that this technology can overcome many limitations of short and long cDNA sequencing methods [7]."

The Direct RNA-Seq section has been replaced with a Single-molecule real-time RNA sequencing section that emphasizes the strengths and limitations of this technology.

  • Methods: polyA "filtered for RNA with 3' polyadenylated (poly(A)) tails to include only mRNA" For example, long non-coding RNA can be also polyadenylated: see [8] "Most of the lncRNAs contain normal 5’-caps and 3’ poly-A tails." Thus I suggest changing "The RNA with 3' poly(A) tails are [mainly composed of] mature, processed, coding sequences." and remove/rephrase "Poly(A) selection ignores noncoding RNA" because there is also polyadenylated (long) ncRNA

This text has been updated. I also changed the RNA selection/depletion table column header from "Type of RNA" to "Predominant Type of RNA".

  • Methods: RNA selection and depletion methods table Why is the genomic DNA content for rRNA-depleted library preparation "high"? If treated with DNase there should be almost no DNA left. Or do you mean that there is a higher content of RNA transcribed from non-protein-coding genomic regions?

The genomic DNA content column likely reflected older procedures and has been removed.

  • Methods: Data management "A single RNA-Seq experiment in humans is usually on the order of 1 Gb." I would like to discuss if this 1 Gb is a little but underestimating the actual size of a single RNA-Seq human sample sequenced with Illumina, in particular, when not further compressed via e.g. gzip. An uncompressed RNA-Seq FASTQ for totalRNA can easily have >5-10 Gb. I checked some of our recent bat/mice/human RNA-Seq runs and would suggest writing "1-5 Gb (compressed)"? In addition, one could also mention in this "Data management" section the high computational load and storage of all the intermediate files (such as mapping files, counting files, ...) that might occur and also make up a large proportion of the stored data.

These size caveats are now in the text.

  • Analysis: Assembly "Examples of assemblers that use de Bruijn graphs are ..." I suggest re-writing and extending this to also include more state-of-the-art tools: "Examples of RNA-Seq assemblers that use de Bruijn graphs are Oases (derived from the genome assembler Velvet), Trinity, Bridger, and rnaSPAdes [1]" The genome-guided transcriptome assembly part can be improved. The tools mentioned here are more general RNA-Seq mapping tools and no real assembly tools because another step is mostly needed to actually derive contigs from RNA-Seq mapping files. I suggest adding an additional sentence about tools, that acutally take the mapping output (SAM/BAM file) and then reconstruct contigs (a FASTA file) directly after the listing of the mapping tools: "The output of these genome-guided alignment (mapping) tools can be further utilized by tools such as Cufflinks [2] or StringTie [3] to actually reconstruct contiguous transcript sequences." In addition, can Subread really map reads? I thought it's a tool for read counting from already available BAM files (but this might be wrong) Further, can Sailfish and Kallisto really be used for genome-guided assembly? I thought these are fast alignment-free read counting tools, thus, I am not sure if one can actually assemble transcripts with these tools (or based on their output). Please also consider to add another recent paper about the comparison of de novo transcriptome assembly tools and comparative metrics to the citations in this paragraph: "A note on assembly quality" [4] "assemblies that scored well in one species" --> "assembly tools that scored well in one species"

The assembly section now includes newer tools, a sentence on reconstructing contigs, the suggested citation comparing de novo transcriptome assembly tools, and the suggested minor edit. I also moved Kallisto/Sailfish to the gene expression quantification section.

Yes, Subread can map reads and is a function in the rsubread package. FeatureCounts is the much more commonly used function from the rsubread package.

  • Analysis: Gene expression quantification, Sequencing depth/coverage Because there is still confusion about what the actual difference between RPKM (reads per ...) and FPKM (fragments per ...) is, I suggest mentioning this briefly in the "Sequencing depth/coverage" section. Please consider changing to "... is typically normalized by converting counts to reads, fragments, or counts per million mapped reads (RPM, FPM, or CPM). The differentiation between RPM and FPM historically derived from the initial single-end sequencing of fragments which evolved to paired-end sequencing. Thus, in single-end sequencing, only a single read is produced from one fragment whereas, in paired-end sequencing, two reads derive from one fragment. Thus, when dealing with paired-end read data, the term FPM is used to account for the relationship of two reads belonging to the same single fragment, instead of RPM, where a single read only represents one fragment." See also "fragments per kilobase of transcript per million mapped reads (FPKM)" I think the more common translation of FPKM is "Fragments per kilobase of exon per million reads mapped" or maybe more generic "Fragments per kilobase of feature per million reads mapped" whereas a feature can be an exon, transcript, gene, ... depending on what the counting was performed on. Finally, in the "Differential expression section", paragraph "Methods": "... to identify differentially expressed genes, and are either count-based (DESeq2, limma, edgeR) or assembly-based (via alignment-free quantification, sleuth, Cuffdiff, Ballgown)". I'm not sure if "count-based" and "assembly-based" is misleading in this context. Afaik, tools such as Sleuth, Cuffdiff, ... also need a reference genome for quantification, or in other words, they need the output of other tools that quantify RNA-Seq data against a reference in an alignment-free way such as Kallisto ( But also Kallisto needs to build an index from a FASTA formatted file of target sequences. And this reference can be also an (genome, transcriptome) assembly. And the result of Kallisto are (read, k-mer) counts that are fed into sleuth, cuffdiff, ... Taking all of this into account, I suggest changing "count-based" into "based on read counts derived from mapping against a reference" and "assembly-based" into "based on read counts derived from alignment-free quantification". I don't want to complicate things here for general readers, but on the other hand, I also want to be as precise as possible.

The distinction between RPM and FPM is a fantastic addition. I further clarified the language on this distinction. I also replaced count-based and assembly-based with the suggested, more precise language.

  • Minor * "Recent advances in RNA-seq include single cell sequencing and in situ sequencing of fixed tissue." * please consider expanding "... single-cell sequencing, in situ sequencing of fixed tissue, and the sequencing of native RNA molecules using single-molecule real-time nanopore sequencing." * I already edit the caption of Figure 1 to better reflect that RNA is reverse-transcribed into cDNA (I replaced 'copy') and that 'ds' stands for 'double-stranded' * "Once isolated, linkers are added to the 3' and 5' end [and] then purified." * please reference all figures in the main text, sometimes it's only written "see figure" (which one?) * Bowtie vs BowTie * please harmonize RNA-Seq vs RNASeq vs RNA Seq, ... I think RNA-Seq is the most common writing * great that the dangerous excel-conversion of gene names is mentioned in this article! :) * "Common tools for gene set enrichment include web interfaces (e.g., ENRICHR, g:profiler) " I suggest adding WEBGESTALT ( which is a still maintained web tool for (not only) GSEA [5] * "and used to predict [and] their biological functions." remove the second 'and' * Section "Variant discovery", please consider adding the following sentence to the paragraph to account for recent findings regarding bias in variant detection based on de novo transcriptome assembly: "However, nucleotide variants derived from de novo transcriptome assemblies were recently shown to contain bias and thus underestimate heterozygosity [6]." * "approach is to use pair[ed]-end reads"

All suggested improvements have been implemented.

  • Further suggested references [1] Bushmanova, Elena, et al. "rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data." GigaScience 8.9 (2019): giz100. [2] Trapnell, Cole, et al. "Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks." Nature protocols 7.3 (2012): 562-578. [3] Pertea, Mihaela, et al. "StringTie enables improved reconstruction of a transcriptome from RNA-seq reads." Nature biotechnology 33.3 (2015): 290-295. [4] Hölzer, Martin, and Manja Marz. "De novo transcriptome assembly: A comprehensive cross-species comparison of short-read RNA-Seq assemblers." Gigascience 8.5 (2019): giz039. [5] Liao, Yuxing, et al. "WebGestalt 2019: gene set analysis toolkit with revamped UIs and APIs." Nucleic acids research 47.W1 (2019): W199-W205. [6] Freedman, Adam H., Michele Clamp, and Timothy B. Sackton. "Error, noise and bias in de novo transcriptome assemblies." Molecular Ecology Resources (2020). [7] Gleeson, J. et al. (2020) ‘Nanopore direct RNA sequencing detects differential expression between human cell populations’. doi: 10.1101/2020.08.02.232785. [8] Sun, Qinyu, Qinyu Hao, and Kannanganattu V. Prasanth. "Nuclear long noncoding RNAs: key regulators of gene expression." Trends in Genetics 34.2 (2018): 142-157.

These citations are included in the text with your suggested improvements.

Thanks a lot for all the hard work!


Thank you again for the thorough review.