Preprint/K-mer

From Wikiversity
Jump to navigation Jump to search

PLOS Topic Pages
PLOS Computational Biology • PLOS Genetics • PLOS ONE

This is a PLOS Topic Page draft

Public peer review comments are posted here
All content on this page is being developed under a CC BY 4.0 license


Authors
Benjamin Lee
About the Authors 

Benjamin Lee
AFFILIATION: Department of Computer Science, Harvard University , Cambridge, MA, USA
0000-0002-7133-8397


Abstract[edit | edit source]

The sequence ATGG has two 3-mers: ATG and TGG.

In biological sequence analysis, k-mers are unique subsequences of a sequence of length k. Primarily used within the context of computational genomics, in which k-mers are comprised of nucleotides (i.e. ATGC), k-mers are capitalized upon to assemble sequences,[1] improve heterologous gene expression,[2][3] identify species in metagenomic samples,[4] and create attenuated vaccines.[5] Usually, the term k-mer refers to all of a sequence's length k subsequences , such that the sequence AGAT would have four monomers (A, G, A, and T), three 2-mers (AG, GA, AT), two 3-mers (AGA and GAT) and one 4-mer (AGAT). More generally, a sequence of length L will have Lk + 1 k-mers and nk total possible k-mers, where n is number of possible monomers (four in the case of DNA). 

Introduction[edit | edit source]

k-mers are simply subsequences of a sequence of length k. For example, all the possible k-mers of a sequence are shown in Table 1:

Table 1: k-mers for GTAGAGCTGT
k k-mers
1 G, T, A, G, A, G, C, T, G, T
2 GT, TA, AG, GA, AG, GC, CT, TG, GT
3 GTA, TAG, AGA, GAG, AGC, GCT, CTG, TGT
4 GTAG, TAGA, AGAG, GAGC, AGCT, GCTG, CTGT
5 GTAGA, TAGAG, AGAGC, GAGCT, AGCTG, GCTGT
6 GTAGAG, TAGAGC, AGAGCT, GAGCTG, AGCTGT
7 GTAGAGC, TAGAGCT, AGAGCTG, GAGCTGT
8 GTAGAGCT, TAGAGCTG, AGAGCTGT
9 GTAGAGCTG, TAGAGCTGT
10 GTAGAGCTGT

A method of visualizing k-mers, the k-mer spectrum, shows the multiplicity of each k-mer in a sequence versus the number of k-mers with that multiplicity (see figure 2 for an example).[6] Interestingly, the number of modes in a k-mer spectrum for a species's genome varies, with most species having a unimodal distribution but all tetrapods having a multimodal distribution.[7]

FAn example 8-mer spectrum for E. coli comparing 8-mers' frequency (i.e. multiplicities) with their number of occurrences.

Forces Affecting k-mer Frequency [edit | edit source]

The frequency of k-mer usage is affected by numerous forces, working at multiple levels. This section will describe some of these forces, but it is important to note that k-mers for higher values of k are affected by the forces affecting lower values of k as well. For example, if the 1-mer A does not occur in a sequence, none of the 2-mers containing A (AA, AT, AG, and AC) will occur either.

k = 1[edit | edit source]

When k = 1, there are four DNA k-mers, i.e., A, T, G, and C. At the molecular level, there are three hydrogen bonds between G and C, whereas there are only two between A and T. GC bonds, as a result of the extra hydrogen bond (and stronger stacking interactions), are more thermally stable than AT bonds.[8] Mammals and birds have a higher ratio of Gs and Cs to As and Ts (GC-content), which led to the hypothesis that thermal stability was a driving factor of GC-content variation.[9] However, while promising, this hypothesis did not hold up under scrutiny: analysis among a variety of prokaryotes showed no evidence of GC-content correlating with temperature as the thermal adaptation hypothesis would predict.[10] Indeed, if natural selection were to be the driving force behind GC-content variation, that would require that single nucleotide changes, which are often silent, would have to alter the fitness of an organism.[11]

Rather, current evidence suggests that GC‐biased gene conversion (gBGC) is a driving factor behind variation in GC content.[11] gBGC is a process that occurs during recombination which replaces Gs and Cs with As and Ts.[12] This process, though distinct from natural selection, can nevertheless exert selective pressure on DNA biased towards GC replacements being fixed in the genome. gBGC can therefore be seen as an "impostor" of natural selection. As would be expected, GC content is greater at sites experiencing greater recombination.[13] Furthermore, organisms with higher rates of recombination exhibit higher GC content, in keeping with the gBGC hypothesis's predicted effects.[14] Interestingly, gBGC does not appear to be limited to eukaryotes.[15] Asexual organisms such as bacteria and archaea also experience recombination by means of gene conversion, a process of homologous sequence replacement resulting in multiple identical sequences throughout the genome.[16] That recombination is able to drive up GC content in all domains of life suggests that gBGC is universally conserved. Whether gBGC is a (mostly) neutral byproduct of the molecular machinery of life or is itself under selection remains to be determined. The exact mechanism and evolutionary advantage or disadvantage of gBGC is currently unknown.[17]

k = 2[edit | edit source]

Despite the comparatively large body of literature discussing GC-content biases, relatively little has been written about dinucleotide biases. What is known is that these dinucleotide biases are relatively constant throughout the genome, unlike GC-content, which, as seen above, can vary considerably.[18] This is an important insight that must not be overlooked. If dinucleotide bias were subject to pressures resulting from translation, then we would expect to see differing patterns of dinucleotide bias in coding and noncoding regions driven by some dinucelotides' reduced translational efficiency.[19] Because we do not, we can therefore infer that the forces modulating dinucleotide bias are independent of translation. Further evidence against translational pressures affecting dinucleotide bias is the fact that the dinucleotide biases of viruses, which rely heavily on translational efficiency, are shaped by their viral family more than by their hosts, whose translational machinery the viruses hijack.[20]

Counter to gBGC's increasing GC-content is CG suppression, which reduces the frequency of CG 2-mers due to deamination of methylated CG dinucleotides, resulting in substitutions of CGs with TGs, thereby reducing the GC-content.[21] This interaction highlights the interrelationship between the forces affecting k-mers for varying values of k.

One interesting fact about dinucleotide bias is that it can serve as a "distance" measurement between phylogenetically similar genomes. The genomes of pairs of organisms that are closely related share more similar dinucleotide biases than between pairs of more distantly related organisms.[18]

k = 3[edit | edit source]

There are twenty natural amino acids that are used to build the proteins that DNA encodes. However, there are only four nucleotides. Therefore, there cannot be a one-to-one correspondence between nucleotides and amino acids. Similarly, there are 42 or 16 2-mers, which is also not enough to unambiguously represent every amino acid. However, there are 43 or 64 distinct 3-mers in DNA, which is enough to uniquely represent each amino acid. These non-overlapping 3-mers are called codons. While each codon only maps to one amino acid, each amino acid can be represented by multiple codons. Thus, the same amino acid sequence can have multiple DNA representations. Interestingly, each codon for an amino acid is not used in equal proportions.[22] This is called codon-usage bias (CUB). As a parenthetical, when k = 3 we must distinguish between true 3-mer frequency and CUB. For example, the sequence ATGGCA has four k-mer words within it (ATG, TGG, GGC, and GCA) while only containing two codons (ATG and GCA). However, CUB is a major driving factor of 3-mer usage bias (accounting for up to ⅓ of it, since ⅓ of the k-mers in a coding region are codons) and will be the main focus of this section.

The exact cause of variation between the frequencies of various codons is not fully understood. It is known that codon preference is correlated with tRNA abundances, with codons matching more abundant tRNAs being correspondingly more frequent[22] and that more highly expressed proteins exhibit greater CUB.[23] This suggests that selection for translational efficiency or accuracy is the driving force behind CUB variation. 

k = 4[edit | edit source]

Similar to the effect seen in dinucleotide bias, the tetranucleotide biases of phylogenetically similar organisms are more similar than between less closely related organisms.[4] The exact cause of variation in tetranucleotide bias is not well understood, but it has been hypothesized to be the result of the maintenance of genetic stability at the molecular level.[24]

Applications[edit | edit source]

Sequence Assembly and Alignment [edit | edit source]

This figure shows the process of splitting reads into smaller k-mers (4-mers in this case) in order to be able to be used in a de Bruijn graph. (A) Shows the initial segment of DNA being sequenced. (B) Shows the reads that were made output from sequencing and also shows how they align. The problem with this alignment though is that they overlap by k-2 not k-1 (which is needed in de Bruijn graphs). (C) Shows the reads being split into smaller 4-mers. (D) Discards the repeated 4-mers and then shows the alignment of them. Note that these k-mers overlap by k-1 and can then be used in a de Bruijn graph.

In DNA sequence assembly from short read sequencers, k-mers are used to construct De Bruijn graphs that can be used to reconstruct the original sequence.[25] Specifically, each read (which can be of varying length in next-generation sequencers), is broken into its constituent k-mers (for some value of k). These k-mers are then able to be used to create a De Bruijn graph modeling the original, complete sequence (see figure for details). Notably, this approach is very susceptible to read errors, highlighting the importance of error correction.[25] Introduced by the Euler assembler,[26] De Bruijn graph-based methods have become popular with Velvet,[27] SOAPdenovo,[28] and ALLPATHS also using variations of this procedure.[29]

Additionally, k-mers are able to detect genome mis-assembly[30] and bacterial contamination of eukaryotic genome assemblies.[31][32] Similarly, k-mer spectra have been applied to correcting next-generation sequence data.[33]

Genetics and Genomics[edit | edit source]

With respect to disease, dinucleotide bias has been applied to the detection of genetic islands associated with pathogenicity.[18] Prior work has also shown that tetranucleotide biases are able to effectively detect horizontal gene transfer in both prokaryotes[34] and eukaryotes.[35]

Another application of k-mers is in genomics-based taxonomy. For example, GC-content has been used to distinguish between species of Erwinia with moderate success.[36] Similar to the direct use of GC-content for taxonomic purposes is the use of Tm, the melting temperature of DNA. Because GC bonds are more thermally stable, sequences with higher GC content exhibit a higher Tm. In 1987, the Ad Hoc Committee on Reconciliation of Approaches to Bacterial Systematics proposed the use of ΔTm as factor in determining species boundaries as part of the phylogenetic species concept, though this proposal does not appear to have gained traction within the scientific community.[37]

Other applications within genetics and genomics include the detection of recombination sites in genomes,[38] estimation of genome size,[39][40] identification of CpG islands,[41][42] and the DNA barcoding of species.[43][44]

Metagenomics[edit | edit source]

k-mer frequency variation is heavily used in metagenomic binning algorithms. In binning, the challenge is to separate sequencing reads into "bins" of reads for each organism, which will then be assembled. TETRA is a notable tool that takes metagenomic samples and bins them into organisms based on their tetranucleotide (k = 4) frequencies.[45]  Other tools that similarly rely on k-mer frequency for metagenomic binning are CompostBin (k = 6),[46] PCAHIER,[47] PhyloPythia (5 ≤ k ≤ 6),[48] CLARK (k ≥ 20),[49] and TACOA (2 ≤ k ≤ 6).[50] Recent developments have even applied deep learning to metagenomic binning using k-mers.[51]

k-mers can also be used to estimate species abundance in metagenomic samples,[52] determine which species are present in samples,[53][54] and identify biomarkers for diseases from samples.[55]

Biotechnology [edit | edit source]

Modifying k-mer frequencies in DNA sequences has been used extensively in biotechnological applications to control translational efficiency. Specifically, it has been used to both up- and down-regulate protein production rates.

With respect to increasing protein production, reducing unfavorable dinucleotide frequency has been used yield higher rates of protein synthesis.[56] In addition, codon usage bias has been modified to create synonymous sequences with greater protein expression rates.[3][57] Similarly, codon pair optimization, a combination of dinucelotide and codon optimization, has also been successfully used to increase expression.[58]

The most studied application of k-mers for decreasing translational efficiency is codon-pair manipulation for attenuating viruses in order to create vaccines. Researchers were able to recode dengue virus, the virus that causes dengue fever, such that its codon-pair bias was more different to mammalian codon-usage preference than the wild type.[59] Though containing an identical amino-acid sequence, the recoded virus demonstrated significantly weakened pathogenicity while eliciting a strong immune response. This approach has also been used effectively to create an influenza vaccine[60] as well a vaccine for Marek's disease herpesvirus (MDV).[5] Notably, the codon-pair bias manipulation employed to attenuate MDV did not effectively reduce the oncogenicity of the virus, highlighting a potential weakness in the biotechnology applications of this approach. To date, no codon-pair deoptimized vaccine has been approved for use.

Two later articles help explain the actual mechanism underlying codon-pair deoptimization: codon-pair bias is the result of dinucleotide bias.[61][62] By studying viruses and their hosts, both sets of authors were able to conclude that the molecular mechanism that results in the attentuation of viruses is an increase in dinucleotides poorly suited for translation.

GC-content, due to its effect on DNA melting point, is used to predict annealing temperature in PCR, another important biotechnology tool.

Implementation[edit | edit source]

Simply finding all the k-mers in a sequence is a simple task, although, as will be seen, doing so efficiently at scale is a nontrivial challenge. Below are sample implementations in common bioinformatics languages.

Pseudocode[edit | edit source]

procedure k_mers(sequence, k)

     /* iterate over the length of sequence */
     for i=1 to length(sequence)-k+1 inclusive do

          /* output the k-mer from i to i+k in sequence*/
          output sequence[i:i+k]

     end for

end procedure

Python[edit | edit source]

In Python, an efficient generator function is included in the itertools documentation [63]:

from itertools import islice

def k_mers(sequence, k):
    it = iter(sequence)
    result = tuple(islice(it, k))
    if len(result) == k:
        yield "".join(result)
    for elem in it:
        result = result[1:] + (elem,)
        yield "".join(result)

Perl[edit | edit source]

sub k_mers {
	my ($sequence, $k) = @_;
	my $len = length($sequence);
	my @result = ();
	for (my $i = 0; $i <= $len-$k; $i++) {
		push(@result, substr($sequence, $i, $k));
	}
	return @result;
}

R[edit | edit source]

k_mers <- function(sequence, k) {
  substring(sequence, 1:(nchar(sequence)-k+1), k:nchar(sequence)) 
}

In Bioinformatics Pipelines[edit | edit source]

Because the number of k-mers grows exponentially for values of k, counting k-mers for large values of k (>10) is a computationally difficult task. While the simple implementations above work for small values of k, they need to be adapted for high-throughput applications or when k is large. To solve this problem, various approaches have been developed:

  • Jellyfish uses a multithreaded, lock-free hash table for k-mer counting and has Python, Ruby, and Perl bindings.[64]
  • KMC is a tool for k-mer counting that uses a multidisk architecture for optimized speed.[65]
  • Gerbil uses a hash table approach but with added support for GPU acceleration.[66]
  • K-mer Analysis Toolkit (KAT) uses a modified version of Jellyfish to analyze k-mer counts.[6]

Acknowledgements[edit | edit source]

The author would like to thank James Mallet and Nathaniel Edelman for their early feedback on this article as well as Paul Gamble, Maria Barrios, Karl Ni, Cory Stephenson, and Todd Stavish for their feedback on the completed manuscript.

See also[edit | edit source]

Wikipedia pages that should link here[edit | edit source]

References[edit | edit source]

  1. Compeau PEC, Pevzner PA, Tesler G. How to apply de Bruijn graphs to genome assembly. Nature Biotechnology. Springer Nature; 2011;29: 987–991. doi:10.1038/nbt.2023
  2. Welch M, Govindarajan S, Ness JE, Villalobos A, Gurney A, Minshull J, et al. Design Parameters to Control Synthetic Gene Expression in Escherichia coli. Kudla G, editor. PLoS ONE. Public Library of Science (PLoS); 2009;4: e7002. doi:10.1371/journal.pone.0007002
  3. 3.0 3.1 Gustafsson C, Govindarajan S, Minshull J. Codon bias and heterologous protein expression. Trends in Biotechnology. Elsevier BV; 2004;22: 346–353. doi:10.1016/j.tibtech.2004.04.006
  4. 4.0 4.1 Perry SC, Beiko RG. Distinguishing Microbial Genome Fragments Based on Their Composition: Evolutionary and Comparative Genomic Perspectives. Genome Biology and Evolution. Oxford University Press (OUP); 2010;2: 117–131. doi:10.1093/gbe/evq004
  5. 5.0 5.1 Eschke K, Trimpert J, Osterrieder N, Kunec D. Attenuation of a very virulent Marek’s disease herpesvirus (MDV) by codon pair bias deoptimization. Mocarski E, editor. PLOS Pathogens. Public Library of Science (PLoS); 2018;14: e1006857. doi:10.1371/journal.ppat.1006857
  6. 6.0 6.1 Mapleson D, Garcia Accinelli G, Kettleborough G, Wright J, Clavijo BJ. KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics. Oxford University Press (OUP); 2016; btw663. doi:10.1093/bioinformatics/btw663
  7. Chor B, Horn D, Goldman N, Levy Y, Massingham T. Genomic DNA k-mer spectra: models and modalities. Genome Biology. Springer Nature; 2009;10: R108. doi:10.1186/gb-2009-10-10-r108
  8. Yakovchuk P. Base-stacking and base-pairing contributions into thermal stability of the DNA double helix. Nucleic Acids Research. Oxford University Press (OUP); 2006;34: 564–574. doi:10.1093/nar/gkj454
  9. Bernardi G. Isochores and the evolutionary genomics of vertebrates. Gene. Elsevier BV; 2000;241: 3–17. doi:10.1016/s0378-1119(99)00485-0
  10. Hurst LD, Merchant AR. High guanine-cytosine content is not an adaptation to high temperature: a comparative analysis amongst prokaryotes. Proceedings of the Royal Society B: Biological Sciences. The Royal Society; 2001;268: 493–497. doi:10.1098/rspb.2000.1397
  11. 11.0 11.1 Mugal CF, Weber CC, Ellegren H. GC-biased gene conversion links the recombination landscape and demography to genomic base composition. BioEssays. Wiley; 2015;37: 1317–1326. doi:10.1002/bies.201500058
  12. Romiguier J, Roux C. Analytical Biases Associated with GC-Content in Molecular Evolution. Frontiers in Genetics. Frontiers Media SA; 2017;8. doi:10.3389/fgene.2017.00016
  13. Spencer CCA. Human polymorphism around recombination hotspots: Figure 1. Biochemical Society Transactions. Portland Press Ltd.; 2006;34: 535–536. doi:10.1042/bst0340535
  14. Weber CC, Boussau B, Romiguier J, Jarvis ED, Ellegren H. Evidence for GC-biased gene conversion as a driver of between-lineage differences in avian base composition. Genome Biology. Springer Nature; 2014;15. doi:10.1186/s13059-014-0549-1
  15. Lassalle F, Périan S, Bataillon T, Nesme X, Duret L, Daubin V. GC-Content Evolution in Bacterial Genomes: The Biased Gene Conversion Hypothesis Expands. Petrov DA, editor. PLOS Genetics. Public Library of Science (PLoS); 2015;11: e1004941. doi:10.1371/journal.pgen.1004941
  16. Santoyo G, Romero D. Gene conversion and concerted evolution in bacterial genomes. FEMS Microbiology Reviews. Wiley; 2005;29: 169–183. doi:10.1016/j.femsre.2004.10.004
  17. Bhérer C, Auton A. Biased Gene Conversion and Its Impact on Genome Evolution [Internet]. eLS. John Wiley & Sons, Ltd; 2014. doi:10.1002/9780470015902.a0020834.pub2
  18. 18.0 18.1 18.2 Karlin S. Global dinucleotide signatures and analysis of genomic heterogeneity. Current Opinion in Microbiology. Elsevier BV; 1998;1: 598–610. doi:10.1016/s1369-5274(98)80095-7
  19. Beutler E, Gelbart T, Han JH, Koziol JA, Beutler B. Evolution of the genome and the genetic code: selection at the dinucleotide level by methylation and polyribonucleotide cleavage. Proceedings of the National Academy of Sciences. Proceedings of the National Academy of Sciences; 1989;86: 192–196. doi:10.1073/pnas.86.1.192
  20. Di Giallonardo F, Schlub TE, Shi M, Holmes EC. Dinucleotide Composition in Animal RNA Viruses Is Shaped More by Virus Family than by Host Species. Dermody TS, editor. Journal of Virology. American Society for Microbiology; 2017;91: e02381-16. doi:10.1128/jvi.02381-16
  21. Zemojtel T, kielbasa SM, Arndt PF, Behrens S, Bourque G, Vingron M. CpG Deamination Creates Transcription Factor-Binding Sites with High Efficiency. Genome Biology and Evolution. Oxford University Press (OUP); 2011;3: 1304–1311. doi:10.1093/gbe/evr107
  22. 22.0 22.1 Hershberg R, Petrov DA. Selection on Codon Bias. Annual Review of Genetics. Annual Reviews; 2008;42: 287–299. doi:10.1146/annurev.genet.42.110807.091442
  23. Sharp PM, Li W-H. The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Research. Oxford University Press (OUP); 1987;15: 1281–1295. doi:10.1093/nar/15.3.1281
  24. Noble PA, Citek RW, Ogunseitan OA. Tetranucleotide frequencies in microbial genomes. Electrophoresis. Wiley; 1998;19: 528–535. doi:10.1002/elps.1150190412
  25. 25.0 25.1 Nagarajan N, Pop M. Sequence assembly demystified. Nature Reviews Genetics. Springer Nature; 2013;14: 157–167. doi:10.1038/nrg3367
  26. Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences. Proceedings of the National Academy of Sciences; 2001;98: 9748–9753. doi:10.1073/pnas.171285098
  27. Zerbino DR, Birney E. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research. Cold Spring Harbor Laboratory; 2008;18: 821–829. doi:10.1101/gr.074492.107
  28. Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Research. Cold Spring Harbor Laboratory; 2009;20: 265–272. doi:10.1101/gr.097261.109
  29. Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, et al. ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Research. Cold Spring Harbor Laboratory; 2008;18: 810–820. doi:10.1101/gr.7337908
  30. Phillippy AM, Schatz MC, Pop M. Genome assembly forensics: finding the elusive mis-assembly. Genome Biology. Springer Nature; 2008;9: R55. doi:10.1186/gb-2008-9-3-r55
  31. Delmont TO, Eren AM. Identifying contamination with advanced visualization and analysis practices: metagenomic approaches for eukaryotic genome assemblies. PeerJ. PeerJ; 2016;4: e1839. doi:10.7717/peerj.1839
  32. Bemm F, Weiß CL, Schultz J, Förster F. Genome of a tardigrade: Horizontal gene transfer or bacterial contamination? Proceedings of the National Academy of Sciences. Proceedings of the National Academy of Sciences; 2016;113: E3054–E3056. doi:10.1073/pnas.1525116113
  33. Liu Y, Schröder J, Schmidt B. Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data. Bioinformatics. Oxford University Press (OUP); 2012;29: 308–315. doi:10.1093/bioinformatics/bts690
  34. Goodur HD, Ramtohul V, Baichoo S. GIDT &#x2014; A tool for the identification and visualization of genomic islands in prokaryotic organisms. 2012 IEEE 12th International Conference on Bioinformatics & Bioengineering (BIBE). IEEE; 2012. doi:10.1109/bibe.2012.6399707
  35. Jaron KS, Moravec JC, Martinkova N. SigHunt: horizontal gene transfer finder optimized for eukaryotic genomes. Bioinformatics. Oxford University Press (OUP); 2013;30: 1081–1086. doi:10.1093/bioinformatics/btt727
  36. Starr MP, Mandel M. DNA Base Composition and Taxonomy of Phytopathogenic and Other Enterobacteria. Journal of General Microbiology. Microbiology Society; 1969;56: 113–123. doi:10.1099/00221287-56-1-113
  37. Wayne LG, Moore WEC, Stackebrandt E, Kandler O, Colwell RR, Krichevsky MI, et al. Report of the Ad Hoc Committee on Reconciliation of Approaches to Bacterial Systematics. International Journal of Systematic and Evolutionary Microbiology. Microbiology Society; 1987;37: 463–464. doi:10.1099/00207713-37-4-463
  38. Wang R, Xu Y, Liu B. Recombination spot identification Based on gapped k-mers. Scientific Reports. Springer Nature; 2016;6. doi:10.1038/srep23934
  39. Lamichhaney S, Fan G, Widemo F, Gunnarsson U, Thalmann DS, Hoeppner MP, et al. Structural genomic changes underlie alternative reproductive strategies in the ruff (Philomachus pugnax). Nature Genetics. Springer Nature; 2015;48: 84–88. doi:10.1038/ng.3430
  40. Hozza M, Vinař T, Brejová B. How Big is that Genome? Estimating Genome Size and Coverage from k-mer Abundance Spectra. String Processing and Information Retrieval. Springer International Publishing; 2015. pp. 199–209. doi:10.1007/978-3-319-23826-5_20
  41. Chae H, Park J, Lee S-W, Nephew KP, Kim S. Comparative analysis using K-mer and K-flank patterns provides evidence for CpG island sequence evolution in mammalian genomes. Nucleic Acids Research. Oxford University Press (OUP); 2013;41: 4783–4791. doi:10.1093/nar/gkt144
  42. Mohamed Hashim EK, Abdullah R. Rare k-mer DNA: Identification of sequence motifs and prediction of CpG island and promoter. Journal of Theoretical Biology. Elsevier BV; 2015;387: 88–100. doi:10.1016/j.jtbi.2015.09.014
  43. Chor B, Horn D, Goldman N, Levy Y, Massingham T. Genomic DNA k-mer spectra: models and modalities. Genome Biology. Springer Nature; 2009;10: R108. doi:10.1186/gb-2009-10-10-r108
  44. Meher PK, Sahu TK, Rao AR. Identification of species based on DNA barcode using k-mer feature vector and Random forest classifier. Gene. Elsevier BV; 2016;592: 316–324. doi:10.1016/j.gene.2016.07.010
  45. Teeling H, Waldmann J, Lombardot T, Bauer M, Glöckner F. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics. Springer Nature; 2004;5: 163. doi:10.1186/1471-2105-5-163
  46. Chatterji S, Yamazaki I, Bai Z, Eisen JA. CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads. Lecture Notes in Computer Science. Springer Berlin Heidelberg; pp. 17–28. doi:10.1007/978-3-540-78839-3_3
  47. Zheng H, Wu H. Short Prokaryotic DNA Fragment Binning Using a Hierarchical Classifier Based On Linear Discriminant Analysis and Principal Component Analysis. Journal of Bioinformatics and Computational Biology. World Scientific Pub Co Pte Lt; 2010;8: 995–1011. doi:10.1142/s0219720010005051
  48. McHardy AC, Martín HG, Tsirigos A, Hugenholtz P, Rigoutsos I. Accurate phylogenetic classification of variable-length DNA fragments. Nature Methods. Springer Nature; 2006;4: 63–72. doi:10.1038/nmeth976
  49. Ounit R, Wanamaker S, Close TJ, Lonardi S. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics. Springer Nature; 2015;16. doi:10.1186/s12864-015-1419-2
  50. Diaz NN, Krause L, Goesmann A, Niehaus K, Nattkemper TW. TACOA – Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach. BMC Bioinformatics. Springer Nature; 2009;10: 56. doi:10.1186/1471-2105-10-56
  51. Fiannaca A, La Paglia L, La Rosa M, Lo Bosco G, Renda G, Rizzo R, et al. Deep learning models for bacteria taxonomic classification of metagenomic data. BMC Bioinformatics. Springer Nature; 2018;19. doi:10.1186/s12859-018-2182-6
  52. Lu J, Breitwieser FP, Thielen P, Salzberg SL. Bracken: estimating species abundance in metagenomics data. PeerJ Computer Science. PeerJ; 2017;3: e104. doi:10.7717/peerj-cs.104
  53. Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology. Springer Nature; 2014;15: R46. doi:10.1186/gb-2014-15-3-r46
  54. Rosen G, Garbarine E, Caseiro D, Polikar R, Sokhansanj B. Metagenome Fragment Classification Using -Mer Frequency Profiles. Advances in Bioinformatics. Hindawi Limited; 2008;2008: 1–12. doi:10.1155/2008/205969
  55. Wang, Y., Fu, L., Ren, J., Yu, Z., Chen, T., & Sun, F. (2018). Identifying Group-Specific Sequences for Microbial Communities Using Long k-mer Sequence Signatures. Frontiers in Microbiology, 9. https://doi.org/10.3389/fmicb.2018.00872
  56. Al-Saif M, Khabar KS. UU/UA Dinucleotide Frequency Reduction in Coding Regions Results in Increased mRNA Stability and Protein Expression. Molecular Therapy. Elsevier BV; 2012;20: 954–959. doi:10.1038/mt.2012.29
  57. Welch M, Govindarajan S, Ness JE, Villalobos A, Gurney A, Minshull J, et al. Design Parameters to Control Synthetic Gene Expression in Escherichia coli. Kudla G, editor. PLoS ONE. Public Library of Science (PLoS); 2009;4: e7002. doi:10.1371/journal.pone.0007002
  58. Trinh R, Gurbaxani B, Morrison SL, Seyfzadeh M. Optimization of codon pair use within the (GGGGS)3 linker sequence results in enhanced protein expression. Molecular Immunology. Elsevier BV; 2004;40: 717–722. doi:10.1016/j.molimm.2003.08.006
  59. Shen SH, Stauft CB, Gorbatsevych O, Song Y, Ward CB, Yurovsky A, et al. Large-scale recoding of an arbovirus genome to rebalance its insect versus mammalian preference. Proceedings of the National Academy of Sciences. Proceedings of the National Academy of Sciences; 2015;112: 4749–4754. doi:10.1073/pnas.1502864112
  60. Kaplan BS, Souza CK, Gauger PC, Stauft CB, Robert Coleman J, Mueller S, et al. Vaccination of pigs with a codon-pair bias de-optimized live attenuated influenza vaccine protects from homologous challenge. Vaccine. Elsevier BV; 2018;36: 1101–1107. doi:10.1016/j.vaccine.2018.01.027
  61. Kunec D, Osterrieder N. Codon Pair Bias Is a Direct Consequence of Dinucleotide Bias. Cell Reports. Elsevier BV; 2016;14: 55–67. doi:10.1016/j.celrep.2015.12.011
  62. Tulloch F, Atkinson NJ, Evans DJ, Ryan MD, Simmonds P. RNA virus attenuation by codon pair deoptimisation is an artefact of increases in CpG/UpA dinucleotide frequencies. eLife. eLife Sciences Organisation, Ltd.; 2014;3. doi:10.7554/elife.04531
  63. "5.14.2 Examples". Python Library Reference. Python Software Foundation. Retrieved 2018-05-29.
  64. Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. Oxford University Press (OUP); 2011;27: 764–770. doi:10.1093/bioinformatics/btr011
  65. Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A. KMC 2: fast and resource-frugal k-mer counting. Bioinformatics. Oxford University Press (OUP); 2015;31: 1569–1576. doi:10.1093/bioinformatics/btv022
  66. Erbert M, Rechner S, Müller-Hannemann M. Gerbil: a fast and memory-efficient k-mer counter with GPU-support. Algorithms for Molecular Biology. Springer Nature; 2017;12. doi:10.1186/s13015-017-0097-9