the theoretical “fold-coverage” of a shotgun sequencing experiment:

<number of reads> * <read length> / <target size>


An amplicon is a piece of DNA or RNA that is <the source and/or product of natural or artificial amplification or replication events>.

It can be formed using various methods including polymerase chain reactions (PCR), ligase chain reactions (LCR), or natural gene duplication.

3.Whole genome mapping

A Whole Genome Map is a high-resolution, ordered, whole genome restriction map generated from single DNA molecules extracted from bacteria, yeast, or other fungi. Whole Genome Mapping is a novel technology with unique capabilities in the field of microbiology, with specific applications in the areas of Comparative Genomics, Strain Typing, and Whole Genome Sequence Assembly. Whole Genome Maps are generated de novo, independent of sequence information, require no amplification or PCR steps, and provide a comprehensive view of whole genome architecture. A Whole Genome Map is displayed in the MapCode pattern where the vertical lines indicate the locations of restriction sites, and the distance between the lines represent the restriction fragment size.

4.Radiation hybrid mapping

A theory is developed to predict marker retention and conditional retention or loss in radiation hybrids. Applied to multiple pairwise analysis of a human chromosome 21 data set, this theory fits much better than proposed alternatives and gives a physical map consistent with other evidence and robust with respect to errors to typing. Radiation hybrids have great promise to provide order and physical location at two levels of resolution, spanning the techniques of linkage and restriction fragments and not limited to polymorphic loci.

5.dna barcoding

DNA barcoding is a taxonomic method that uses a short genetic marker in an organism’s DNA to identify it as belonging to a particular species

6.metric space

In mathematics, a metric space is a set for which distances between all members of the set are defined. Those distances, taken together, are called a metric on the set.

7.Pseudometric space

In mathematics, a pseudometric space is a generalized metric space in which the distance between two distinct points can be zero.


Pyrosequencing is a method of DNA sequencing (determining the order of nucleotides in DNA) based on the “sequencing by synthesis” principle. It differs from Sanger sequencing, in that it relies on the detection of pyrophosphate release on nucleotide incorporation, rather than chain termination with dideoxynucleotides.The desired DNA sequence is able to be determined by light emitted upon incorporation of the next complementary nucleotide by the fact that only one out of four of the possible A/T/C/G nucleotides are added and available at a time so that only one letter can be incorporated on the single stranded template (which is the sequence to be determined). The intensity of the light determines if there are more than one of these “letters” in a row. The previous nucleotide letter (one out of four possible dNTP) is degraded before the next nucleotide letter is added for synthesis: allowing for the possible revealing of the next nucleotide(s) via the resulting intensity of light (if the nucleotide added was the next complementary letter in the sequence). This process is repeated with each of the four letters until the DNA sequence of the single stranded template is determined.


In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.


DNA sequencing:base pair


…, A, G, C, T, T, C, G, A, …

…, AG, GC, CT, TT, TC, CG, GA, …


10.sequence space

In evolutionary biology, sequence space is a way of representing all possible sequences (for a protein, gene or genome).

11.k-mer distance





12.optical map(ordered restriction map)

Optical mapping is a technique for constructing ordered, genome-wide, high-resolution restriction maps from single, stained molecules of DNA, called “optical maps”. By mapping the location of restriction enzyme sites along the unknown DNA of an organism, the spectrum of resulting DNA fragments collectively serve as a unique “fingerprint” or “barcode” for that sequence.

13.Restriction map

A restriction map is a map of known restriction sites within a sequence of DNA. Restriction mapping requires the use of restriction enzymes. In molecular biology, restriction maps are used as a reference to engineer plasmids or other relatively short pieces of DNA, and sometimes for longer genomic DNA.

14.Expressed sequence tag

An expressed sequence tag or EST is a short sub-sequence of a cDNA sequence.They may be used to identify gene transcripts, and are instrumental in gene discovery and gene sequence determination. The identification of ESTs has proceeded rapidly, with approximately 74.2 million ESTs now available in public databases (e.g. GenBank 1 January 2013, all species).

15.Multiple Sequencing Alignment

A Multiple Sequence Alignment (MSA) is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences’ shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation events such as point mutations (single amino acid or nucleotide changes) that appear as differing characters in a single alignment column, and insertion or deletion mutations (indels or gaps) that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.

16.POA(Partial Order Alignment)

Partial order alignment (POA) has been proposed as a new approach to multiple sequence alignment (MSA), which can be combined with existing methods such as progressive alignment. This is important for addressing problems both in the original version of POA (such as order sensitivity) and in standard progressive alignment programs (such as information loss in complex alignments, especially surrounding gap regions).

17.Progressive Alignment

This approach begins with the alignment of the two most closely related sequences (as determined by pairwise analysis) and subsequently adds the next closest sequence or sequence group to this initial pair [37,7]. This process continues in an iterative fashion, adjusting the positioning of indels in all sequences. The major shortcoming of this approach is that a bias may be introduced in the inference of the ordered series of motifs (homologous parts) because of an overrepresentation of a subset of sequences.

18.核糖体小亚基(英文:Ribosomal Small Subunit,简称“SSU”)

是核糖体中较小的核糖体亚基。每个核糖体都由一个核糖体小亚基与一个核糖体大亚基共同构成。[1]小亚基在核糖体翻译过程中负责信息的识别。  原核细胞中的70S核糖体、真核细胞细胞质中的80S核糖体与真核细胞线粒体中的线粒体核糖体各拥有一种不同的核糖体小亚基:70S核糖体中包含30S核糖体亚基,80S核糖体中包含40S核糖体亚基,线粒体核糖体中则包含28S核糖体亚基。

原核细胞 (70S核糖体) 大亚基:50S亚基(包含5S rRNA及23S rRNA)  
  小亚基:30S亚基(包含16S rRNA)  
真核细胞 细胞质核糖体 (80S核糖体) 大亚基:60S亚基(包含5S rRNA、5.8S rRNA及28S rRNA)
    小亚基:40S亚基(包含18S rRNA)
  线粒体核糖体 39S大亚基(12S MT-RNR1)
    28S小亚基(16S MT-RNR2)

19.rare biosphere

Low-abundance high-diversity group is what is now called the “Rare Biosphere”.

20.Phred quality score

Phred quality scores were originally developed by the program Phred to help in the automation of DNA sequencing in the Human Genome Project. Phred quality scores are assigned to each nucleotide base call in automated sequencer traces.[1][2] Phred quality scores have become widely accepted to characterize the quality of DNA sequences, and can be used to compare the efficacy of different sequencing methods. Perhaps the most important use of Phred quality scores is the automatic determination of accurate, quality-based consensus sequences.

21.Base calling

Base calling is the process of assigning bases (nucleobases) to chromatogram peaks. One of the best computer programs for accomplishing this job is Phred base-calling, which is currently the most widely used basecalling software program by both academic and commercial DNA sequencing laboratories because of its high base calling accuracy

22.MIAME(Minimum Information About a Microarray Experiment)

describes the Minimum Information About a Microarray Experiment that is needed to enable the interpretation of the results of the experiment unambiguously and potentially to reproduce the experiment.

1.The raw data for each hybridisation.

2.The final processed data for the set of hybridisations in the experiment (study)

3.The essential sample annotation, including experimental factors and their values

4.The experiment design including sample data relationships

5.Sufficient annotation of the array design

6.Essential experimental and data processing protocols