the theoretical “fold-coverage” of a shotgun sequencing experiment:

<number of reads> * <read length> / <target size>


An amplicon is a piece of DNA or RNA that is <the source and/or product of natural or artificial amplification or replication events>.

It can be formed using various methods including polymerase chain reactions (PCR), ligase chain reactions (LCR), or natural gene duplication.

3.Whole genome mapping

A Whole Genome Map is a high-resolution, ordered, whole genome restriction map generated from single DNA molecules extracted from bacteria, yeast, or other fungi. Whole Genome Mapping is a novel technology with unique capabilities in the field of microbiology, with specific applications in the areas of Comparative Genomics, Strain Typing, and Whole Genome Sequence Assembly. Whole Genome Maps are generated de novo, independent of sequence information, require no amplification or PCR steps, and provide a comprehensive view of whole genome architecture. A Whole Genome Map is displayed in the MapCode pattern where the vertical lines indicate the locations of restriction sites, and the distance between the lines represent the restriction fragment size.

4.Radiation hybrid mapping

A theory is developed to predict marker retention and conditional retention or loss in radiation hybrids. Applied to multiple pairwise analysis of a human chromosome 21 data set, this theory fits much better than proposed alternatives and gives a physical map consistent with other evidence and robust with respect to errors to typing. Radiation hybrids have great promise to provide order and physical location at two levels of resolution, spanning the techniques of linkage and restriction fragments and not limited to polymorphic loci.

5.dna barcoding

DNA barcoding is a taxonomic method that uses a short genetic marker in an organism’s DNA to identify it as belonging to a particular species

6.metric space

In mathematics, a metric space is a set for which distances between all members of the set are defined. Those distances, taken together, are called a metric on the set.

7.Pseudometric space

In mathematics, a pseudometric space is a generalized metric space in which the distance between two distinct points can be zero.


Pyrosequencing is a method of DNA sequencing (determining the order of nucleotides in DNA) based on the “sequencing by synthesis” principle. It differs from Sanger sequencing, in that it relies on the detection of pyrophosphate release on nucleotide incorporation, rather than chain termination with dideoxynucleotides.The desired DNA sequence is able to be determined by light emitted upon incorporation of the next complementary nucleotide by the fact that only one out of four of the possible A/T/C/G nucleotides are added and available at a time so that only one letter can be incorporated on the single stranded template (which is the sequence to be determined). The intensity of the light determines if there are more than one of these “letters” in a row. The previous nucleotide letter (one out of four possible dNTP) is degraded before the next nucleotide letter is added for synthesis: allowing for the possible revealing of the next nucleotide(s) via the resulting intensity of light (if the nucleotide added was the next complementary letter in the sequence). This process is repeated with each of the four letters until the DNA sequence of the single stranded template is determined.


In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.


DNA sequencing:base pair


…, A, G, C, T, T, C, G, A, …

…, AG, GC, CT, TT, TC, CG, GA, …


10.sequence space

In evolutionary biology, sequence space is a way of representing all possible sequences (for a protein, gene or genome).

11.k-mer distance





12.optical map(ordered restriction map)

Optical mapping is a technique for constructing ordered, genome-wide, high-resolution restriction maps from single, stained molecules of DNA, called “optical maps”. By mapping the location of restriction enzyme sites along the unknown DNA of an organism, the spectrum of resulting DNA fragments collectively serve as a unique “fingerprint” or “barcode” for that sequence.

13.Restriction map

A restriction map is a map of known restriction sites within a sequence of DNA. Restriction mapping requires the use of restriction enzymes. In molecular biology, restriction maps are used as a reference to engineer plasmids or other relatively short pieces of DNA, and sometimes for longer genomic DNA.

14.Expressed sequence tag

An expressed sequence tag or EST is a short sub-sequence of a cDNA sequence.They may be used to identify gene transcripts, and are instrumental in gene discovery and gene sequence determination. The identification of ESTs has proceeded rapidly, with approximately 74.2 million ESTs now available in public databases (e.g. GenBank 1 January 2013, all species).

15.Multiple Sequencing Alignment

A Multiple Sequence Alignment (MSA) is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences’ shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation events such as point mutations (single amino acid or nucleotide changes) that appear as differing characters in a single alignment column, and insertion or deletion mutations (indels or gaps) that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.

16.POA(Partial Order Alignment)

Partial order alignment (POA) has been proposed as a new approach to multiple sequence alignment (MSA), which can be combined with existing methods such as progressive alignment. This is important for addressing problems both in the original version of POA (such as order sensitivity) and in standard progressive alignment programs (such as information loss in complex alignments, especially surrounding gap regions).

17.Progressive Alignment

This approach begins with the alignment of the two most closely related sequences (as determined by pairwise analysis) and subsequently adds the next closest sequence or sequence group to this initial pair [37,7]. This process continues in an iterative fashion, adjusting the positioning of indels in all sequences. The major shortcoming of this approach is that a bias may be introduced in the inference of the ordered series of motifs (homologous parts) because of an overrepresentation of a subset of sequences.

18.核糖体小亚基(英文:Ribosomal Small Subunit,简称“SSU”)

是核糖体中较小的核糖体亚基。每个核糖体都由一个核糖体小亚基与一个核糖体大亚基共同构成。[1]小亚基在核糖体翻译过程中负责信息的识别。  原核细胞中的70S核糖体、真核细胞细胞质中的80S核糖体与真核细胞线粒体中的线粒体核糖体各拥有一种不同的核糖体小亚基:70S核糖体中包含30S核糖体亚基,80S核糖体中包含40S核糖体亚基,线粒体核糖体中则包含28S核糖体亚基。

原核细胞 (70S核糖体) 大亚基:50S亚基(包含5S rRNA及23S rRNA)  
  小亚基:30S亚基(包含16S rRNA)  
真核细胞 细胞质核糖体 (80S核糖体) 大亚基:60S亚基(包含5S rRNA、5.8S rRNA及28S rRNA)
    小亚基:40S亚基(包含18S rRNA)
  线粒体核糖体 39S大亚基(12S MT-RNR1)
    28S小亚基(16S MT-RNR2)

19.rare biosphere

Low-abundance high-diversity group is what is now called the “Rare Biosphere”.

20.Phred quality score

Phred quality scores were originally developed by the program Phred to help in the automation of DNA sequencing in the Human Genome Project. Phred quality scores are assigned to each nucleotide base call in automated sequencer traces.[1][2] Phred quality scores have become widely accepted to characterize the quality of DNA sequences, and can be used to compare the efficacy of different sequencing methods. Perhaps the most important use of Phred quality scores is the automatic determination of accurate, quality-based consensus sequences.

21.Base calling

Base calling is the process of assigning bases (nucleobases) to chromatogram peaks. One of the best computer programs for accomplishing this job is Phred base-calling, which is currently the most widely used basecalling software program by both academic and commercial DNA sequencing laboratories because of its high base calling accuracy

22.MIAME(Minimum Information About a Microarray Experiment)

describes the Minimum Information About a Microarray Experiment that is needed to enable the interpretation of the results of the experiment unambiguously and potentially to reproduce the experiment.

1.The raw data for each hybridisation.

2.The final processed data for the set of hybridisations in the experiment (study)

3.The essential sample annotation, including experimental factors and their values

4.The experiment design including sample data relationships

5.Sufficient annotation of the array design

6.Essential experimental and data processing protocols



suffix array – 后缀数组(倍增算法实现版)





Counting sort – 计数排序





c[1] = 0;  c[2] = 1;  c[3] = 2;  c[4]=1;


a[2] = 1;a[3] = 3;a[4]=4;


for(int i = 1;i <= n;i ++) rank[– c[a[i]]] = a[i];




How to Become a Bioinformatics Professional

1 Understand what Bioinformaticians do.

  • Broadly, computational biology is involved with developing and implementing tools in order to use and manage biological data.
  • The medical field is a major employer of Bioinformaticians, but they are also needed in industry and agriculture.

2 Stay abreast of new developments in Bioinformatics and biotechnology.

  • This highly technological field is undergoing rapid changes.
  • The Bioinformatics Organization offers continuing education courses.

3 Become proficient in computer science.

  • This includes database administration and programming skills.
  • UNIX is currently the preferred operating system platform.
  • Be able to write programs in computer languages such as PERL, SQL and C.
  • Learn to use genomic sequence analysis and molecular modeling programs.

4 Study college level biology.

  • Biology courses should include analytical techniques and molecular biology.

5 Take math courses, particularly those for biologists.

  • Biostatistics is an important discipline in Bioinformatics.

6 Pursue higher education. Undergraduate degrees can be in biology, computer science or biotechnology.

  • In graduate school, find a program that combines both disciplines, if possible; however the emphasis seems to be on molecular biology study with the acquiring of information technology skills.
  • Bioinformatics or computational biology programs are still fairly new.
  • Researchers should have a doctorate in biology, statistics or math.

7 Learn to identify the right questions to ask in addition to the methodologies to apply.



DNA Packaging: Nucleosomes and Chromatin


At the top right portion of the diagram, a vertical double-ended arrow indicates that the DNA double helix strands are 2 nm apart. The strands are represented as gray ribbons connected by vertical colored bars that are either half red/half green or half yellow/half cyan.

As the DNA strand reaches the left side of the illustration, all colors are replaced by gray. Box 1 has the text “At the simplest level, chromatin is a double-stranded helical structure of DNA. The DNA strand turns down and goes back toward the right, still compacting along the way.

Below this is Box 2, with the text “DNA is complexed with histones to form nucleosomes.” Toward the center of the schematic are three sets of two brown discs, each disc quartered, and the cylinders are wrapped 1.65 times by the DNA, which has now compacted into a thick gray thread shape. Each nucleosome consists of eight histone molecules.

To the right of the first nucleosome complex is Box 3, with the text “Each nucleosome consists of eight histone proteins around which the DNA wraps 1.65 times.” The second nucleosome has a vertical red bar, about as long as the nucleosome is high, attached to the side of the nucleosome. This bar is labeled H1 histone. A horizontal, double-ended, black arrow indicates the nucleosome with DNA has a diameter of 11 nm. A third nucleosome to the right of the second is labeled “Chromatosome.” Above and to the right of the chromatosome is Box 4, with the text “A chromatosome consists of a nucleosome plus the H1 histone.”

Below this, the nucleosomes are folded in on each other to form a hollow, tube-like fiber, where many nucleosomes are arranged in parallel rings to form the tube’s outer layer. To the right of this is a vertical, double-ended, black arrow labeled 30 nm. To the right of this arrow is Box 5, with the text “The nucleosomes fold up to produce a 30-nm fiber…” The nucleosome tube continues to compact to form a gray spiral and gray squiggles as it continues leftward. Above this is Box 6 with the text “… that forms loops averaging 300 nm in length.” A black, vertical, double-ended arrow is labeled 300 nm. The squiggles compact further, going down and back toward the right, coiling like a telephone cord. Below this is Box 7 with the text “The 300-nm fibers are compressed and folded to produce a 250-nm-wide fiber.” A black, vertical, double-ended arrow is labeled 700 nm. Two, inward-pointing, black arrows indicate a gap labeled “250-nm-wide fiber.”

These coils continue to the right and compress further, forming a horizontal, X-shaped, chromosome. A black, vertical, double-ended arrow is labeled 1400 nm. Below this is Box 8 with the text “Tight coiling of the 250-nm fiber produces the chromatid of a chromosome.”




  1. 2016.5.4 二区间押分:172000(全押);获得:516000;收益:344000。总弈币:516000;
    1. 前面八十手,选手’乌市少年宫’计算速度比’ltsoo’快许多,而且,在八十手之前,局面还是五五开,所以我决定押乌市少年宫;
    2. 看到2/3的人押ltsoo,我觉得我应该全押获得最大利润;
    3. 还有一个直觉,那就是我相信小孩子的战斗力;
  2. 2016.5.12 一区间押分:100000(小酌);获得:450000;收益:350000。总弈币:855460。

Excel 数据分析

  1. 回归分析
  2. 直线图
  3. 快速公式套用:
    1. 在一个格子内输入公式;
    2. 点击该格子,Ctrl+Shift+方向,选定所有需要套用的格子;
    3. Ctrl+D,完成计算。


Oxford Nanopore Technology – MinION

MinION is a portable device for molecular analyses that is driven by nanopore technology. It is adaptable for the analysis of DNA, RNA, proteins or small molecules with a straightforward workflow.



The MinION can be run for minutes or days according to the experimental need. Users can adjust settings like the speed that the DNA passes through the nanopore. PromethION, which will soon be released into early access, is designed to be fully scalable so that users can operate between one or 48 flow cells at any one time.

Long read lengths

The Oxford Nanopore system processes the reads that are presented to it rather than generating specific read lengths. The longest read reported by a MinION user to date is more than 200Kb, but it can process the spectrum of read lengths.


  1. windows自带纸牌游戏(NP难)

Keywords of Genomics

Population genetics

Population genetics is the study of the distribution and change in frequency of alleles within populations, and as such it sits firmly within the field of evolutionary biology.

The main processes of evolution are natural selection, genetic drift, gene flow, mutation, and genetic recombination and they form an integral part of the theory that underpins population genetics.

Studies in this branch of biology examine such phenomena as adaptation, speciation, population subdivision, and population structure.

Population stratification

Population stratification refers to differences in allele frequencies between cases and controls due to systematic differences in ancestry rather than association of genes with disease.

It would be caused by systematic differences in the ancestry of cases and controls.

Diploid genome

Diploid genome refers to a genome that contains a balanced set of chromosomes derived equally from maternal and paternal sources.

Coalescent theory

Coalescent theory is a retrospective stochastic model of population genetics that relates genetic diversity in a sample to demographic history of the population from which it was taken.

That is, it is a model of the effect of genetic drift, viewed backwards in time, on the genealogy of antecedents.



A repository is usually used to organize a single project. Repositories can contain folders and files, images, videos, spreadsheets, and data sets – anything your project needs. We recommend including a README, or a file with information about your project. GitHub makes it easy to add one at the same time you create your new repository. It also offers other common options such as a license file.


Branching is the way to work on different versions of a repository at one time.

By default your repository has one branch named master which is considered to be the definitive branch. We use branches to experiment and make edits before committing them to master.


On GitHub, saved changes are called commits.

Pull Request

When you open a pull request, you’re proposing your changes and requesting that someone review and pull in your contribution and merge them into their branch. Pull requests show diffs, or differences, of the content from both branches. The changes, additions, and subtractions are shown in green and red.

GitHub Pages


GitHub Pages are public webpages hosted and published through our site.

You can create and publish GitHub Pages online using the Automatic Page Generator. If you prefer to work locally, you can use the GitHub Desktop or the command line.

Pages are served over HTTP, not HTTPS, so you shouldn’t use them for sensitive transactions, like sending passwords or credit card numbers.


  1. 读摘要
  2. 读图
  3. 选读

J.Q. Liu

  1. Yak whole-genome resequencing reveals domestication signatures and prehistoric population expansions (2015)
    1. genome variation of wild and domestic yaks
    2. evolution
  2. Genome resequencing: 13 wild yaks and 59 domestic yaks

windows install and configure


  1. 操作系统重装
  2. 硬件驱动重装
  3. 软件重装
    1. DirectX
  4. 运行库
  5. 编程语言编译工具
    1. Java
    2. MinGW
    3. Strawberry Perl
  6. 小工具
    1. daemon tool lites
    2. pchunter
    3. xming+putty
  7. 123

NCBI 使用注意事项及技巧

  1. 关于序列标识join和complement:

    gene complement(2872..3195)
    /gene=” lacZ’ ”
    Sequence:NC_000913.3 (363231..366305, complement)

  2. 在指定的基因组检索目的序列:打开基因组,然后输入目的序列,开始检索。
  3. 对于蛋白质,NCBI提供了查看其CD(conserved domain),名字叫“Identify Conserved Domains”;

WordPress 建站配置

  1. 连接数据库
    1. 在空间提供商处找到数据库的IP;
    2. 在网站根目录下找到WP的wp-config.php文件;
    3. 将DB_HOST的值,改为数据库的IP
  2. 备份
    1. 使用WP自带的工具中的导出工具,导出能被WP模板普遍识别的网站内容,方便在网站出现意外时,在任何一个新的WP木板上导入网站所有内容(不含图片)。
    2. 使用空间提供商的数据下载,直接下载整个网站,如若网站出现问题,直接从新上传整个WP;
  3. 防垃圾评论
    1. WP自身会要求管理者审核每一条评论;
    2. 使用Akismet,以后会自动过滤同一个邮箱的评论;
  4. 托管在国内的虚拟主机后,加载缓慢:
    解决方案:安装两个插件:Disable Google Fonts,WP Acceleration for China
  5. wordpress图片使用中文名称,会加载不出来,尽量使用英文名称
  6. wordpress整站迁移:需要修改配置文件。

Windows10 debug & optimize