Terminology

1.fold-coverage

the theoretical “fold-coverage” of a shotgun sequencing experiment:

<number of reads> * <read length> / <target size>

2.Amplicon

An amplicon is a piece of DNA or RNA that is <the source and/or product of natural or artificial amplification or replication events>.

It can be formed using various methods including polymerase chain reactions (PCR), ligase chain reactions (LCR), or natural gene duplication.

3.Whole genome mapping

A Whole Genome Map is a high-resolution, ordered, whole genome restriction map generated from single DNA molecules extracted from bacteria, yeast, or other fungi. Whole Genome Mapping is a novel technology with unique capabilities in the field of microbiology, with specific applications in the areas of Comparative Genomics, Strain Typing, and Whole Genome Sequence Assembly. Whole Genome Maps are generated de novo, independent of sequence information, require no amplification or PCR steps, and provide a comprehensive view of whole genome architecture. A Whole Genome Map is displayed in the MapCode pattern where the vertical lines indicate the locations of restriction sites, and the distance between the lines represent the restriction fragment size.

4.Radiation hybrid mapping

A theory is developed to predict marker retention and conditional retention or loss in radiation hybrids. Applied to multiple pairwise analysis of a human chromosome 21 data set, this theory fits much better than proposed alternatives and gives a physical map consistent with other evidence and robust with respect to errors to typing. Radiation hybrids have great promise to provide order and physical location at two levels of resolution, spanning the techniques of linkage and restriction fragments and not limited to polymorphic loci.

5.dna barcoding

DNA barcoding is a taxonomic method that uses a short genetic marker in an organism’s DNA to identify it as belonging to a particular species

6.metric space

In mathematics, a metric space is a set for which distances between all members of the set are defined. Those distances, taken together, are called a metric on the set.

7.Pseudometric space

In mathematics, a pseudometric space is a generalized metric space in which the distance between two distinct points can be zero.

8.pyrosequencing

Pyrosequencing is a method of DNA sequencing (determining the order of nucleotides in DNA) based on the “sequencing by synthesis” principle. It differs from Sanger sequencing, in that it relies on the detection of pyrophosphate release on nucleotide incorporation, rather than chain termination with dideoxynucleotides.The desired DNA sequence is able to be determined by light emitted upon incorporation of the next complementary nucleotide by the fact that only one out of four of the possible A/T/C/G nucleotides are added and available at a time so that only one letter can be incorporated on the single stranded template (which is the sequence to be determined). The intensity of the light determines if there are more than one of these “letters” in a row. The previous nucleotide letter (one out of four possible dNTP) is degraded before the next nucleotide letter is added for synthesis: allowing for the possible revealing of the next nucleotide(s) via the resulting intensity of light (if the nucleotide added was the next complementary letter in the sequence). This process is repeated with each of the four letters until the DNA sequence of the single stranded template is determined.

9.n-gram(k-mer)

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.

 

DNA sequencing:base pair

…AGCTTCGA…

…, A, G, C, T, T, C, G, A, …

…, AG, GC, CT, TT, TC, CG, GA, …

…, AGC, GCT, CTT, TTC, TCG, CGA, …

10.sequence space

In evolutionary biology, sequence space is a way of representing all possible sequences (for a protein, gene or genome).

11.k-mer distance

1.li,lj,表示两条序列

2.τ:表示一个k-mer的一个子序列,

ni(τ),nj(τ):表示该子序列在两条序列的k-mer中的个数。

3.ki,j:表示这两条序列k-mer的相似度

12.optical map(ordered restriction map)

Optical mapping is a technique for constructing ordered, genome-wide, high-resolution restriction maps from single, stained molecules of DNA, called “optical maps”. By mapping the location of restriction enzyme sites along the unknown DNA of an organism, the spectrum of resulting DNA fragments collectively serve as a unique “fingerprint” or “barcode” for that sequence.

13.Restriction map

A restriction map is a map of known restriction sites within a sequence of DNA. Restriction mapping requires the use of restriction enzymes. In molecular biology, restriction maps are used as a reference to engineer plasmids or other relatively short pieces of DNA, and sometimes for longer genomic DNA.

14.Expressed sequence tag

An expressed sequence tag or EST is a short sub-sequence of a cDNA sequence.They may be used to identify gene transcripts, and are instrumental in gene discovery and gene sequence determination. The identification of ESTs has proceeded rapidly, with approximately 74.2 million ESTs now available in public databases (e.g. GenBank 1 January 2013, all species).

15.Multiple Sequencing Alignment

A Multiple Sequence Alignment (MSA) is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences’ shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation events such as point mutations (single amino acid or nucleotide changes) that appear as differing characters in a single alignment column, and insertion or deletion mutations (indels or gaps) that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.

16.POA(Partial Order Alignment)

Partial order alignment (POA) has been proposed as a new approach to multiple sequence alignment (MSA), which can be combined with existing methods such as progressive alignment. This is important for addressing problems both in the original version of POA (such as order sensitivity) and in standard progressive alignment programs (such as information loss in complex alignments, especially surrounding gap regions).

17.Progressive Alignment

This approach begins with the alignment of the two most closely related sequences (as determined by pairwise analysis) and subsequently adds the next closest sequence or sequence group to this initial pair [37,7]. This process continues in an iterative fashion, adjusting the positioning of indels in all sequences. The major shortcoming of this approach is that a bias may be introduced in the inference of the ordered series of motifs (homologous parts) because of an overrepresentation of a subset of sequences.

18.核糖体小亚基(英文:Ribosomal Small Subunit,简称“SSU”)

是核糖体中较小的核糖体亚基。每个核糖体都由一个核糖体小亚基与一个核糖体大亚基共同构成。[1]小亚基在核糖体翻译过程中负责信息的识别。  原核细胞中的70S核糖体、真核细胞细胞质中的80S核糖体与真核细胞线粒体中的线粒体核糖体各拥有一种不同的核糖体小亚基:70S核糖体中包含30S核糖体亚基,80S核糖体中包含40S核糖体亚基,线粒体核糖体中则包含28S核糖体亚基。

原核细胞 (70S核糖体) 大亚基:50S亚基(包含5S rRNA及23S rRNA)  
  小亚基:30S亚基(包含16S rRNA)  
真核细胞 细胞质核糖体 (80S核糖体) 大亚基:60S亚基(包含5S rRNA、5.8S rRNA及28S rRNA)
    小亚基:40S亚基(包含18S rRNA)
  线粒体核糖体 39S大亚基(12S MT-RNR1)
    28S小亚基(16S MT-RNR2)

19.rare biosphere

Low-abundance high-diversity group is what is now called the “Rare Biosphere”.

20.Phred quality score

Phred quality scores were originally developed by the program Phred to help in the automation of DNA sequencing in the Human Genome Project. Phred quality scores are assigned to each nucleotide base call in automated sequencer traces.[1][2] Phred quality scores have become widely accepted to characterize the quality of DNA sequences, and can be used to compare the efficacy of different sequencing methods. Perhaps the most important use of Phred quality scores is the automatic determination of accurate, quality-based consensus sequences.

21.Base calling

Base calling is the process of assigning bases (nucleobases) to chromatogram peaks. One of the best computer programs for accomplishing this job is Phred base-calling, which is currently the most widely used basecalling software program by both academic and commercial DNA sequencing laboratories because of its high base calling accuracy

22.MIAME(Minimum Information About a Microarray Experiment)

describes the Minimum Information About a Microarray Experiment that is needed to enable the interpretation of the results of the experiment unambiguously and potentially to reproduce the experiment.

1.The raw data for each hybridisation.

2.The final processed data for the set of hybridisations in the experiment (study)

3.The essential sample annotation, including experimental factors and their values

4.The experiment design including sample data relationships

5.Sufficient annotation of the array design

6.Essential experimental and data processing protocols

 

 

suffix array – 后缀数组(倍增算法实现版)

使用倍增算法,反复调用基数排序,最后得到’名次数组rank’;
使用’名次数组rank’最终得到’后缀数组sa‘。

倍增算法的倍增体现在,每次处理的字符串长度增长方式为:2^0,2^1,2^2…(a^b表示a的b次方);
rank[i]表示首指针为i的序列,排名第几;
sa[i]表示排名第i的序列,首指针是多少;
注意编号开始的位置。

时间复杂度:
O(n*log[2](n))
log2n为倍增次数,n为基数排序的次数;

注意两个很重要的边界:一个是单个数据的范围,一个是数据个数的范围;
即,每个数据的最小值最大值,和一共要处理的数据的个数;

Counting sort – 计数排序

原理:

假设待排序对象为整数,且范围为1…1000;

举例:2,3,3,4

首先统计每个数字出现的次数,用c[i]表示。i表示数字,c[i]表示数字i出现的次数;

c[1] = 0;  c[2] = 1;  c[3] = 2;  c[4]=1;

然后从小到大,统计比每个数小的数一共有多少,用a[i]表示。i表示数字,a[i]=c[1]+c[2]+c[3]…+c[i]

a[2] = 1;a[3] = 3;a[4]=4;

接着算出每个数字的序数;遍历每个数字,算出其序数,将这个数字直接放在数组rank中,其对应的位置上;

for(int i = 1;i <= n;i ++) rank[– c[a[i]]] = a[i];

整个过程并没有一个数组专门来存所有待排列数字,只有一个数组rank来存储排列好的数字,所有一样大的数字之间的顺序取决于代码实现的过程。

《算法设计编程实验》中提到使用计数排序来实现后缀数组中rank数组的计算(这里的rank是后缀数组的rank),其实是笔误,那里用的是基数排序。

 

How to Become a Bioinformatics Professional

1 Understand what Bioinformaticians do.

  • Broadly, computational biology is involved with developing and implementing tools in order to use and manage biological data.
  • The medical field is a major employer of Bioinformaticians, but they are also needed in industry and agriculture.

2 Stay abreast of new developments in Bioinformatics and biotechnology.

  • This highly technological field is undergoing rapid changes.
  • The Bioinformatics Organization offers continuing education courses.

3 Become proficient in computer science.

  • This includes database administration and programming skills.
  • UNIX is currently the preferred operating system platform.
  • Be able to write programs in computer languages such as PERL, SQL and C.
  • Learn to use genomic sequence analysis and molecular modeling programs.

4 Study college level biology.

  • Biology courses should include analytical techniques and molecular biology.

5 Take math courses, particularly those for biologists.

  • Biostatistics is an important discipline in Bioinformatics.

6 Pursue higher education. Undergraduate degrees can be in biology, computer science or biotechnology.

  • In graduate school, find a program that combines both disciplines, if possible; however the emphasis seems to be on molecular biology study with the acquiring of information technology skills.
  • Bioinformatics or computational biology programs are still fairly new.
  • Researchers should have a doctorate in biology, statistics or math.

7 Learn to identify the right questions to ask in addition to the methodologies to apply.

听起来就是这么容易……

http://www.wikihow.com/Become-a-Bioinformatics-Professional

DNA Packaging: Nucleosomes and Chromatin

18847_6

At the top right portion of the diagram, a vertical double-ended arrow indicates that the DNA double helix strands are 2 nm apart. The strands are represented as gray ribbons connected by vertical colored bars that are either half red/half green or half yellow/half cyan.

As the DNA strand reaches the left side of the illustration, all colors are replaced by gray. Box 1 has the text “At the simplest level, chromatin is a double-stranded helical structure of DNA. The DNA strand turns down and goes back toward the right, still compacting along the way.

Below this is Box 2, with the text “DNA is complexed with histones to form nucleosomes.” Toward the center of the schematic are three sets of two brown discs, each disc quartered, and the cylinders are wrapped 1.65 times by the DNA, which has now compacted into a thick gray thread shape. Each nucleosome consists of eight histone molecules.

To the right of the first nucleosome complex is Box 3, with the text “Each nucleosome consists of eight histone proteins around which the DNA wraps 1.65 times.” The second nucleosome has a vertical red bar, about as long as the nucleosome is high, attached to the side of the nucleosome. This bar is labeled H1 histone. A horizontal, double-ended, black arrow indicates the nucleosome with DNA has a diameter of 11 nm. A third nucleosome to the right of the second is labeled “Chromatosome.” Above and to the right of the chromatosome is Box 4, with the text “A chromatosome consists of a nucleosome plus the H1 histone.”

Below this, the nucleosomes are folded in on each other to form a hollow, tube-like fiber, where many nucleosomes are arranged in parallel rings to form the tube’s outer layer. To the right of this is a vertical, double-ended, black arrow labeled 30 nm. To the right of this arrow is Box 5, with the text “The nucleosomes fold up to produce a 30-nm fiber…” The nucleosome tube continues to compact to form a gray spiral and gray squiggles as it continues leftward. Above this is Box 6 with the text “… that forms loops averaging 300 nm in length.” A black, vertical, double-ended arrow is labeled 300 nm. The squiggles compact further, going down and back toward the right, coiling like a telephone cord. Below this is Box 7 with the text “The 300-nm fibers are compressed and folded to produce a 250-nm-wide fiber.” A black, vertical, double-ended arrow is labeled 700 nm. Two, inward-pointing, black arrows indicate a gap labeled “250-nm-wide fiber.”

These coils continue to the right and compress further, forming a horizontal, X-shaped, chromosome. A black, vertical, double-ended arrow is labeled 1400 nm. Below this is Box 8 with the text “Tight coiling of the 250-nm fiber produces the chromatid of a chromosome.”

这里的欠缺在于,从图框6到图框8之间的折叠细节没有交代清楚。而且染色质\染色体的构象是随着时间在发生动态变化的,并非一成不变的。所以上述描述可以说是一个剪影。

http://www.nature.com/scitable/topicpage/DNA-Packaging-Nucleosomes-and-Chromatin-310

押棋新纪录

  1. 2016.5.4 二区间押分:172000(全押);获得:516000;收益:344000。总弈币:516000;
    2016-05-04_0203282016-05-04_020428
    判断依据:
    1. 前面八十手,选手’乌市少年宫’计算速度比’ltsoo’快许多,而且,在八十手之前,局面还是五五开,所以我决定押乌市少年宫;
    2. 看到2/3的人押ltsoo,我觉得我应该全押获得最大利润;
    3. 还有一个直觉,那就是我相信小孩子的战斗力;
    中盘的时候将双活看错了,于是以为白输棋了,当时伤心啊,还在对话区说,白输多了。想来也是不应该。别人对局时自己无论如何不该说话。
    白棋右下角当黑棋打入时的计算又快又准。左上角的做活堪称一绝。对黑棋的侵消是处处紧逼,恰到好处。这是我截止目前最高的弈币收入。我本来说,等我弈币上一百万,我的棋艺肯定会精进不少,因为押分能力增强了嘛。不过看来,对小孩子的偏信和运气能够使我很快到达一百万。看来得上一千万时我的棋艺才会通过押棋得到较大飞升。
  2. 2016.5.12 一区间押分:100000(小酌);获得:450000;收益:350000。总弈币:855460。
    这局棋押胜吧,我是开了sgo的。我把对局下载下来,然后用sgo判断,黑领先8目,当时,黑的赔率是4.5倍,所以我就押了十万,小酌一下,毕竟我的总弈币都有五十多万,押十万输了影响也不大。
    另外,弈币和人民币的汇率为:1元=1金币=40万弈币。

Excel 数据分析

  1. 回归分析
  2. 直线图
  3. 快速公式套用:
    1. 在一个格子内输入公式;
    2. 点击该格子,Ctrl+Shift+方向,选定所有需要套用的格子;
    3. Ctrl+D,完成计算。

 

Oxford Nanopore Technology – MinION

MinION is a portable device for molecular analyses that is driven by nanopore technology. It is adaptable for the analysis of DNA, RNA, proteins or small molecules with a straightforward workflow.

MinION

Scalability

The MinION can be run for minutes or days according to the experimental need. Users can adjust settings like the speed that the DNA passes through the nanopore. PromethION, which will soon be released into early access, is designed to be fully scalable so that users can operate between one or 48 flow cells at any one time.

Long read lengths

The Oxford Nanopore system processes the reads that are presented to it rather than generating specific read lengths. The longest read reported by a MinION user to date is more than 200Kb, but it can process the spectrum of read lengths.

身边的算法

  1. windows自带纸牌游戏(NP难)
    对于如何生成一副可被完成的组合,这是一个NP难问题;电脑每次随机生成一副牌,不保证一定有解。所以有时候,纸牌游戏无论如何都完成不了时,可能是这次真的完成不了,而不是你自己的问题。不过,判定到底是你的问题还是电脑的问题仍然是NP难的问题。可以编写一个程序,来算:当我们觉得已经无法前进时,出现了的所有牌,是否存在一种新的组合会使得出现新的翻牌希望。

Keywords of Genomics

Population genetics

Population genetics is the study of the distribution and change in frequency of alleles within populations, and as such it sits firmly within the field of evolutionary biology.

The main processes of evolution are natural selection, genetic drift, gene flow, mutation, and genetic recombination and they form an integral part of the theory that underpins population genetics.

Studies in this branch of biology examine such phenomena as adaptation, speciation, population subdivision, and population structure.

Population stratification

Population stratification refers to differences in allele frequencies between cases and controls due to systematic differences in ancestry rather than association of genes with disease.

It would be caused by systematic differences in the ancestry of cases and controls.

Diploid genome

Diploid genome refers to a genome that contains a balanced set of chromosomes derived equally from maternal and paternal sources.

Coalescent theory

Coalescent theory is a retrospective stochastic model of population genetics that relates genetic diversity in a sample to demographic history of the population from which it was taken.

That is, it is a model of the effect of genetic drift, viewed backwards in time, on the genealogy of antecedents.

GitHub

Repository

A repository is usually used to organize a single project. Repositories can contain folders and files, images, videos, spreadsheets, and data sets – anything your project needs. We recommend including a README, or a file with information about your project. GitHub makes it easy to add one at the same time you create your new repository. It also offers other common options such as a license file.

Branch

Branching is the way to work on different versions of a repository at one time.

By default your repository has one branch named master which is considered to be the definitive branch. We use branches to experiment and make edits before committing them to master.

Commit

On GitHub, saved changes are called commits.

Pull Request

When you open a pull request, you’re proposing your changes and requesting that someone review and pull in your contribution and merge them into their branch. Pull requests show diffs, or differences, of the content from both branches. The changes, additions, and subtractions are shown in green and red.

GitHub Pages

Nottwya

GitHub Pages are public webpages hosted and published through our site.

You can create and publish GitHub Pages online using the Automatic Page Generator. If you prefer to work locally, you can use the GitHub Desktop or the command line.

Pages are served over HTTP, not HTTPS, so you shouldn’t use them for sensitive transactions, like sending passwords or credit card numbers.

如何读文章?

  1. 读摘要
    通过摘要,我们能快速知道,这篇文章的主题、研究对象和实验结论等,这些能够帮助我们最终确定这篇文章是否含有我们需要的信息;
  2. 读图
    通过读图,我们能够迅速知道这篇文章比较凝练的信息,从而快速切入这篇文章的核心结论;此外,图片方便理解,通过图片能够帮助我们对文章建立初步认识;
  3. 选读
    在进行了上面两步以后,选定自己感兴趣的部分进行深入阅读。

J.Q. Liu

  1. Yak whole-genome resequencing reveals domestication signatures and prehistoric population expansions (2015)
    1. genome variation of wild and domestic yaks
    2. evolution
  2. Genome resequencing: 13 wild yaks and 59 domestic yaks

windows install and configure

如果是让电脑维修店的人重装系统,要注意找一家好一点的维修店。因为,重装系统看似一样,其实,每个店使用的安装镜像以及一些细节的配置是有出入的。去一家很差的店,重装的系统,会给自己后期的配置带来极大的困难。

  1. 操作系统重装
    win7
  2. 硬件驱动重装
    1.显卡驱动
  3. 软件重装
    1. DirectX
  4. 运行库
  5. 编程语言编译工具
    1. Java
    2. MinGW
    3. Strawberry Perl
  6. 小工具
    1. daemon tool lites
    2. pchunter
    3. xming+putty
  7. 123

NCBI 使用注意事项及技巧

  1. 关于序列标识join和complement:
    join:表示序列是模板链上的5′->3’;
    complement:表示序列是编码链上的5′->3’;
    example:
    join:
    现在(2016.4.5),似乎没有再标识join了。
    complement:
    我看到的素有gene类别下的序列都是给的complement。

    gene complement(2872..3195)
    /gene=” lacZ’ ”
    Sequence:NC_000913.3 (363231..366305, complement)

  2. 在指定的基因组检索目的序列:打开基因组,然后输入目的序列,开始检索。
  3. 对于蛋白质,NCBI提供了查看其CD(conserved domain),名字叫“Identify Conserved Domains”;

WordPress 建站配置

  1. 连接数据库
    1. 在空间提供商处找到数据库的IP;
    2. 在网站根目录下找到WP的wp-config.php文件;
    3. 将DB_HOST的值,改为数据库的IP
  2. 备份
    1. 使用WP自带的工具中的导出工具,导出能被WP模板普遍识别的网站内容,方便在网站出现意外时,在任何一个新的WP木板上导入网站所有内容(不含图片)。
    2. 使用空间提供商的数据下载,直接下载整个网站,如若网站出现问题,直接从新上传整个WP;
  3. 防垃圾评论
    1. WP自身会要求管理者审核每一条评论;
    2. 使用Akismet,以后会自动过滤同一个邮箱的评论;
  4. 托管在国内的虚拟主机后,加载缓慢:
    原因:使用了google的fonts和ajax库
    解决方案:安装两个插件:Disable Google Fonts,WP Acceleration for China
  5. wordpress图片使用中文名称,会加载不出来,尽量使用英文名称
  6. wordpress整站迁移:需要修改配置文件。

Windows10 debug & optimize