発現配列タグ

発現配列タグ（はつげんはいれつタグ、英: expressed sequence tag、略称: EST）とはcDNA配列の一部からなる短い断片であり^[1]、遺伝学において遺伝子の転写産物の同定に用いられ、遺伝子の発見や配列決定に有用である^[2]。ESTの同定は迅速に進行しており、公共データベースで約7420万種類のESTが利用可能である（GenBank、全生物種、2013年1月1日段階）。

ESTはクローニングされたcDNAからの1度のシーケンシングによって得られる。一般的にESTの生成に用いられるcDNAはcDNAライブラリ（英語版）中に存在する個々のクローンである。得られる配列は比較的低品質な断片であり、その長さは現在の技術では約500–800ヌクレオチド程度が限界である。cDNAクローンはmRNAに相補的なDNAで構成されており、そのためESTは発現している遺伝子の一部を表すことになる。データベース中のESTは、cDNA/mRNAの配列のいずれか、またはmRNAに対して逆相補的な鋳型鎖の配列を表している。

RHマッピング（英語版）、HAPPYマッピング（英語版）、蛍光 in situ ハイブリダイゼーション（FISH）などの物理的マッピング（英語版）技術を用いてESTを染色体上の特定の位置へマッピングすることで、対応する遺伝子の染色体上の位置に関する手がかりが得られる。ESTの由来となった生物種のゲノムの配列決定が行われている場合、計算機を用いてEST配列のアラインメントを行うことができる。

ヒトの遺伝子セットに関する現在の理解には、その存在の証拠がESTのみに基づいている遺伝子が数千個含まれている（2006年段階）。ESTはこうした遺伝子からの転写産物の予測を精密化するツールとなり、そこからさらに遺伝子のタンパク質産物や最終的にはその機能の予測が行われる。さらに、ESTが得られた状況（組織、器官、がんなどの疾患状態）からは、対応する遺伝子が機能する状況に関する情報が得られる。ESTはDNAマイクロアレイのプローブを正確にデザインするには十分な情報が含まれており、遺伝子発現プロファイルの決定に用いることができる。

「EST」という語は、タグ以外の情報がほとんどまたは全く存在しない遺伝子を記述するために利用されることもある^[3]。

歴史

1979年、ハーバード大学とカリフォルニア工科大学のチームはin vitroでmRNAのDNAコピーを作製するという基礎的アイデアを、そのようなライブラリを細菌のプラスミドを用いて増幅するというところまで拡張した^[4]。1982年、cDNAライブラリからランダムまたは準ランダムにクローンを選択し配列決定を行うというアイデアがGreg Sutcliffeらによって模索された^[5]。1983年、Putneyらはウサギ筋肉由来のcDNAライブラリから178種類のクローンの配列決定を行った^[6]。1991年、AdamsらはESTという用語を作り出し、より体系的な配列決定プロジェクトを開始した（脳由来のcDNA600種類から開始された）^[2]。

データソースとアノテーション

dbEST

dbESTはGenBankの一部門として1992年に創設された。dbESTのデータは世界中の研究室から直接提出されたもので、キュレーションは行われていない。

ESTコンティグ

ESTのシーケンシングの結果、同一のmRNAに対応した多くの異なるESTが生成されることが多くある。下流の遺伝子発見解析に用いるESTの数を減らすため、いくつかのグループによってESTはESTコンティグへとまとめられている。ESTコンティグは、TIGR gene indices^[7]、 Unigene^[8]、STACK^[9]などから提供されている。

ESTコンティグの構築過程は瑣末なものではなく、2つの異なる遺伝子の産物を含むコンティグの作成などの多くのアーティファクトが生じる可能性がある。生物種の全ゲノム配列が決定され、転写産物のアノテーションが行われている場合には、コンティグの構築を行わず、直接ESを転写産物へのマッチングすることも可能である。このアプローチは後述するTissueInfoシステムでも利用されており、ゲノムデータベースのアノテーションとESTデータからもたらされる組織情報との関連付けが容易なものとなっている。

組織情報

ESTのハイスループット解析は類似したデータ管理上の課題に多く直面する。最初の課題は、ESTライブラリの組織情報がdbESTでは英語のプレーンテキストで記載されていることであった^[10]。そのため、2つのライブラリが同じ組織のシーケンシングによって得られたものであるかを明確に決定するプログラムを書くことが困難であった。同様に、組織の疾患状況も計算機処理にに適した方法でのアノテーションが行われておらず、がん由来のライブラリであるという情報が組織名に混在していることも多かった（例えば、組織名が"glioblastoma"（膠芽腫）とある場合、ESTライブラリが脳組織由来であること、そして疾患状況ががんであることを意味している）^[11]。また、がんのような例外を除けば、dbESTのエントリに疾患状況が記載されていないことも多かった。TissueInfoプロジェクトはこうした課題に対する手助けとなるよう2000年に開始された。プロジェクトは組織と疾患状況（がん/非がん）の曖昧さを解消するキュレーションされたデータ（毎日更新）、組織と器官を包含関係で結びつける組織オントロジー（視床下部は脳の一部であり、脳は中枢神経系の一部であるというような知識の定式化）を提供し、配列決定されたゲノムからの転写アノテーションとdbESTのデータを用いて計算された組織発現プロファイルとを関連付けるオープンソースのソフトウェアを配布している^[12]。

出典

^ ESTs Factsheet. National Center for Biotechnology Information.
^ ^a ^b “Complementary DNA sequencing: expressed sequence tags and human genome project”. Science 252 (5013): 1651–6. (Jun 1991). doi:10.1126/science.2047873. PMID 2047873.
^ “What is dbEST?”. www.ncbi.nlm.nih.gov. 2020年3月31日閲覧。
^ “Use of a cDNA library for studies on evolution and developmental expression of the chorion multigene families”. Cell 18 (4): 1303–16. (December 1979). doi:10.1016/0092-8674(79)90241-1. PMID 519770.
^ “Common 82-nucleotide sequence unique to brain RNA”. Proc Natl Acad Sci U S A 79 (16): 4942–6. (August 1982). doi:10.1073/pnas.79.16.4942. PMC 346801. PMID 6956902.
^ “A new troponin T and cDNA clones for 13 different muscle proteins, found by shotgun sequencing”. Nature 302 (5910): 718–21. (1983). doi:10.1038/302718a0. PMID 6687628.
^ “The TIGR Gene Indices: clustering and assembling EST and known genes and integration with eukaryotic genomes”. Nucleic Acids Res. 33 (Database issue): D71–4. (Jan 2005). doi:10.1093/nar/gki064. PMC 540018. PMID 15608288.
^ “Identifying tissue-enriched gene expression in mouse tissues using the NIH UniGene database”. Appl Bioinform 2 (3 Suppl): S65–73. (2003). PMID 15130819.
^ “STACK: Sequence Tag Alignment and Consensus Knowledgebase”. Nucleic Acids Res. 29 (1): 234–8. (Jan 2001). doi:10.1093/nar/29.1.234. PMC 29830. PMID 11125101.
^ “TissueInfo: high-throughput identification of tissue expression profiles and specificity”. Nucleic Acids Res. 29 (21): E102–2. (Nov 2001). doi:10.1093/nar/29.21.e102. PMC 60201. PMID 11691939.
^ “Mining expressed sequence tags identifies cancer markers of clinical interest”. BMC Bioinformatics 7: 481. (2006). doi:10.1186/1471-2105-7-481. PMC 1635568. PMID 17078886.
^ :institute for computational biomedicine::TissueInfo Archived June 4, 2008, at the Wayback Machine.

外部リンク

“ESTs: Gene Discovery Made Easier”. Science Primer. NCBI. 2020年4月1日閲覧。
Pontius, Joan U.; Wagner, Lukas; Schuler, Gregory D. (Aug 13, 2003). “Ch. 21 - UniGene: A Unified View of the Transcriptome - sec. Expressed Sequence Tags (ESTs)”. NCBI Handbook. NCBI. NBK21101. "This publication is provided for historical reference only and the information may be out of date"
Friedel, CC1; Jahn, KH; Sommer, S; Rudd, S; Mewes, HW; Tetko, IV (Apr 15, 2005). “Support vector machines for separation of mixed plant-pathogen EST collections based on codon usage (ECLAT)”. Bioinformatics 21 (8): 1383-8. doi:10.1093/bioinformatics/bti200. PMID 15585526.
- “ECLAT”. Munich Information Center for Protein Sequences. 2020年4月1日閲覧。 “Server for the classification of ESTs from mixed EST pools (from fungus infected plants) using codon usage”
“dbEST”. GenBank. 2020年4月1日閲覧。
- “dbEST summary”. GenBank. 2020年4月1日閲覧。
Ranganathan, Shoba. “Bioinformatics”. 2020年4月1日閲覧。
- “Web Resources for EST data and analysis”. 2020年4月1日閲覧。

TissueInfo

“TissueInfo”. Wiki. 2020年4月1日閲覧。
“TissueInfo”. 2020年4月1日閲覧。 “Curated EST tissue provenance, tissue ontology, open-source software”
“TissueInfo: high-throughput identification of tissue expression profiles and specificity”. Nucleic Acids Res. 29 (21): E102–2. (Nov 1, 2001). doi:10.1093/nar/29.21.e102. PMC 60201. PMID 11691939.

[:0-1] ESTs Factsheet. National Center for Biotechnology Information.

[adams-2] “Complementary DNA sequencing: expressed sequence tags and human genome project”. Science 252 (5013): 1651–6. (Jun 1991). doi:10.1126/science.2047873. PMID 2047873.

[:1-3] “What is dbEST?”. www.ncbi.nlm.nih.gov. 2020年3月31日閲覧。

[:2-4] “Use of a cDNA library for studies on evolution and developmental expression of the chorion multigene families”. Cell 18 (4): 1303–16. (December 1979). doi:10.1016/0092-8674(79)90241-1. PMID 519770.

[:3-5] “Common 82-nucleotide sequence unique to brain RNA”. Proc Natl Acad Sci U S A 79 (16): 4942–6. (August 1982). doi:10.1073/pnas.79.16.4942. PMC 346801. PMID 6956902.

[:4-6] “A new troponin T and cDNA clones for 13 different muscle proteins, found by shotgun sequencing”. Nature 302 (5910): 718–21. (1983). doi:10.1038/302718a0. PMID 6687628.

[:5-7] “The TIGR Gene Indices: clustering and assembling EST and known genes and integration with eukaryotic genomes”. Nucleic Acids Res. 33 (Database issue): D71–4. (Jan 2005). doi:10.1093/nar/gki064. PMC 540018. PMID 15608288.

[:6-8] “Identifying tissue-enriched gene expression in mouse tissues using the NIH UniGene database”. Appl Bioinform 2 (3 Suppl): S65–73. (2003). PMID 15130819.

[:7-9] “STACK: Sequence Tag Alignment and Consensus Knowledgebase”. Nucleic Acids Res. 29 (1): 234–8. (Jan 2001). doi:10.1093/nar/29.1.234. PMC 29830. PMID 11125101.

[:8-10] “TissueInfo: high-throughput identification of tissue expression profiles and specificity”. Nucleic Acids Res. 29 (21): E102–2. (Nov 2001). doi:10.1093/nar/29.21.e102. PMC 60201. PMID 11691939.

[:9-11] “Mining expressed sequence tags identifies cancer markers of clinical interest”. BMC Bioinformatics 7: 481. (2006). doi:10.1186/1471-2105-7-481. PMC 1635568. PMID 17078886.

[:10-12] :institute for computational biomedicine::TissueInfo Archived June 4, 2008, at the Wayback Machine.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

歴史