Statistically Improbable Phrases

A statistically improbable phrase (統計的にありそうもないフレーズ) は文書内で一部の大規模なコーパスよりも頻繁に出てくるフレーズまたは単語の集まり^[1]^[2]^[3]。本やチャプターのキーワードはセクション内では偏って現れるため、Amazon.comはこの概念を所定の本またはチャプターを決定するキーワードとして使った^[4]^[5] 。クリスチャン・ラダーは著書『Dataclysm』で一定の人種または性別の最も特徴的なフレーズを決めるためにこのコンセプトを出会い系サイトとツイッターの投稿からのデータと共に使った^[6]。

例

コンピューターについての文書内で最も一般的な単語はtheの可能性が高いが、theは英語で最も共通して使われる単語でもあり、どの文書でもtheが頻繁に使われている可能性がある。しかしながら「明示的なブーリアンアルゴリズム」のようなフレーズは英語よりも文書でより高い確率で現れる。「Hence(それ故に)」は与えられたドキュメントでは出てくる可能性は低いが、与えたドキュメントでは現れる。「明示的なブーリアンアルゴリズム」は統計的にありそうもないフレーズである。

ダーウィンの種の起源の統計的に起こりそうもないフレーズは「temperate productions」「genera descended」「transitional gradations)」「unknown progenitor」「fossiliferous formations」「our domestic breeds」「modified offspring」「doubtful forms」「closely allied forms」「profitable variations」「enormously remote」「transitional grades」「very distinct species and mongrel offspring」である^[7]。

脚注

[脚注の使い方]

^ “SIPping Wikipedia”. Courses.cms.caltech.edu. 2017年1月1日閲覧。
^ Jonathan Bailey (3 July 2012). “How Long Should a Statistically Improbably Phrase Be?”. Plagiarism Today. 2018年2月16日閲覧。
^ Errami, Mounir; Sun, Zhaohui; George, Angela C.; Long, Tara C.; Skinner, Michael A.; Wren, Jonathan D.; Garner, Harold R. (1 June 2010). “Identifying duplicate content using statistically improbable phrases”. Bioinformatics 26 (11): 1453–1457. doi:10.1093/bioinformatics/btq146. PMC 2872002. PMID 20472545 1 January 2017閲覧。.
^ “What are Statistically Improbable Phrases?”. Amazon.com. 2007年12月18日閲覧。
^ Weeks, Linton (August 30, 2005). “Amazon's Vital Statistics Show How Books Stack Up”. The Washington Post September 8, 2015閲覧。
^ Rudder, Christian (2014). Dataclysm: Who We Are When We Think No One's Looking. New York: Crown Publishers. ISBN 978-0-385-34737-2
^ Sociologically Improbable Phrases Crooked Timber April 2005

[1] “SIPping Wikipedia”. Courses.cms.caltech.edu. 2017年1月1日閲覧。

[2] Jonathan Bailey (3 July 2012). “How Long Should a Statistically Improbably Phrase Be?”. Plagiarism Today. 2018年2月16日閲覧。

[3] Errami, Mounir; Sun, Zhaohui; George, Angela C.; Long, Tara C.; Skinner, Michael A.; Wren, Jonathan D.; Garner, Harold R. (1 June 2010). “Identifying duplicate content using statistically improbable phrases”. Bioinformatics 26 (11): 1453–1457. doi:10.1093/bioinformatics/btq146. PMC 2872002. PMID 20472545 1 January 2017閲覧。.

[4] “What are Statistically Improbable Phrases?”. Amazon.com. 2007年12月18日閲覧。

[5] Weeks, Linton (August 30, 2005). “Amazon's Vital Statistics Show How Books Stack Up”. The Washington Post September 8, 2015閲覧。

[6] Rudder, Christian (2014). Dataclysm: Who We Are When We Think No One's Looking. New York: Crown Publishers. ISBN 978-0-385-34737-2

[7] Sociologically Improbable Phrases Crooked Timber April 2005

[1]

[2]

[3]

[4]

[5]

[6]

[7]

例

関連項目

脚注