人大经济论坛 › 论坛 › 经济学人二区 › 外文文献专区 › DNA序列中稀有词搜索的泊松近似

CDA数据分析研究院

商业数据分析与大数据领航教育品牌



经管云课堂

经管/金融/财会/社科/名师公开课



学术培训

Stata 空间计量 SSCI Python

贵宾：通行论坛特权+数据库权限
+案例库+下载特权 VIP：论坛特权+更多下载次数
+ccerdata数据库+更高阅读权限+……

发帖

楼主: mingdashike22

179 0

[统计数据] DNA序列中稀有词搜索的泊松近似 [推广有奖]

0关注
3粉丝

会员

学术权威

79%

还不是VIP/贵宾

威望: 10 级
论坛币: 10 个
通用积分: 71.1822
学术水平: 0 点
热心指数: 0 点
信用等级: 0 点
经验: 25194 点
帖子: 4201
精华: 0
在线时间: 1 小时
注册时间: 2022-2-24
最后登录: 2022-4-15

楼主

mingdashike22

发表于 2022-3-15 16:00:00 来自手机 |只看作者 |坛友微信交流群|倒序 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

摘要翻译：
利用最近关于具有混合性质的随机过程中符号串出现次数的结果，我们提出了一种在一般由马尔可夫链建模的生物序列中搜索生僻字的新方法。我们得到了一个词在序列（在马尔可夫模型下）中出现次数的分布与其泊松近似之间的误差的一个界。一个全局界已经由Chen-Stein方法给出。我们的方法，PSI-混合方法，给出了局部边界。由于只需要分布尾部的误差，所以Chen-Stein的全局一致界太大，考虑局部界是一种更好的方法。我们寻找两个关于出现次数的阈值，从中我们可以将所研究的词视为一个过度表示或一个不足表示的词。这些过度或不足的词被认为有生物学作用。我们的方法给出了比Chen-Stein方法宽得多的一组词的阈值。比较两种方法，我们观察到PSI-混合方法对于分布尾部的界有更好的精度。我们还提供了一个软件PANOW（可在http://stat.genopole.cnrs.fr/software/panowdir/上获得），用于计算被研究单词的错误项和阈值。
---
英文标题：
《Poisson approximation for search of rare words in DNA sequences》
---
作者：
Nicolas Vergne (1) and Miguel Abadi (2) ((1) Laboratoire Statistique
et G\'enome France, (2) Universidade de Campinas Brazil)
---
最新提交年份：
2007
---
分类信息：

一级分类：Mathematics 数学
二级分类：Probability 概率
分类描述：Theory and applications of probability and stochastic processes: e.g. central limit theorems, large deviations, stochastic differential equations, models from statistical mechanics, queuing theory
概率论与随机过程的理论与应用：例如中心极限定理，大偏差，随机微分方程，统计力学模型，排队论
--
一级分类：Mathematics 数学
二级分类：Statistics Theory 统计理论
分类描述：Applied, computational and theoretical statistics: e.g. statistical inference, regression, time series, multivariate analysis, data analysis, Markov chain Monte Carlo, design of experiments, case studies
应用统计、计算统计和理论统计：例如统计推断、回归、时间序列、多元分析、数据分析、马尔可夫链蒙特卡罗、实验设计、案例研究
--
一级分类：Statistics 统计学
二级分类：Applications 应用程序
分类描述：Biology, Education, Epidemiology, Engineering, Environmental Sciences, Medical, Physical Sciences, Quality Control, Social Sciences
生物学，教育学，流行病学，工程学，环境科学，医学，物理科学，质量控制，社会科学
--
一级分类：Statistics 统计学
二级分类：Statistics Theory 统计理论
分类描述：stat.TH is an alias for math.ST. Asymptotics, Bayesian Inference, Decision Theory, Estimation, Foundations, Inference, Testing.
Stat.Th是Math.St的别名。渐近，贝叶斯推论，决策理论，估计，基础，推论，检验。
--

---
英文摘要：
Using recent results on the occurrence times of a string of symbols in a stochastic process with mixing properties, we present a new method for the search of rare words in biological sequences generally modelled by a Markov chain. We obtain a bound on the error between the distribution of the number of occurrences of a word in a sequence (under a Markov model) and its Poisson approximation. A global bound is already given by a Chen-Stein method. Our approach, the psi-mixing method, gives local bounds. Since we only need the error in the tails of distribution, the global uniform bound of Chen-Stein is too large and it is a better way to consider local bounds. We search for two thresholds on the number of occurrences from which we can regard the studied word as an over-represented or an under-represented one. A biological role is suggested for these over- or under-represented words. Our method gives such thresholds for a panel of words much broader than the Chen-Stein method. Comparing the methods, we observe a better accuracy for the psi-mixing method for the bound of the tails of distribution. We also present the software PANOW (available at http://stat.genopole.cnrs.fr/software/panowdir/) dedicated to the computation of the error term and the thresholds for a studied word.
---
PDF链接：
https://arxiv.org/pdf/711.2382

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏0 回帖

关键词：DNA序列 DNA distribution Applications epidemiology Chen Poisson 过程 thresholds word

[统计数据] DNA序列中稀有词搜索的泊松近似 [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我拉你入群

相关帖子

本版微信群

[统计数据] DNA序列中稀有词搜索的泊松近似 [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

扫码加我 拉你入群

相关帖子

本版微信群

扫码加我拉你入群