【独家发布】大数据分析的好书--Frontiers in Massive Data Analysis P191-经管之家官网！

经济学管理学金融学统计学

您当前的位置> 数据>>

【独家发布】大数据分析的好书--Frontiers in Massive Data Analysis P191

发布：olderp | 分类：大数据

关于本站

人大经济论坛-经管之家：分享大学、考研、论文、会计、留学、数据、经济学、金融学、管理学、统计学、博弈论、统计年鉴、行业分析包括等相关资源。
经管之家是国内活跃的在线教育咨询平台!

完整电子版已上线CDA网校，累计已有10万+在读~ 教材严格按考试大纲编写，适合CDA考生备考，也适合业务及数据分析岗位的从业者提升自我。

TOP热门关键词

专题页面精选

扫码加入数据分析学习群

海量数据分析的前沿
Frontiers in Massive Data Analysis
THE CHALLENGE
Although humans have gathered data since the beginning of recorded
history—indeed, data gathered by ancestral humans provides much of the
raw material for the reconstruction of human history—the rate of acquisition
of data has surged in recent years, with no end in sight. Expectations
have surged as well, with hopes for new scientific discoveries pinned on
emerging massive collections of biological, physical, and social data, and
with major areas of the economy focused on the commercial implications
of massive data.
Although it is difficult to characterize all of the diverse reasons for the
rapid growth in data, a few factors are worth noting. First, many areas
of science are in possession of mature theories that explain a wide range
of phenomena, such that further testing and elaboration of these theories
requires probing extreme phenomena. These probes often generate very
large data sets. An example is the world of particle physics, where massive
data (e.g., petabytes per year for the Large Hadron Collider; 1 petabyte is
1015 bytes) arises from the new accelerators designed to test aspects of the
Standard Model of particle physics. Second, many areas of science and engineering
have become increasingly exploratory, with large data sets being
gathered outside the context of any particular theory in the hope that new
phenomena will emerge. Examples include the massive data arising from
genome sequencing projects (which can accumulate terabytes (1012 bytes)
of data for each project) as well as the massive data expected to arise from
the Large Synoptic Survey Telescope, which will be measured in petabytes.
Rapid advances in cost-effective sensing mean that engineers can readily
collect massive amounts of data about complex systems, such as those for
communication networks, the electric grid, and transportation and financial
systems, and use that data for management and control. Third, much
human activity now takes place on the Internet, and this activity generates
data that has substantial commercial and scientific value. In particular,
many commercial enterprises are aiming to provide personalized services
that adapt to individual behaviors and preferences as revealed by data associated
with the individual. Fourth, connecting these other trends is the
significant growth in the deployment of sensor networks that record biological,
physical, and social phenomena at ever-increasing scale, and these
sensor networks are increasingly interconnected.
In general, the hope is that if massive data could be exploited effectively,
science would extend its reach, and technology would become more
adaptive, personalized, and robust. It is appealing to imagine, for example,
a health-care system in which increasingly detailed data are maintained for
each individual—including genomic, cellular, and environmental data—and
in which such data can be combined with data from other individuals and
with results from fundamental biological and medical research, so that
optimized treatments can be designed for each individual. One can also
envision numerous microeconomic consequences of massive data analysis
where preferences and needs at the level of single individuals are combined
with fine-grained descriptions of goods, skills, and services to create new
markets. In general, what is particularly notable about the recent rise in
the prevalence of “big data” is not merely the size of modern data sets, but
rather that their fine-grained nature permits inferences and decisions at the
level of single individuals.
It is natural to be optimistic about the prospects. Several decades of
research and development in databases and search engines have yielded a
wealth of relevant experience in the design of scalable data-centric technology.
In particular, these fields have fueled the advent of cloud computing
and other parallel and distributed platforms that seem well suited to massive
data analysis. Moreover, innovations in the fields of machine learning, data
mining, statistics, and the theory of algorithms have yielded data-analysis
methods that can be applied to ever-larger data sets. When combined with
arguments that simple algorithms can work better than more sophisticated
algorithms on large-scale data (see, e.g., Halevy et al., 2009), it is natural
to be bullish on big data.1
While not entirely unwarranted, such optimism overlooks a number
of major difficulties that arise in attempting to achieve the goals that
are envisioned in discussions of massive data. In part these difficulties
are those familiar from implementations of large-scale databases—involving
finding and mitigating bottlenecks, achieving simplicity and generality
of the programming interface, propagating metadata, designing a system
that is robust to hardware failure, and exploiting parallel and distributed
hardware—all at an unprecedented scale. But the goals for massive data go
beyond the storage, indexing, and querying that have been the province of
classical database systems (and classical search engines), instead focusing
on the ambitious goal of inference. Inference is the problem of turning data
into knowledge, where knowledge often is expressed in terms of variables
(e.g., a patient’s general state of health, or a shopper’s tendency to buy) that
are not present in the data per se, but are present in models that one uses to
interpret the data. Statistical principles are needed to justify the inferential
leap from data to knowledge, and many difficulties arise in attempting to
bring these principles to bear on massive data. Operating in the absence
of these principles may yield results that are not useful at best or harmful
at worst. In any discussion of massive data and inference, it is essential to
be aware that it is quite possible to turn data into something resembling
knowledge but which actually is not. Moreover, it can be quite difficult to
know that this has happened.
Consider a database where the rows correspond to people and the
columns correspond to “features” that are used to describe people. If the
database contains data on only a thousand people, it may suffice to measure
only a few dozen features (e.g., age, gender, years of education, city of
residence) to make the kinds of distinctions that may be needed to support
assertions of “knowledge.” If the database contains data on several billion
people, however, we are likely to have heightened expectations for the data,
and we will want to measure many more features (e.g., latest magazine
read, culinary preferences, genomic markers, travel patterns) to support the
wider range of inferences that we wish to make on the basis of the data. We
might roughly imagine the number of features scaling linearly in the number
of individuals. Now, the knowledge we wish to obtain from such data is
often expressed in terms of combinations of the features. For example, if
one lives in Memphis, is a male, enjoys reading about gardening, and often
travels to Japan, what is the probability that the person will click on an ad
about life insurance? The problem is that there are exponential numbers of
such combinations of features and, in any given data set, a vast number of
these combinations will appear to be highly predictive of any given outcome
by chance alone.

「经管之家」APP：经管人学习、答疑、交友，就上经管之家！
免流量费下载资料----在经管之家app可以下载论坛上的所有资源，并且不额外收取下载高峰期的论坛币。
涵盖所有经管领域的优秀内容----覆盖经济、管理、金融投资、计量统计、数据分析、国贸、财会等专业的学习宝库，各类资料应有尽有。
来自五湖四海的经管达人----已经有上千万的经管人来到这里，你可以找到任何学科方向、有共同话题的朋友。
经管之家（原人大经济论坛），跨越高校的围墙，带你走进经管知识的新世界。
扫描下方二维码下载并注册APP

本文关键词：

本文论坛网址：https://bbs.pinggu.org/thread-2685725-1-1.html

上一篇 | 一稿多发（湖南大学姚德权博导）

下一篇 | R软件操作多重插补之后，lambda值是什么统 ...

大数据精彩帖子推荐更多

您可能感兴趣的文章

本站推荐的文章

人气文章

本文标题：【独家发布】大数据分析的好书--Frontiers in Massive Data Analysis P191

本文链接网址：https://bbs.pinggu.org/jg/shuju_dashuju_2685725_1.html

1.凡人大经济论坛-经管之家转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责；
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性，不作出任何保证或承若；
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。

【独家发布】大数据分析的好书--Frontiers in Massive Data Analysis P191-经管之家官网！

大数据