[BeijingTomorrow]Python Package for Beijing Pollution Forecast

1关注
62粉丝

VIP

已卖：4901份资源

学术权威

14%

还不是VIP/贵宾

-

TA的文库 其他...

R资源总汇

Panel Data Analysis

Experimental Design

0%

威望: 1 级
论坛币: 49675 个
通用积分: 56.2487
学术水平: 370 点
热心指数: 273 点
信用等级: 335 点
经验: 57805 点
帖子: 4005
精华: 21
在线时间: 582 小时
注册时间: 2005-5-8
最后登录: 2023-11-26

楼主

ReneeBK 发表于 2017-9-8 03:32:44 |AI写论文

是否 +2 论坛币

k人参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群

赵安豆老师微信：zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

立即领取

感谢您参与论坛问题回答

经管之家送您两个论坛币！

+2 论坛币

BEIJING TOMORROW
----------------
By Tavis Barr, tavisbarr@gmail.com, Copyright 2016
Licensed under the Gnu Public License V. 2.0
Contact me about other licensing arrangements
This program uses satellite data from the NASA MODIS program, and pollution
data from the US embassy in Beijing to develop a predictor of the next day's
change in pollution based on today's satellite images.
It is not designed as a first-rate forecaster -- it would need several
improvements for that, including addition of weather data and probably better
training of the models, or for that matter, just including the current
pollution level as a feature -- but rather as a demonstration of how to use
CaffeOnSpark to perform a complete modelling exercise from download to
training to prediction. The CaffeOnSpark package is not very well documented
as of the time of this writing, so the code serves to illustrate its usage.
SETUP
-----
This program is designed to be run on Spark. I execute this through the
Eclipse IDE using PyDev, but it can also be done via the command line. To get
PyDev working with CaffeOnSpark, the following changes need to be made to the
Python interpreter:
(a) The following need to be added to PYTHONPATH, with ${SPARK_HOME} and
${CAFFE_ON_SPARK} replaced with the actual absolute paths:
${SPARK_HOME}
${SPARK_HOME}/python
${SPARK_HOME}/python/lib/py4j-[your_version]-src.zip
${SPARK_HOME}/python/lib/pyspark.zip
${CAFFE_ON_SPARK}/caffe-grid/target/caffeonsparkpythonapi.zip
(b) The following environmental variables need to be added:
SPARK_HOME needs to be set to the root of your Spark installation
PYSPARK_SUBMIT_ARGS needs to be set to the following (YMMV):
--master local[*] # OR URL if you are running on YARN
--queue PyDevSpark1.6.1 # Can be called whatever
--files ${CAFFE_ON_SPARK}/caffe-public/python/caffe/_caffe.so,
${CAFFE_ON_SPARK}/caffe-public/distribute/lib/libcaffe.so.1.0.0-rc3
--jars "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar,
${CAFFE_ON_SPARK}/caffe-distri/target/caffe-distri-0.1-SNAPSHOT-jar-with-dependencies.jar"
--driver-library-path "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar"
--driver-class-path "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar"
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}:
${CAFFE_ON_SPARK}/caffe-public/distribute/lib:
${CAFFE_ON_SPARK}/caffe-distri/distribute/lib"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:
${CAFFE_ON_SPARK}/caffe-public/distribute/lib:
${CAFFE_ON_SPARK}/caffe-distri/distribute/lib"
pyspark-shell
These are roughly the same arguments that will need to be made when
invoking pyspark if this program is run from the command line. Again,
${CAFFE_ON_SPARK} should be replaced with its actual value.
Running the program takes place in four major steps, each of which uses its
own module. These are: (1) Downloading the data, (2) Building the LMDB
database, (3) Training the module, and (4) Prediction on today's data.
Additionally, there is a module to train a standard (non-"deep") model
using the Random Forest library on Apache Spark for comparison. Each of
these modules are described in order.
(1) Downloading the data
The module that is responsible for downloading the satellite data,
tilescraper, is available as a separate package, because it likely has uses
aside from training models etc. The pollution data are much smaller, and
are pulled in real time using the beijingpollutioncompiler module. The data
are pulled from the US Embassy web site in Beijing; the Chinese Ministry of
Environmental Protection also publishes data that are more accurate, but these
are more difficult to obtain and also do not go back quite as far.
(2) Building the database
CaffeOnSpark expects the data in one of a handful of datbase formats; I found
LMDB easiest to work with, so I transformed the data into this format to use
them. Note that the images require to have the index order of the underlying
data changed from the standard image format for Caffe to use them.
Unfortunately, the Python module for Caffe also expects the outcome data to be
an integer; fixing this is a one-line change in the Caffe source code, but I
did not want to use a non-standard version of Caffe, so I just multiplied the
changes by 100.
When no satellite data is available, MODIS does not throw an exception but
merely returns a black image. The pollution data may be blank for some days.
As the CaffeOnSpark trainer is expecting all observations to be valid, days
with such data are discarded at the time the database is built.
The outcome to be predicted is the logarithmic change between today's pollution
level and tomorrow's pollution level.
The features are expected to have a mean of zero, so the mean of each pixel is
subtratcted from every image of the training and test data before it is placed
in the database.
(3) Training the Model
The Python interface to CaffeOnSpark is somewhat limited as of this writing;
in particular, it is necessary to have the configuration of the solver and
network specified in .prototxt files rather than configured as Python objects
in the code (the Caffe interfaces to define them in the code will work, but
they cannot easily be attached to the CaffeOnSpark configuration).
The steps to train the model are pretty much the same as in the demonstration
examples widely available online; in fact, very little deviation from these
steps is supported by the Python interface to CaffeOnSpark. The key is that
we obtain our model configuration from a folder in the resource directory,
and also deliver the model there. When it is finished training, the routine
reports an R-squared of the regression on the test sample. In any event,
most of the work in this section involves tweaking the .prototxt files to
improve the model.
(4) Prediction
Once the model is trained, it can be called from the regular Caffe
interface without requiring the overhead of Spark. After checking that today's
image is not blank, we load it and transform it the same way as the training
images were transformed -- the axes are swapped and the mean is subtracted.
Finally, the test image is expected to be part of a batch, in this case a
batch of one, so we add an extra dimension to the image array.
Additionally, we need a slight alteration to the .prototxt file to predict
using this model. First, the top layers (loss and accuracy) need to be taken
out of the model, otherwise they will be returned and not the prediction.
Finally, the batch size for the test model needs to be set to one.
With that in hand, we are ready to make a prediction. Obviously, this model
is quite rudimentary.
(5) Comparison
It may be interesting to see how well we can do without using "deep" learning.
The pollutionrflearner downscales the image into a 4x4 grid, and then builds
a random forest over the grid. The clear challenge in building any model
of this phenomenon is that we have far more features than observations, so we
need to intelligently simplify the features before attempting to train. In
any event, the random forest model does slightly worse than the "deep" model,
but not substantially so. Frankly, neither model is highly predictive.

复制代码

本帖隐藏的内容

https://github.com/tavisbarr/BeijingTomorrow

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

分享0 收藏0 回帖

关键词：Pollution Forecast Tomorrow Beijing package

[BeijingTomorrow]Python Package for Beijing Pollution Forecast [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我拉你入群

相关帖子

浏览过的帖子

浏览过的版块

初级热心勋章

中级热心勋章

高级热心勋章

特级热心勋章

初级信用勋章

本版微信群

加关注串个门加好友发消息 6关注 76粉丝禁止访问 auirzxp 当前离线阅读权限 0 威望 1 级论坛币 229692 个通用积分 25371.2470 学术水平 4223 点热心指数 4861 点信用等级 4173 点经验 4493 点帖子 13491 精华 0 在线时间 12559 小时注册时间 2007-1-3 最后登录 2024-4-8 雷达卡	地板 auirzxp 发表于 2017-9-9 17:58:55 提示: 作者被禁止或删除内容自动屏蔽

	回复举报

[BeijingTomorrow]Python Package for Beijing Pollution Forecast [推广有奖]

经管之家送您一份

经管之家联合CDA

感谢您参与论坛问题回答

本帖隐藏的内容

扫码加我 拉你入群

相关帖子

浏览过的帖子

浏览过的版块

初级热心勋章

中级热心勋章

高级热心勋章

特级热心勋章

初级信用勋章

本版微信群

扫码加我拉你入群