楼主: ReneeBK
1571 8

[BeijingTomorrow]Python Package for Beijing Pollution Forecast [推广有奖]

  • 1关注
  • 62粉丝

VIP

已卖:4897份资源

学术权威

14%

还不是VIP/贵宾

-

TA的文库  其他...

R资源总汇

Panel Data Analysis

Experimental Design

威望
1
论坛币
49635 个
通用积分
55.7537
学术水平
370 点
热心指数
273 点
信用等级
335 点
经验
57805 点
帖子
4005
精华
21
在线时间
582 小时
注册时间
2005-5-8
最后登录
2023-11-26

楼主
ReneeBK 发表于 2017-9-8 03:32:44 |AI写论文

+2 论坛币
k人 参与回答

经管之家送您一份

应届毕业生专属福利!

求职就业群
赵安豆老师微信:zhaoandou666

经管之家联合CDA

送您一个全额奖学金名额~ !

感谢您参与论坛问题回答

经管之家送您两个论坛币!

+2 论坛币
  1. BEIJING TOMORROW
  2. ----------------

  3. By Tavis Barr, tavisbarr@gmail.com, Copyright 2016
  4. Licensed under the Gnu Public License V. 2.0
  5. Contact me about other licensing arrangements


  6. This program uses satellite data from the NASA MODIS program, and pollution
  7. data from the US embassy in Beijing to develop a predictor of the next day's
  8. change in pollution based on today's satellite images.

  9. It is not designed as a first-rate forecaster -- it would need several
  10. improvements for that, including addition of weather data and probably better
  11. training of the models, or for that matter, just including the current
  12. pollution level as a feature -- but rather as a demonstration of how to use
  13. CaffeOnSpark to perform a complete modelling exercise from download to
  14. training to prediction.  The CaffeOnSpark package is not very well documented
  15. as of the time of this writing, so the code serves to illustrate its usage.


  16. SETUP
  17. -----

  18. This program is designed to be run on Spark.  I execute this through the
  19. Eclipse IDE using PyDev, but it can also be done via the command line.  To get
  20. PyDev working with CaffeOnSpark, the following changes need to be made to the
  21. Python interpreter:

  22. (a) The following need to be added to PYTHONPATH, with ${SPARK_HOME} and
  23. ${CAFFE_ON_SPARK} replaced with the actual absolute paths:

  24. ${SPARK_HOME}
  25. ${SPARK_HOME}/python
  26. ${SPARK_HOME}/python/lib/py4j-[your_version]-src.zip
  27. ${SPARK_HOME}/python/lib/pyspark.zip
  28. ${CAFFE_ON_SPARK}/caffe-grid/target/caffeonsparkpythonapi.zip

  29. (b) The following environmental variables need to be added:

  30. SPARK_HOME needs to be set to the root of your Spark installation
  31. PYSPARK_SUBMIT_ARGS needs to be set to the following (YMMV):

  32. --master local[*]       # OR URL if you are running on YARN
  33. --queue PyDevSpark1.6.1 # Can be called whatever  
  34. --files ${CAFFE_ON_SPARK}/caffe-public/python/caffe/_caffe.so,
  35.         ${CAFFE_ON_SPARK}/caffe-public/distribute/lib/libcaffe.so.1.0.0-rc3  
  36. --jars "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar,
  37.                 ${CAFFE_ON_SPARK}/caffe-distri/target/caffe-distri-0.1-SNAPSHOT-jar-with-dependencies.jar"
  38. --driver-library-path "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar"
  39. --driver-class-path "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar"
  40. --conf         spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}:
  41.                 ${CAFFE_ON_SPARK}/caffe-public/distribute/lib:
  42.                 ${CAFFE_ON_SPARK}/caffe-distri/distribute/lib"  
  43. --conf         spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:
  44.                 ${CAFFE_ON_SPARK}/caffe-public/distribute/lib:
  45.                 ${CAFFE_ON_SPARK}/caffe-distri/distribute/lib"
  46. pyspark-shell

  47. These are roughly the same arguments that will need to be made when
  48. invoking pyspark if this program is run from the command line.  Again,
  49. ${CAFFE_ON_SPARK} should be replaced with its actual value.



  50. Running the program takes place in four major steps, each of which uses its
  51. own module.  These are: (1) Downloading the data, (2) Building the LMDB
  52. database, (3) Training the module, and (4) Prediction on today's data.
  53. Additionally, there is a module to train a standard (non-"deep") model
  54. using the Random Forest library on Apache Spark for comparison.  Each of
  55. these modules are described in order.

  56. (1) Downloading the data

  57. The module that is responsible for downloading the satellite data,
  58. tilescraper, is available as a separate package, because it likely has uses
  59. aside from training models etc.  The pollution data are much smaller, and
  60. are pulled in real time using the beijingpollutioncompiler module.  The data
  61. are pulled from the US Embassy web site in Beijing; the Chinese Ministry of
  62. Environmental Protection also publishes data that are more accurate, but these
  63. are more difficult to obtain and also do not go back quite as far.


  64. (2) Building the database

  65. CaffeOnSpark expects the data in one of a handful of datbase formats; I found
  66. LMDB easiest to work with, so I transformed the data into this format to use
  67. them.  Note that the images require to have the index order of the underlying
  68. data changed from the standard image format for Caffe to use them.

  69. Unfortunately, the Python module for Caffe also expects the outcome data to be
  70. an integer; fixing this is a one-line change in the Caffe source code, but I
  71. did not want to use a non-standard version of Caffe, so I just multiplied the
  72. changes by 100.

  73. When no satellite data is available, MODIS does not throw an exception but
  74. merely returns a black image.  The pollution data may be blank for some days.
  75. As the CaffeOnSpark trainer is expecting all observations to be valid, days
  76. with such data are discarded at the time the database is built.

  77. The outcome to be predicted is the logarithmic change between today's pollution
  78. level and tomorrow's pollution level.

  79. The features are expected to have a mean of zero, so the mean of each pixel is
  80. subtratcted from every image of the training and test data before it is placed
  81. in the database.

  82. (3) Training the Model

  83. The Python interface to CaffeOnSpark is somewhat limited as of this writing;
  84. in particular, it is necessary to have the configuration of the solver and
  85. network specified in .prototxt files rather than configured as Python objects
  86. in the code (the Caffe interfaces to define them in the code will work, but
  87. they cannot easily be attached to the CaffeOnSpark configuration).

  88. The steps to train the model are pretty much the same as in the demonstration
  89. examples widely available online; in fact, very little deviation from these
  90. steps is supported by the Python interface to CaffeOnSpark.  The key is that
  91. we obtain our model configuration from a folder in the resource directory,
  92. and also deliver the model there.  When it is finished training, the routine
  93. reports an R-squared of the regression on the test sample.  In any event,
  94. most of the work in this section involves tweaking the .prototxt files to
  95. improve the model.

  96. (4) Prediction

  97. Once the model is trained, it can be called from the regular Caffe
  98. interface without requiring the overhead of Spark.  After checking that today's
  99. image is not blank, we load it and transform it the same way as the training
  100. images were transformed -- the axes are swapped and the mean is subtracted.
  101. Finally, the test image is expected to be part of a batch, in this case a
  102. batch of one, so we add an extra dimension to the image array.  

  103. Additionally, we need a slight alteration to the .prototxt file to predict
  104. using this model.  First, the top layers (loss and accuracy) need to be taken
  105. out of the model, otherwise they will be returned and not the prediction.
  106. Finally, the batch size for the test model needs to be set to one.

  107. With that in hand, we are ready to make a prediction.  Obviously, this model
  108. is quite rudimentary.

  109. (5) Comparison

  110. It may be interesting to see how well we can do without using "deep" learning.
  111. The pollutionrflearner downscales the image into a 4x4 grid, and then builds
  112. a random forest over the grid.  The clear challenge in building any model
  113. of this phenomenon is that we have far more features than observations, so we
  114. need to intelligently simplify the features before attempting to train.  In
  115. any event, the random forest model does slightly worse than the "deep" model,
  116. but not substantially so.  Frankly, neither model is highly predictive.
复制代码

本帖隐藏的内容

https://github.com/tavisbarr/BeijingTomorrow

二维码

扫码加我 拉你入群

请注明:姓名-公司-职位

以便审核进群资格,未注明则拒绝

关键词:Pollution Forecast Tomorrow Beijing package

沙发
MouJack007 发表于 2017-9-8 03:37:21
谢谢楼主分享!
已有 1 人评分论坛币 收起 理由
Nicolle + 20 鼓励积极发帖讨论

总评分: 论坛币 + 20   查看全部评分

藤椅
MouJack007 发表于 2017-9-8 03:39:36

板凳
zhaosl 发表于 2017-9-8 05:02:24
已有 1 人评分论坛币 收起 理由
Nicolle + 20 鼓励积极发帖讨论

总评分: 论坛币 + 20   查看全部评分

报纸
nonewman 发表于 2017-9-8 07:43:32
day new month update

地板
auirzxp 学生认证  发表于 2017-9-9 17:58:55
提示: 作者被禁止或删除 内容自动屏蔽

7
yeayee 发表于 2017-9-10 09:10:41

8
goldgood88 发表于 2017-9-14 05:20:44
谢谢,学习你了。

9
lifeng20511 发表于 2017-9-24 19:28:33
多谢楼主

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注jltj
拉您入交流群
GMT+8, 2026-1-2 12:34