The top 10 data science projects on Github are chiefly composed of a number of tutorials and educational resources for learning and doing data science. Have a look at the resources others are using and learning from.
By Matthew Mayo, KDnuggets.In our latest inspection of Github repositories, we focus on "data science" projects. Unlike other searches we have performed over the past several months, nearly all of the repositories which show up (listed by number of stars* in descending order) are resources forlearning data science, as opposed to tools for doing. As such, this is much less a software listing than it is a collection of tutorials and educational resources. There are, however, a few software surprises in here as well, such as a data science-oriented IDE and a great notebook-related project.
We include, however, the standard informational notification we have placed on our previous Github Top 10 lists: open source tools have been used by 73% of data scientists in the past 12 months, according to a recent KDnuggets survey (and accounting for the 12 months prior to the survey). While the following repositories focus mainly on learning resources, previous offerings have been software-heavy; also, open source learning materials are the new black, and amain source of learning for data scientists these days.
1. Data Science iPython Notebooks
Stars: 5169, Forks: 902
Donne Martin has put together a great (and, apparently, wildly popular) resource for those looking for iPython notebooks for tutorials. The repo describes itself best:
Continually updated data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
2. The Open Source Data Science Masters
Stars: 4338, Forks: 2624
This is the official repository holding the curriculum of the Data Science Masters, the brainchild of data scientist Clare Corthell, designed as an open source alternative to formal data science education. With that in mind, this repo is a collection of materials for pursuit of this alternative route to data science mastery.
The open-source curriculum for learning Data Science. Foundational in both theory and technologies, the OSDSM breaks down the core competencies necessary to making use of data.
3. Rodeo
Stars: 2540, Forks: 229
Rodeo is a data science IDE. Developed by yhat, Rodeo is currently in version 1.0 of development. Rodeo's philosophy builds on iPython notebooks:
We originally built Rodeo because we like the Jupyter Notebook for presentations and tutorials, but thought it was a bit clunky for daily work. We wanted a one-stop IDE for Python with a good text editor, a simple plot window and a terminal with autocomplete.
Stars: 2307, Forks: 259
This is a simple, but extensive, list of data science blogs, listed in alphabetical order. You'll find all the big blogs in here (including KDnuggets, of course), but also many smaller, off-the-beaten-path selections as well. The repo appears to be updated often, with the most recent updates happening only hours prior to this writing.
Stars: 2142, Forks: 529
This is another of the Awesome... "brand" of curated lists. Straight to the point:
An open source Data Science repository to learn and apply towards solving real world problems.
Like other Awesome lists around (what, exactly, makes these lists more "awesome" than others?), there are countless resources broken down into several categories.
6. Data Science Specialization
Stars: 1986, Forks: 20800
This is a collection of the resources for the Johns Hopkins Data Science Specialization on Coursera. A wildly popular course with names like Roger Peng, Jeff Leek, and Brian Caffo attached to it, it is responsible for teaching data science and R to thousands of learners. Get all of the resources used in all of the courses collected here.
7. Data Science Specialization Community Site
Stars: 1153, Forks: 2307
This is a community-curated content companion site for the Johns Hopkins Data Science Specialization on Coursera.
A couple students have created quality content around the subjects we discuss, and many of these materials are so good we feel that they should be shared with all of our students. This site is meant to serve as a central directory for community created content.
If you have a resource which would be useful to others in the program, a pull request can be submitted in order to have it included in the curated knowledge pages list.
Stars: 1087, Forks: 258
Andy Petrella forked scala-notebook and refactored it for massive dataset analysis with Apache Spark, and this is the result. From the repo:
The tool allows performing reproducible analysis with Scala, Apache Spark and more.
This is achieved through an interactive web-based editor that can combine Scala code, SQL queries, Markup or even JavaScript in a collaborative manner.
Stars: 993, Forks: 541
Nitin Borwankar has put together another compilation of resources for learning data science. It is a collection of iPython notebooks focusing on machine learning, specifically the topics of:
- Linear Regression
- Logistic Regression
- Random Forests
- K-Means Clustering
It appears to be a beginner's guide to fundamental concepts in machine learning, but a well-crafted one.
10. Data Science at the Command Line
Stars: 948, Forks: 260
This repository contains the virtual machine, data, scripts, and custom command-line tools used in the book Data Science at the Command Line.
Included is the Data Science Toolbox, a virtual environment for data science. Author Jeroen Janssens' brand of data science includes the interplay of Python, R, numerous packages, and command line utilities. If you have read the book, or reading these few lines has captured your interest, give the repo a look.
* As viewed 6:00 PM EST, March 21, 2016.