By Dominik Haitz, IONOS.
IntroductionDrew Conway’s visualization of the data science skill set is an often cited classic. Different opinions and the versatility of the role have spawned numerous variations:
Various data science Venn diagrams. Image courtesy of Google Images. Source: https://sinews.siam.org/Details-Page/a-timely-focus-on-data-science
There seems to be no consensus on the data science skill set. Additionally, as the field evolves, shortcomings become obvious and new challenges arise. How can we describe this evolution?
The first wave of data scientists happendbefore data became big and before data science was actually a thing (pre-2010s): Statisticians and analysts who had always been around, doing a lot of what modern data scientists are doing, but accompanied by less hype.
Second wave: Large-scale data collection created a demand for smart minds who can work magic and turn all this big data into big money. Companies were still figuring out what kind of people to employ and often turned to science graduates. While the second wave data scientists did a lot right, their carefully crafted models often ended up as PoCs and failed to bring about actual change.
Now, at the end of the 2010s, amidst the hype around deep learning and AI, enter the third wave of data scientists: Experimenting and innovating, efficiently seeking out business value und bridging the deployment gap to create great data products. What skills are required here?
The skill portfolio of the third wave data scientist.
1. Business MindsetThe business mindset is the centerpiece of the data science skill set, as it sets goals and applies the other skills to reach them. Patrick McKenzie states in this blog post:
Engineers are hired to create business value, not to program things: Businesses do things for irrational and political reasons all the time […], but in the main they converge on doing things which increase revenue or reduce costs.
Likewise, data scientists are hired to create business value, not just to build models. Ask yourself: How will the outcome of my work influence company decisions? What do I have to do to maximize this effect? With this entrepreneurial spirit, the third wave data scientist does not only produce actionable insights, but also seeks that they bring about real change.
Look where the money flows in your organization — the divisions with the largest cost or revenue will likely offer the highest financial leverage. However, business value is a fuzzy concept: It goes beyond cost and revenue of the current fiscal year. Experimenting and creating an innovative data culture will increase a company’s long-term competitiveness.
Prioritizing your workand knowing when to stop is the key to efficiency. Think of diminishing returns: Is it worth spending weeks to tweak a model for another 0.2% of precision? Quite often, good enough is the real perfect.
Domain expertise, which makes up a third of Conway’s skill set, is by no means to be neglected — however, you’ll almost everywhere have to learn it on the job. This includes knowledge about your industry as well as all the company processes, naming schemes and peculiarities. This knowledge does not only set the frame conditions for your work, but it is often indispensable to understand and interpret your data.
Keep it simple, stupidMat Velloso@matvelloso
Half of the time when companies say they need "AI" what they really need is a SELECT clause with GROUP BY.
You're welcome.
6,661
2:53 AM - May 31, 2018
Twitter Ads info and privacy
2,667 people are talking about this
Look out for the low hanging fruit and quick wins. A simple SQL query on existing data warehouse might yield valuable insights unbeknownst to product managers or executives. Don’t fall into the trap of doing “buzzword-driven data science”, focusing on state-of-the-art deep learning where a beautifully simple regression model would be sufficient — and much less work to build, implement and maintain. Know the complicated things, but do not overcomplicate things.
2. Software Engineering CraftsmanshipThe notion of (second wave) data scientists needing only “hacking skills” instead of proper software engineering has been repeatedly critizised. Lack of readability, modularity or versioning hinders collaboration, reproducibility and productionizing.
Instead, learn the craft from proper software engineers. Test your code and use version control. Follow an established coding style (e.g. PEP8) and learn how to use an IDE (e.g. PyCharm). Try pair programming. Modularize and document your code, use meaningful variable names and refactor, refactor, refactor.
Bridge the deployment gap for agile prototyping of data products: Learn to use tools for logging and monitoring. Know how to build a REST API (e.g. using Flask) to provide your results to others. Learn how to ship your work inside a Docker container or deploy it to a platform like Heroku. Instead of letting your models rot on your laptop, wrap them into data-driven services that fit snugly into your company’s IT landscape.
3. Statistics and Algorithms ToolboxData scientists have to thoroughly understand the basic concepts in statistics and particularly in machine learning (A STEM university education is probably the best way to acquire this foundation). There are tons of resources on what’s important, so I’m not gonna delve further into this here. You will often have to explain algorithms or concepts like statistical uncertainty to your clients, or red-flag an insight because of a confusion between correlation and causation.