One of my favorite subreddits is (sorry in advance) r/programmingcirclejerk, because it offers a place to call out some of the more ridiculous, lofty statements that people make about various programming languages. Sometimes it can be mean-spirited, but often it’s right on the mark. One of the recent posts was a quote from a blog post that said,
I often think Python is too easy. Can you really call it “programming” if you can generate classification predictions with only 6 lines of code? Especially if 3 of those lines are a dependency and your training data, I would argue that someone else did the real programming.
The discussion of the quote centered around the fact that it’s ridiculous to rebuild programming APIs from scratch, when there is already a full set of them pre-built by a community that specializes in that particular problem.
This cut to the heart of a trend I’ve been thinking about recently: how the process of data science itself is becoming a commodity.
To be clear, not analysis. Data analysis will never be able to be automated because it involves too much business logic, trial and error, and human involvement. But the data science models and the underlying algorithms, the pieces of code that go something like this:
import matplotlib.pyplot as plt import numpy as np from sklearn import datasets, linear_model from sklearn.metrics import mean_squared_error, r2_score # Split the data into training/testing sets diabetes_X_train = diabetes_X[:-20] diabetes_X_test = diabetes_X[-20:] # Split the targets into training/testing sets diabetes_y_train = diabetes.target[:-20] diabetes_y_test = diabetes.target[-20:] # Create linear regression object regr = linear_model.LinearRegression() # Train the model using the training sets regr.fit(diabetes_X_train, diabetes_y_train) # Make predictions using the testing set diabetes_y_pred = regr.predict(diabetes_X_test) # The coefficients print('Coefficients: \n', regr.coef_) # The mean squared error print("Mean squared error: %.2f" % mean_squared_error(diabetes_y_test, diabetes_y_pred)) # Explained variance score: 1 is perfect prediction print('Variance score: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))
are already in the process of automation.
This is your run-of-the-mill linear regression model, straight from the scikit-learn (a commonly-used Python machine learning library) documentation, predicting the likelihood of getting diabetes based on a number of factors like eating, exercise, and glucose measurements.
It’s also the least interesting part of data science, because it’s based on
1) Clean data and
2) Running the code interactively on one single machine (i.e. you can’t hook it up to a web app or ask if other people are going to get diabetes.)
Neither of these situations ever occur in the wild west environments that data scientists and engineers work in.
Some days back, I read a really interesting post by Vincent Warmerdam called “The Future of Data Science is Past,” where he hypothesizes that the data science algorithms we’ve all come to know and love, like this one, are merely small parts of very complicated systems, and that what’s becoming more and more important, rather than the model code itself, is those systems themselves and how they talk to each other.
In short, the hype around algorithms forgot to mention that:
- algorithms are merely cogs in a system
- algorithms don’t understand problems
- algorithms aren’t creative solutions
- algorithms live in a mess
His main reasoning is that algorithms like decision trees, neural nets, and the like, are blunt tools that can only be used within the context of people shepherding them. Systems are complicated and take a long time to build, and the value is in the entire system working from end to end rather than a single algorithm making a prediction:
It is unrealistic that a self learning algorithm is able to pick up these rules as hard constraints so it makes sense to handle this elsewhere in the system. With that in mind; notice how the algorithm is only a small cog in the entire system. The jupyter notebook that contains the recommender algorithm is not the investment that needs to be made by the team but rather it is all the stuff around the algorithm that requires practically all the work.
This is increasingly true from my perspective. On any given data science project, with the exception of specialized ones that are very industry specific, it takes much more time to implement an algorithm than to choose one.
There has been a lot of talk about how machines are replacing humans with the advent of <scary CNN anchor voice> AI and machine learning </anchor voice.> Ironically what’s happening is that the data scientists that started working on these algorithms are getting crowded out of that space.
As I’ve written before, it’s my opinion that data scientists will now need to become much more like developers than statisticians, and we’re seeing that bear out in industry, particularly as the big shops – Amazon, Google, and Microsoft, build out products that do some (not all!) of the heavy lifting of machine learning (Sagemaker, Azure ML, and Google AI tools), and as products like H20AI and Yellowbrick that do feature selection and the parts of the machine learning process.
What does this mean? Not that data science will become obsolete. Analysis and model selection will never be fully automated, and, now more than ever, in an age of extremely large corporate goofs – both intentional and not – in the way algorithms and data are used, humans need to be in the loop.
As I’ve written before,
I realized that there are three fundamental parts of any data project:
- creating code that moves or analyzes data
- human judgment to interpret the results of parts one and two
If any one of the three are de-prioritized, a data project goes awry.
But, there will be a shift in the value of the work that a data scientist does. Before, it used to be all about exploration, finding correlations, and modeling.
Now, it’s about putting those model-commodities into production where other people can use and see them.