This summer, my Amalgam Insights colleague Hyoun Park and I will be teaming up to address that question. When it comes to data science platforms, there’s no such thing as “one size fits all.” We are writing this landscape because understanding the processes of scaling data science beyond individual experiments and integrating it into your business is difficult. By breaking down the key characteristics of the data science platform market, this landscape will help potential buyers choose the appropriate platform for your organizational needs. We will examine the following questions that serve as key differentiators to determine appropriate data science platform purchasing solutions to figure out which characteristics, functionalities, and policies differentiate platforms supporting introductory data science workflows from those supporting scaled-up enterprise-grade workflows.
Amalgam’s Assumptions: The baseline order of operations for conducting data science experiments begins with understanding the business problem you’re trying to address. Gathering, prepping, and exploring the data are the next steps, done to extract appropriate features and start creating your model. The modeling process is iterative, and data scientists will adjust their model throughout the process based on feedback. Finally, if and when a model is deemed satisfactory, it can be deployed in some form.
How do these platforms support reproducibility of data, workflows, and results?
One advantage some data science platforms provide is the ability to track and save the data and hyperparameters run in each experiment, so that that experiment can be re-run at any time. Individual data scientists running ad hoc experiments need to do this tracking manually, if they even know to bother with it.
How secure, governable, and compliant are these platforms compared to corporate, standards-based, and legislative needs?
Data access is fragmented, and in early-stage data science setups, it’s not uncommon for data scientists to copy and paste and store the data they need on their own laptop, because they lack the ability to use that data directly while keeping it secure in an IT-approved manner. Data science platforms can help make this secure access process easier.
How do these platforms support collaboration between data scientists, data analysts, IT, and line-of-business departments?
Your data scientists should be able to share their reports in a usable form with the rest of the business, whether this looks likes reports, dashboards, microservices, or apps. In addition, the consumers of these data outputs need to be able to give feedback to the producers to improve results. To capitalize on data science experiments being done in a company, some level of collaboration is necessary, but this may mean different things to different organizations. Some have shared code repositories. Some use chat. Effectively scaling up data science operations requires a more consistent experience across the board, so that everybody knows where to find what they need to get their work done. Centralizing feedback on models into the platform, associated with the models and their outputs, is one example of enabling the consistency necessary.
How do these platforms support a consistent view of data science based on the user interfaces and user experiences that the platforms provide to all users?
This consistency isn’t just limited to creating a model catalog with centralized feedback – the process of going from individual data scientists operating ad hoc and using their specific preferred tools to a standardized experience can meet resistance. Data science platforms often support a wide variety of such tools, which can ease this transition, but not all data science platforms support the same sets of tools. But moving to a unified experience makes it easier to onboard new data scientists into your environment.
What do data science teams look like when they are using data science platforms?
Some teams consist of a couple of people constructing skunkworks pipelines out of code as an initial or side project. Others may do enough ongoing data science work that they work with line of business stakeholders, perhaps with the assistance of a project manager. If data science is core business for your organization, that’s a large team relative to your company size no matter how large your company is, and these teams have different needs. A focus of this research is to categorize typical experiences across the spectrum by team size and complexity, code-centricness, and other measures.
By exploring the people, processes, and technological functionalities associated with data science platforms over this summer, Amalgam Insights looks forward to bringing clarity to the market and providing directional recommendations to the enterprise community. This Vendor SmartList on Data Science Platforms will explore these questions and more in differentiating between a variety of Data Science Platforms currently in the market including, but not limited to: Alteryx, Anaconda, Cloudera, Databricks, Dataiku, Domino, H20.ai, IBM, KNIME, Mathworks, Oracle, RapidMiner, SAP, SAS Viya, Teradata, TIBCO, and other startups and new entrants in this space that establish themselves over the Summer of 2018.
If you’d like to learn more about this research initiative, or set up a briefing with Amalgam Insights for potential inclusion, please email me at firstname.lastname@example.org.