Posted on Leave a comment

Data Science and Machine Learning News Roundup, January 2019

On a monthly basis, I will be rounding up key news associated with the Data Science Platforms space for Amalgam Insights. Companies covered will include: Alteryx, Amazon, Anaconda, Cambridge Semantics, Cloudera, Databricks, Dataiku, DataRobot, Datawatch, DominoElastic, Google, H2O.ai, IBM, Immuta, Informatica, KNIME, MathWorks, Microsoft, Oracle, Paxata, RapidMiner, SAP, SAS, Tableau, Talend, Teradata, TIBCO, Trifacta, TROVE.

Cloudera and Hortonworks Complete Planned Merger

In early January, Cloudera and Hortonworks completed their planned merger. With this, Cloudera becomes the default machine learning ecosystem for Hadoop-based data, while providing an easy pathway for expanding into  machine learning and analytics capabilities for Hortonworks customers.

Study: 89 Percent of Finance Teams Yet to Embrace Artificial Intelligence

A study conducted by the Association of International Certified Professional Accountants (AICPA) and Oracle revealed that 89% of organizations have not deployed AI to their finance groups. Although a correlation exists between companies with revenue growth and companies that are using AI, the key takeaway is that artificial intelligence is still in the early adopter phase for most organizations.

Gartner Magic Quadrant for Data Science and Machine Learning Platforms

In late January, Gartner released its Magic Quadrant for Data Science and Machine Learning Platforms. New to the Data Science and Machine Learning MQ this year are both DataRobot and Google – two machine learning offerings with completely different audiences and scope. DataRobot offers an automated machine learning service targeted towards “citizen data scientists,” while Google’s machine learning tools, though part of Google Cloud Platform, are more of a DIY data pipeline targeted towards developers. By contrast, I find it curious that Amazon’s SageMaker machine learning platform – and its own collection of task-specific machine learning tools, despite their similarity to Google’s – failed to make the quadrant, given this quadrant’s large umbrella.

While data science and machine learning are still emerging markets, the contrasting demands of these technologies made by citizen data scientists and by cutting-edge developers warrants splitting the next Data Science and Machine Learning Magic Quadrant into separate reports targeted to the considerations of each of these audiences. In particular, the continued growth of automated machine learning technologies will likely drive such a split, as citizen data scientists pursue a “good enough” solution that provides quick results.

Posted on

Amazon Expands Toolkit of Machine Learning Services at AWS re:Invent

At AWS re:Invent, Amazon Web Services expanded its toolkit of machine learning application services with the announcements of Amazon Comprehend Medical, Amazon Forecast, Amazon Personalize, and Amazon Textract. These new services augment the capabilities Amazon provides to end users when it comes to text analysis, personalized recommendations, and time series forecasts. The continued growth of these individual services removes obstacles for companies looking to get started with common machine learning tasks on a smaller scale; rather than building a wholesale data science pipeline in-house, these services allow companies to quickly get one task done, and this permits an incremental introduction to machine learning for a given organization. Forecast, Personalize, and Textract are in preview, while Comprehend Medical is available now.

Amazon Comprehend Medical, Forecast, Personalize, and Textract join a collection of machine learning services that include speech recognition (Transcribe) and translation (Translate), speech-to-text and text-to-speech (Lex and Polly) to power machine conversation such as chatbots and Alexa, general text analytics (Comprehend), and image and video analysis (Rekognition).

New Capabilities

Amazon Personalize lets developers add personalized recommendations into their apps, based on a given activity stream from that app and a corpus of what’s available to be recommended, whether that’s products, articles, or other things. In addition to recommendations, Personalize can also be used to customize search results and notifications. By combining a given search string or location with contextual behavior data, Amazon looks to provide customers with the ability to build trust.

Amazon Forecast builds private, custom time-series forecast models that predict future trends based on that data. Customers provide both histoical data and related causal data, and Forecast analyzes the data to determine the relevant factors in building its models and providing forecasts.

Amazon Textract extracts text and data from scanned documents, without requiring manual data entry or custom code. In particular, using machine learning to recognize when data is in a table or form field and treat it appropriately will save a significant amount of time over the current OCR standard.

Finally, Amazon Comprehend Medical, an extension of last year’s Amazon Comprehend, uses natural language processing to analyze unstructured medical text such as doctor’s notes or clinical trial records, and extract relevant information from this text.

Recommendations

Organizations doing resource planning, financial planning, or other similar forecasting that currently lack the capability to do time series forecasting in-house should consider using Amazon Forecast to predict product demand, staffing levels, inventory levels, material availability, and to perform financial forecasting. Outsourcing the need to build complex forecasting models in-house lets departments focus on the predictions.

Consumer-oriented organizations looking to build higher levels of engagement with their customers who provide generic, uncontextualized recommendations right now (based on popularity or other simple measures) should consider using Amazon Personalize to provide personalized recommendations, search results, and notifications via their apps and website. Providing high-quality relevant recommendations a la minute builds customer trust in the quality of a given organization’s engagement efforts, particularly compared to the average spray-and-pray marketing communication.

Organizations that still depend on physical documents, or who have an archive of physical documents to scan and analyze, should consider using Amazon Textract. OCR’s limits are well-known, especially when it comes to accurately interpreting and formatting semi-structured blocks of text data such as form fields and tables, resulting in significant time devoted to post-processing manual correction. Textract handles complex documents without the need for custom code or maintaining templates; being able to automate text interpretation and analysis further accelerates document processing workflows, and better permits organizations to maintain compliance.

Medical organizations using software that depends on manually-implemented rules to process their medical text should consider using Amazon Comprehend Medical. By removing the need to maintain a list of rules in-house, Comprehend Medical accelerates the ability to extract and analyze medical information from unstructured text fields like doctor’s notes and health records, improving processes such as medical coding, cohort analysis to recruit patients for clinical trials, and health monitoring of patients.

All organizations looking to use machine learning services from external providers need to consider whether outsourcing will work for their circumstances. Data privacy is a key concern, and even more so in regulated verticals with industry-specific rules such as HIPAA. Does the service you want to use respect those rules? From a compliance perspective, why a model gives the results it does needs to be explained as well; merely accepting results from the black box at face value is insufficient. Machine learning products that automatically provide such an explanation in plain English do exist, but this feature is still uncommon and in its infancy.

Conclusion

With its latest announcements, Amazon continues to broaden the scope of customer issues it addresses with machine learning services. Medical companies need better text analytics yesterday, but struggle to comply with HIPAA while assessing the data they have. Customer-facing organizations face stiff competition when their competitor is only a click away. And any company trying to plan for the future based on past data grapples with understanding what factors affect future results. Amazon’s machine learning application services address common tactical business issues by simplifying the process for customers of implementing task-specific machine learning models to pure inputs and outputs. These services present outsourcing opportunities for overworked departments struggling to keep up.

Posted on 1 Comment

Data Science and Machine Learning News, November 2018

On a monthly basis, I will be rounding up key news associated with the Data Science Platforms space for Amalgam Insights. Companies covered will include: Alteryx, Amazon, Anaconda, Cambridge Semantics, Cloudera, Databricks, Dataiku, DataRobot, Datawatch, DominoElastic, H2O.ai, IBM, Immuta, Informatica, KNIME, MathWorks, Microsoft, Oracle, Paxata, RapidMiner, SAP, SAS, SnapLogic, Tableau, Talend, Teradata, TIBCO, Trifacta, TROVE.

Continue reading Data Science and Machine Learning News, November 2018

Posted on 1 Comment

Data Science and Machine Learning News, October 2018

On a monthly basis, I will be rounding up key news associated with the Data Science Platforms space for Amalgam Insights. Companies covered will include: Alteryx, Anaconda, Cambridge Semantics, Cloudera, Databricks, Dataiku, DataRobot, Datawatch, DominoElastic, H2O.ai, IBM, Immuta, Informatica, KNIME, MathWorks, Microsoft, Oracle, Paxata, RapidMiner, SAP, SAS, Tableau, Talend, Teradata, TIBCO, Trifacta, TROVE.

Continue reading Data Science and Machine Learning News, October 2018

Posted on Leave a comment

Data Science Platforms News Roundup, September 2018

On a monthly basis, I will be rounding up key news associated with the Data Science Platforms space for Amalgam Insights. Companies covered will include: Alteryx, Anaconda, Cambridge Semantics, Cloudera, Databricks, Dataiku, DataRobot, Datawatch, DominoElastic, H2O.ai, IBM, Immuta, Informatica, KNIME, MathWorks, Microsoft, Oracle, Paxata, RapidMiner, SAP, SAS, Tableau, Talend, Teradata, TIBCO, Trifacta, TROVE.

Continue reading Data Science Platforms News Roundup, September 2018

Posted on Leave a comment

Learning Elastic’s Machine Learning Story at Elastic{ON} in Boston

Why is a Data Science and Machine Learning Analyst at Elastic’s road show when they’re best known for search? In early September, Amalgam Insights attended Elastic{ON} in Boston, MA. Prior to the show, my understanding of Elastic was that they were primarily a search engine company. Still, the inclusion of a deep dive into machine learning interested me, and I was also curious to learn more about their security analytics, which were heavily emphasized in the agenda.

In exploring Elastic’s machine learning capabilities, I got a deep dive with Rich Collier, the Senior Principal Solutions Architect and Machine Learning specialist. Elastic acquired Prelert, an incident management company with unsupervised machine learning capabilities in September 2016 with the goal of incorporating real-time behavioral analytics into the Elastic Stack; in the interim two years, integrating Prelert has grown Elastic’s abilities to act on time-series data anomalies found in the Elasticsearch data store, offering an extension to the Elastic Stack called “Machine Learning” as part of their Platinum-level SaaS offerings.

Elastic Machine Learning users no longer have to define rules to identify abnormal time-series data, nor do they even need to code their own models – the Machine Learning extension analyzes the data to understand what “normal” looks like in that context, including what kind of shifts can be expected over different periods of time from point-to-point all the way up to seasonal patterns. From that, it learns when to throw an alert on encountering atypical data in real time, whether that data is log data, metrics, analytics, or a sudden upsurge in search requests for “$TSLA.” Learning from the data rather than configuring blunt rules makes for a more granular precision that reduces alerts on false positives on anomalous data.

The configuration for the Machine Learning extension is simple and requires no coding experience; front-line business analysts can customize the settings via pull-down menus and other graphical form fields to suit their needs. To simplify the setup process even further, Elastic offers a list of “machine learning recipes” on their website for several common use cases in IT operations and security analytics; given how graphically oriented the Elastic stack is, I wouldn’t be surprised to see these “recipes” implemented as default configuration options in the future. Doing so would simplify the configuration from several minutes of tweaking individual settings to selecting a common profile in a click or two.

Elastic also stated that one long-term goal is to “operationalize data science for everyone.” At the moment, that’s a fairly audacious claim for a data science platform, let alone a company best known for search and search analytics. One relevant initiative Kearns mentioned in the keynote was the debut of the Elastic Common Schema, a common set of fields for ingesting data into Elasticsearch. Standardizing data collection makes it easier to correlate and analyze data through these relational touch points, and opens up the potential for data science initiatives in the future, possibly through partnerships or acquisitions. But they’re not trying to be a general-purpose data science company right now; they’re building on their core of search, logging, security, and analytics; machine learning ventures are likely to fall within this context. Currently, that offering is anomaly detection on time series data.

Providing users who aren’t data scientists with the ability to do anomaly detection on time series data may be just one step in one category of data modeling, but having that sort of specialized tool accessible to data and business analysts would help organizations better understand the “periodicity” of typical data. Retailers could track peaks and valleys in sales data to understand purchasing patterns, for example, while security analysts could focus on responding to anomalies without having to define what anomalous data looks like ahead of time as a collection of rules.

Elastic’s focus on making this one specific machine learning tool accessible to non-data-scientists reminded me of Google’s BigQuery ML initiative  – take one very specific type of machine learning query, and operationalize it for easy use by data and business analysts to address common business queries. Then, once they’ve perfected that tool, they’ll move onto building the next one.

Improving the quality of data acquired and stored in Elasticsearch from search results will be key to improving on the user experience. I spoke with Steve Kearns, the Senior Director of Product Management at Elastic, who delivered the keynote speech with a sharp focus on “scale, speed, and relevance” for improving search results. Better search data can be used to optimize machine learning applied to that data. With how Elastic has created the Machine Learning extension focused on anomaly detection across time series data – data Elasticsearch specializes in collecting, such as log data – this can support more accurate data analysis and better business results for data-driven organizations.

Overall, It was intriguing to see how machine learning is being incorporated into IT solutions that aren’t directly supporting data science environments. Enabling growth in the use of machine learning tactics effectively spreads the use of data across an organization, bringing companies closer to the advantages of the data-driven ideal. Elastic’s Machine Learning capability potentially opens up a specific class of machine learning for a broader spectrum of Elastic users without requiring them to acquire coding and statistics backgrounds; this positions Elastic as a provider of a specific type of machine learning services in the present, and makes it more plausible to consider them as a company for providing machine learning services in the future.

Posted on Leave a comment

Data Science Platforms News Roundup, August 2018

On a monthly basis, I will be rounding up key news associated with the Data Science Platforms space for Amalgam Insights. Companies covered will include: Alteryx, Anaconda, Cloudera, Databricks, Dataiku, DataRobot, Datawatch, Domino, H2O.ai, IBM, Immuta, Informatica, KNIME, MathWorks, Microsoft, Oracle, Paxata, RapidMiner, SAP, SAS, Tableau, Talend, Teradata, TIBCO, Trifacta.

Continue reading Data Science Platforms News Roundup, August 2018

Posted on Leave a comment

Code-Free to Code-Based: The Power Spectrum of Data Science Platforms

The spectrum of code-centricity on data science platforms ranges from “code-free” to “code-based.” Data science platforms frequently boast that they provide environments that require no coding, and that are code-friendly as well. Where a given platform falls along this spectrum affects who can successfully use a given data science platform, and what tasks they are able to perform at what level of complexity and difficulty. Codeless interfaces supply drag-and-drop simplicity and relatively quick responses to straightforward questions at the expense of customizability and power. Code-based interfaces require specialized coding, data, and statistics skills, but supply the flexibility and power to answer cutting-edge questions.

Codeless and hybrid code environments furnish end users who may lack a significant coding and statistics background with some level of data science capabilities. If a problem is relatively simple, such as a straightforward clustering question to identify customer personas for marketing, graphic interfaces provide the ability to string together data workflows from a pool of available algorithms without needing to know Python or other coding languages. Even for data scientists who do know how to code, the ability to pull together relatively simple models in a drag-and-drop GUI can be faster than manually coding them, and this also avoids the problem of typos and reduces the need for debugging code technicalities at the expense of focusing on the pure logic without distractions.

Answering a more advanced question may require some level of custom coding. Your data workflow may be constructed in a hybrid manner, composed of some pre-built models connected to nodes that can include bespoke code. This permits more adaptability of models, and makes them more powerful than those restricted solely to what a given data science platform supplies out of the box. However, even if a data science platform includes the option to include custom code in a hybrid model, taking advantage of this feature requires somebody with coding knowledge to create the code.

If the problem being addressed is complex enough, sharper coding, statistics, and data skills are necessary to create appropriately tailored models. At this level of complexity, a code-centric interactive development environment is necessary so that the data scientist can put their advanced skills into model construction and customization.

Data science platforms can equip data science users and teams with multiple interfaces for creating machine learning models. What interfaces are included say a fair bit about what kind of end users a given platform aims to best serve, and the level of skill expected of the various members of your data science team. A fully-inclusive data science platform includes both a GUI environment for data analysts to construct simple workflows (and for project managers and line of business to understand what the model is doing from a high-level perspective), as well as a proper coding environment for data scientists to code more complex custom models.

Posted on Leave a comment

Oracle GraphPipe: Expediting and Standardizing Model Deployment and Querying

On August 15, 2018, Oracle announced the availability of GraphPipe, a network protocol designed to transmit machine learning data between remote processes in a standardized manner, with the goal of simplifying the machine learning model deployment process. The spec is now available on Oracle’s GitHub, along with clients and servers that have implemented the spec for Python and Go (with a Java client soon to come); and a TensorFlow plugin that allows remote models to be included inside TensorFlow graphs.

Oracle’s goal with GraphPipe is to standardize the process of model deployment regardless of the frameworks utilized in the model creation stage.

Continue reading Oracle GraphPipe: Expediting and Standardizing Model Deployment and Querying

Posted on Leave a comment

Growing Your Data Science Team: Diversifying Beyond Unicorns

If your organization already has a data scientist, but your data science workload has grown beyond their capacity, you’re probably thinking about hiring another data scientist. Perhaps even a team of them. But cloning your existing data scientist isn’t the best way to grow your organization’s capacity for doing data science.

Why not simply hire more data scientists? First, so many of the tasks listed above are actually well outside the core competency of data scientists’ statistical work, and other roles (some of whom likely already exist in your organization) can perform these tasks much more efficiently. Second, data scientists who can perform all of these tasks well are a rare find; hoping to find their clones in sufficient numbers on the open market is a losing proposition. Third, though your organization’s data science practice continues to expand, the amount of time your original domain expert is able to spend with the data scientist on a growing pool of data science projects does not; it’s time to start delegating some tasks to operational specialists.

Continue reading Growing Your Data Science Team: Diversifying Beyond Unicorns