Posted on 1 Comment

From BI to AI: A Weekly Recap of Data and Analytics

Amalgam Insights is relaunching its weekly summary of important announcements in the data and analytics space. If you would like your announcement to be included in these roundups, please email lynne@amalgaminsights.com.

Product Launches and Updates

Alkymi Launches Patterns to Allow Business Users to Identify and Extract Data in Real-Time to Automate Daily Workflows

Wednesday, May 5, Alkymi launched Alkymi Patterns, a tool to automate process workflows based on data from tables and text in emails and documents. Users can quickly train Patterns to extract necessary information from unstructured data in their inbox, and automate the demonstrated workflow for the arrival of future emails and documents. Alkymi Patterns is now available for all Alkymi Data Inbox customers.

TIBCO Adds Additional Automation Capabilities to TIBCO Cloud Integration

On Tuesday, May 4, TIBCO announced updates to TIBCO Cloud Integration, its iPaaS (Integration Platform as a Service), including process automation updates and new accelerators. The new features will help customers with logistics and transportation challenges by mitigating the need to develop custom integrations from scratch.

Databricks on Google Cloud Now Generally Available

Databricks announced general availability of Databricks on the Google Cloud Platform on Tuesday, May 4. The latest version includes new features such as a Tableau connector to Databricks on GCP, as well as a terraform provider to provision and manage Databricks in GCP along with the associated cloud infrastructure. This announcment is indicative of the increased popularity of multicloud and multi vendor solutions to solve analytic challenges.

Oracle Analytics Cloud Announced New Analytics Cloud Capabilities

Oracle added capabilities to its Analytics Cloud on Monday, May 3, expanding its “self-serve analytics” capacity. Among the new features are “explainable machine learning” (which factors were most important in a given machine learning model predicting a certain outcome), automated data prep, text analytics, “affinity analysis” (identifying sets of items that often go together), graph analytics, custom map analytics, the ability to query using natural language, and a new mobile app.

New SAS Viya offerings help better manage and navigate big data for AI and analytics

On Monday, May 3, SAS announced the addition of two new data management solutions to its Viya platform, both available now. The first, SAS Studio Analyst, provides a more visually oriented IDE to enable end users to more quickly add data quality and data prep steps to their data workflow. The second, SAS Information Governance, integrates metadata search capabilities and a data catalog into Viya, allowing data professionals to find and manage their data resources more easily.

Anaconda and HP Announce Collaboration on Products for Data Scientists

On April 28, Anaconda and HP announced a “deepened collaboration” to support the data science community. Anaconda is now fully integrated into the Z by HP for Data Science product line as part of the pre-loaded software stack. By automatically including Anaconda’s curated repositories, Z by HP for Data Science users will be able to spend less time on Anaconda package management and environment configuration.

Red Hat Launches OpenShift Data Science

On April 27, Red Hat added new managed cloud services to its portfolio. Among these new services is Red Hat OpenShift Data Science, based on Open Data Hub. This service provides for faster development, training, and testing of machine learning models without the typical infrastructure demands. OpenShift Data Science is currently available in beta, as an add-on to OpenShift Dedicated and to Red Hat OpenShift Service on AWS; general availability is expected later this year. Red Hat’s offering is important because of its position as a leading Open Source development platform. By combining its existing software development platform capabilities with a data science platform, Red Hat provides a one-stop shop for developing machine learning-based applications.

Funding

Predictive analytics startup Pecan.ai raises $35M to boost AI adoption

Pecan.ai, a predictive analytics startup, announced a $35M Series B round Thursday, led by GGV Capital. Pecan.ai is a no-code platform that pits neural networks against each other to determine the most evolved and accurate neural network for any given prediction.

StarTree Secures $24 Million Funding to Commercialize “Blazing Fast” Analytics Platform Used by LinkedIn and Uber

StarTree, a real-time cloud analytics platform, announced a $24M Series A round of funding. Bain Capital Ventures and GGV Capital (spreading the wealth) led the round, with participation from existing investor CRV. StarTree, based on Apache Pinot OLAP, is designed to let users access their data more quickly and easily. With this funding, StarTree is expected to seek business from companies needing to provide analytics to large customer bases in real time.

Timescale Raises $40M Series B to Support its Time Series Database

Timescale raised a $40M Series B funding round Thursday, May 6. Redpoint Ventures led this round of funding, and all existing investors participated. The funds will be used to enhance its product suite, including new features to manage large-scale product deployments and enhancing visibility for developers of its time-series relational database.

Tellius Raises $8 Million to Provide AI-Supported Decision Intelligence

Tellius raised an $8M Series A funding round on April 27. Sands Capital Ventures led the funding round, with participation from Grotech Ventures. The funds will be used to further enhance its decision intelligence solution, with the goal of helping customers make more data-driven decisions more easily.

n8n raises $12m Series A for its ‘fair-code’ workflow automation tool

n8n announced a $12M Series A funding round on April 26. Sequoia Capital, firstminute Capital, and Harpoon Ventures participated in the round led by Felicis Ventures. n8n seeks to provide a “fair code” environment for developing applications in a no-code environment, and to grow a community of citizen developers. n8n intends to use the funds to expand hiring in engineering, developer relations, sales, and marketing, as well as investing in its workflow automation tool.

Acquisitions

Francisco Partners and TPG to Acquire Boomi from Dell Technologies

Francisco Partners and TPG will acquire Boomi from Dell Technologies for $4B. Boomi provides cloud-based integration platform as a service (IPaaS) services. Francisco Partners has a history of investing in data companies, such as Lucidworks and Redis Labs, while prior TPG data and AI company investments include C3.ai and Domo.

Upcoming Events

DataRobot’s 2021 AI Experience Worldwide, May 11-12, 2021

On May 11-12, DataRobot will hold their 2021 AI Experience Worldwide event online. Broad themes include the new era of democratized AI, and why trust in AI is a requirement. To register for this event, please visit AI Experience Worldwide.

TIBCO Analytics Forum 2021

On May 25-26, TIBCO will hold the TIBCO Analytics Forum. The event will focus on recent innovations in data management, data science, and visual and streaming analytics. To register for this event, please visit the TIBCO Analytics Forum.

Posted on Leave a comment

IBM and Cloudera Join Forces to Expand Data Science Access

On June 21, IBM and Cloudera jointly announced that they were expanding their existing relationship to bring more advanced data science solutions to Hadoop users by developing a shared go-to-market program. IBM will now resell Cloudera Enterprise Data Hub and Cloudera DataFlow, while Cloudera will resell IBM Watson Studio and IBM BigSQL.

In bulking up their joint go-to-market programs, IBM and Cloudera are reaffirming their pre-existing partnership to amplify each others’ capabilities, particularly in heavy data workflows. Cloudera Hadoop is a common enterprise data source, but Cloudera’s existing base of data science users is small despite the growing demand for data science options, and their Data Science Workbench is coder-centric. Being able to offer the more user-friendly IBM Watson Studio to its customers gives Cloudera’s existing data customers a convenient option for doing data science without necessarily needing to know Python or R or Scala. IBM can now sell Watson Studio, BigSQL, and IBM consulting and services into Cloudera customers more deeply; it broadens their ability to upsell additional offerings.

Because IBM and Cloudera each hold significant amounts of on-prem data, It’s interesting to look at this partnership in terms of the 800-pound gorilla of cloud data: AWS. IBM, Cloudera, and Amazon are all leaders when it comes to the sheer amount of data each holds. But Amazon is the biggest cloud provider on the planet; it holds the plurality of the cloud hosting market, and most of IBM and Cloudera’s customers’ data is on-prem. Because that data is hosted on-prem, it’s data Amazon doesn’t have access to; IBM and Cloudera are teaming up to sell their own data science and machine learning capabilities on that on-prem data where there may be security or policy reasons to keep it out of the cloud.

A key differentiator in comparing AWS with the IBM-Cloudera partnership lies in AWS’ breadth of machine learning offerings. In addition to having a general-purpose data science and machine learning platform in SageMaker, AWS also offers task-specific tools like Amazon Personalize and Textract that address precise use cases for a number of Amazon customers who don’t need a full-blown data science platform. IBM has some APIs for visual recognition, natural language classification, and decision optimization, but AWS has developed their own APIs into higher-level services. Cloudera customers building custom machine learning models may find that IBM’s Watson Studio suits their needs. However, IBM lacks the variety of off-the-shelf machine learning applications that AWS provides. IBM supplies their machine learning capabilities as individual APIs that an application development team will need to fit together to create their own in-house apps.

Recommendations

  • For Cloudera customers looking to do broad data science, IBM Watson Studio is now an option. This offers Cloudera customers an alternative to Data Science Workbench; in particular, an option that has a more visual interface, with more drag-and-drop capabilities and some level of automation, rather than a more code-centric environment.
  • IBM customers can now choose Cloudera Enterprise Data Hub for Hadoop. IBM and Hortonworks had a long-term partnership; IBM supporting and cross-selling Enterprise Data Hub demonstrates that IBM will continue to sell enterprise Hadoop in some flavor.
Posted on

Data Science and Machine Learning News Roundup, May 2019

On a monthly basis, I will be rounding up key news associated with the Data Science Platforms space for Amalgam Insights. Companies covered will include: Alteryx, Amazon, Anaconda, Cambridge Semantics, Cloudera, Databricks, Dataiku, DataRobot, Datawatch, Domino, Elastic, Google, H2O.ai, IBM, Immuta, Informatica, KNIME, MathWorks, Microsoft, Oracle, Paxata, RapidMiner, SAP, SAS, Tableau, Talend, Teradata, TIBCO, Trifacta, TROVE.

Domino Data Lab Champions Expert Data Scientists While Outpacing Walled-Garden Data Science Platforms

Domino announced key updates to its data science platform at Rev 2, its annual data science leader summit. For data science managers, the new Control Center provides information on what an organization’s data science team members are doing, helping managers address any blocking issues and prioritize projects appropriately. The Experiment Manager’s new Activity Feed supplies data scientists with better organizational and tracking capabilities on their experiments. The Compute Grid and Compute Engine, built on Kubernetes, will make it easier for IT teams to install and administer Domino, even in complex hybrid cloud environments. Finally, the beta Domino Community Forum will allow Domino users to share best practices with each other, as well as submit feature requests and feedback to Domino directly. With governance becoming a top priority across data science practices, Domino’s platform improvements around monitoring and making experiments repeatable will make this important ability easier for its users.

Informatica Unveils AI-Powered Product Innovations and Strengthens Industry Partnerships at Informatica World 2019

At Informatica World, Informatica publicized a number of key partnerships, both new and enhanced. Most of these partnerships involve additional support for cloud services. This includes storage, both data warehouses (Amazon Redshift) and data lakes (Azure, Databricks). Informatica also announced a new Tableau Dashboard Extension that enables Informatica Enterprise Data Catalog from within the Tableau platform. Finally, Informatica and Google Cloud are broadening their existing partnership by making Intelligent Cloud Services available on Google Cloud Platform, and providing increased support for Google BigQuery and Google Cloud Dataproc within Informatica. Amalgam Insights attended Informatica World and provides a deeper assessment of Informatica’s partnerships, as well as CLAIRE-ity on Informatica’s AI initiatives.

Microsoft delivers new advancements in Azure from cloud to edge ahead of Microsoft Build conference

Microsoft announced a number of new Azure Machine Learning and Azure AI capabilities. Azure Machine Learning has been integrated with Azure DevOps to provide “MLOps” capabilities that enable reproducibility, auditability, and automation of the full machine learning lifecycle. This marks a notable increase in making the machine learning model process more governable and compliant with regulatory needs. Azure Machine Learning also has a new visual drag-and-drop interface to facilitate codeless machine learning model creation, making the process of building machine learning models more user-friendly. On the Azure AI side, Azure Cognitive Services launched Personalizer, which provides users with specific recommendations to inform their decision-making process. Personalizer is part of the new “Decisions” category within Azure Cognitive Services; other Decisions services include Content Moderator, an API to assist in moderation and reviewing of text, images, and videos; and Anomaly Detector, an API that ingests time-series data and chooses an appropriate anomaly detection model for that data. Finally, Microsoft added a “cognitive search” capability to Azure Search, which allows customers to apply Cognitive Services algorithms to search results of their structured and unstructured content.

Microsoft and General Assembly launch partnership to close the global AI skills gap

Microsoft also announced a partnership with General Assembly to address the dearth of qualified data workers, with the goal of training 15,000 workers by 2022 for various artificial intelligence and machine learning roles. The two companies will found an AI Standards Board to create standards and credentials for artificial intelligence skills. In addition, Microsoft and General Assembly will develop scalable training solutions for Microsoft customers, and establish an AI Talent network to connect qualified candidates to AI jobs. This continues the trend of major enterprises building internal training programs to bridge the data skills gap.

Posted on 1 Comment

The CLAIRE-ion Call at Informatica World 2019: AI Needs Managed Data and Data Management Needs AI

From May 20-23, Amalgam Insights attended Informatica World 2019, Informatica’s end user summit dedicated to the world of data management. Over the years, Informatica has transformed from a data integration and data governance vendor to a broad-based enterprise data management vendor offering what it calls the “Intelligent Data Platform,” consisting of data integration, Big Data Management, Integration Platform as a Service, Data Quality and Governance, Master Data Management,  Enterprise Data Catalog, and Data Security & Privacy. Across all of these areas, Informatica has built market-leading products with the goal of providing high-quality point solutions that can work together to solve broad data challenges.

To support this holistic enterprise data management approach, Informatica has developed an artificial intelligence layer, called CLAIRE, to support metadata management, machine learning, and artificial intelligence services that provide context, automation, and anomaly recognition across Informatica’s varied data offerings. And at Informatica World 2019, CLAIRE was everywhere as AI served as a core theme of the event.

CLAIRE was mentioned in Informatica’s Intelligent Cloud Services, automating redundant, manual data processing steps. It was in Informatica Data Quality, cleansing and standardizing incoming data. It was in Data Integration, speeding up the integration process for non-standard data. It was in the Enterprise Data Catalog, helping users to understand where data was going across their organization. And it was in Data Security, identifying patterns and deviations in user activities that could indicate suspicious behavior, while contextually masking sensitive data for secure use.

What’s meant by CLAIRE? It’s the name tying together all of Informatica’s automated smart contextual improvements across its Intelligent Data Platform, surfacing relevant data, information, and recommendations just-in-time throughout various parts of the data pipeline. By bringing “AI” to data management, Informatica hopes to improve efficiency throughout the whole process, helping companies manage the growing pace of data ingestion. Continue reading The CLAIRE-ion Call at Informatica World 2019: AI Needs Managed Data and Data Management Needs AI

Posted on

Data Science and Machine Learning News Roundup, April 2019

On a monthly basis, I will be rounding up key news associated with the Data Science Platforms space for Amalgam Insights. Companies covered will include: Alteryx, Amazon, Anaconda, Cambridge Semantics, Cloudera, Databricks, Dataiku, DataRobot, Datawatch, Domino, Elastic, Google, H2O.ai, IBM, Immuta, Informatica, KNIME, MathWorks, Microsoft, Oracle, Paxata, RapidMiner, SAP, SAS, Tableau, Talend, Teradata, TIBCO, Trifacta, TROVE.

Alteryx Acquires ClearStory Data to Accelerate Innovation in Data Science and Analytics

Alteryx acquired ClearStory Data, an analytics solution for complex and unstructured data with a focus on automating Big Data profiling, discovery, and data modeling.  This acquisition reflects Alteryx’s interest in expanding its native capabilities to include more in-house data visualization tools. ClearStory Data provides a visual focus on data prep, blending, and dashboarding with their Interactive Storyboards that partners with Alteryx’s ongoing augmentation of internal visualization capabilities throughout the workflow such as Visualytics.

Dataiku Announces the Release of Dataiku Lite Edition

Dataiku released two new versions of its machine learning platform, Dataiku Free and Dataiku Lite, targeted towards small and medium businesses. Dataiku Free will allow teams of up to three users to work together simultaneously; it is available both on-prem and on AWS and Azure. Dataiku Lite will provide support for Hadoop and job scheduling beyond the capabilities of Dataiku Free. Since Dataiku already partners with over 1000 small and medium businesses, creating versions of its existing platform more financially accessible to such organizations lowers a significant barrier to entry, and grooms smaller companies to grow their nascent data science practices within the Dataiku family.

DataRobot Celebrates One Billion Models Built on Its Cloud Platform

DataRobot announced that as of mid-April, its customers had built one billion models on its automatic machine learning program. Vice President of Product Management Phil Gurbacki noted that DataRobot customers build more than 2.5 million models per day. Given that the majority of models created are never successfully deployed – a common theme cited this month at both Enterprise Data World and at last week’s Open Data Science Conference – it seems likely that DataRobot customers don’t currently have one billion models operationalized. If the percentage of deployed models is significantly higher than the norm, though, this would certainly boost DataRobot in potential customers’ eyes, and serve to further legitimize AutoML software solutions as plausible options.

Microsoft, SAS, TIBCO Continue Investments in AI and Data Skills Training

Microsoft announced a new partnership with OpenClassrooms to train students for the AI job marketplace via online coursework and projects. Given an estimate that projects 30% of AI and data jobs will go unfilled by 2022, OpenClassrooms’ recruiting 1000 promising candidates seems like just the beginning of a much-needed effort to address the skills gap.

SAS provided more details on the AI education initiatives they announced last month. First, they launched SAS Viya for Learners, which will allow academic institutions to access SAS AI and machine learning tools for free. A new SAS machine learning course and two new Coursera courses will provide access to SAS Viya for Learners to those wanting to learn AI skills without being affiliated with a traditional academic institution. SAS also expanded on the new certifications they plan to offer: three SAS specialist certifications in machine learning, natural language and computer vision, and forecasting and optimization. Classroom and online options for pursuing both of these certifications will be available.

Meanwhile, TIBCO continued expanding its partnerships with educational institutions in Asia to broaden analytics knowledge in the region. Most recently, it has augmented its existing partnership with Singapore Polytechnic to train 1000 students in analytics and IoT skillsets by 2020. Other analytics education partnerships TIBCO has announced in the last year include Yuan Ze University in Taiwan, Asia Pacific University of Technology and Innovation in Malaysia, and BINUS University in Indonesia.

The big picture: existing data science degree programs and machine learning and AI bootcamps are not providing a large enough volume of highly-skilled job candidates quickly enough to fill many of these data-centric positions. Expect to hear more about additional educational efforts forthcoming from data science, machine learning, and AI vendors.

Posted on 1 Comment

Enterprise Data World 2019: Data Science Will Take Over The World! … Eventually.

Amalgam Insights attended Enterprise Data World, a conference focused on data management, in late March. Though the conference tracks covered a wide variety of data practices, our primary interest was in the sessions on the AI and Machine Learning track. We came away with the impression that the data management world is starting to understand and support some of the challenges that organizations face when trying to get complex data initiatives off the ground, but that the learning process will continue to have growing pains.

Data Strategy Bootcamp

I began my time at Enterprise Data World with the Data Strategy Bootcamp on Monday. Often, organizations focus on getting smaller data projects done quickly in a tactical fashion at the expense of consciously developing their broader data strategy. The bootcamp addressed how to incorporate these “quick wins” into the bigger picture, and delved into the details of what a data strategy should include, and what does the process of building one look like. For people in data analytics and data scientist roles, understanding and contributing to your organization’s data strategy is important because well-documented and properly-managed data means data analysts and data scientists can spend more of their time doing analytics and building machine learning models. The “data scientists spend 80% of their time cleaning and preparing data” number continues to circulate without measurable improvement. To build a successful data strategy, organizations will need to identify business goals that are data-centric to align the organization’s data strategy with its business strategy, assess the organization’s maturity and capabilities across its data ecosystem, and determine long-term goals and “quick wins” that will provide measurable progress towards those goals.

Getting Started with Data Science, Machine Learning, and Artificial Intelligence Initiatives

Actually getting started on data science, machine learning, and artificial intelligence initiatives remains a point of confusion for many organizations looking to expand beyond the basic data analytics they’re currently doing. Both Kristin Serafin and Lizzie Westin of FINRA and Vinay Seth Mohta of Manifold led sessions discussing how to turn talk about machine learning and artificial intelligence into action in your organizations, and how to do so in a way that can scale up quickly. Key takeaways: your organization needs to understand its data to understand what questions it wants answered that require a machine learning approach; it needs to understand what tools are necessary to move forward; it needs to understand who already has pertinent data capabilities within the organization, and who is best positioned to improve their skills in the necessary manner; and you need to obtain buy-in from relevant stakeholders.

Data Job Roles

Data job roles were discussed in multiple sessions; I attended one from the perspective of how analytical jobs themselves are evolving, and one from the perspective of analytical career development. Despite the hype, not everyone is a data scientist, even if they may perform some tasks that are part of a data science pipeline! Data engineers are the difference between data scientists’ experiments sitting in silos and getting them into production where they can affect your company. Data analysts aren’t going anywhere – yet. (Though Michael Stonebraker, in his keynote Tuesday morning, stated that he believed data science would eventually replace BI, pending upskilling a sufficient number of data workers.) And data scientists spend 80% of their time doing data prep instead of building machine learning models; they’d like to do more of the latter, and because they’re an expensive asset, the business needs them to be doing less prep and more building as well.

By the same token, there are so many different specialties across the data environment, and the tool landscape is incredibly large. No one will know everything; even relatively low-level people will need to provide leadership in their particular roles to bridge the much-bemoaned gap between IT and Business. So how can data people do that? They’ll need to learn to talk about their initiatives and accomplishments in business terms – increasing revenue, decreasing cost, managing risk. By doing this, data strategy can be tied to business strategy, and this barrier to success can be surmounted.

Data Integration at Scale

Michael Stonebraker’s keynote highlighted the growing need for people with data science capabilities, but the real meat of his talk centered around how to support complex data science initiatives: doing data integration at scale. One example: General Electric’s procurement system problem. Clearly, the ideal number of procurement systems in any company is “one.” Given mergers and acquisitions, over time, GE had accumulated *75* procurement systems. They could save $100M if they could bring together all of these systems, with all of the information on the terms and conditions negotiated with each vendor via each of these systems. But this required a rather complex data integration process. Once that was done, the same process remained for dealing with their supplier databases, and their customer databases, and a whole host of other data. Machine learning can help with this – once there are sufficient people with machine learning skills to address these large problems. But doing data integration at scale will remain a significant challenge for enterprises for now, with machine learning skills being relatively costly and rare, data accumulation continuing to grow exponentially, and bringing in third-party data to supplement existing analyses..

Knowledge Graphs and Semantic AI

A number of sessions discussed knowledge graphs and their importance for supporting both data management and data science tasks. Knowledge graphs provide a “semantic” layer over standard relational databases – they prioritize documenting the relationships between entities, making it easier to understand how different parts of your organization’s data are interrelated. Because having a knowledge graph about your organization’s data provides natural-language context around data relationships, it can make machine learning models based on that data more “explainable” due to the additional human-legible information available for interpretation and understanding. Another example: if you’re trying to perform a search, most results rely on exact matches. Having a knowledge graph makes it simple to pull up “related” results based on the relationships documented in that knowledge graph.

Data Access, Control, and Usage

My big takeaway from Scott Taylor’s Data Architecture session: data should be a shared, centralized asset for your entire organization; it must be 1) accessible by its consumers 2) in the format they require 3) via the method they require 4) if they have permission to access it (security) 5) and they will use it in a way that abides by governance standards and laws. Data scientists care about this because they need data to do their job, and any hurdle in accessing usable data makes it more likely they’ll avoid using official methods to access the data. Nobody has three months to wait for a data requisition from IT’s data warehouses to be turned around anymore; instead, “I’ll just use this data copy on my desktop” – or more likely these days, in a cloud-hosted data silo. Making centralized access easy to use makes data users much more likely to comply with data usage and access policies, which helps secure data properly, govern its use appropriately, and prevent data silos from forming.

Digging a bit more into the security and governance aspects mentioned above, it’s surprisingly easy to identify individuals in a set of anonymized data. In separate presentations, Matt Vogt of Immuta demonstrated this with a dataset consisting of anonymized NYC taxi data, even as more and more information was redacted from it. Jeff Jonas of Senzing’s keynote took this further – as context accumulates around data, it gets easier to make inferences, even when your data is far from clean. With GDPR on the table, and CCPA coming into effect in nine months, how data workers can use data, ethically and legally, will shift, significantly affecting data workflows. Both the use of data and the results provided by black-box machine learning models will be challenged.

Recommendations

Data scientists and machine learning practitioners should familiarize themselves with the broader data management ecosystem. Said practitioners understand why dirty data is problematic, given that they spend most of their work hours cleaning that data so they can do the actual machine learning model-building, but there are numerous tools available to help with this process, and possibly obviate the need for a particular cleaning job that’s already been done once. As enterprise data catalogs become more common, this will prevent data scientists from spending hours on duplicative work when someone else has already cleaned the set they were planning to use and made it available for the organization’s use.

Data scientists and data science managers should also learn how to communicate the business value of their data initiatives when speaking to business stakeholders. From a technical point of view, making a model more accurate is an achievement in and of itself. But knowing what it means from a business standpoint builds understanding of what that improved accuracy or speed means for the business as a whole. Maybe your 1% improvement in model accuracy means you save your company tens of thousands of dollars by more accurately targeting potential customers who are ready to buy your product – that’s what will get the attention of your line-of-business partners.

Data science directors and Chief Data or Chief Analytics Officers should approach building their organization’s data strategy and culture with the long-term view in mind. Aligning your data strategy with the organization’s business strategy is crucial to your organization’s success. Rather than having both departments tugging on opposite ends of the rope going in different directions, develop an understanding of each others’ needs and capabilities and apply that knowledge to keep everyone focused on the same goal.

Chief Data Officers and Chief Analytics Officers should understand their organization’s capabilities by conducting an assessment both of their data capabilities and capacity available by individual, and to assess the general maturity in each data practice area (such as Master Data Management, Data Integration, Data Architecture, etc.). Knowing the availability of both technical and people-based resources is necessary to develop a scalable set of data processes for your organization with consistent results no matter who the data scientist or analyst is in charge of executing on the process for any given project.

As part of developing their organization’s data strategy, Chief Data Officers and Chief Analytics Officers must work with their legal department to develop rules and processes for accumulating, storing, accessing, and using data appropriately. As laws like GDPR and the California Privacy Act start being enforced, data access and usage will be much more scrutinized; companies not adhering to the letters of those laws will find themselves fined heavily. Data scientists and data science managers who are working on projects that involve sensitive or personal data should talk to their general counsel to ensure they remain on the right side of the law.

Posted on 1 Comment

Data Science and Machine Learning News Roundup, March 2019

On a monthly basis, I will be rounding up key news associated with the Data Science Platforms space for Amalgam Insights. Companies covered will include: Alteryx, Amazon, Anaconda, Cambridge Semantics, Cloudera, Databricks, Dataiku, DataRobot, Datawatch, Domino, Elastic, Google, H2O.ai, IBM, Immuta, Informatica, KNIME, MathWorks, Microsoft, Oracle, Paxata, RapidMiner, SAP, SAS, Tableau, Talend, Teradata, TIBCO, Trifacta, TROVE.

Dataiku Releases Version 5.1 in Anticipation of AI’s Surge in the Enterprise

Dataiku released version 5.1 of their software platform. This includes a GDPR framework for governance and control, as well as user-experience upgrades such as the ability to copy and reuse analytic workflows in new projects, coders being able to use their preferred development environment from within Dataiku, and easier navigation of complex analytics projects where data sources may number in the hundreds.

Being able to document when sensitive data is being used and prevent inappropriate use of such data is key for companies trying to work within GDPR and similar laws and not lose significant funds to violations of these laws. Dataiku’s inclusion of a governance component within its data science platform distinguishes it from its competitors, many of whom lack such a component natively, and enhances Dataiku’s attractiveness as a data science platform.

Domino Data Lab Platform Enhancements Improve Productivity of Data Science Teams Across the Entire Model Lifecycle

Domino announced three new capabilities for their data science platform. Datasets is a high-performance data store that will make it easier for data scientists to find, share, and reuse large data resources across multiple projects, saving time in the search process. Experiment Manager gives data science teams a system of record for ongoing experiments, making it easier to avoid unnecessary duplicate work. Activity Feed provides this type of information for data science leads to understand changes in any given project when they may be tracking multiple projects at once. Together, these three collaboration capabilities enhance Domino users’ ability to do data science in a documented, repeatable, and mature fashion.

SAS Announces $1 Billion Investment in Artificial Intelligence (AI)

SAS announced a $1B investment in AI across three key areas: Research and Development, education initiatives, and a Center of Excellence. The goal is to to enable SAS users to use AI to some degree even without a significant baseline of AI skills, to help SAS users improve their baseline AI skills through training, and to help organizations using SAS to bring AI projects into production more quickly with the help of AI experts as consultants. A significant percent of SAS users aren’t currently using SAS to perform complex machine learning and artificial intelligence tasks; helping these users to  get actual SAS-based AI projects into production enhances SAS’ ability to sell its AI software.

NVIDIA-Related Announcements

H2O.ai and SAS both announced partnerships with NVIDIA this month. H2O.ai’s Driverless AI and H2O4GPU are now optimized for NVIDIA’s Data Science Workstations, and NVIDIA RAPIDS will be integrated into H2O as well. SAS disclosed future plans to expand NVIDIA GPU support across SAS Viya, and plan to use these GPUs and the CUDA-X AI acceleration library to support SAS’ AI software. Both H2O.ai and SAS are using NVIDIA’s GPUs and CUDA-X to make certain types of machine learning algorithms operate more quickly and efficiently.

These follow prior announcements about NVIDIA partnerships with IBM, Oracle, Anaconda, and MathWorks, reflecting NVIDIA’s importance in machine learning. With NVIDIA GPUs making up an estimated 70% of the world market share, data science and machine learning software programs and platforms need to be able to work well on the de facto default GPU.

Posted on Leave a comment

At IBM Think, Watson Expands “Anywhere”

At IBM Think in February, IBM made several announcements around the expansion of Watson’s availability and capabilities, framing these announcements as the launch of “Watson Anywhere.” This piece is intended to provide guidance to data analysts, data scientists, and analytic professionals seeking to implement machine learning and artificial intelligence capabilities and evaluating the capabilities of IBM Watson’s AI and machine learning services for their data.

Announcements

IBM declared that Watson is now available “anywhere” – both on-prem and in any cloud configuration, whether private, public, singular, multi-cloud, or a hybrid cloud environment. Data that needs to remain in place for privacy and security reasons can now have Watson microservices act on it where it resides. The obstacle of cloud vendor lock-in can be avoided by simply bringing the code to the data instead of vice versa. This ubiquity is made possible via a connector from IBM Cloud Private for Data that makes these services available via Kubernetes containers. New Watson services that will be available via this connector include Watson Assistant, IBM’s virtual assistant, and Watson OpenScale, an AI operation and automation platform.

Watson OpenScale is an environment for managing AI applications that puts IBM’s Trust and Transparency principles into practice around machine learning models. It builds trust in these models by providing explanations of how said models come to the conclusions that they do, permitting visibility into what’s seen as a “black box” by making their processes auditable and traceable. OpenScale also claims the ability to automatically identify and mitigate bias in models, suggesting new data for model retraining. Finally, OpenScale also provides monitoring capabilities of AI in production, validating ongoing model accuracy and health from a central management console.

Watson Assistant lets organizations build conversational bot interfaces into applications and devices. When interacting with end users, it can perform searches of relevant documentation, ask the user for further clarification, or redirect the user to a person for sufficiently complex queries. Its availability as part of Watson Anywhere permits organizations to implement and run virtual assistants in clouds outside of the IBM Cloud.

These new services join other Watson services currently available via the IBM Cloud Private for Data connector including Watson Studio and Watson Machine Learning, IBM’s programs for creating and deploying machine learning models. Additional Watson services being made available for Watson Anywhere later this year include Watson Knowledge Studio and Watson Natural Language Understanding.

In addition, IBM also announced IBM Business Automation with Watson, a future AI capability that will permit businesses to further automate existing work processes by analyzing patterns in workflows for commonly repeated tasks. Currently, this capability is available via limited early access; general availability is anticipated later in 2019.

Recommendations

Organizations seeking to analyze data “in place” have a new option with Watson services now accessible outside of the IBM Cloud. Data that must remain where it is for security and privacy reasons can now have Watson analytics processes brought to it via a secure container, whether that data resides on-prem or in any cloud, not just the IBM cloud. This opens the possibility of using Watson to enterprises in regulated industries like finance, government, and healthcare, as well as in departments where governance and auditability are core requirements, such as legal and HR.

With the IBM Cloud Private for Data connector enabling Watson Anywhere, companies now have a net-new reason to consider IBM products and services in their data workflow. While Amazon and Azure dominate the cloud market, Watson’s AI and machine learning tools are generally easier to use out of the box. For companies who have made significant commitments to other cloud providers, Watson Anywhere represents an opportunity to bring more user-friendly data services to their data residing in non-IBM clouds.

Companies concerned about the “explainability” of machine learning models, particularly in regulated industries or for governance purposes, should consider using Watson OpenScale to monitor models in production. Because OpenScale can provide visibility into how models behave and make decisions, concerns about “black box models” can be mitigated with the ability to automatically audit a model, trace a given iteration, and explain how the model determined its outcomes. This transparency boosts the ability for line of business and executive users to understand what the model is doing from a business perspective, and justify subsequent actions based on that model’s output. For a company to depend on data-driven models, those models need to prove themselves trustworthy partners to those driving the business, and explainability bridges the gap between the model math and the business initiatives.

Finally, companies planning for long-term model usage need to consider how they plan to support model monitoring and maintenance. Longevity is a concern for machine learning models in production. Model drift reflects changes that your company needs to be aware of. How do companies ensure that model performance and accuracy is maintained over the long haul? What parameters determine when a model requires retraining, or to be taken out of production? Consistent monitoring and maintenance of operationalized models is key to their ongoing dependability.

Posted on Leave a comment

Data Science and Machine Learning News Roundup, February 2019

On a monthly basis, I will be rounding up key news associated with the Data Science Platforms space for Amalgam Insights. Companies covered will include: Alteryx, Amazon, Anaconda, Cambridge Semantics, Cloudera, Databricks, Dataiku, DataRobot, Datawatch, Domino, Elastic, Google, H2O.ai, IBM, Immuta, Informatica, KNIME, MathWorks, Microsoft, Oracle, Paxata, RapidMiner, SAP, SAS, Tableau, Talend, Teradata, TIBCO, Trifacta, TROVE.

Four Key Announcements from H2O World in San Francisco

At H2O World in San Francisco, H2O.ai made several important announcements. Partnerships with Alteryx, Kx, and Intel will extend Driverless AI’s accessibility, capabilities, and speed, while improvements to Driverless AI, H2O, Sparkling Water, and AutoML focused on expanding support for more algorithms and heavier workloads. Amalgam Insights covered H2O.ai’s H2O World announcements.

IBM Watson Now Available Anywhere

At IBM Think in San Francisco, IBM announced the expansion of Watson’s availability “anywhere” – on-prem, and in any cloud configuration, whether private or public, singular or multi-cloud. Data no longer has to be hosted on the IBM Cloud to use Watson on it – instead, a connector from IBM Cloud Private for Data permits organizations to bring various Watson services to data that cannot be moved for privacy and security reasons. Update: Amalgam Insights now has a more in-depth evaluation of IBM Watson Anywhere.

Databricks’ $250 Million Funding Supports Explosive Growth and Global Demand for Unified Analytics; Brings Valuation to $2.75 Billion

Databricks has raised $250M in a Series E funding round, bringing its total funding to just shy of $500M. The funding round raises Databricks’ valuation to $2.75B in advance of a possible IPO. Microsoft joins this funding round, reflecting continuing commitment to the Azure Databricks collaboration between the two companies. This continued increase in valuation and financial commitment demonstrates that funders are satisfied with Databricks’ vision and execution.

Posted on 1 Comment

Four Key Announcements from H2O World San Francisco

Last week at H2O World San Francisco, H2O.ai announced a number of improvements to Driverless AI, H2O, Sparkling Water, and AutoML, as well as several new partnerships for Driverless AI. The improvements provide incremental improvements across the platform, while the partnerships reflect H2O.ai expanding their audience and capabilities. This piece is intended to provide guidance to data analysts, data scientists, and analytic professionals working on including machine learning in their workflows.

Announcements

H2O.ai has integrated H2O Driverless AI with Alteryx Designer; the connector is available for download in the Alteryx Analytics Gallery. This will permit Alteryx users to implement more advanced and automatic machine learning algorithms into analytic workflows in Designer, as well as doing automatic feature engineering for their machine learning models. In addition, Driverless AI models can be deployed to Alteryx Promote for model management and monitoring, reducing time to deployment. Both of these new capabilities provide Alteryx-using business analysts and citizen data scientists more direct and expanded access to machine learning via H2O.ai.

H2O.ai is integrating Kx’s time-series database, kdb+, into Driverless AI. This will extend Driverless AI’s ability to process large datasets, resulting in faster identification of more performant predictive capabilities and machine learning models. Kx users will be able to perform feature engineering for machine learning models on their time series datasets within Driverless AI, and create time-series specific queries.

H2O.ai also announced a collaboration with Intel that will focus on accelerating H2O.ai technology on Intel platforms, including the Intel Xeon Scalable processor and H2O.ai’s implementation of XGBoost. Driverless AI on Intel, globally.  Accelerating H2O on Intel will help establish Intel’s credibility in machine learning and artificial intelligence for heavy compute loads. Other aspects of this collaboration will include expanding the reach of data science and machine learning by supporting efforts to integrate AI into analytics workflows and using Intel’s AI Academy to teach relevant skills. The details of the technical projects will remain under wraps until spring.

Finally, H2O.ai announced numerous improvements to both Driverless AI and their open-source H2O, Sparkling Water, and AutoML, mostly focused on expanding support for more algorithms and heavier workloads among their product suite. Among the improvements that caught my eye was the new ability to inspect trees thoroughly for all of the tree-based algorithms that the open-source H2O platform supports. With concern about “black-box” models and lack of insight around how a given model performs its analysis and why it yields the results it does for any given experiment, providing an API for tree inspection is a practical step towards making the logic behind model performance and output more transparent for at least some machine learning models.

Recommendations

Alteryx users seeking to implement machine learning models into analytic workflows should take advantage of increased access to H2O Driverless AI. Providing more machine learning capabilities to business analysts and citizen data scientists enhances the capabilities available to their data analytics workflows; Driverless AI’s existing AutoDoc capability will be particularly useful for ensuring Alteryx users understand the results of the more advanced techniques they now have access to.

If your organization collects time-series data but has not yet pursued analytics of this data with machine learning yet, consider trialing KX’s kdb+ and H2O’s Driverless AI. With this integration, Driverless AI will be able to quickly and automatically process time series data stored in kdb+, allowing swift identification of performant models and predictive capabilities.

If your organization is considering making significant investments in heavy-duty computing assets for heavy machine learning loads in the medium-term future, keep an eye on the work Intel will be doing to design chips for specific types of machine learning workloads. NVIDIA has its GPUs and Google its TPUs; by partnering with H2O, Intel is declaring its intentions to remain relevant in this market.

If your organization is concerned about the effects of “black box” machine learning models, the ability to inspect tree-based models in H2O, along with the AutoDoc functionality in Driverless AI, are starting to make the logic behind machine learning models in H2O more transparent. This new ability to inspect tree-based algorithms is a key step towards more thorough governance surrounding the results of machine learning endeavors.