If your organization already has a data scientist, but your data science workload has grown beyond their capacity, you’re probably thinking about hiring another data scientist. Perhaps even a team of them. But cloning your existing data scientist isn’t the best way to grow your organization’s capacity for doing data science.
Why not simply hire more data scientists? First, so many of the tasks listed above are actually well outside the core competency of data scientists’ statistical work, and other roles (some of whom likely already exist in your organization) can perform these tasks much more efficiently. Second, data scientists who can perform all of these tasks well are a rare find; hoping to find their clones in sufficient numbers on the open market is a losing proposition. Third, though your organization’s data science practice continues to expand, the amount of time your original domain expert is able to spend with the data scientist on a growing pool of data science projects does not; it’s time to start delegating some tasks to operational specialists.
What does a data science team look like? Most organizations start out with a line of business director recognizing the need for a data scientist to find answers in the mass of company data, hiring that data scientist, then providing some guidance to what they want to learn from their data to said data scientist – a data science team of two people. The data scientist engages with the data to understand it, get it into shape for analysis, extract relevant features, create a machine learning model, and then turn that model into the desired type of outputs so that the line of business director can act on the results. But when you’re looking to do data science in your organization at a larger scale, companies need to build out these skillsets across multiple different roles within the organization, evolving over time to construct a team of specialists.
Growing the Team
The first person you’re likely to add to your baseline data science team of a domain expert and a data scientist is a data analyst or business analyst. It may be a cliche that the majority of time spent on the technical aspects of doing data science involves data preparation, but it’s a key part of data analyst work. Gathering the data and ensuring that it’s both the right kind of data to answer the questions asked and that it’s in in a properly structured form for an analysis to be effective is core data analyst work. If the model output is destined for a report or a dashboard, the data analyst can take care of this simple work as well, leaving data scientists more time to focus on their core competency: model creation.
As a project grows in complexity, it’s likely that the domain expert who recognized the strategic need for initiating data science projects will need to offload the day-to-day aspects of managing those projects on the business side to a departmental operations manager or project manager. That line-of-business manager will be responsible for keeping everybody on the project on the same page – you’re now up to a data science team of four people, but each of these people is able to more effectively work in the comfort zone of their specialties.
The project manager is also likely to be the key person interfacing with IT to request necessary compute infrastructure and other resources. If your organization has multiple data science projects occurring at the same time, especially once your organization is doing enough data science to require multiple data scientists and then multiple data science teams, they will need to share available technical resources, and an IT manager will need to manage the availability of those resources and determine how they are to be shared.
If the desired output of a model is not for reports and dashboards, but for APIs, micro services, and embedded into apps, data science teams will frequently add a software engineer specifically to bring a given model into production via those forms. Often, models are created in Python or R, but need to be translated to a different language to work within the context of other software; a software engineer can optimize this better and faster than a data scientist usually can.
By the time your organization has multiple data scientists doing numerous data science projects across a number of data science teams, the need for a specific data science manager is clear. A data science manager manages data scientists, keeps track of your organizations’ data science projects from a broader perspective, and is able to provide guidance and resources to your data scientists that the original two-person team didn’t have access to.
As data science work continues to scale across the organization, some companies are finding value in adding data engineers to the team: individuals with a background in big data management, designing the workflow of data to be more efficient across distributed systems by writing specialized software bringing together various frameworks to create this data pipeline. They may come from IT, or they may come from your organization’s software development department; they may even be a data scientist who’s found their niche is really the management of the data pipeline.
A modern, fully scaled-out data science team consists of about seven people: a data scientist, a data analyst, a manager or director-level domain expert who sees the strategic need for data science to answer specific questions and serves as the executive champion, a project manager from the line-of-business side to keep the team updated, a software engineer to generate coded outputs (whether batch analytics, real-time code, or something else), an IT manager to appropriately provision hardware resources, a data science manager who has oversight across multiple data scientists’ work across the organization, and a data engineer to control the data pipeline. Remember your original two-person team? Your data scientist was doing specialized work across at least five domains: data preparation (data analyst work), project management (project manager), provisioning compute resources (often just running models on their company laptop, maybe outsourcing to the cloud if they have the budget – IT work), putting models into production (either a data analyst or a software engineer depending on the outputs), and probably managing multiple data science initiatives from multiple different departments. No wonder we keep referring to full-stack data scientists as unicorns!
If your organization is considering hiring its first data scientist, the business executive expressing the need to hire a data scientist for a given initiative will need to work directly with them. The data scientist will know how to manipulate the data, but may not have domain expertise in the particular area of study, and will not be aware of company-specific nuances for which the business executive can provide valuable context.
Remember that solo data scientists perform tasks that can often be delegated to other roles when time is short. Data analysts can do data prep and reporting work just as easily as data scientists can. Software engineers can translate models into custom apps faster. Data engineers focus on optimizing the data pipeline for multiple data scientists and data science projects. And project managers keep everyone in the loop, getting the whole process out of the data scientist’s head and into an accessible shared resource that can be used as a model to standardize the process for future data science projects and make them more efficient and repeatable.
When your organization’s data science workload has ramped up enough that your project needs are more than what a solo data scientist can support, before hiring another data scientist, look to other departments that may already exist in your organization to offload key tasks from your data scientist that aren’t “analyze the data to create a working model.” For example, if your organization has been collecting significant data for awhile, you likely have data analysts in-house performing some level of analytics on that data. They can do some of the prep and reporting work, allowing your data scientist to spend their time in the higher-value added activity of model creation.
Consider your desired outputs. What form does your organization require these outputs to take? Do the results of a model need to appear in a report or dashboard? Or do those results need to be made available programmatically via an API, as a micro service, or embedded into an app? The business demands for the outputs determines the necessary form, and to what role you can pass along the task of generating said outputs (data analysts for reports and dashboards, software engineers for coded options). For a model to be useful, it needs to be put into production.