How complete are existing Data Science methodologies?
In 'How successful are your Data Science projects?' we asked the question of all stakeholders, be they the sponsors who are investing, the leaders who have to justify the investment or the passionate data scientists, whether they were happy with the results of their effort in Data Science. For those wanting to improve this answer, we recommended a review of their current approach to Data Science projects.
In this article, we assess some existing Data Science methodologies to understand how they address the fundamental, yet not-so-obvious challenges facing Data Science:
- Data Science is experimental in nature – and this makes executives nervous when investing.
- Data Science is often executed as a technical exercise, without enough focus on how much value the Data Science idea will deliver and whether that value is believable.
- Data Science is tackled inconsistently within many organisations, lacking the appropriate guidelines, templates and accelerators, leading to inconsistent results and prolonged time to outcomes – be that success or failure.
We have elected to look at a representative sample of offerings, based on them being either more established or popular options, or presenting some substantial improvements over previous work:
- Cross-industry standard process for data mining (CRISP-DM)
- Microsoft’s Team Data Science Process (TDSP)
- Domino Data Science Lifecycle (DDSL)
CRISP-DM is presented as a lengthy reference manual and user guide, detailing the phases, tasks and outputs that should be executed during a Data Science project. The high-level structure and steps within each phase are clearly outlined, with details on the tasks as well as considerations and potential pitfalls.
CRISP-DM process flow
What we like
The high-level process flow of CRISP-DM is a good representation of what is required in a typical Data Science project, including the need to iterate between phases.
The user guide is extensive and describes important activities such as identifying business objectives and preparing a cost-benefit analysis.
Where we see gaps
The CRISP-DM process is lacking in checkpoints and key questions to navigate the process. Given the experimental nature of Data Science, it is critical that Data Science teams know how to measure their progress and how to decide whether they should proceed.
CRISP-DM is also light-touch on proving business value, with little guidance on how to go about quantifying the potential return on investment. There is very little mention of assessing the feasibility of the idea with respect to the impact on the current business process.
The final evaluation phase in CRISP-DM recommends an assessment against business objectives but provides no further guidance on what that could look like. There is little emphasis on assessing how an accurate statistical model might translate into realised value for the business, e.g. through a prototype demonstration for stakeholders.
CRISP-DM provides nothing in terms of accelerators beyond the recommendation of what details should be captured for each task. The methodology is tool-agnostic and thus more general, but it does mean that additional effort is needed for a team to execute the methodology.
Microsoft’s TDSP is available as a documented Data Science lifecycle on their website, describing the phases, activities, roles and outputs that are expected during a Data Science project. Some tools and artefacts are available to help get projects going, and there are tutorials and walkthroughs as well. The material references relevant Microsoft Azure services frequently, but the core process documentation is platform agnostic.
Microsoft TDSP - Data Science Lifecycle
What we like
This methodology follows largely from CRISP-DM in terms of the high-level phases, and similarly makes it clear that the team may move between the phases in a very non-linear way. The activities are all appropriate for Data Science.
TDSP excels in the provision of accelerators. A standardised project file structure is available to download that can be easily deployed for new projects and contains templates for the common types of documentation and artefacts that should be completed during a project. Tools are provided for automating basic exploratory data analysis and model training, which emit standard artefacts ready-to-go for project delivery. Whilst these templates and tools are not particularly sophisticated and may struggle to adapt to the workflows of all Data Science teams, they do provide a great starting point to help get projects going. They will be especially helpful for less experienced Data Science teams.
Where we see gaps
Despite the indication of needing to iterate between phases, TDSP is lacking entirely in clear checkpoints and questions to gauge how to navigate the process. There is no guidance within the documentation on how to measure progress and make decisions on how to proceed.
TDSP pays scant attention to the importance of proving business value. Although a Business Understanding phase is presented upfront, the emphasis is heavily on the technical aspects of defining data and modelling requirements, as opposed to understanding business objectives and success criteria. The implication is that understanding the potential return on investment and justifying the cost-benefit trade-off is a separate concern and not part of the Data Science process. This increases the risk associated with this type of work, as it is a frequent failure mode of Data Science to get carried away with the technical work without a clear focus on business objectives.
TDSP does not encourage the team to think through what type of solution or product will be delivered to end users and whether the impact on current process would even be feasible. There is no separate evaluation phase to assess the results against business objectives, only a technical evaluation of model performance prior to deployment.
The Domino Data Science Lifecycle is detailed in a whitepaper that Domino Data Labs have issued to provide guidance on best practices in executing Data Science projects. The guide takes the form of an end-to-end flowchart and guidance for each of the phases. It is presumed that the lifecycle is intended to accompany the use of their flagship product, the Domino Data Science Platform.
What we like
The flowchart describes similar high-level phases as for CRISP-DM and TDSP, and it is clear about where the decision points are, indicating when the team may need to return to an earlier activity. At a high-level this encourages the Data Science team to work through the right activities and pause for decisions at appropriate junctures.
The initial ideation phase of DDSL emphasises that the business problem must be carefully thought through before proceeding onto the technical work. The guidance strongly encourages the team to quantify the potential return on investment as a way to understand whether the idea should be prioritised, and stresses that the potential change in business process needs to be mapped to be confident that implementation would even be feasible.
The guidance recommends thinking carefully through the form of the final deliverable to ensure the end goal is concrete and to improve the odds of effective adoption; mock-ups and early engagement with product delivery teams are encouraged.
Where we see gaps
Despite clearly indicating where the decision points are, DDSL is not so clear on how those decisions would be made, such as through key questions to keep the focus on the business outcomes. The flowchart also does not acknowledge the more significant decisions that may need to be made, such as rethinking the hypothesis should adverse findings arise, e.g. the data being of low quality.
The final evaluation is light on emphasising the importance of assessing the results against business objectives. There is a decision point to validate with business stakeholders, but no further guidance, with the focus mainly on technical validation.
With regards to accelerators, DDSL itself is a guide outlining the process and does not come with tools or templates, but the Domino Data Science Platform does explicitly aim to accelerate the Data Science process, in particular model training, experiment tracking and deployment. This platform would not be an option for all Data Science teams however, and there is not a tight integration between the platform and the lifecycle guide.
It is important to recognise that all methodologies reviewed here contribute towards the goal of helping Data Science succeed. However, there are still some gaps in the available offerings. The table below summarises our assessment of these Data Science methodologies with respect to the key areas of focus.
In 'How to boost confidence in your Data Science investments' we discuss how we have incorporated lessons learnt from delivering Data Science projects into our approach to closing the gaps identified above. This approach has been evolved through exposure to a range of industries and with organisations at varying levels of Data Science maturity.
- CRISP-DM Reference Manual
- Microsoft Team Data Science Process Documentation
- Domino Data Labs Whitepaper