27 Oct Doing Data Science with SAP Predictive Analytics – Overview
In this series, we explore the data science processes involved in the SAP Business Objects Predictive Analytics suite. This series is aimed at readers who are familiar with the SAP technology stack, but the themes and processes discussed are universal and should appeal to a broad range of readers interested in data science, predictive analytics and how workflows can be integrated into enterprise environments.
SAP have presented this suite of tools to appeal to business and data analysts who do not have extensive data science experience. The product overview mentions that usage of the suite will increase the accuracy in which analysis can be performed, and increase the ROI by reusing existing datasets in numerous ways. This is all done through the use of a Graphical User Interface (GUI) based web wizard, negating the need for extensive coding knowledge from a dedicated data scientist.
The product overview tells the casual observer that the use of the Predictive Analytics suite will provide them with the capacity to access robust and accurate insights about their business in “days rather than weeks, by automating the entire predictive lifecycle from dataset preparation through to model training and scoring.” The suite is advertised as an all-purpose guided analytics workbench, that enables more users to get more solid analytical information out of a single dataset.
We will explore through the lens of a specific case study, involving a fictional cycling wholesaler data set from Microsoft known as AdventureWorks. The case study creates a realistic and relatable context so that the exploration is grounded and the issues presented are plausible.
We will be working with a few different tools as a part of the exploration in this series. In addition to using theAutomated Analytics component of Predictive Analytics (composed of the Data Manager, Data Manipulation Editor and several different modelling tools), we will also use RStudio and HANA Studio as our comparative approach to the data science process.
DATA SCIENCE PROCESS
Before getting stuck into this case study, it’s important to outline the data science process itself. This process can (and ought to) be interpreted differently in different industries, but we will define it for a proof-of-value type scenario that should be broadly applicable. Typically, the process will follow some or all of the following steps:
- Getting relevant data
- Often the hardest part of data science is physically getting to the data, which can be difficult for security, technology and bureaucratic reasons. The process of understanding the data once it has been obtained can also be challenging since it will often be undocumented, no data model may exist, or the column and table names may be incomprehensible. This article will assume the data has been retrieved, cleaned and made “nice”, which is almost never the case in reality.
- Exploration and transformation of data using statistics and visualisations
- This is often broken up into several discrete steps but our data scientists mesh them all together
- A data scientist will spend most of their time here, producing visualisations and complex data transformations
- Developing a model
- This may be, for example, a machine learning model
- Transferring the data flow and model logic into a production environment
- A related issue is model upkeep and retirement
We are not considering the modelling and machine learning aspect of the process because it is typically the smallest and easiest part of the workflow, and we will cover it more comprehensively in future instalments.
CASE STUDY: ADVENTUREWORKS
To undertake this exploration of data science workflows we will be using the AdventureWorks data set provided by Microsoft. This data set represents the complete data platform of a fictitious cycling parts wholesaler. The data set contains 68 interconnected tables relating to customers, sales, production, etc. We will work with just a small subset of these tables.
We have constructed a fairly straightforward predictive scenario relating to customer behaviour. Based on customer demographics and past purchasing behaviour, we would like to be able to predict whether the customer is likely to purchase again in the near future. The data set and scenario are typical of an organisation of this nature. We have deliberately chosen such a problem so that we can explore the data science workflow in a plausible real-world context.
It is rarely the case that data is sitting there in a single table, ready to be plugged into a model. In practice, a great deal of preparation, cleaning, and transformation is necessary to bring the predictive gold to the surface. Roughly speaking the required transformations to get the data set ready for modelling involve joining tables, performing aggregations, normalising and other more complex functions. We describe this process in more detail in the next section.