Guiding Principles For Data Science | Peak Indicators
In the world of user experience and Human-Computer Interaction, there is an oft-quoted saying “Know Thy User” that highlights the importance of understanding your audience, getting to know them and ultimately designing for them. This has become a guiding principle, along with other principles, for good user interface design. Recently at Peak Indicators I have been involved in a number of data science projects for clients and wanted to share my experiences for doing data science activities in practice.
In the spirit of the usability quote, I have summarised my experiences into three sayings or guiding principles (presented in no particular order): Know Thy Data, Know Thy Business and Know Thy Algorithm.
Know Thy Data
This guiding principle reflects the need to understand your data. Often at the start of a project, I will work with a sample of data and get my hands dirty by “playing” with it. Typically, I use a combination of data visualisation, descriptive summaries of individual variables (e.g. a five-number summary) and statistical analysis of multiple variables (e.g. correlation), often in the form of exploratory data analysis or EDA. For the latter, I tend to use existing libraries that provide automated EDA functions (commonly referred to as AutoEDA). Getting to know your data will start to reveal characteristics, problems and trends, such as the presence of missing and null values, extreme outliers and anomalies, what data types are present, whether there are associations or correlations between variables, the distribution or spread of the data (e.g. whether it is “normal”), and the range of values for categorical data-you may find that the values for a variable are constant, which is not very informative for prediction tasks. Understanding the data may lead to applying some kind of corrective action, such as filling in missing values, removing outliers or correcting errors. Getting to know and understand your data will also continue throughout the lifetime of the project and may involve analysing enriched subsets, such as new variables created through feature engineering for use by machine learning. I also find it helpful to do a close inspection of the data -select specific cases and work through the values, typically trying to work out how this reflects the underlying business process (see “Know Thy Business”). As an output from this stage, I have found it helpful to generate a summary document of the (sample) data comprising a definition to define each variable (which may require input from the business)and data type (i.e. similar to a data dictionary), a five-number summary for each variable, plots showing the distribution of variable values and plots to show relationships between variables and correlations. The output can also provide useful insights for the business regarding data quality.
Know Thy Business
This principle captures the need to understand the context in which the data is being generated and used, such as business processes. In addition, finding out the technology stack being used, more general business needs and requirements, and cultures and organisational attitudes to data-driven methods can help. The more you can discover about the business, the more you can understand the data and interpret the outputs of your analysis or modelling. I have found that regular meetings with the client, reviewing relevant company documents and talking to people involved in creating (and using) the data, can all help with getting to know the business. In addition to understanding the wider organisational context, it is vital to fully understand the objectives of a particular problem being modelled. The company may express their needs in their own words, but you have to translate or map this into data science language and tasks. I have also found that reviewing relevant past work or case studies can help to provide useful background material, as well as identifying potentially appropriate approaches that can be used to generate solutions. This can help prevent you from reinventing the wheel and provide a foundation upon which to build -”standing on the shoulders of giants” and all that. By way of example, I have recently been working with a company that wanted to identify which users of their website were likely to disengage and stop interacting with its services. This is similar to the existing problem of churn prediction. Our view of the relevant scientific and business literature enabled me to look for common ways of defining and modelling the problem. My understanding of the data evolved over time following discussions with members of the organisation responsible for managing the data, as did my understanding of the users and their interactions with the system -the data-generating processes. By better understanding the client and their needs, we have been able to propose a tailored solution.
Know Thy Algorithm
My third guiding principle reflects the need to understand the algorithms, methods and processes you use to analyse and model the data. It also reflects the need to be scientific or principled in the way you approach the problem -seeking to produce reliable, valid, and ultimately, successful solutions. For example, when building a predictive model, you should seek to minimise bias and over-learning effects through the use of training and test sets and cross-validation. Reporting results on a completely unseen dataset (hold outset) will ensure you have a better idea of how your model will perform in practice. Or when selecting thresholds, you should try different cut-off values to select the most appropriate, rather than just what feels right. Many approaches, particularly statistical methods and machine learning, adhere to assumptions and presume certain conditions are met. For example, some statistical tests that are used to assess differences between two or more samples of data assume that data are numeric and follow a certain trend (e.g. a “normal” distribution). However, if you have data that breaks these assumptions then the appropriate method needs to be selected, otherwise, your findings are questionable. It is also important to understand the assumptions and characteristics of methods used to build predictive models, such as machine learning. Some methods, such as artificial neural networks, can be used to model complex data and problems; however, they typically require large volumes of data to work successfully. Some methods assume a certain relationship between the target variable (what you want to predict) and the predictor variables (what you use to predict), for example linear, and that the variables are independent with no interactions. In practice, I tend to follow a standard process for running machine learning experiments, whereby I test a range of data pre-processing steps (e.g., dealing with missing values, scaling numeric variables), algorithms and parameter settings. I then compare results; typically favouring simpler approaches over the more complex (an example of applying the “Occam’s razor” principle).
In sum, experience has taught me the importance of investing time in properly understanding your data, the underlying data-generating processes and business needs, and understanding the best approaches and processes to analyse and model the data. I hope that these three guiding principles may also help you as you approach data science tasks.
Originally published at https://www.peakindicators.com on December 20, 2022.