Blog 2 in 3 Part Series on Analysis of Bias-Filled Data
Though most people associate the ability to predict the future with their neighborhood fortune-teller, customer experience practitioners are often in the business of forecasting customer behavior. Different flavors of regression models exist that do a great job at this, using current customers’ survey responses to gain insight into how later customers will act.
Unfortunately, self-selection bias (a form of systematic bias outlined in the first blog in this series) violates one of the classical assumptions of regression modeling – that your sample is representative of the population in question. So how does one tackle this issue before it enters the data? And when it’s there, how can the practitioner identify its presence before starting any data analysis?
The lessons learned here all center around the concept of response propensity (RP), which is a customer’s likelihood of responding to the survey. This can be based on, among other things, cultural/geographical factors and communication hindrances (whether this customer is likely to be responsive to email or inundated by their inbox, for example).
Much like pre-treating stains on laundry before tossing it in the washer, pre-treating your survey design to account for differences in RP can result in a cleaner dataset. Though RP is usually calculated after a survey has been administered, drawing insight from past surveys can tell you how this is distributed within your survey’s population. Have you found that Decision Makers are less likely than End Users to respond to surveys? Perhaps this group needs targeted reminders sent to them with language emphasizing the importance their responses hold. Or maybe past projects have shown that one region has a particularly low response rate which contributes to its members’ tiny RPs. Offering incentives personalized to this demographic could yield the response rates you need to use your responses to predict future ones.
Regardless of your steps to pre-treat your survey design, you must identify the extent to which this bias exists in your data. The most common way is to compare response rates for the different subgroups within every variable liable to influence RP. If any subgroup’s response rate is statistically significantly different from another’s, then you will need to correct for this bias before performing any predictive analytics. This method is not foolproof: it assumes that all factors that impact RP are kept for the full population of survey recipients (non-respondents and respondents). Thus, tracking any potentially relevant variables for the entire customer base can help identify self-selection bias and how exactly it impacts your data. You’ll then be ready to attack the self-selection problem head on (how? I’ll explain in the next entry) and use your data as a crystal ball for future customer behavior.
Which tools or techniques do you use to pre-treat for self-selection bias? Do you see different response rates for different groups or customer segments?