Fundamental Principle of Data Science
Assume the data matters.
It’s so obvious that it’s rarely stated. But look closely and you’ll find this principle violated in many data analyses. I’ll go out on a limb and say it’s violated in most or even almost all data analyses.
The principle is violated almost every time an outlier is winsorized (i.e., deleted), a missing value is imputed (i.e., made up), or strange pattern is justified (i.e., rationalized) by adding extra assumptions.
I say almost every time and not every time, because sometimes deleting, making up, and rationalizing are the right thing to do. That’s why data analysis is so hard. The “science” part of “data science” isn’t about learning a few python or R commands, or how to tabulate data in Tableau. The “science” is about knowing what and what not to do.
But it’s mostly about what not to do.
The principle seems to be much better internalized by those who know very little about data than by bonafide “data scientists” or statisticians. It’s similar to the disagreement between scientists who advocate for spraying chemicals into the air to stop global warming and everyone else who knows that’s a terrible idea. Just as so-called climate scientists believe their expertise gives them control over the climate, so-called data scientists believe their expertise gives them control over data. The non-experts understand that nature (whether the climate or the data) is in charge.
The fundamental principle above could therefore be restated as
Fundamental Principle of Data Science (restated)
The data is the data.
With this simple point of view, we realize that outliers, missing values, weird patterns, etc. are part of the data. They shouldn’t just be thrown out, filled in, or ignored as unexplained anomalies. To the contrary, the outliers, missing values, and weird patterns we find in data often contain more information than 100s or 1000s of other “well-behaved” data points.
Outliers
We have a data set of 10,000 observations x (independent variables) and y (response). For 9,997 observations, the y value is between -10 and 10. For 3 of the observations, the y value is larger than 100. What does this indicate?
A naive approach would immediately declare these 3 observations as outliers and remove them from the analysis.
By definition, they are “outliers”, as they are very extreme values that represent only 0.03% of the data. Technically, there are outliers in every dataset — there are always values at the extremes. But they’re still part of the data.
What justification is there for this “otherizing” of outliers? There is no logic underlying the idea that observations should be thrown away just for being different than the rest.
Whether an outlier should be removed depends on some important questions:
What is causing these outliers to occur? Is it a data entry error or do they represent actual extreme observations that occur infrequently?
If a data entry error, what is the source of the error? Can it be fixed?
If an extreme observation, how can these observations be modeled in the context of the entire dataset?
How could removal of these observations adversely impact conclusions drawn from the analysis?
How do these extreme observations relate to the questions we’re trying to answer with this analysis?
And so on.
If there is any possibility that the outliers matter will affect our conclusions for the current analysis — and often it’s impossible to know for sure — then it’s best to avoid throwing out any observations, except in very rare circumstances.
Missing values
We have a data set of 10,000 observations x and y. The y value is missing for 400 observations. In this case, the missing observations represent 4% of the data, which is too much to simply ignore as outliers or rare events. What should we do about it?
There are two predominant ways to handle this:
Throw out missing observations
Impute missing observations
Before throwing out missing data:
Is the distribution of x the same for the missing values of y as for the non-missing values?
How familiar are we with the data collection process? Are there aspects of this process that could have led to a bias in the appearance of missing values?
How likely is it that there is an unobserved variable (i.e., a variable that is not part of x) whose distribution does vary between the missing values of y and the non-missing values?
The questions should be addressed in order. In order to have any possible justification of throwing away missing values, the answers need to be:
Yes.
Very familiar. I collected the data myself or oversaw the process by which the data was collected. I am able to explain exactly why these few values are missing, and the explanation does not suggest any possible way of introducing bias into the results.
Very unlikely, with a clear explanation why.
If we come up with any different answers, then it’s best to stop analyzing and figure out a way to go get those 400 missing observations.
As far as imputation goes, let’s save that for another post.
Weird patterns
Finding unintuitive behaviors in data is one of the reasons we analyze data in the first place. If the data just confirms whatever we already expected, then it doesn’t add any value. Finding strange patterns can indicate a new finding, a violation of a modeling assumption (such as stationarity or independence), or a data error. All of these are important and should be celebrated, not ignored.
Wow, otherizing the outliers and just tossing them? That's like looking at market performance by year and saying, "Welp, we'll just delete the data points for 1931, 1937, and 2008, those can't be important. Must be an error."
I don't think a finance modeler would do that, but in other situations/scenarios outliers *do* get tossed. What do you think the difference is?