The Data is the Data

Feb 7, 2023

The Most Fundamental Principle of Data Science

4 Comments

Mar 18, 2023

Wow, otherizing the outliers and just tossing them? That's like looking at market performance by year and saying, "Welp, we'll just delete the data points for 1931, 1937, and 2008, those can't be important. Must be an error."

I don't think a finance modeler would do that, but in other situations/scenarios outliers *do* get tossed. What do you think the difference is?

Expand full comment

Reply (1)

Harry Crane

Mar 22, 2023

I think plenty of finance modelers do exactly that. It's a delicate situation because including huge outliers in the data can mess up the model for all of the data --- when the model is misspecified (which it almost always is). But removing them and forgetting about them will understate uncertainty and overstate confidence that a big outlier will never happen again.

Expand full comment

Reply (1)

Stephanie Losi

Mar 22, 2023

It seems bizarre to me that, even knowing the context of those years being important, the data points could get tossed, rather than adjusting the model spec or at least being able to toggle those data points in and out.

Expand full comment

Reply (1)

Harry Crane

Mar 30, 2023

Easier said than done. The limits are mostly technical, not conceptual. Conceptually, it's easy to think of how you'd model those anomalous situations in theory. In practice, there is a much smaller subset of models that are computationally feasible for applications. The subset of such models grows smaller as the complexity and size of the application increases.

Expand full comment