Datamining, Data Snooping and P-Hacking
Cliff Asness (June 2, 2015) defines datamining as âdiscovering historical patterns that are driven by random, not real, relationships and assuming theyâll repeatâŠa huge concern in many fieldsâ. In finance, datamining is especially relevant when investigators are attempting to explain or identify patterns in stock returns. Often, they are attempting to establish a relationship between characteristics of firms with returns, using only US firms in the dataset. For example, a regression is conducted that relates, say, the market value of equity, growth rates or the like, to their respective stock returns. It is important to note that the crux of the datamining issue is that a specific sample of firms observed at a specific time produce the observed results from the regression. The question then arises as to whether or not the results and implications are specific to that period of time only and/or that specific sample of firms only. It is difficult to ensure that the results are not âone-time wondersâ within such an in-sample-only design.
In his Presidential Address for the American Finance Association in 2017, Campbell Harvey takes the issue further into the intentional misuse of statistics. He defined intentional p-hacking as the practice of reporting only significant results when the investigator has conducted any number of correlations on finance variables; or has used a variety of statistical methods such as ordinary regression versus Cluster Analysis versus linear or nonlinear probability approaches; or has manipulated data via transformations or excluded data by eliminating outliers from the data set. There are likely others, but all have the same underlying motivating factor: the desire to be published when finance journals, to a large extent, only publish research with significant results.
The practices of p-hacking and datamining are at high risk to turn up significant results that are really just random phenomena. By definition, random events donât repeat themselves in a predictable fashion. âSnooping the dataâ in this manner goes a long way toward explaining why predictions about investment strategies fail on a going forward basis. Even worse, if they are accompanied by a lack of âtheoryâ that proposes direct hypotheses about investment behavior, the failure to generate alpha in the real world is often a monumental disappointment. In finance, and specifically in the investments area, we therefore describe datamining as the statistical analysis of financial and economic data without a guiding, a priori hypothesis (i.e. no theory). This is an important distinction in that if a sound theoretical basis can be articulated, then the negative aspects of data mining may be mitigated and prospects for successful investing will improve.
What is a sound theoretical basis? Essentially, sound theory is a story about the investment philosophy that you can believe in. There are likely numerous studies and backtests that have great results that you cannot really trust or believe in. You are unable to elicit any confidence in the investment strategy because it makes no sense. The studies and backtests with results that you can believe in are likely those whose strategies have worked over long periods of time, across a number of various asset classes, across countries, on an out-of-sample basis and have a reasonable story.
The root of the problem with financial data is that there is essentially one set of data and one set of variables all replicated by numerous vendors or available on the internet that can be used. This circumstance effectively eliminates the possibility of benefiting from independent replications of the research. Although always considered âpoorâ practice by statisticians and econometricians, datamining has become increasingly problematic for investors due to the improved availability of large sets of data that are easily accessible and easily analyzed. Nowadays, enormous amounts of quantitative data are available. Computers, spreadsheet and data subscriptions too numerous to list here are commonplace. Every conceivable combination of factors can be and likely has been tested and found to be spectacularly successful using in-sample empirical designs. However, the same strategies have no predictive power when implemented on an out-of-sample basis. Despite these very negative connotations, datamining is not only part of the deal in data driven investing, it requires a commitment to proper use of scientific and statistical methods.