Chapter 1
Calculus Ratiocinator
Abstract
There is more need than ever to implement Leibniz's Calculus Ratiocinator suggestion concerning a machine that simulates human cognition but without the inherent subjective biases of humans. This need is seen in how predictive models based upon observation data often vary widely across different blinded modelers or across the same standard automated variable selection methods. This unreliability is amplified in today's Big Data with very high-dimension confounding candidate variables. The single biggest reason for this unreliability is uncontrolled error that is especially prevalent with highly multicollinear input variables. So, modelers need to make arbitrary or biased subjective choices to overcome these problems because widely used automated variable selection methods like standard stepwise methods are simply not built to handle such error. The stacked ensemble method that averages many different elementary models was reviewed as one way to avoid such bias and error to generate a reliable prediction, but there are disadvantages including lack of automation and lack of transparent, parsimonious, and understandable solutions. A form of logistic regression that also models error events as a component of the maximum likelihood estimation called Reduced Error Logistic Regression (RELR) was also introduced as a method that avoids this multicollinearity error. An important neuromorphic property of RELR is that it shows stable explicit and implicit learning in small training samples and high-dimension inputs as observed in neurons. Other important neuromorphic properties of RELR consistent with a Calculus Ratiocinator machine were also introduced including the ability to produce unbiased automatic stable maximum probability solutions and stable causal reasoning based upon matched sample quasi-experiments. Given RELR's connection to information theory, these stability properties are the basis of the new stable information theory that is reviewed in this book with wide ranging causal and predictive analytics applications.
Keywords
Analytic science; Big data; Calculus Ratiocinator; Causal analytics; Causality; Cognitive neuroscience; Cognitive science; Data mining; Ensemble learning; Explanation; Explicit learning and memory; High dimension data; Implicit learning and memory; Information theory; Logistic regression; Machine learning; Matching experiment; Maximum entropy; Maximum likelihood; Multicollinearity; Neuromorphic; Neuroscience; Observational data; Outcome score matching; Prediction; Predictive analytics; Propensity score; Quasi-experiment; Randomized controlled experiment; Reduced error logistic regression (RELR); Stable information theory
āIt is obvious that if we could find characters or signs suited for expressing all our thoughts as clearly and as exactly as arithmetic expresses numbers or geometry expresses lines, we could do in all matters insofar as they are subject to reasoning all that we can do in arithmetic and geometry. For all investigations which depend on reasoning would be carried out by transposing these characters and by a species of calculus.ā
Gottfried Leibniz, Preface to the General Science, 1677.1
Contents
1. A Fundamental Problem with the Widely Used Methods
2. Ensemble Models and Cognitive Processing in Playing Jeopardy
3. The Brain's Explicit and Implicit Learning
4. Two Distinct Modeling Cultures and Machine Intelligence
5. Logistic Regression and the Calculus Ratiocinator Problem
At the of end of his life and starting in 1703 Gottfried Leibniz engaged in a 12-year feud with Isaac Newton over who first invented the calculus and who committed plagiarism. All serious scholarship now indicates that both Newton and Leibniz developed calculus independently.2 Yet, stories about Leibniz's invention of calculus usually focus on this priority dispute with Newton and give much less attention to how Leibniz's vision of calculus differed substantially from Newton's. Whereas Newton was trained in mathematical physics and continued to be associated with academia during the most creative time in his career, Leibniz's early academic failings in math led him to become a lawyer by training and an entrepreneur by profession.3 So Leibniz's deep mathematical insights that led to calculus occurred away from a university professional association. Unlike Newton whose entire mathematical interests seemed tied to physics, Leibniz clearly had a much broader goal for calculus. These applications were in areas well beyond physics that seem to have nothing to do with mathematics. His dream application was for a Calculus Ratiocinator, which is synonymous with Calculus of Thought.4 This can be interpreted to be a very precise mathematical model of cognition that could be automated in a machine to answer any important philosophical, scientific, or practical question that traditionally would be answered with human subjective conjecture.5 Leibniz proposed that if we had such a cognitive calculus, we could just say āLet us calculateā6 and always find most reasonable answers uncontaminated by human bias.
In a sense, this concept of Calculus Ratiocinator foreshadows today's predictive analytic technology.7 Predictive analytics are widely used today to generate better than chance longer term projections for more stable physical and biological outcomes like climate change, schizophrenia, Parkinson's disease, Alzheimer's disease, diabetes, cancer, optimal crop yields, and even good short-term projections for less stable social outcomes like marriage satisfaction, divorce, successful parenting, crime, successful businesses, satisfied customers, great employees, successful ad campaigns, stock price changes, loan decisions, among many others. Until the widespread practice of predictive analytics with the introduction of the computers in the past century, most of these outcomes were thought to be too capricious to have anything to do with mathematics. Instead, they were traditionally answered with speculative and biased hypotheses or intuitions often rooted in culture or philosophy (Fig. 1.1).
Figure 1.1 Gottfried Wilhelm Leibniz.8
Until just very recently, standard computer technology could only evaluate a small number of predictive features and observations. But, we are now in an era of big data and high performance massively parallel computing. So our predictive models should now become much more powerful. This is because it would seem reasonable that those traditional methods that worked to select important predictive features from small data will scale to high-dimension data and suddenly select predictive models that are much more accurate and insightful. This would give us a new and much more powerful big data machine intelligence technology that is everything that Leibniz imagined in a Calculus Ratiocinator. Big data massively parallel technology should thus theoretically allow completely new data-driven cognitive machines to predict and explain capricious outcomes in science, medicine, business, and government.
Unfortunately, it is not this simple. This is because observation samples are still fairly small in most of today's predictive analytic applications. One reason is that most real-world data are not representative samples of the population to which one wishes to generalize. For example, the people who visit Facebook or search on Google might not be a good representative sample of many populations, so smaller representative samples will need to be taken if the analytics are to generalize very well. Another problem is that many real-world data are not independent observations and instead are often repeated observations from the same individuals. For this reason, data also need to be down sampled significantly to be independent observations. Still, another problem is that even when there are many millions of independent representative observations, there are usually a much smaller number of individuals who do things like respond to a particular type of cancer drug or commit fraud or respond to an advertising promotion in the recent past. The informative sample for a predictive model is the group of targeted individuals and a group of similar size that did not show such a response, but these are not usually big data samples in terms of large numbers of observations. So, the biggest limitation of big data in the sense of a large number of observations is that most real-world data are not ābigā and instead have limited numbers of observations. This is especially true because most predictive models are not built from Facebook or Google data.9
Still, most real-world data are ābigā in another sense. This is in the sense of being very high dimensional given that interactions between variables and nonlinear effects are also predictive features. Previously we have not had the technology to evaluate high dimensions of potentially predictive variables rapidly enough to be useful. The slower processing that was the reason for this ācurse of dimensionalityā is now behind us. So many might believe that this suddenly allows the evaluation of almost unfathomably high dimensions of data for the selection of important features in much more accurate and smarter big data predictive models simply by applying traditional widely used methods.
Unfortunately, the traditional widely used methods often do not give unbiased or non-arbitrary predictions and explanations, and this problem will become ever more apparent with today's high-dimension data.
1 A Fundamental Problem with the Widely Used Methods
There is one glaring problem with today's widely used predictive analytic methods that stands in the way of our new data-driven science. This problem is inconsistent with Leibniz's idea of an automated machine that can reproduce the very computations of human cognition, but without the subjective biases of humans. This problem is suggested by the fact that there are probably at least hundreds of predictive analytic methods that are in use today. Each method makes differing assumptions that would not be agreed upon by all, and all have at least one and sometimes many arbitrary parameters. This arbitrary diversity is defended by those who believe a āno free lunchā theorem that argues that there is no one best method across all situations.10,11 Yet, when predictive modelers test various arbitrary algorithms based upon these methods to get a best model for a specific situation, they obviously will only test but a tiny subset of the possibilities. So unless there is an obvious very simple best model, different modelers will almost always produce substantially different arbitrary models with the same data.
As examples of this problem of arbitrary methods, there are different types of decision tree methods like CHAID and CART which have different statistical tests to determine branching. Even with the very same method, different user-provided parameters for splitting the branches of the tree will often give quite different decision trees that will generate very different predictions and explanations. Likewise, there are many widely used regression variable selection methods like stepwise and LASSO logistic regression that are all different in the arbitrary assumptions and parameters employed in how one selects important āexplanatoryā variables. Even with the very same regression method, different user choices in these parameters will almost always generate widely differing explanations and often substantially differing predictions. There are other methods like Principal Component Analysis (PCA), Variable Clustering and Factor Analysis that attempt to avoid the variable selection problem by greatly reducing the dimensionality of the variables. These methods work well when the data match underlying assumptions, but most behavioral data will not be easily modeled with the assumptions in these methods like orthogonal components in the case of PCA or that one knows how to rotate the components to be nonorthogonal using the other methods given that there are an infinite number of possible rotations. Likewise, there are many other methods like Bayesian Networks, Partial Least Squares, and Structural Equation Modeling that modelers often use to make explanatory inferences. These methods each make differing arbitrary assumptions that often generate wide diversity in explanations and predictions. Likewise, there are a large number of fairly black box methods like Support Vector Machines, Artificial Neural Networks, Random Forests, Stochastic Gradient Boosting, and various Genetic Algorithms that are not completely transparent in their explanations of how the predictions are formed, although some measure of variable importance often can be obtained. These methods can generate quite different predictions and important variables simply because of differing assumptions across the methods or differing user-defined modeling parameters within the methods.
Because there are so many methods and because all require unsubstantiated modeling assumptions along with arbitrary user-defined parameters, if you gave exactly the same data to a 100 different predictive modelers, you would likely get a 100 completely different models unless it was a simple solution. These differing models often would make very different predictions and almost always generate different explanations to the extent that the method produces transparent models that could be interpreted. In cases where regression methods are used and raw interaction or nonlinear effects are parsimoniously selected without accompanying main effects, the model's predictions are even likely to depend on how variables are scaled so that currency in Dollars versus Euros would give different predictions.12 Because of such variability that even can defy basic principles of logic, it is unreasonable to interpret any of these arbitrary models as reflecting a causal and/or most probable explanation or prediction.
Because the widely used methods yield arbitrary and even illogical models in many cases, hardly can we say āLet us calculateā to answer important questions such as the most likely contribution of environmental versus genetic versus other biological factors in causing Parkinson's disease, Alzheimer's disease, prostate cancer, breast cancer and so on. Hardly, can we say āLet us calculateā, when we wish to provide a most likely explanation for why there is climate change or why certain genetic and environmental markers correlate to diseases or why our business is suddenly losing customers or how we may decrease costs and yet improve quality in health care. Hardly, can we say āLet us calculateā, when we wish to know the extent to which sexual orientation and other average gender differences are determined by biological factors or by social factors, when we wish to know whether stricter guns control policies would have a positive or negative impact on crime and murder rates, or when we wish to know whether austerity as an economic intervention tool is helpful or hurtful. Because our widely used predictive analytic methods are so influenced by completely subjective human choices, predictive model explanations and predictions about human diseases, climate change, and business and social outcomes will have substantial variability simply due to our cognitive biases and/or our arbitrary modeling methods. The most important questions of our day relate to various economic, social, medical, and environmental outcomes related to human behavior by cause or effect, but our widely used predictive analytic methods cannot answer these questions reliably.
Even when the very same method is used to select variables, the important variables that the model selects as the basis of explanation are likely to vary across independent observation samples. This sampling variability will be especially prevalent if the observations available to train the model are limited or if there are many possible features that are candidates for explanatory variables, and if there is also more than a modest correlation between at least some of the candidate explanatory variables. This problem of correlation between variables or multicollinearity is ultimately the real culprit. This multicollinearity problem is almost always seen with human behavior outcomes. Unlike many physical phenomena, behavioral outcomes usually cannot be understood in terms of easy to separate uncorrelated causal components. Models based upon randomized controlled experimental selection methods can avoid this multicollinearity problem through designs that yield variables that are orthogonal.13 Yet, most of today's predictive analytic applications necessarily must deal with observation data, as randomized experiments are simply not possible usually with human behavior in real-world situations. Leo Breiman, who was one of the more prominent statisticians of recent memory, referred to this inability to deal with multicollinearity error as āthe quiet scandal of statisticsā because the attempts to avoid it in traditional predictive modeling methods are arbitrary and pro...