Section 1
BENFORDāS LAW
DIGITS VERSUS NUMBERS
The typical statistician, during a typical day at the office, spends most of the time intensely staring at data charts and scatter plots, seeking real or imaginary patterns where perhaps none exist, summarizing data, calculating averages and standard deviations, regressing and correlating seemingly unrelated variables, analyzing subtle variances between related data sets to determine whether they are significantly or randomly different from each other, dissecting and bisecting those pesky numbers sent by clients, government agencies, companies, and research institutes.
Interestingly, the statistician is recently taking on the role of a philosopher of sorts, and instead of examining the numbers themselves as is the standard practice, he or she is investigating the digital language utilized in writing those numbers. What letters are to words, digits are to numbers. Why should a poetry lover seek any patterns or beauty by looking into the letters in Shakespeareās prose instead of the elegantly combined words? Yet, the relative proportions of our ten digits 0 to 9 occurring within our typical everyday numbers are now being routinely recorded and investigated by statisticians and data analysts, and even theorized as to how exactly they should be spread within any given data set by applying mathematical and statistical reasoning. Moreover, the study of digit proportions is further subdivided by classifying them into different categories according to position. For example, the specific proportions of the leftmost digit, namely the first digit of numbers, is looked into and examined separately. Another separate analysis is performed on the second-leftmost digit, which indeed shows quite different digital proportions than those of the first digit. But arenāt all digits supposed to be occurring randomly and thus equally distributed? Why should the digit 4 for example have a higher or lower chance of occurring within numbers than say the digit 5? One wonders whether the occurrences of digits themselves within numbers are just ātoo randomā for the statistician to even consider and analyze. Is there indeed a particular statistical law supposedly governing digital proportions? In addition, it seems doubtful that there would be any use or consequence in looking into this digital language proportion in the first place. Are there any applications that can exploit the examination of these digital proportions?
The answers to the latter two questions are all decisively positive, as evident by the newly-created role assigned to the statistician recently as a private detective utilizing known digital patterns in data to detect fraud by knowing that fake data probably lacks those particular digital patterns. Previously, the task of the statistician was merely to analyze data, but never to decide on the authenticity of the provided data. Data was traditionally always taken as a given without any ability to authenticate. For how could the unsuspecting, honest and naive statistician know that people were sending him or her fake data that was merely invented? One incentive to fake data and reduce reported revenues and income would naturally be to lower tax payments. Another incentive is the temptation to inflate revenues and profits in order to impress investors and present the company in a better light as being financially sound. Therefore there is a strong need on the part of tax authorities, governmental financial regulatory and supervisory agencies worldwide, as well as auditing and accounting companies and others, to obtain professional statistical advice as to how to detect fake data. By wearing that philosopherās hat and examining the digital language used in writing the numbers in provided data sets, the statistician is then able to wear his or her other hat, namely the detectiveās hat, and forensically analyze data for any possible fraud.
TO FIND FRAUD, SIMPLY EXAMINE ITS DIGITS!
As our civilization progresses, we are able to do things previously thought impossible. Our collective mathematical and technological abilities have reached fantastic heights. We literally perform magic with our computers and other gadgets. But can we perform the simple task of telling when a friend or a spouse lies? Perhaps not, but the truly sophisticated statistician, aware of the latest developments in the field, can nowadays detect straight-faced fraudsters when presented with their fake data. Underpinning this ability is the fact that to concoct authentic-looking data one must know something about the particular properties of their digital language, while most fraudsters havenāt got a clue about the topic, and mistakenly believe that digital equality rules the universe of numbers. Yet in fact, low digits such as 1, 2, and 3 actually occur with very high frequencies within the first-place position of typical everyday data, while high digits such as 7, 8, and 9 have very little overall proportion of occurrence. So much so that the proportion of everyday typical numbers starting with digit 1 is about seven times that of numbers starting with digit 9! About 30% of typical everyday numbers in use start with digit 1, while only about 4% start with digit 9.
In order to illustrate the ability of utilizing this peculiar digital phenomenon in fraud detection, we shall digitally analyze hypothetical accounting data from five different companies where amounts represent revenues. The table in Fig. 1.1 shows 25 dollar amounts from each company. Nothing seems unusual or suspicious if we merely focus on the numbers themselves. Yet, if we forensically investigate the digital language used in writing those numbers, namely the digits at the very beginning of each number (the leftmost ones), we can immediately reveal an abnormality with one particular data set. Figure 1.2 shows the proportions of the first digits for all five companies.
Clearly, MF Capital comes under strong suspicion in the eyes of the expert statistician, since typical accounting data rarely comes with anything near digital equality for the first position. First-digit proportions of the other four companies show an overall pattern of gradual decrease, consistent with the expected pattern in almost all types of accounting data. The set of the first digits for MF Capital revenue data (commas omitted) is {4736281255914389752766432}, which is distinctly different compared to say Alcoaās {6111119321441128225618431}. Digits at the second and third positions are much more equal in proportions for all five companies and do not show any particular pattern; they also do not single out MF Capital in any way. Had the focus of the statistician been misplaced on those digits, there wouldnāt be any clue about MF Capitalās possible fraudulent activities.
Figure 1.1 Hypothetical Accounting Data for Five Companies
Figure 1.2 1st Digits Proportions of the Data of Five Companies
FIRST LEADING DIGITS
First Leading Digit (LD) or First Significant Digit is the first (non-zero) digit of a given number appearing on the leftmost side. For 567.34 the leading digit is 5. For 0.0367 the leading digit is 3, as we discard the zeros. For the lone integer 6 the leading digit is 6. For negative numbers we simply discard the sign, hence for -62.97 the leading digit is 6. Another way of defining the first digit of any number is by writing it in scientific notation as A*10N with N being an integer and A being a real number such that 1 ā¤ |A| < 10. For such representation of numbers, the integral part of A (excluding the fractional part), and with the positive or negative sign ignored, is what we consider the first leading digit. For example, the number 311.75 is scientifically written as 3.1175*102 and digit 3 leads the number. Naturally, when digit d appears first in a number composed of several digits, we call d the āleaderā, as it leads all the other digits trailing behind it to the right.
EMPIRICAL EVIDENCE FROM REAL-LIFE DATA ON DIGIT DISTRIBUTION
Perhaps it is tempting to intuit that for numbers in typical real-life data sets, all nine digits {1, 2, 3, 4, 5, 6, 7, 8, 9} should be equally likely to occur and thus uniformly distributed. Let us examine three typical data sets from a variety of real-life situations where digital results run counter to that misguided intuition and where, surprisingly, low digits such as 1, 2, and 3 are strongly favored over high digits such as 7, 8, and 9. The three data sets to be digitally examined are: (I) stock market prices and volume of stock traded, (II) the 10 by 10 multiplication table, and (III) house number in typical address data.
Examination of first digits of closing prices and daily volume of stocks traded on the New York Stock Exchange on December 23, 2011 reveals a definite pattern in which digital proportions are almost monotonically and consistently decreasing. The first 31 companies on top of the alphabetically-sorted list were arbitrarily chosen. Figure 1.3 shows the extracted data.
Low digits lead much more often than high digits, for both stock prices and volume. Figure 1.4 shows the exact LD distributions for this limited set of 31 companies. It should be noted that almost all other such subsets down the long list on the NYSE website yield quite similar results, that there was nothing unusual about the trading day of the 23rd of December 2011, and that very similar digital results are gotten on other trading days.
Let us examine LD of the 10 by 10 multiplication table that we all were forced to memorize at school against our will, as shown in Fig. 1.5(A).
Surprisingly, out of 100 numbers, 21 start with the lowest digit 1 (shown in large and bold font), and only five start with the highest digit 9 (shown within circles), namely a ratio of 4:1 roughly. This result is surprising yet approximately compatible with the digital results seen in the example with stock prices and volume data. In this digital analysis the numbers 1, 10, and 100 are grouped together under the same category since all of them are being led by digit 1. Digital proportions here are {21%, 17%, 13%, 14%, 8%, 9%, 6%, 7%, 5%}.
Figure 1.3 Price and Volume of Stocks Traded on the NYSE
Interestingly, if the digital as...