Data Driven Statistical Methods
eBook - ePub

Data Driven Statistical Methods

  1. 406 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Data Driven Statistical Methods

About this book

Calculations once prohibitively time-consuming can be completed in microseconds by modern computers. This has resulted in dramatic shifts in emphasis in applied statistics. Not only has it freed us from an obsession with the 5% and 1% significance levels imposed by conventional tables but many exact estimation procedures based on randomization tests are now as easy to carry out as approximations based on normal distribution theory. In a wider context it has facilitated the everyday use of tools such as the bootstrap and robust estimation methods as well as diagnostic tests for pinpointing or for adjusting possible aberrations or contamination that may otherwise be virtually undetectable in complex data sets. Data Driven Statistical Methods provides an insight into modern developments in statistical methodology using examples that highlight connections between these techniques as well as their relationship to other established approaches. Illustration by simple numerical examples takes priority over abstract theory. Examples and exercises are selected from many fields ranging from studies of literary style to analysis of survival data from clinical files, from psychological tests to interpretation of evidence in legal cases. Users are encouraged to apply the methods to their own or other data sets relevant to their fields of interest. The book will appeal both to lecturers giving undergraduate mainstream or service courses in statistics and to newly-practising statisticians or others concerned with data interpretation in any discipline who want to make the best use of modern statistical computer software.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Data Driven Statistical Methods by Peter Sprent in PDF and/or ePUB format, as well as other popular books in Mathematics & Probability & Statistics. We have over one million books available in our catalogue for you to explore.

Information

1

Data-driven inference

1.1 Data-driven or model-driven

Data are the raw material of applied statistics - providing a focal point or pivot in statistical methodology. I say ‘focal’ rather than ‘starting’ point, because even before data are collected the statistician has a role advising what data are needed, and how these should be collected to best answer questions posed by experimenters and others who use numerical information. Both the statistician’s before- and after-collection roles are centered on data.
One impact of computers on data analysis has been to shift emphasis away from restricted probabilistic models chosen on the pragmatic grounds that hopefully they would reflect the main data characteristics while providing a method of analysis that was not too computationally demanding. The shift has been to computer-intensive analyses that let one explore more fully the characteristics of data without artificial constraints imposed by a particular preconceived mathematical model. Obvious limitations of simple probabilistic models are usually drawn to the attention of students early in their training, but some potential difficulties are far from self-evident, and even when they are the way round them may not be clear.
Example 1.1 gives simple data where the assumptions for a standard classic test break down. Example 1.2 covers a situation where a standard test seems reasonable, but it produces results contrary to intuition and further analysis is needed to see why this happens.
Example 1.1 Look at any statistical journal, or indeed at any scientific journal published several decades ago, and at its modern counterpart and you are likely to see changes in the type of material and often in the way it is presented. This may be due in part to changes in editorial or production policy but it also reflects rapid developments in science and increasing pressure on scientists to publish their results in reputable journals. Recently I looked at the statistical aspects of such changes in several journals. I wanted to find out for each whether annual volumes were becoming larger, if the proportion of space devoted to particular topics had changed, whether papers now tended to be shorter or longer, if the geographical distribution of authors had changed, whether joint authorship was becoming more common than individual presentation, and so on. There were supplementary questions. For example, did authors of statistical papers in the 1990s tend to list more references than those writing in the 1950s? A priori, this seemed likely in the light of the ever-increasing store of theory and knowledge. As part of my larger study I took a random sample of papers published in the journal Biometrics in 1956-7 and another sample from that journal in 1990. The numbers of references in papers in the samples were:
image
Do these data support the hypothesis of a higher mean number of references in the later period? The classic test for a difference between population means is the two-sample t-test which is strictly relevant to random samples from normal distributions that differ, if at all, only in mean. The samples here are not from normal distributions, but from finite populations of, in each case, less than 100 papers. Also the data are counts, not the continuous variables implicit in a normal distribution. However, a wealth of empirical experience plus certain theoretical consequences of the often-quoted central limit theorem indicate that these factors may not matter greatly and that the t-test may still be useful. But that test is not satisfactory here because the observations 72 and 59 are isolated from the rest of the data and strongly suggest that the underlying distributions are skew, or that something is peculiar about these two observations. Some people would call them outliers - a concept I describe in section 3.2. The data reflect the fact that Biometrics publishes many papers presenting new theory and methods with typically a small to moderate number of references and also a few expository or review papers covering broad subject areas that usually include extensive bibliographies. This implies we are sampling from a mixture of at least two populations, one of which (theory and method papers) is the larger and thus more strongly represented in the random samples. This mixture takes away the near-symmetry expected in samples from a normal distribution even though there is no evidence here that the samples come from populations with different variances. This lack of symmetry reduces the ability (power is the technical term described in section 1.7) of the t-test to detect shifts in the mean. Indeed, for these data the t-test supports the hypothesis that the population means do not differ. Yet over half the papers in the 1990 sample give 20 or more references while only one paper in the 1956-7 sample does. In section 4.1, 4.2 and 4.6 I discuss tests that are in these circumstances more appropriate than the t-test for detecting evidence of any location difference.
Example 1.2 Statisticians often meet data that are counts of individuals with various characteristics. A simple case is that where items are allocated to one or other of two categories for each of two criteria. This gives rise to the familiar 2×2 contingency table. We often want to know if there is evidence for association between criteria. Gastwirth and Greenhouse (1995) give data for promotions (first criterion) among 100 employees in a firm where each employee belongs to either a majority or a minority group (second criterion). Data like these are of interest when there are charges of discrimination against some group on the grounds of race, sex, age, religion, etc. The data given for firm A were:
Promoted
Not Promoted
Total
Minority group
1
31
32
Majority group
10
58
68
If the data were presented in a court case alleging discrimination on, say, ethnic grounds, Gastwirth and Greenhouse indicate that US courts may seek an explanation from a company as to why its policy is not discriminatory if the appropriate statistical test rejects at the 5% significance level a hypothesis that promotion is independent of ethnic grouping. The company must then try to explain the discrepancy on grounds other than ethnic considerations. It might argue, for example, that promotion required either a prior agreement or a willingness to work unsocial hours or spend long periods away from home and that relatively few in the minority group met that requirement. One appropriate test for independence is Fisher’s exact test. If you are not familiar with this test it is described in section 4.6, but it suffices here to say that although the proportion promoted in the minority group is only 1/32, or 3.125% compared to the overall rate of 11% the hypothesis of independence is not rejected at the 5% level by the Fisher test. Gastwirth and Greenhouse give a similar table for a comparable group of 100 employees recruited from the same population for a second firm, B. For this firm the data are
Promoted
Not Promoted
Total
Minority group
2
46
48
Majority group
9
43
52
The promotion rate is 4.167% in the minority group, greater than the corresponding proportion for firm A, while the overall rate is again 11%, so clearly there is not going to be significant evidence of association.
I made that statement with tongue in cheek, for if we apply the Fisher exact test to confirm lack of association the result is significant at the 5% level, i.e. there is an indication of association that we did not get for firm A! Unlike the situation in Example 1.1, however, it is not the test that is inappropriate. I leave you in suspense until I return to this problem in section 13.6, telling you only that the anomaly arises because we have ignored information in the data that affects the relative performance of the test in the two cases.
In modern statistical inferences two complementary but often overlapping approaches are:
  • Model-driven analyses. These are based on probabilistic or mathematical models, simple or sophisticated, to encapsulate the main features of data. Once the model is specified, analyses tend to be driven by that model. A well known example is the general linear model that covers analysis of variance and linear regression; validity of these analyses in, for example, basic analysis of variance depends upon assumptions like additivity of effects and homogeneity of error variance. Many inferences are strictly valid only under further assumptions of normality. Transformation of data sometimes induces such conditions, but the price to pay may be increased difficulty of interpretation.
  • Data-driven analyses There are three subcategories. The first comprises methods that work over a range of potential models and includes exploratory and robust methods. These are useful if there is not enough data-based or other information to select any one probabilistic model. Robust methods are ones that perform well for several models, even if optimal for none or only some.
    The second kind of data-driven analyses use the data to squeeze out information with only limited assumptions about potential models. These include permutation tests and also the bootstrap and jackknife described ...

Table of contents

  1. Cover
  2. Halftitle
  3. Series Page
  4. Title Page
  5. Copyright Page
  6. Table of Contents
  7. Preface
  8. 1 Data-driven inference
  9. 2 The bootstrap
  10. 3 Outliers contamination and robustness
  11. 4 Location tests for two independent samples
  12. 5 Location tests for single and paired samples
  13. 6 More one- and two-sample tests
  14. 7 Three or more independent samples
  15. 8 Designed experiments
  16. 9 Correlation and concordance
  17. 10 Bivariate regression
  18. 11 Other regression models and diagnostics
  19. 12 Categorical data analysis
  20. 13 Further categorical data analysis
  21. 14 Data-driven or model-driven?
  22. References
  23. Index