Mathematics

Box Plots

Box plots, also known as box-and-whisker plots, are graphical representations of the distribution of a dataset. They display the median, quartiles, and potential outliers of the data. The box represents the interquartile range, while the whiskers extend to the minimum and maximum values. Box plots are useful for comparing the spread and central tendency of different datasets.

Written by Perlego with AI-assistance

12 Key excerpts on "Box Plots"

  • Book cover image for: An Essential Guide to Business Statistics
    • Dawn A. Willoughby(Author)
    • 2016(Publication Date)
    • Wiley
      (Publisher)
    Box Plots Based on a five-number summary, a box plot is a graphical display which tells us about the average value and variability of a data set. A box plot is a very useful way to visually summarise and compare data sets; it is sometimes known as a box-and-whisker plot. There are several steps involved in constructing a box plot: 1. Create a horizontal axis with a scale that allows the smallest and largest data values to be included on the diagram. 2. Draw a box above the axis; the left-hand vertical side of the box should be located at the position of the lower quartile, and the right-hand side at the position of the upper quartile. 3. Inside the box, draw a vertical line at the position of the median on the axis scale. 4. Draw a line that extends from the left-hand side of the box at Q 1 to the smallest value in the data set, marked by a small vertical line. Create a corresponding line that joins the right-side of the box at Q 3 to the largest value in the data set. To demonstrate the construction of a box plot, we will return to our example about the ages of the wealthiest billionaires, choosing to focus only on males at this stage. The five-number summary for males is: males minimum Q 1 median Q 3 maximum 45 66 72.5 78 85 168 A N E S S E N T I A L G U I D E T O B U S I N E S S S T A T I S T I C S Step 1. The minimum age for the males is 45 and the maximum value is 85, so our horizontal axis could be drawn as follows: Step 2. We locate the position of our box using the lower quartile, 66, and the upper quartile, 78. Step 3. Using the median value, 72.5, we draw a vertical line inside the box. Step 4. We complete the box plot by drawing lines that extend from the sides of the box to the minimum and maximum values. These lines are sometimes known as ‘whiskers’, giving the alternative name for the diagram: box-and-whisker plot. Box Plots are most useful for visually comparing two or more data sets side by side on a single diagram using the same scale.
  • Book cover image for: Applied Statistics and Probability for Engineers
    • Douglas C. Montgomery, George C. Runger(Authors)
    • 2020(Publication Date)
    • Wiley
      (Publisher)
    A chart of occurrences by category (in which the categories are ordered by the number of occurrences) is sometimes referred to as a Pareto chart. An exercise asks you to construct such a chart. In this section, we have concentrated on descriptive methods for the situation in which each observation in a data set is a single number or belongs to one category. In many cases, we work with data in which each observation consists of several measurements. For example, in a gasoline mileage study, each observation might consist of a measurement of miles per gallon, the size of the engine in the vehicle, engine horsepower, vehicle weight, and vehicle length. This is an example of multivariate data. In Section 6.6, we illustrate one simple graphical display for multivariate data. In later chapters, we discuss analyzing this type of data. 6.4 Box Plots 139 6.4 Box Plots The stem-and-leaf display and the histogram provide general visual impressions about a data set, but numerical quantities such as x or s provide information about only one feature of the data. The box plot is a graphical display that simultaneously describes several important features of a data set, such as center, spread, departure from symmetry, and identification of unusual observations or outliers. A box plot, sometimes called box-and-whisker plots, displays the three quartiles, the mini- mum, and the maximum of the data on a rectangular box, aligned either horizontally or vertically. The box encloses the interquartile range with the left (or lower) edge at the first quartile, q 1 , and the right (or upper) edge at the third quartile, q 3 . A line is drawn through the box at the second quartile (which is the 50th percentile or the median), q 2 = x. A line, or whisker, extends from each end of the box. The lower whisker is a line from the first quartile to the smallest data point within 1.5 interquartile ranges from the first quartile.
  • Book cover image for: Introduction to Bayesian Statistics
    • William M. Bolstad, James M. Curran(Authors)
    • 2016(Publication Date)
    • Wiley
      (Publisher)
    n = 29,
    This is halfway between the 7th - and 8th -order statistics, so
    The five number summary of a data set is y[1] ,Q1 , Q2 , Q3 , y
    [n]
    . This gives the minimum, the three quartiles, and the maximum of the observations. The boxplot or box-and-whisker plot is a pictorial way of representing the five number summary. The steps are:
    1. Draw and label an axis.
    2. Draw a box with ends at the first and third quartiles.
    3. Draw a line through the box at the second quartile (median).
    4. Draw a line (whisker) from the lower quartile to the lowest observation, and draw a line (whisker) from the upper quartile to the highest observation.
    5. Warning: Minitab extends the whiskers only to a maximum length of 1.5 × the interquartile range. Any observation further out than that is identified with an asterisk (*) to indicate the observation may be an outlier. This can seriously distort the picture of the sample, because the criterion does not depend on the sample size. A large sample can look very heavy-tailed because the asterisks show that there are many possibly outlying values, when the proportion of outliers is well within the normal range. In Exercise 3.6, we show how this distortion works and how we can control it by editing the outlier symbol in the Minitab boxplot.
    The boxplot divides the observations into quarters. It shows you a lot about the shape of the data distribution. Examining the length of the whiskers compared to the box length shows whether the data set has light, normal, or heavy tails. Comparing the lengths of the whiskers show whether the distribution of the data appears to be skewed or symmetric. Figure 3.2 shows the bo data distribution is fairly symmetric but with a slightly longer lower tail.
    Figure 3.2
    Boxplot of Earth density measurements by Cavendish.

    Stem-and-Leaf Diagram

    The stem-and-leaf diagram is a quick and easy way of extracting information about the distribution of a sample of numbers. The stem represents the leading digit(s) to a certain depth (power of 10) of each data item, and the leaf represents the next digit of the data item. A stem-and-leaf diagram can be constructed by hand for a small data set. It is often the first technique used on a set of numbers. The steps are:
  • Book cover image for: Visual Statistics
    eBook - PDF

    Visual Statistics

    Seeing Data with Dynamic Interactive Graphics

    • Forrest W. Young, Pedro M. Valero-Mora, Michael Friendly(Authors)
    • 2011(Publication Date)
    Boxplots were first described by Tukey (1977), who added some important variations shortly thereafter (McGill et al., 1978). Figure 6.4 shows the dot-plot in Figure 6.2 with a boxplot drawn on top of it. The elements of the boxplot are the following: • Box: The horizontal line in the center is located at the median: thus, half the data are above this line, half below. The bottom and top edges of the box are located at the first and third quartiles, which, with the median, divides the data into quarters: thus, half the data are inside the box, half outside; one-quarter below the box, one-quarter above; and each half of the box contains one-quar-ter of the data. Notched boxplots add intervals of confidence around the median as shown in the first plot of Figure 6.4. • The whiskers of the boxplot, which are the bottom and top horizontal lines, are located in the following way. The upper whisker is drawn at the largest obser-vation, which is less than q 3 + 1.5 x (q 3 - q x ) , where q x and q 3 are the first and third quartiles, respectively. The lower whisker is located at the smallest observation that is larger than q x -.5 x (q 3 -q t ). Points larger or smaller than whiskers are considered outside values. If the data were from a normal Figure 6.4 Notched boxplot, boxplot, diamond plot, and box + diamond plot. 193 6 Seeing Univariate Data distribution, such values would occur less than once in 100 times. Tukey also defined points that would fall farther away than 3 x (q 3 -q A ) as far-out val-ues but no indication of them is given in Figure 6.4. As defined by Tukey (1977) boxplots only plot individually the outside or far-out val-ues. We prefer, however, to show all the points because the dynamic-interactive capa-bilities of the plot would suffer if some of the points could not be selected or identified. Also, the information portrayed by the schematic can be better judged with the help of the individual points, as we will soon see.
  • Book cover image for: Graphical Methods for Data Analysis
    Figure 2.1 . Ignoring the two largest and two smallest values, the rest of the data appear slightly skewed toward small values, which explains the position of the median relative to the quartiles. But we should remember that the number of observations in this sample is small and that we would quite likely see different behavior in another sample.
    Box Plots are useful in situations where it is either not necessary or not feasible to portray all details of the distribution. For example, if many distributions are to be compared, it is difficult to try to compare all aspects of the distributions. In situations where the summary values of the box plot do a good job of conveying the prominent features of the distribution and the less prominent detailed features do not matter, it makes sense to use the box plot and eliminate the unneeded information.
    The width of the box, as defined so far, has no particular meaning. The plot can be made quite narrow without affecting its visual impact so that it can be used in situations where compactness is important. This is useful in Chapters 3 when many distributions are being compared and in Chapters 4 when the box plot is added to the margin of another visual display.

    2.6 Histograms

    Another way to summarize a data distribution, one that has a long history in statistics, is to partition the range of the data into several intervals of equal length, count the number of points in each interval, and plot the counts as bar lengths in a histogram. This has been done in Figure 2.11 for the ozone data. The relative heights of the bars represent the relative density of observations in the intervals.
    The histogram is widely used and thus is familiar even to most nontechnical people and without extensive explanation. This makes it a convenient way to communicate distributional information to general audiences.
  • Book cover image for: Data Analysis and Statistics for Geography, Environmental Science, and Engineering
    • Miguel F. Acevedo(Author)
    • 2012(Publication Date)
    • CRC Press
      (Publisher)
    From this type of graph, we can see the variability of the data and identify potential outliers. For example, we can tell that there are three observations (10, 53, and 74) that have very low values (around and less than 10) and one observation (38) that has a very high value (near 100). In the computer session, we will see how to identify the observations on a plot. 4.1.2 B OXPLOT The boxplot or box and whiskers plot (Figure 4.2) is a display of the main features of the descrip-tive summary: the median (a line inside the box), the first and third quartiles or lower and upper hinges (edges of the box), and the minimum and maximum nonoutlier values (the whiskers). These last two values are determined from the extremes of the range (or fence), which are the hinge (lower and upper respectively) minus or plus a factor (e.g., 1.5) of the inter-quartile distance (iqd, for short). The upper whisker is at the largest value within the range and the lower whisker is 96 Data Analysis and Statistics for Geography, Environmental Science, and Engineering the smallest value within the range. Values above or below the extremes of the range are outliers and identified as circles on the plot. For example, for the 100 observations used for the boxplot of Figure 4.2, the following values are used: lower hinge (first quartile) = 38, upper hinge (third quartile) = 54, and median = 46. In this case, the iqd is 54 − 38 = 16, and therefore using 1.5 × 16 = 24 for the range, we obtain 38 − 24 = 14 and 54 + 24 = 78 for the extremes of the range. The lowest value contained within the range is 30 (this sets the lower whisker) and the largest value is 75 (upper whisker). In this case, below 14 we have three values (7, 10, 13) and above 78 we have one value (96). All these four values are outliers and displayed as small circles (Figure 4.2). It is helpful to label the outliers with the observation number (Figure 4.3).
  • Book cover image for: Statistics
    eBook - PDF

    Statistics

    Learning from Data

    The five-number summary for a data set consists of the values of the minimum, the lower quartile, the median, the upper quartile, and the maximum. These five numbers divide a data set into quarters and are used to construct a boxplot. For an example of calculating the numbers in the five-number summary, see Example 3.13. M7: Given a numerical data set, construct a boxplot . A boxplot uses the numbers in the five-number summary to create a graph that provides information about center, variability, and symmetry or skewness in a data distribution. Modified boxplots also show outliers. For the steps in constructing a simple boxplot, see the box just prior to Example 3.13. For an example that illustrates the identification of outliers and construction of a boxplot that shows outliers, see Example 3.15 and the dis-cussion just prior to that example. M8: Identify outliers in a numerical data set . For a univariate numerical data set, outliers are unusually small or unusually large data values. A data value is considered an outlier if it is more than 1.5 times the iqr away from the nearest quartile. For an example of identifying outliers in a data set, see Example 3.15. Putting It into Practice P3: Use boxplots to make comparisons between two or more groups . A comparative boxplot can be used to compare two or more groups. A comparative box-plot is two or more boxplots drawn using the same numerical scale. For an example of using boxplots to compare groups, see Example 3.14. Copyright 2019 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
  • Book cover image for: Beginning R
    eBook - PDF

    Beginning R

    The Statistical Programming Language

    • Mark Gardener(Author)
    • 2012(Publication Date)
    • Wrox
      (Publisher)
    216 x CHAPTER 7 INTRODUCTION TO GRAPHICAL ANALYSIS between samples. The basic form of the box-whisker plot shows the median value, the quartiles (or hinges), and the max/min values. This means that you get a lot of information in a compact manner. The box-whisker plot is also useful to visualize a single sample because you can show outliers if you choose. You can use the boxplot() command to create box-whisker plots. The command can work in a variety of ways to visualize simple or quite complex data. Basic Boxplots The following example shows a simple data frame composed of two columns: > fw count speed Taw 9 2 Torridge 25 3 Ouse 15 5 Exe 2 9 Lyn 14 14 Brook 25 24 Ditch 24 29 Fal 47 34 You have seen these data before. You can use the boxplot() command to visualize one of the variables here: > boxplot(fw$speed) This produces a simple graph like Figure 7-1. This graph shows the typical layout of a box-whisker plot. The stripe shows the median, the box represents the upper and lower hinges, and the whiskers show the maximum and minimum values. If you have several items to plot, you can simply give the vector names in the boxplot() command: > boxplot(fw$count, fw$speed) The resulting graph appears like Figure 7-2. In this case you specify vectors that correspond to the two columns in the data frame, but they could be completely separate. 35 25 20 15 30 10 5 FIGURE 7-1 40 30 20 10 1 2 FIGURE 7-2 Box-whisker Plots x 217 Customizing Boxplots A plot without labels is useless; the plot needs labels. You can use the xlab and ylab instructions to label the axes. You can use the names instruction to set the labels (currently displayed as 1 and 2) for the two samples, like so: > boxplot(fw$count, fw$speed, names = c('count', 'speed')) > title(xlab = 'Variable', ylab = 'Value') The resulting plot looks like Figure 7-3. In this case you used the title() command to add the axis labels, but you could have specified xlab and ylab within the boxplot() command.
  • Book cover image for: Understanding Business Statistics
    • Ned Freed, Stacey Jones, Timothy Bergquist(Authors)
    • 2013(Publication Date)
    • Wiley
      (Publisher)
    Use a leaf unit of .1. (Note: with a leaf unit of .1, EXERCISES Box Plots Another useful visual format in exploratory data analysis is called a box-and-whisker plot, or, more simply, a box plot. With an emphasis on quartiles, Box Plots are rudimentary pictures that cleverly show the central tendency, the dispersion, the shape, and any extreme values in a data set. We can use the plot shown in Figure 3.1 to illustrate the idea (labels have been added here for explanation): Algeria 1.74 Madagascar 4.96 Angola 5.54 Malawi 5.35 Benin 5.22 Mali 6.35 Botswana 2.46 Mauritania 4.22 Burkina Faso 6.07 Morocco 2.19 Burundi 6.08 Mozambique 5.40 Cameroon 4.09 Namibia 2.41 Cape Verde 2.44 Niger 7.52 Central African Rep. 4.57 Nigeria 5.38 Chad 4.93 Rep of Congo 5.59 Comoros 4.09 Rwanda 4.81 Djibouti 2.63 Sao Tome and Principe 4.92 DR Congo 5.09 Senegal 4.69 Egypt 2.94 Seychelles 1.90 Equat Guinea 4.83 Sierra Leone 4.90 Eritrea 4.37 Somalia 6.25 Ethiopia 5.97 South Africa 2.28 Gabon 4.56 Sudan 4.17 Gambia 4.10 Swaziland 3.03 Ghana 3.39 Tanzania 4.02 Guinea 5.04 Togo 4.64 Guinea-Bissau 4.44 Tunisia 2.02 Ivory Coast 3.82 Uganda 6.65 Kenya 3.98 West Sahara 4.22 Lesotho 2.89 Yemen 4.45 Liberia 5.02 Zambia 5.90 Libya 2.90 Zimbabwe 3.61 29. Refer to Exercise 13 (Home Runs). Construct a “stretched” stem and leaf diagram to represent the data. Use stems of 3, 3, 4, 4, etc. 30. Refer to Exercise 21 (iTunes Apps). Construct a stem and leaf diagram to represent the data. Use stems of 16, 17, 18, etc. and a leaf unit of 100. Middle Q1 Q3 Largest Smallest Q2 (Median) 220 225 230 235 240 245 250 50% FIGURE 3.1 Box Plot Illustration In a box plot, the box extends from the first quartile to the third quartile. The position of the median is indicated inside the box. The “whiskers” typically extend to the largest and smallest values (unless there are outliers in the data).
  • Book cover image for: Data Analysis for the Life Sciences with R
    A boxplot shows these 3 values along with a range of the points within median ± 1.5 (75th percentile -25th percentile). Values outside this range are shown as points and sometimes referred to as outliers . Boxplots 73 FIGURE 3.5 Histogram and QQ-plot of executive pay. boxplot (exec.pay, ylab= 10,000s of dollars , ylim= c ( 0 , 400 )) FIGURE 3.6 Simple boxplot of executive pay. Here we show just one boxplot. However, one of the great benefits of boxplots is that we could easily show many distributions in one plot, by lining them up, side by side. We will see several examples of this throughout the book. 74 Exploratory Data Analysis 3.3 Scatterplots and Correlation The methods described above relate to univariate variables. In the biomedical sciences, it is common to be interested in the relationship between two or more variables. A classic example is the father/son height data used by Francis Galton 1 to understand heredity. If we were to summarize these data, we could use the two averages and two standard deviations since both distributions are well approximated by the normal distribution. This summary, however, fails to describe an important characteristic of the data. data (father.son, package= UsingR ) x=father.son$fheight y=father.son$sheight plot (x,y, xlab= Father’s height in inches , ylab= Son’s height in inches , main= paste ( correlation = , signif ( cor (x,y), 2 ))) FIGURE 3.7 Heights of father and son pairs plotted against each other. The scatter plot shows a general trend: the taller the father, the taller the son. A summary of this trend is the correlation coefficient, which in this case is 0.5. We will motivate this statistic by trying to predict the son’s height using the father’s height. 3.4 Stratification Suppose we are asked to guess the height of randomly selected sons. The average height, 68.7 inches, is the value with the highest proportion (see histogram) and would be our prediction.
  • Book cover image for: Research Methods in Radiology
    eBook - PDF
    • Andrea S. Doria, George Tomlinson, Joseph Beyene, Rahim Moineddin, Andrea S. Doria, George Tomlinson, Joseph Beyene, Rahim Moineddin, Andrea S. Doria, George Tomlinson, Joseph Beyene(Authors)
    • 2018(Publication Date)
    • Thieme
      (Publisher)
    2.7 Number of imaging examinations in a small region with a mode of 175 images. 22 Research Methods in Radiology when data are unimodal, or have a single mode. In this instance, most of the sum-mary information in the sample is obtained by looking at the range, the median, and two matching quantiles, such as the first and third quartiles. When these five pieces of information are all someone wishes to use to obtain an impression of data, this can be easily represented in a boxplot, shown in Fig. 2.8 . The boxplot is also useful for skewed, uni-modal data. The ends of the box correspond to the first and third quartiles, with the mark in the middle of the box indicating the sec-ond quartile, the median. One indication of skewness comes from the line in the middle, the median, not lying in the middle of the first and third quartiles. Examples of this are shown in Fig. 2.9a and Fig. 2.9b . There are lines extending from the box called whiskers . The length of whiskers follows less univer-sal rules. Whiskers never extend beyond the range of the data and are meant to indicate where another large mass of the data lies. In R the default is to make the whiskers no more than 50% longer than the distance between Q1 and Q3, consistent with a common conven-tion by the boxplot’s inventor. 10 Any points beyond this range are explicitly drawn as being of particular interest, possibly outliers . Such points are few in number (relative to the size of the data) in unimodal data. It is possible that these points represent differ -ent populations or measurement errors, but given large enough samples, it is expected that some points will be far away from the middle of the data. When data are not unimodal, impressions of data are extremely unreliable. Recall our example of the duration of time to conduct imaging examinations. The boxplot is shown in Fig. 2.10 . The boxplot, a summary of five numbers, is not capable of describing the nuances of multimodal data.
  • Book cover image for: Statistical Methods for Communication Science
    • Andrew F. Hayes(Author)
    • 2020(Publication Date)
    • Routledge
      (Publisher)
    A box plot contains information about the median of a distribution, the interquartile range (IQR), the measurement interval that contains the inner 50% of measurements, and the minimum and maximum mea-surements in a distribution, while at the same time highlighting measurements that extremes TV Viewing Hours 12 11 10 9 8 7 6 5 4 3 2 1 0 Median Median + 1.5 IQR (or maximum, whichever is smallest) Median - 1.5 IQR Inner 50% outliers (or minimum, whichever -1 is largest) -2 61 4.6. Standardization Figure 4.4 A box plot of the TV viewing data. are unusual using certain criteria. A box plot of the TV viewing data is displayed in Figure 4.4. The figure itself is fairly self-explanatory. The dark line dividing the gray box is the median, while the upper and lower edges of the box define the end points of the ordinal middle 50% of the measurements. From the box plot, you can see that the median measurement is 2, whereas 50% of the measurements reside between 1.5 and 3. By definition, then, the interquartile range is 3 − 1 . 5 = 1 . 5. The long horizontal lines above and below the box are set at the median plus and minus 1.5 interquartile ranges. However, if the median plus 1.5 IQRs exceeds the maximum measurement, then the upper line is placed at the maximum. If the median minus 1.5 IQRs is smaller than the minimum measurement, then the lower line is set at the minimum. The box plot also depicts the “unusual” measurements, defined as those with measurements that are more than 1.5 IQRs from the median (in either direction). Different statistical programs will depict unusual cases differently. In SPSS (which generated this figure), “outliers” in a box plot are defined as cases with measurements between 1.5 and 3 IQRs from the median. “Extreme values” are defined by SPSS as measurements more than three IQRs from the median.
Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.