The super-fabulous School of Data has an excellent introduction to databases that any aspiring data-head should read. Specifically, today’s entry by Noah Veltman, entitled ”SQL: The Prequel (Excel vs. Databases),” is on the differences between spreadsheets (Microsoft Excel, in particular) and databases. I’m proud of them for pointing out the many things that Excel does very well. (John Forman’s new book Data Smart: Using Data Science to Transform Information into Insight also praises Excel and shows how to do some very sophisticated data work with it. I’m in the process of reading the book and will have more to say on it later.)
For database newbies (such as myself), one particularly interesting part is this:
So when should I use a database instead of Excel?
Excel and databases are good for very different things. Neither is per se good or bad. A rule of thumb: you should strongly consider using a database for a project to the extent that the following are true:
- You have a lot of data.
- Your data is messy or complex.
- You want to power something else with your data.
- Other people need to work with the same data.
And see the rest of the post here.
Great stuff. Thank you, School of Data!
In the previous post we took a look at histograms and how they can be used to describe the shapes of distributions for quantitative variables. Specifically, we looked at modes and a few particular shapes, like normal (meaning bell curve), uniform, and U-shaped. In this post we’ll look at a few variations on normality: skewness and kurtosis.
One important thing to look for in a histogram is whether the data are symmetrical or whether they are skewed one way or another. For example, a normal distribution like we saw earlier is completely symmetrical. On the other hand, a distribution might have most of the scores at the low end and a few especially high ones. This kind of distribution often happens with things relating to money, such as income, where most people earn a small or moderate amount of money each year but a smaller number of people earn an enormous amount (such as the “1%” that the Occupy Wall Street people talked about). This is called a positively skewed distribution, or skewed right, because that’s where the extreme scores are. (And by the way, it’s called “positively skewed” because there is a formula for skewness and if you calculate it for this kind of distribution, then you get a positive number.) A positively skewed distribution looks like this:
On the other hand, a distribution may have the extreme scores on the low end, which is called negatively skewed or left skewed, again, because that’s where the unusual scores are. This is what the distribution for infant health looks like (as most infants are relatively healthy but a smaller number are very sick), as well as the distribution of college GPAs (as people who do poorly tend to drop out). It looks like this:
Skewness is important because it can dramatically affect the results of analyses when you don’t know that it’s there, so it’s always important to check. And, finally, here’s a chart that shows all three versions of skewness at once:
The final characteristic of distributions that we’ll discuss is kurtosis, which comes the Greek word κυρτός, kyrtos or kurtos, which means “bulging.” Kurtosis has to do with how flat or pointed the distribution is compared to a normal (i.e., bell curve) distribution. In practical terms, the thing that most influences kurtosis is the presence of outliers, as outliers will give the distribution unusually long tails. This has the effect of making the middle part look relatively narrow and pointed.
The normal distribution is called mesokurtic, which means “middle bulge.” A platykurtic distribution on the other hand, is relatively flat, as “platykurtic” means “flat bulge” (like in the flat-tailed platypus). Finally, a leptokurtic distribution is narrow and pointed compared to the normal distribution, as “leptokurtic” means “narrow/thin bulge.” Here are illustrations of the three variations on kurtosis:
Platykurtic Distribution (Kurtosis is Negative)
Mesokurtic Distribution (Kurtosis is 0)
Leptokurtic Distribution (Kurtosis is Positive)
Here is another chart with similar curves, except this time it also shows the range of values for K (as in Kurtosis) for each version: a mesokurtic, normal distribution has a value of 0 for kurtosis, whereas a skinny, leptokurtic distribution has a positive number, and a flat, platykurtic distribution has a negative number (and I apologize for the pixeliness of this one):
But best of all is this drawing by William Gosset, the inventor of the t-test (which we will cover later), that uses the tails of animals as mnemonic devices. (Gosset published under the pseudonym “Student.” This illustration is from “Student.” (1927). Errors of routine analysis, Biometrika, 19, 151–164.)
That’s a platypus with its short, flat tail on the left for the platykurtic distribution. On the right are two long-tailed kangaroos, which are known for “lepping,” which is an antiquated version of “leaping” (especially from horse jumping.) And now you’ll never forget. (Thank you, Student!)
And here I am going over this info in the second half of the video on histograms. The link takes you straight to the part on skewness and kurtosis.
When you have quantitative data – that is, interval or ratio level variables – then one of your best choices for charting is a histogram. The most familiar example of this is a bell curve. Histograms are exceptionally helpful for getting a feel for a set of scores – how spread out they are, where the middle is, whether there are extremely high or low scores, etc. They also make it very easy to describe the shape of a distribution. Finally, histograms can be used for a simple kind of prediction, where the predictor variable is the one listed across the bottom and the thing being predicted is, for example, the probability of falling into a certain score grouping, which is given by the height of the bar for that score on the X variable, as will be seen below.
One of the most prominent characteristics of a distribution is its general shape. For example, one of the most important distributions in statistics is the bell curve, which is technically known as a “normal distribution.” (Remember that “normal” has a special meaning in statistics and always refers to bell-curve shaped distributions.) A normal distribution is curved, with a single hump (called a mode) in the middle, and symmetrical tails, like this:
Now that’s just the mathematically determined line that shows the top of the normal distribution. If you randomly sample 10,000 scores from that distribution and make a histogram, then this is what it looks like:
But lots of other shapes are possible. For example, if every score in a distribution is equally likely, which often happens with random variables, then you can have a uniform distribution, which is flat across the top like this:
Or, you could have something that is essentially the opposite of a normal distribution, called a U-shaped distribution, because it’s high on the ends and low in the middle, like the letter “U”:
Lots of other shapes are possible, and these can be described using some of the concepts that follow.
In statistics, the mode of a distribution is the most frequently occurring score. In a histogram, this shows up as the highest point or hump of the distribution. If there is only one obvious peak, then the distribution is called unimodal, as in “one mode,” like the normal distribution we saw earlier:
On the other hand, a distribution could have two pronounced peaks or humps, in which case it is called bimodal, as in “two modes.” Here are a few examples of bimodal distributions with varying degrees of separation:
An important thing to note is that when you have a bimodal distribution, what is usually happening is that there are actually two unimodal distributions that got combined. Here’s an example from the top distribution in right above:
In this examples, there is a narrower normal distribution on the left (shown by the red curve), and a wider one on the right (shown by the blue curve). (The two curves are different heights because they are spread out differently.) When the two different distribution are combined, the gray histogram results, which has a bimodal shape.
A distribution can have more than two modes or peaks, in which case it is sometimes called multimodal, although it may be more accurate to say that the distribution doesn’t really have any mode at all. Here’s an example of a multimodal distribution:
Just like with bimodal distributions, a multimodal distribution can often result from having multiple normal distributions that got combined, as you can see below:
Now, you’ll also want to talk about things like skewness and kurtosis, both of which have to do with how a histogram deviates from a simple bell curve. That’s where we’ll go in the next entry.
Thank you, XKCD.
A boxplot is a way of looking at an entire distribution at once. Any time I’m looking at a quantitative variable, I make two charts: a boxplot (which we’ll talk about here) and a histogram (which we’ll talk about in a following post). The good things about boxplots are:
- It’s easy to judge symmetry
- It’s really easy to see outliers
Here’s a simple version of a boxplot (and I’ve deleted the scale on the X axis to make things a little cleaner):
Boxplots get their name from the box in the middle of the chart; that box shows the scores that mark off the lowest 25% of the distribution and the highest 25% (known as the first and third quartiles, respectively). The line in the middle of the box is the median, or the value that splits the distribution into two equal sized groups of people. The lines on the left and right of the box go out to the lowest and highest non-outlying scores, while the circles are used to show outliers.
Outliers, which are unusually high or low scores, play an important role in data analysis because they can dramatically distort many common statistics. While there are many ways to determine if a scores is an outlier, one of the most common and effective ways is based on the size of the box in the middle of a boxplot. As the chart below shows, the width of the box gives something called the “interquartile range.” (This will be discussed in more detail in a later chapter.)
If that interquartile range is multiplied by 1.5 and then tacked on to the top and bottom of the box, then anything further away than that is considered an outlier and should be given special attention (or removed) in any data analysis.
The chart below shows the anatomy or components of boxplots:
I like to draw boxplots horizontally because that puts the scale in the same orientation as the scale in a histogram, which makes it easier to compare the two. On the other hand, you can also draw boxplots vertically. Either way, boxplots are good for checking for outliers. They’re also good because they’re compact and you can put several right next to each other, which lets you compare the distributions of several variables (as long as their on the same scale) or several groups on the same variable, like this chart showing five random variables, each consisting of 200 values from the normal distribution:
What’s interesting about this chart is that, despite the fact that all five distribution were drawn randomly from the normal distribution, there are difference. For example, the top distribution has one extremely low score and the second has none. In addition the medians (the lines in the middle of the boxes) seem to shift around a fair amount. Those kinds of differences are easiest to see with boxplots.
And from my playlist “Introduction to Data Analysis: Lectures,” here is the all singing/all dancing version of the discussion. (Well, except without any music or movement.)
[Thanks again, XKCD. Notice that the middle panel is easiest to read. More on this below.]
When presenting categorical data (i.e., nominal or ordinal variables), bar charts are an excellent choice but pie charts are also a common option. They are especially good if you want to emphasize the proportions of people in each category. There are, however, two important qualifications for using pie charts:
- They work best if you only have a few categories or, at least, a few large ones (a pie chart that looks like chromatic spokes on a wheel is not helpful)
- The categories have to be mutually exclusive; that is, people need to be in only one category so the pie segments add up to 100%. This is not necessary with bar charts: each of the bars could be, for example, 99% if people fall into more than one category.
With those caveats in mind, you can easily do a pie chart in either program. In fact, in Excel you can simply right click on the bar chart, choose “Change Series Chart Type…”, and it will walk you through it all. Here is an pie chart based on the data from the bar chart example, using the Excel defaults (which is why one of the labels is in a weird place):
Problems with Pie Charts
As mentioned earlier, pie charts are common but they have some problems. The first is that they indicate values through angles and areas of circles, both of which can be very hard to read. These two sets of charts demonstrate the problem. Each chart shows the percentage of respondents in each of six categories. Your job is to determine whether there are more people in slice C or slice D of each pie chart.
Obviously, it’s difficult to do. On the other hand, when the same data are presented in bar charts, it’s a much easier task.
On the left and center bar charts, D is larger, but on the right chart, C is larger.
[By the way, Excel normally makes all of the bars the same color, which I believe is a good choice. However, I clicked “color by point” so each bar would match the colors in the previous pie charts.]
Another issue is that in pie charts the categories need to be mutually exclusive (which is not the case with bar charts). That is, the sections need to add up to 100% or you look silly, as in this example based on an actual chart of poll results from Fox News:
In this case, the percentages add up to 193%. Oops. The way to deal with this is to make it so the categories are mutually exclusive. In the above chart, the results are from a primary election and so the candidates are all from the same party; it’s not too surprising that a voter might like more than one of them. (This is less likely to happen in a general election when the candidates are all from different parties.) What Fox News should have reported was which candidates people back the most. Or, to give another example, if you’re looking at psychopathologies, you have to worry about the problem of comorbidity or multiple illnesses/diagnoses, such as the common combination of depression and anxiety. You can make a bar chart to show the prevalence of each disorder and your data can include people who have both. Consequently, their data would contribute to the magnitude of each bar. With a pie chart, on the other hand, you would have to chose one or the other and make a chart of primary diagnoses. That way, each person fits into just one category, although you may lose a substantial amount of data in the process.
[I’d also like to point out that the above pie chart leans back and has a false third dimension. Both of these contribute to misreading. For example, Romney's slice is at the bottom of the chart and the thickness of the pie contributes to the perception that his slice is larger than Palin’s (even though 60% < 70%). Both of these practices should be avoided. Shame on them.]
Also, there are variations on the pie chart, such as this “doughnut” chart from Excel, which is like a series of nested pie charts with a hole in the middle. The next two charts show the percentage of of residents to earn bachelor’s degrees and in the highest state (Colorado), the lowest (West Virginia), Utah (for local interest), and the US total. Here is the result:
And while it’s fun to look at, the chart is essentially useless because you can’t label the four rings. Instead, it may be easier to show the data with something else called a 100% stacked bar chart, which looks like this (again, using the Excel defaults):
Given the problems with pie charts, it’s usually best to just ignore them completely. And, just in case you want to hear me say that again, here is the relevant video from my ”Introduction to Data Analysis: Lectures” playlist. Now you know for sure.
- The massive burden of pie charts (junkcharts.typepad.com)
Bar charts are one good way to display categorical variables where you are showing, for example, the number of people (as either a frequency – the raw number – or as a percentage) in a particular group. They also work well for showing the means of quantitative variables, which we will discuss at later time. Bar charts are easy to interpret and nearly everybody has experience with them. However, some people think that they’re boring and try to spice them up with all of the effects that Excel offers, like the whopper above. Here’s a closer look at its extraordinary badness:
This chart is supposed to show the percentage of adults in the USA with different levels of education. Now, you might think that I’m intentionally causing problems with this chart but I swear to you that I have seen charts with every one of these problems . Here are (some of) the problems:
- The background colors are distracting, as are the two shadows behind the entire chart
- The background image is distracting
- The texture on the bottom provides poor contrast for the cones
- Instead of bars, cones with elliptical bases are used, making it hard to see the top of the cone, which is the most important part
- The cones are all visually different: some have patterns, some have gradients, some have textures, some have photos
- The view is almost straight down so it’s hard to see how tall the bars are
- The false third dimension removes the cones from the walls so, again, it’s hard to see how tall they actually are
- There are no grid lines on the walls to line up with the numbers on the axis
- The font on the axis is incredibly small
- There are numbers on the axis for every percentage point
- Although no normal human can see it, the numbers on the axis start at 8%, which put the grad degrees (7%) underground and invisible
- The range stops before the highest value
- The cones are arranged by alphabetical order
- Et cetera, et cetera, et cetera
A better version of this chart (then again, you’d have to work hard to make it worse) might look like this:
This one has several advantages:
- The flat, 2D layout is much easier to read
- There are no distracting colors
- By laying the bars out sideways, it is much easier to read the labels
- The bars are in a logical order by level of education
- The scale starts at zero and has a reasonable number of reference lines
- It is extremely easy to see which groups are higher and which are lower
Anyhow, the first one is a lot better at grabbing your attention but the second one is a lot better at informing, which is, after all, the purpose of statistical graphics.
Finally, I have a playlist on my YouTube channel that is called “Introduction to Data Analysis: Lectures,” which is for the stats course I teach. The video on bar charts covers this same material in a more, well, dynamic way (or something like that). Enjoy!
- These 230-Year-Old Charts and Graphs Were the Very First Infographics (gizmodo.com)
- 10 Tools for Creating Infographics and Visualizations (moz.com)
In statistics, 2 does not always equal 2. That is, the same number can carry different kinds of information depending on its level of measurement. For example, if you are studying helping behavior, you need to know whether a 2 indicates one kind of helping behavior (e.g., helping pick up dropped books) while a 4 indicates another kind of helping behavior (e.g., giving directions) or whether a 2 means half as many helping behaviors as a 4 does, or whether a person who gets a 2 is the second most helpful person in the study, or whether people strongly agree that a person is helpful on a -2 to +2 rating scale.
The most common way to describe different levels of measurement is nominal, ordinal, interval, and ratio. And while I will use these terms below, it can be more helpful to think of variables as just two kinds: categorical and quantitative.
A categorical variable is one that, as its name implies, indicates different categories. Examples include:
- College major
- Experimental condition
Categorical variables can be subdivided into two other common types of variables: nominal and ordinal.
These are sometimes called nominal variables because the variable consists of names for categories, like sex/gender, country of origin, model of car, etc. There is no order to these categories; any numbers assigned to them are ultimately arbitrary. Also be aware that the coding system used for a nominal variable is arbitrary. For example, if you want to code a dichotomous (i.e., two values only) variable for employment, you can use any of the following:
It doesn’t matter which version you use; the variable is always categorical. However, you may find that one coding system is easier than another to enter or interpret in a given situation, so make your choices accordingly.
There are also ordinal variables or rank variables, which indicate first, second, third, and so on. Examples include tallest to shortest, first to last, fastest to slowest, etc. Although there are special statistical procedures for ordinal variables, they are usually difficult to deal with. As such, ordinal variables are usually treated as categorical variables without doing too much damage to the data.
Quantitative variables are ones where you can measure the size of the differences between scores and not just that they are different, as with categorical variables, or that they differ in rank, as with ordinal variables. Examples include:
- IQ scores
- Temperature (in Fahrenheit or Celsius)
- Time to complete a task
Quantitative variables can also be subdivided into to other common types of variables: interval and ratio.
To be technically correct, the first two examples of quantitative variables above are called interval variables because they indicate the size of the difference between scores but they don’t have zero starting point, either because they don’t have a zero (as with IQ) or because you can go right past zero into the negative numbers (as with temperature). Because of this, it isn’t possible to say that one value is twice as much as another (e.g., that 80 degrees is twice as hot as 40 degrees). One situation where it is common to hear the word “interval” is in races where the time for the leader or winner is given and then intervals, or the time behind the leader, are given for all the competitors who follow.
The last three examples, however, have definite zeroes that indicate the complete absence of something. These are called ratio variables and you can say that one value is twice as much as another (e.g., 10 minutes is twice as long a 5 minutes). However, the distinction between these is usually irrelevant for most analytic purposes and so it’s easier to just call both of them quantitative variables. Whenever possible, you should use quantitative variables instead of categorical variables because quantitative variables make a lot more statistical procedures possible.