February 5th 2010

Divide and Color
What do each of these thematic maps have in common? They are all mapping the exact same data. What makes them so different? They are using different classification methods.

One of the first things that pops up when whipping up a thematic data visualization that has discrete range brakes (commonly a choropleth map, but not necessarily) is how do I bucketize my numbers? Picking range breaks to drive your color categorization can range anywhere from arbitrary and predefined to rigorously statistical and dynamic, and the various options will generate very different looking results.

Some Options
There are lots of ways to carve datasets into discrete classes. I’ll go over three of them…

Quantile
Breaks the data into equally filled groups
Standard Deviation
Breaks the data into statistical chunks diverging from the mean
Equal Interval
Breaks the data into equally distant groups

Or you could just eyeball the data and then divide it into range breaks that look good or are easy to read. This, actually, is probably the most common method that I’ve seen in online mapping applications. It is also the most fertile ground for misunderstanding. More details on these methods below, including examples and how-to’s.

Distribution
When choosing a method of classifying your dataset into discrete ranges, there a couple of things to consider off the bat. First, what does the data distribution look like (if it is dynamic, what does it generally look like?)? Is it skewed toward one extreme or the other? Is it relatively normal (bell shaped on a histogram)? Are there outliers to consider?
Applying various classification methods can create very different impressions of the data. Any interface is a manifestation of tradeoffs, let’s take a look at some examples…

Normal Data
With relatively normally-distributed data, picking a classification method may not make a massive difference in your visualization.
Your biggest concern is probably how many breaks to make, and what colors to use. Check out this example of average age per county -nice and normal.

Normal data tends to deliver a relatively consistent message across many classification methods.

Skewed Data
Now it gets fun. With a dataset like the percent of folks who consider themselves multi-ethnic, the distribution is far from normal. In this case, there is a bulge at the lower end and a long tail that eventually pinches off around 30% multi-ethnic. What a difference the classification methods make here!
Am I telling the truth with the map (above) on the left? Yes. I can clearly see the locations of higher and lower proportions of multi-ethnic US residents, even regional trends and abrupt shifts.
Am I telling the truth with the map on the right? Yes. I get a clear indication that most places in the US are, proportionally, pretty low in multi-ethnic residents.
I’m telling the truth about two different things.

Skewed data may look way different depending on the classification method.

Examples in Detail

		Equal Interval for Normal data… Equal interval slices the data into equally distant range breaks. Some color buckets get more counties than others, but if the distribution is wide, then the visualization will be adequate. The gist: Evenly spaced, Unevenly Filled Buckets.
		Quantile for Normal data… Quantile yields a pretty high-contrast map, that is reliably good looking. The fact that the data is normally distributed doesn’t really matter –each bucket has the exact same number of counties, but you’ll notice that in order to accommodate that, the ranges have to span varying distances. The gist: Unevenly spaced, evenly filled buckets.
		Standard Deviation for Normal data… Trusty old Standard Deviation. It is going to look alright in most cases, but it really shines when applied to normal datasets. You’ve got the mean in the middle and you chunk it out from there based on standard deviation distances in either direction from the mean. Beautiful. Also, don’t do what I did –you should put actual values in your legend instead of the math nerd standard deviation breaks. And while we’re at it, it’s often a good idea to pick a diverging color scheme for data that is classified by Standard Deviation. Pick a neutral color for the mean (center) range and then transition to one color on the left and another color on the right. ColorBrewer gives some nice background here along with a rocking tool to generate your own cartographic color schemes. The gist: Evenly spaced (to a statistician), unevenly filled buckets.
		Equal Interval for Skewed data… Equal Interval falls apart pretty easily. If the data is remotely skewed then it’s feast or famine for the evenly spaced color buckets. In this case most of the buckets are largely empty while the low end bucket (0% – 6% multi-ethnic) is jam packed. Equal Interval is more fair to the population as a whole but does not capture smaller scale fluctuations. To be fair, just because all the eggs are in one basket and the map is largely monochromatic doesn’t mean that it’s useless. You could argue that is a a perfectly fair treatment of the data because it illustrates the predominant characteristic of the data: it’s highly skewed to the lower end. The gist: Evenly spaced, unevenly filled buckets.
		Quantile for Skewed data… Quantile to the rescue. When buckets are defined by an equal number of member counties, Now Devil’s Advocate. It could be argued that this method implies a false or misleading heterogeneity if the data. While the vast number of counties have a proportionally tiny multi-ethnic population, this method could imply a greater variance (as compared to the Equal Interval example above). It’s just not fair. Devil’s Advocate, Advocate. How could you get any more fair than groups of equal size? Plus the result illustrates a finer articulation of the variance. Just remember, when reading a map, read those legends and take the range breaks for what they are worth. Quantile is a good illustration of that. The gist: Unevenly spaced, evenly filled buckets.
		Standard Deviation for Skewed data… Standard Deviation. It is still providing valuable visual breaks when applied to highly skewed data. But I can never get too cozy with it because it is so darn hard to explain. The gist: Evenly spaced (to a statistician), unevenly filled buckets.

How-To’s
Equal Interval

Determine overall population range (highest value – lowest value) for the value of interest…
Determine range break distance (population range / desired # of breaks)…
Insert break every Nth value.

Quantile

Determine equally-filled range quantity (overall population / desired # of groups)…
Segment population by every Nth item.

Standard Deviation

http://en.wikipedia.org/wiki/Standard_deviation…
Good luck.

Truth
It’s one thing to willfully mislead others by the categorization and representation of data (obviously not cool). It’s another to do it on accident and mislead your audience and yourself. Varying classification methods will produce very different results. In gaining a little background about various methods of classification you’ll be in a better position to…

Create better, more effective visualizations
More keenly understand the visualizations of others
Read the legend
Use what you learn for good instead of evil

In any case, the thing to keep in mind is effective and truthful communication; your visualization should enable the data to tell it’s story.

Let us know if we can be of any help.

John Nelson / IDV Solutions / [email protected]

Redefining Gameplay: Top Indie Games …
Top Reasons to Switch to an Electric …
Navigating the Waters with a Premier …
Discover the Top 5 Beaches in Spain: …

This post first appeared on User Experience Blog Of IDV Solutions | Just Another WordPress.com Site, please read the originial post: here