Why Data Visualizations
A picture is worth 1000 words
A proverb is a simple and concrete saying popularly known and revered, which expresses a truth, based on common sense or the practical experience of humanity. The proverb "A picture is worth 1000 words" is one you have probably heard more than once.
A picture can also be worth 1000 data points. In 1973, the statistician Francis Anscombe demonstrated the importance of graphing data. The Anscombe's Quartet shows how four sets of data with identical simple summary statistics can vary considerably when graphed.
Anscombe's Quartet Data Table
Simple Summary Statistics of Anscombe's Quartet Data Table
|Mean of x of each data set||9 (exact)|
|Variance of x in each data set||11 (exact)|
|Mean of y in each data set||7.50 (to 2 decimal places)|
|Variance of y in each data set||4.122 or 4.127 (to 3 decimal places)|
x and y in each data set
|0.816 (to 3 decimal places)|
|Linear regression line|
for each data set
|y = 3.00 + 0.500x|
(to 2 and 3 decimal places, respectively)
Graph of Anscombe's Quartet Data Table
source: Wikimedia Commons
It is hard to tell how the data behaves in the data table. The simple summary statistics table would lead us to believe that all of the data sets are the same. Only when we graph the data do we get a clear picture how the data behaves.
A Famous Data Visualization
In 1854, a cholera outbreak killed 600 people in London. The physician John Snow made this outbreak famous. John Snow used Data Visualization to show that cholera is spread by contaminated water.
At this point in history the germ theory of disease was not known. This theory proposes that micro-organism are the cause of many diseases. Since the theory was not known, the spread of cholera was a mystery to public health officials.
John Snow spoke with locals near the cholera outbreak to discover the source of the germs spreading the disease. He used a "dot distribution map" to show how the cholera cases were clustered. He showed the cases were clustered around a public water pump on Broad Street.
John Snow's Dot Distribution Map of Broad Street Cholera Cases
source: Wikimedia Commons
In addition to this Public Health Data Visualization, John Snow conducted further research. This work is seen as a major event in the history of public health and geography. The work is regarded as the birth of modern Epidemiology.
Data - now and in the future
Gigantic mounts of data are being generated on a daily basis. The amount of data being generated is growing exponentially every year. Below is an info graphic example of the data generated from the 2012 Olympic Games - a single event.
2012 Olympic Games Data Generated Infographic
If one event generates this type of data, just think of how much data is going to be generated and ripe for analysis on a daily basis. This has lead big data analysts to posit that Data is the new Oil.
Data is the new Oil
In 2006, Michael Palmer wrote "Data is the new oil!" declaring "Data is just like crude. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value."
Since then, several fascinating people have come out saying and supporting the "Data is the new oil" statement. For instance => “Data is the new oil,” said Andreas Weigend, Stanford’s Head of the Social Data Lab, also the former Chief Scientist at Amazon.
"The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, … because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it."
Which brings us back to the first heading of this section -> "A picture is worth a thousand words." The ability to understand and extract value from data is hugely easier when done through a Data Visualization rather than from looking at the raw data or the simple statistics of the data.