He’s dead, Jim! … musing on Open Data in the age of peta-data

I made a startling realization today. an aha moment if you like. In a few years, the clamor for open data will die out.


Human behavior, and social economics, or at least my three pence thought on this issue, tells me as much. My reasoning is simple; we will only throw away what we no longer need (or ever will).

Over the past few years, the talk around open data, data science, big data … and all things data has grown louder. Everyone wants a piece of the action. Everyone wants to use their data to optimize processes, to make better decisions, to illuminate spending, … to be better. In advertently, what was previously trash (non-usable) is now being horded in abundant stores and servers. Suddenly, data has become that diamond in the rough. And with reactions reflective of its newly acquired status, the grip on data has tightened, and the grip is likely to grow tighter with every new spackle of the data diamond. We have collectively become compulsive hoarders

In 2012, every day 2.5 quintillion bytes of data (1 followed by 18 zeros) are created, with 90% of the world’s data created in the last two years … more

The question that arises is just how much of this data is open, or will ever be open. The reality is that almost a negligible percentage will ever be open, and even then, the data will have been hopelessly summarized and anonymized to be realistically usable. Human behavior suggests that we only throw out what we consider as garbage; what we do not hope to use; ever. And with such growth of data, the data dumps will be growing at an unprecedented rate; rates that will never be matched by the release.

Watching BBC documentaries trashopolis give a glimmer of hope. That even in the putrid mess of outdated, highly summarized, and distorted data, there will be those who will dive in and make some semblance of order, some meaning, some ideas. May be one may be able to build a beautiful island of trash data like these three amazing trash islands . May be, just may be, some George Waring(s)  will help make some order in the chaos of trash data.


Stereotyping Kenyans on Twitter #KOT

Kenyans On Twitter (#KOT if you like) are peculiar. We are peculiar. How peculiar? … see the infographic.


I used standard ttest and Anova  to arrive at the conclusions. To download the data used and other data, which is de-identified, for privacy (?) check blog_data folder in http://bit.ly/opendatake . Also if you are interested in the software output (I used R for the tests, see blog_data folder).

See! its three pence talk


night travels, Statistics, Uncategorized

Should we ban night travels, and weekends travels too: A statistical answer?

The most common, let’s call it man-on-the-street analysis, revolve around the analysis of means (or averages as they call it). The challenge with average is that it may lead to very false conclusion. An interesting use, and probably conclusion, of this rampant error can be captured in the recent dialog of whether or not night traveling should be banned. Though, as usual, the conversation degenerated, at least on twittersphere, into the very hilarious and totally irrelevant #EngKamauLogic, it revealed flaws, or otherwise, in our thinking that banning night travels would reduce road accidents, and consequently deaths and injuries. Proponents of this notion have cited examples such as Tanzania where incidents of road carnage are rather few and have tied this to the Tanzania’s policy banning night travels.
I want to explore this question by questioning another assumption that there are generally more deaths and injuries as a result of road accidents during weekends than during weekdays. The intention is to show, statistically, why our ‘common statistics’ is wrong; and why banning night travels, (and weekend travels while they are at it) could be erroneous.
The data used in this analysis is taken from #opendatake portal (someone needs to provide more recent data). It comprises of 1354 recorded cases of road accidents across the country.

New Picture (1)

Standard descriptive analysis

New Picture (2)


Looking at the histogram above, there is temptation to suggest that Fridays and Saturdays are black days given the frequency of the accidents on those days. When you superimpose this to a strip chart (below), a few things come into perspective.

New Picture (3)
The strip chart puts the histogram into perspective. It suggests that while there Friday and Saturday register relatively the same number of accidents (Histogram), Saturday registers, averagely, more deaths than Friday. Also shows Mondays generally register fewer deaths (and accidents) than most other days, except for that one outlier Accidents in Kitui, on the Monday of 19th September, 2011. (http://www.standardmedia.co.ke/?id=2000043141&cid=159&articleID=2000043141). These are outliers. The distribution of the number of deaths on the strip chart would suggest that Saturdays and Sundays have relatively higher deaths, than most other weekdays. But is it right to assume that Saturday and Sunday are dark days? Note also the outliers, and the variation on the datapoints.
Distribution analysis (inferential statistics)
While analysis of frequencies (the very basic of analysis) is not really wrong (or totally erroneous when used correctly), a more robust approach is to look at the distribution of the data (inferential statistics) rather than the frequency (descriptive statistics). Analyzing the distribution allows more generalized and accurate conclusions to be made about the data. Distribution analyses compensate for outlier data points and extreme values. Distribution analysis can be used to test hypotheses/assumptions about a data.
I tested the hypothesis that there is no statistically significant difference between the number of deaths during weekdays or during the weekends. (Null hypothesis, number of deaths is the same whether it’s the weekend or weekday).
A crosstab of the data used is as below.


New Picture (5)


A non parametric test of independent samples (Done on SPSS) produced the following results.

New Picture (4)


NB: Mann-Whittney test suggest that we reject the hypothesis. However, given that the data violates at least two assumptions for a Mann-Whittney test, we will be committing a type II error if we use the results of the test. K-S tests, on the other hand, makes no assumptions, and consequently is the better test in this set up. Therefore we reject the alternative hypothesis that there is a statistically significant difference between deaths during the weekend and deaths during weekdays.

In short, the test show that statistically, accidents during weekdays and those during the weekend are at least 95% similar in the number of deaths and those hurt.

In conclusion, the above analysis shows how intuition and common statistics may lead to flawed conclusions. This conclusion must be taken with a few riders. First, the authenticity and correctness of this data is not ascertainable. Secondly, the cases used in this analysis only represent a smaller fraction of total accidents for 2011. According to police records, there were some 3000 deaths on the road that year. Our data only has about 1600. Thirdly, any statistical analysis is never really accurate, there are errors.
At least what this analysis shows is the error in concluding from common statistics of averages and means.
Elvis Bando (@levisdoban)