Johan's blog
Data accuracy, a killer app?
I am one of the 992.653 (and counting) followers of Tim O'Reilly's twitter. O'Reilly tweets very often about "cool stuff". That makes it hard for me to distinguish the real cool stuff from the "nice tries". But one tweet about "the killer app for Apps for America" took my attention. It is about data integrity for government data.The article O'Reilly is referring to can be found at the Sunlight Labs blog. The bottom line is that the accuracy of (government) data is extremely important and is often beyond developer scope. The article lists three ways to improve data accuracy:
1.Make Data Imports Modular and Reusable
2.Create tests on the regulations around the data
3.Provide a way for users to report inaccuracies
The first and second way are pretty well known in software development -- although often being ignored in government projects I am aware off, because the requirements in government projects are mostly non-technical.The third way is still rare, although we see a growing interest in this concept. Allowing people to report inaccuracies is often hard, since this basically means that you admit that you might be wrong. And this is something that many customers/organizations still don't want to hear. However, the benefits of using the community to improve the quality of data are huge --- depending on the nature of your community and data of course. Especially when large amounts of public data are the core of a project, it definitely helps to let the user report about the data quality. This goes further than reporting "inaccuracies" though, users can also indicate their importance or appreciation about particular data.
An interesting example is data related to road and street conditions. At least in the country where I live in, government bodies are responsible for keeping road signs visible, making sure there are no dangerous road conditions and so on. But this requires intensive checks from a number of people. A missing sign will probably be spot by hundreds of people before it is seen by the government body.
It would be easier and cheaper if one of the hundreds of people that saw the defect earlier reports this problem online or via an in-car device, PDA or mobile phone. If this is where the software stops, the data quality is unlikely to improve. There will be quite a lot of funny people that will report unmissing stuff as missing. We would use strong authentication and written agreements about punishment for fake reports, but this will scare many people. In many cases, the community can regulate itself. Good citizens will be recognized and rewarded, bad behavior will be filtered as spam. I recently reviewed the book Algorithms of the Intelligent Web, and this contains a number of technologies that can be used to improve the quality of data based on user behavior and algorithms.
Apart from the algorithms, there is a social aspect. In communities, users want to be recognized and appreciated. An implicit or explicit "ranking" amongst users can motivate them to be "good citizens". Good behavior is rewarded by an increase in ranking, bad behavior cause a drop in ranking.
My conclusion is that the quality of data can often be improved by software, a dramatic improved comes from the right combination of community software and human community development.
comments:
