Publisher: O'Reilly Media
The Bad Data Handbook is a collection of experiences of 19 different data scientists and experts, who share their methods of making data work for and not against them.
About the author
Q. Ethan McCallum works as a professional-services consultant, writer, and technology enthusiast. His technical interests range from data analysis, to software and infrastructure.
Inside the book
The book is divided into 19 chapters, each written by a different author. Despite this, the book does not seem disorganized at the least, as each chapter tackles a distinct problem an analyst can encounter when being handed a bucketload of data.
This book will take you, step by step, through methods of evaluating the quality of data of unknown provenance, teach you to turn a dataset meant for human consumption into a format that will allow machines to analyze it, and how to decipher questionable data stored in plain-text files (this last chapter also comes with a helpful set of exercises that will help you retain and deepen the knowledge you just acquired).
You'll also learn about all the things that can go wrong when you're scraping data off the Web.
The writers, each having a specific background and work history, have gone through a number of problems that they have had the (mis)fortune of being tasked with solving in the past and have covered a wide spectrum of interests and sectors.
For example, Jacob Perkins will tell you how he managed to discover deceptive online reviews from genuine ones, and Spencer Burns will explain how the ever changing data used in financial markets should be addressed.
Other touched on subjects include musings on whether the bad data (i.e. data that gets in the way) actually exists, the problematic of data collecting methods, how data storing affects its analysis, the problems related to using the cloud for large-scale data analysis, and the pros and cons of using specific databases.
Questions on data traceability, the possibility of removing it from places where it should not be (social media, for example), and data quality are also brought up.
Without a doubt, this book is an extremely interesting read. But maybe I'm biased, because I love nothing more that to read about people's personal experiences. Still, I have to praise both the writers and the editors for doing a great job in keeping each chapter concise and to the point.
In this day and age, data collection, storage, and analysis are topics that everybody that works with information technology should be at least a little familiar with, and this book is perfect for learning the problems that can arise in each stage as well as the solutions to them.