Improperly anonymized taxi logs reveal drivers' identity, movements
Posted on 24 June 2014.
Software developer Vijay Pandurangan has demonstrated that sometimes data anonymizing efforts made by governments and businesses are worryingly inadequate, as he managed to easily deanonymize data detailing 173 million individual trips made by New York City taxi drivers.


The data was provided to Chris Whong, an "urbanist, mapmaker, data junkie" following a Freedom of Information request, and he made it available to the public.

"Each trip record includes the pickup and dropoff location and time, anonymized hack licence number and medallion number (i.e. the taxiís unique id number), and other metadata," explained Pandurangan.

Government officials did, to their credit, try to anonymize the personally identifiable information (driverís licence number and taxi number), but unfortunately they did it poorly: they used the MD5 algorithm to hash it.

"A cryptographically secure hashing function, like MD5 is a one-way function: it always turns the same input to the same output, but given the output, itís pretty hard to figure out what the input was as long as you donít know anything about what the input might look like. The problem, however, is that in this case we know a lot about what the inputs look like," Pandurangan pointed out.

He knew that NYC taxi licence numbers are 6-digit numbers or 7-digit numbers starting with a 5, and the specific patterns to which taxi numbers had to conform.

He then simply calculated all the possible hashes for both numbers (in less than two minutes!), and used that list to discover the original numbers. With that information in hand he used online resources to look up the identities of the owners of medallions.

He was then effectively in the possession of information that showed the daily movements of those individuals.

"This anonymization is so poor that anyone could, with less then 2 hours work, figure which driver drove every single trip in this entire dataset. It would be even be easy to calculate driversí gross income, or infer where they live," he noted, and added that using hash functions to anonymize data is not a good solution, and that fact has been proven over and over again.

He offered two alternative solutions for this particular example: assigning a random number to each hack licence number and medallion number once, and re-using it throughout the dump file, or creating a secret AES key, and encrypting each value individually.









Spotlight

How to talk infosec with kids

Posted on 17 September 2014.  |  It's never too early to talk infosec with kids: you simply need the right story. In fact, as cyber professionals itís our duty to teach ALL the kids in our life about technology. If we are to make an impact, we must remember that children needed to be taught about technology on their terms.


Weekly newsletter

Reading our newsletter every Monday will keep you up-to-date with security news.
  



Daily digest

Receive a daily digest of the latest security news.
  

DON'T
MISS

Thu, Sep 18th
    COPYRIGHT 1998-2014 BY HELP NET SECURITY.   // READ OUR PRIVACY POLICY // ABOUT US // ADVERTISE //