Improperly anonymized taxi logs reveal drivers' identity, movements
Posted on 24 June 2014.
Software developer Vijay Pandurangan has demonstrated that sometimes data anonymizing efforts made by governments and businesses are worryingly inadequate, as he managed to easily deanonymize data detailing 173 million individual trips made by New York City taxi drivers.


The data was provided to Chris Whong, an "urbanist, mapmaker, data junkie" following a Freedom of Information request, and he made it available to the public.

"Each trip record includes the pickup and dropoff location and time, anonymized hack licence number and medallion number (i.e. the taxiís unique id number), and other metadata," explained Pandurangan.

Government officials did, to their credit, try to anonymize the personally identifiable information (driverís licence number and taxi number), but unfortunately they did it poorly: they used the MD5 algorithm to hash it.

"A cryptographically secure hashing function, like MD5 is a one-way function: it always turns the same input to the same output, but given the output, itís pretty hard to figure out what the input was as long as you donít know anything about what the input might look like. The problem, however, is that in this case we know a lot about what the inputs look like," Pandurangan pointed out.

He knew that NYC taxi licence numbers are 6-digit numbers or 7-digit numbers starting with a 5, and the specific patterns to which taxi numbers had to conform.

He then simply calculated all the possible hashes for both numbers (in less than two minutes!), and used that list to discover the original numbers. With that information in hand he used online resources to look up the identities of the owners of medallions.

He was then effectively in the possession of information that showed the daily movements of those individuals.

"This anonymization is so poor that anyone could, with less then 2 hours work, figure which driver drove every single trip in this entire dataset. It would be even be easy to calculate driversí gross income, or infer where they live," he noted, and added that using hash functions to anonymize data is not a good solution, and that fact has been proven over and over again.

He offered two alternative solutions for this particular example: assigning a random number to each hack licence number and medallion number once, and re-using it throughout the dump file, or creating a secret AES key, and encrypting each value individually.









Spotlight

Operation Pawn Storm: Varied targets and attack vectors, next-level spear-phishing tactics

Posted on 23 October 2014.  |  Targets of the spear phishing emails included staff at the Ministry of Defense in France, in the Vatican Embassy in Iraq, military officials from a number of countries, and more.


Weekly newsletter

Reading our newsletter every Monday will keep you up-to-date with security news.
  



Daily digest

Receive a daily digest of the latest security news.
  

DON'T
MISS

Fri, Oct 24th
    COPYRIGHT 1998-2014 BY HELP NET SECURITY.   // READ OUR PRIVACY POLICY // ABOUT US // ADVERTISE //