Three examples of machine learning for publishers

Getting your Trinity Audio player ready...

In a session entitled ‘Getting started with machine learning for reporting’ at this year’s NICAR conference in Chicago, Peter Aldhous from BuzzFeed, Rachel Shorey from the New York Times, Chase Davis from the Minneapolis Star Tribune, and Anthony Pesce from the Los Angeles Times discussed machine learning and what’s in it for publishers.

Los Angeles Times: Machine learning to uncover skewed crime stats

In an investigation powered by machine learning algorithms, the Los Angeles Times uncovered that the Los Angeles police department misclassified an estimated 14,000 serious assaults as minor offenses from 2005 to 2012, therefore artificially lowering the city’s crime levels.

The Los Angeles Times used an algorithm that parsed crime data from a previous Times investigation in order to learn the keywords that identify assaults as either serious or minor. Find the data and code of this machine learning investigation here.

New York Times: Shazam-ing members of Congress

Another project that was featured in the NICAR panel was ‘Who the hill?’, an app that has been referred to as ‘Shazam, but for House members faces’. It is an MMS-based facial recognition service that identifies members of Congress. Reporters can text pictures to a number The New York Times team has set up.

The face recognition app was built by two New York Times interactive interns Gautam Hathi and Sherman Hewitt. ‘Reporters can use it to help figure out who is talking or presenting if they missed the intro or if they run into a member they don’t immediately recognise in the halls of the capitol’, wrote Shorey in our exchange of emails.

But what actually happens when you use machine learning?

It can be scary to launch yourself into a machine learning project, especially if you’ve never done it before.

During the NICAR session, Aldhous demystified the myth. He came up with the following list of steps to put a machine learning project together:

Find a good library in your favourite programming language;
Read the documentation;
Confirm this is actually a good approach for you and that you understand all the inputs and outputs (even if you don’t understand all the maths);
Spend days to weeks cleaning your data;
Write around ten lines of code.