Will Google News get it right with its news retrieval algorithm
Ten years after its launch, Google News' raw numbers are staggering: 50,000 sources scanned, 72 editions in 30 languages. Frederic Filloux writes.tech reviews Updated: Feb 26, 2013 02:43 IST
Ten years after its launch, Google News' raw numbers are staggering: 50,000 sources scanned, 72 editions in 30 languages.
Google's crippled communication machine, plagued by bureaucracy and paranoia, has never been able to come up with tangible facts about its benefits for the news media it feeds on.
Its official blog merely mentions "6 billion visits per month" sent to news sites and Google News claims to connect "1 billion unique users a week to news content" (to put things in perspective, the NYT.com or the Huffington Post are cruising at about 40 million UVs per month).
Assuming the clicks are sent to a relatively fresh news page bearing higher value advertising, the 6 billion visits can translate into about $400m per year in ad revenue.
That's a very rough estimate. Again: Google should settle the matter and come up with accurate figures for its largest markets. But how exactly does Google News work? What kind of media does its algorithm favour most? Last week, the search giant updated its patent filing with a new document detailing the 13 metrics it uses to retrieve and rank articles and sources for its news service.
What follows is a summary of those metrics, listed in the order shown in the patent filing.
A first metric in determining the quality of a news source may include the number of articles produced by the news source during a given time period [week or month].
This metric may be determined by counting the number of non-duplicate articles produced by the news source over the time period [or] counting the number of original sentences produced by the news source.
This metric clearly favours production capacity. It benefits big media companies deploying large staffs.
But the system can also be cheated by content farms (Google already addressed these questions); new automated content creation systems are gaining traction, many of them could now easily pass the Turing Test.
The other metrics are listed in the box. The last metric is on writing style. In the Google world, this means statistical analysis of contents against a huge language model to assess "spelling correctness, grammar and reading levels".
Google intends to favour legacy media (print or broadcast news) over pure players. All the features recently added, such as editor's pick, reinforce this bias.