Saturday, December 3, 2011

A simple Google News - Lite Implementation

We describe here how a service like Google News might be built, and provide a simple prototype implementation of the same. Google News was apparently invented by a Krishna Bharat at Google.


At its simplest, it appears to work by collecting stories from various news sources, then categorizing them by content similarity (first broadly based on a set of standard categories like Business, Health, Politics etc. and next by collating links for the same story from different sources under a single story heading), then presenting these categories in a HTML interface for the user. In the attached code snippet, we perform the collection of news-stories indexed by date, and utilize the notion of TF-IDF for content categorization.


While building this, my first inclination was to perform the categorization (this may be viewed as simply computing the "closeness" of two news articles based on keywords within them) using the content in the URLs themselves, but URLs do not provide too much information to effectively perform this operation. This is perhaps best done by "deep filtering" i.e. gathering the articles behind the various collected URLs, then putting two URLs into the same story category if they are relatively close. We do not implement the TF-IDF clustering mechanism in the code below, nor do we implement the details of a HTML interface. The former is implemented as an example elsewhere (in the desktop search prototype project), while the latter seems to be a lot of work not directly relevant to the mechanics of implementing the current project and is therefore left out for now (we might revisit this later).


The code we use for the application, and the generated news.html file are attached to this post. These will be updated to include more comments and bells and whistles over time...


One quibble with the attached code might be the relatively small number of news stories mined. That is partly due to the stringency of the filtering criteria used (because this is a working prototype, we take some liberties with the implementation). Code follows (with sample output below)... enjoy!

import os, sys, urllib;
from datetime import *;


def locStr(x,fstr):
 r=[];
 n=x.count(fstr);
 st=0;
 while len(r)<n:
  v=x[st:].index(fstr);
  r+=[st+v];
  st+=v+len(fstr);
 return r;


def incStr(x,L,M):
 r=[];
 for i in L:
  if x.find(i)==-1: r+=[0];
  else: r+=[1];
 if M=="ANY" and sum(r)>0: return True;
 if M=="ANY" and sum(r)==0: return False;
 if M=="ALL" and sum(r)==len(L): return True;
 if M=="ALL" and sum(r)<len(L): return False;


def str0(s): 
 if s<10: return '0'+str(s);
 return str(s);


nsrcs=["nytimes","washingtonpost","latimes","cnn"];
inc=["a href=","http://"];
exc=["()","ad.","advert","facebook","twitter","javascript","click","blog","video","photo","img"];
cats=["business","sports","nation","world","politics","local","opinion","obituaries","arts","dining","health","entertainment","lifestyle"];


dctx,cx={},{};
dt=date.today();
dtstr1=str(dt.year)+"/"+str0(dt.month)+"/"+str0(dt.day);
dtstr2=str(dt.year)+str0(dt.month)+str0(dt.day);
for j in cats: cx[j]=[];


def clnUrls2(x):
 r=[];
 for i in x: 
  if incStr(i,inc,"ALL"): 
   t1=locStr(i,"a href=");
   t2=locStr(i,"</a>");
   for j in zip(t1,t2):
    s=i[j[0]+len("a href=")+1:j[1]];
    if len(s)>0 and incStr(s,["http://www."],"ALL") and incStr(s,nsrcs,"ANY") and not incStr(s,exc,"ANY") and (s.find(dtstr1)>-1 or s.find(dtstr2)>-1): 
     s1,s2=s.index("\""),s.index(">");
     str1,str2=s[:s1],s[s2+1:];
     dctx[str2]=str1;
     r+=[s];


 for i in dctx:  # collect stories by category
  for j in cats:
   if dctx[i].find(j)>-1: cx[j]+=[i];
 return r;




fc,us=[],[];


for i in nsrcs: 
 nlnk="http://www."+i+".com";
 f=urllib.urlopen(nlnk);
 fc+=f.readlines();
 f.close();


us=clnUrls2(fc);
# in us, move forward on each URL until you hit a quote - this is hyperlink
# move backwards in URL until you hit a ">" - this is the note
# save in dctx[note]=hyperlink, or even better, dctx[category]=hyperlink


g=open("urls.txt","w");
g.write("Collected Story URLs:\n")
for i in us: g.write(i+"\n");
g.write("\n\n");


g.write("Stories with URLs by Category\n");
for i in cx: 
 g.write(i+"\n");
 for j in cx[i]: 
  g.write(j+"\n");
  g.write("=> "+dctx[j]+"\n");
 g.write("\n");
g.close();


h=open("news.html","w");
h.write("Today's News:<br><br>\n\n");
for i in cx: 
 h.write(i+"<br><br>\n");
 for j in cx[i]: 
  h.write("<a href='"+dctx[j]+"'>"+j+"</a><br>\n");
  h.write("\n");
 h.write("\n<br>");
h.close();

----------Sample URLs.txt file contents:-----------------------------------------------------------------
Collected Story URLs:
http://www.nytimes.com/slideshow/2011/12/01/us/20111202-canine-ss.html"><span class="icon slideshow">Slide Show</span>: The Dogs of War
http://www.nytimes.com/2011/12/01/opinion/kristof-a-banker-speaks-with-regret.html?hp">Kristof: A Banker Speaks
http://www.nytimes.com/2011/12/01/opinion/gail-collins-mitt-romney-pardon.html?hp">Collins: Romney Pardon
http://www.nytimes.com/2011/12/01/opinion/high-stakes-little-time.html">Editorial: Extend Benefits
http://www.nytimes.com/2011/12/01/opinion/to-understand-china-look-behind-its-laws.html">Op-Ed: China’s Laws
http://www.nytimes.com/2011/12/01/garden/the-holiday-gimme-guide.html">The Holiday Gimme Guide
http://www.nytimes.com/2011/12/01/us/florida-am-university-students-death-turns-spotlight-on-hazing.html">Student&rsquo;s Death Turns Spotlight on Hazing
http://www.nytimes.com/2011/12/01/arts/music/a-review-of-the-metropolitan-operas-faust.html?ref=arts">This Faust Builds Atom Bombs (He Still Sings)
http://www.nytimes.com/2011/12/01/opinion/a-decade-of-progress-on-aids.html">Bono: A Decade of Progress on AIDS
http://www.nytimes.com/2011/12/01/fashion/at-art-basel-miami-beach-insiders-meet-newcomers-scene-city.html">Insiders Meet Art Basel Miami Beach Newcomers
http://www.nytimes.com/2011/12/01/arts/design/san-francisco-museum-of-modern-art-expansion-aims-for-friendly.html?ref=arts">An Imposing Museum Turns Warm and Fuzzy
http://www.nytimes.com/2011/12/01/greathomesanddestinations/in-tasmania-a-place-to-watch-nature-tv.html">In Tasmania, a Place to Watch &lsquo;Nature TV&rsquo;
http://www.nytimes.com/2011/12/01/garden/sprucing-up-your-front-door-for-the-season.html">The Pragmatist: Spruced Up for the Season
http://www.nytimes.com/2011/12/01/fashion/jeremy-scott-fashions-last-rebel.html">Jeremy Scott, Fashion&rsquo;s Last Rebel
http://www.washingtonpost.com/business/economy/boeing-union-reach-tentative-deal-to-end-labor-dispute/2011/12/01/gIQAc7HUHO_story.html">Boeing
http://www.washingtonpost.com/local/national-christmas-tree-lights-up/2011/12/01/gIQAJSI5HO_gallery.html">National Christmas Tree lights up
http://www.washingtonpost.com/lifestyle/wellness/arsenic-fears-aside-apple-juice-can-pose-a-health-threat-_-from-calories-nutritionists-say/2011/12/01/gIQAelLpHO_story.html?tid=pm_pop">Arsenic fears aside, apple juice can pose a health threat _ from calories, nutriti
http://www.washingtonpost.com/local/obituaries/judy-lewis-daughter-of-loretta-young-and-clark-gable-dies-at-76/2011/12/01/gIQAe85sHO_story.html?tid=pm_pop">Judy Lewis, daughter of Loretta Young and Clark Gable, dies at 76
http://www.latimes.com/news/politics/la-pn-cain-interview-20111201,0,5753490.story" target="_top" title="Cain says wife didn't know about payments to White">Cain says wife didn't know about payments to White
http://www.latimes.com/entertainment/news/la-et-dane-cook-20111201,0,2575521.story">Dane Cook heads off road
http://www.latimes.com/news/politics/la-pn-bono-congress-20111201,0,7706900.story" target="_top" title="Bono charms lawmakers in push for AIDS funding">Bono charms lawmakers in push for AIDS funding
http://www.latimes.com/news/politics/la-pn-perry-leno-ad-20111201,0,1625791.story" target="_top" title="Rick Perry ad: Self-mockery, Iowans, and what was the 3rd thing?">Rick Perry ad: Self-mockery, Iowans, and what was the 3rd thing?
http://www.latimes.com/news/politics/la-pn-cain-interview-20111201,0,5753490.story" target="_top" title="Herman Cain says wife didn't know about payments to Ginger White">Herman Cain says wife didn't know about payments to Ginger White
http://www.latimes.com/news/politics/la-pn-campaign-finance-20111201,0,1953462.story" target="_top" title="House votes to end public funding of presidential campaigns">House votes to end public funding of presidential campaigns
http://www.latimes.com/news/politics/la-pn-gingrich-campaign-20111201,0,6923880.story" target="_top" title="Newt Gingrich comeback surprises even the candidate himself">Newt Gingrich comeback surprises even the candidate himself
http://www.latimes.com/news/nationworld/world/la-fg-iran-britain-embassy-20111201,0,2833016.story?track=rss" target="_top">Britain shuts its Tehran embassy, expels Iran's diplomats
http://www.latimes.com/news/nationworld/world/la-fg-myanmar-clinton-20111201,0,7597719.story?track=rss" target="_top">Landmark Clinton visit to Myanmar includes a weapons concern
http://www.latimes.com/health/boostershots/la-heb-artificial-pancreas-fda-20111201,0,810564.story?track=rss" target="_top">Diabetes: FDA provides guidance on artificial pancreas
http://www.latimes.com/health/boostershots/la-heb-world-aids-day-roundup-20111201,0,7630943.story?track=rss" target="_top">On World AIDS Day, groups cheer progress and call for more focus
http://www.latimes.com/entertainment/news/books/la-et-book-night-eternal-box-20111201,0,1165802.story?track=rss" target="_top">'The Night Eternal' info
http://www.latimes.com/news/local/la-me-winds-20111201,0,3189154.story?track=rss" target="_top">Powerful Santa Ana winds sweep across Southland
http://www.latimes.com/news/nationworld/nation/la-na-airports-20111201,0,4817588.story?track=rss" target="_top">FAA promises changes to prevent tarmac delays
http://www.latimes.com/news/nationworld/nation/la-na-nlrb-board-20111201,0,4182874.story?track=rss" target="_top">NLRB, split along party lines, may be put out of work
http://www.latimes.com/news/nationworld/nation/la-na-blagojevich-20111201,0,7679329.story?track=rss" target="_top">Prosecutors seek stiff sentence for Blagojevich
http://www.latimes.com/news/politics/la-pn-bono-congress-20111201,0,7706900.story?track=rss" target="_top">Bono charms lawmakers in push for AIDS programs funding
http://www.latimes.com/news/politics/la-pn-cain-interview-20111201,0,5753490.story?track=rss" target="_top">Herman Cain says wife didn't know about payments to Ginger White
http://www.latimes.com/news/obituaries/la-me-passings-20111201,0,2646776.story?track=rss" target="_top">PASSINGS: Chester McGlockton, Ray Elder, Ante Markovic
http://www.latimes.com/news/obituaries/la-me-judy-lewis-20111201,0,213916.story?track=rss" target="_top">Judy Lewis dies at 76; daughter of stars Loretta Young and Clark Gable
http://www.cnn.com/2011/12/01/world/meast/egypt-muslim-brotherhood/index.html?hpt=hp_t2">Is Muslim Brotherhood's time here? 
http://www.cnn.com/2011/12/01/election/2012/cain-accusation-affair/index.html">Cain: Wife didn't know about latest accuser
http://www.cnn.com/2011/12/01/us/tennessee-crashes/index.html">1 dead in Tenn. crashes involving 176 cars
http://www.cnn.com/2011/12/01/travel/american-airlines-meal-lawsuit/index.html">Family: In-flight meal killed flier
http://www.cnn.com/2011/12/01/us/florida-suspected-hazing/index.html">911 tape reveals efforts to save drum major


Stories with URLs by Category
arts
An Imposing Museum Turns Warm and Fuzzy
=> http://www.nytimes.com/2011/12/01/arts/design/san-francisco-museum-of-modern-art-expansion-aims-for-friendly.html?ref=arts
This Faust Builds Atom Bombs (He Still Sings)
=> http://www.nytimes.com/2011/12/01/arts/music/a-review-of-the-metropolitan-operas-faust.html?ref=arts

lifestyle
Arsenic fears aside, apple juice can pose a health threat _ from calories, nutriti
=> http://www.washingtonpost.com/lifestyle/wellness/arsenic-fears-aside-apple-juice-can-pose-a-health-threat-_-from-calories-nutritionists-say/2011/12/01/gIQAelLpHO_story.html?tid=pm_pop

business
Boeing
=> http://www.washingtonpost.com/business/economy/boeing-union-reach-tentative-deal-to-end-labor-dispute/2011/12/01/gIQAc7HUHO_story.html

entertainment
Dane Cook heads off road
=> http://www.latimes.com/entertainment/news/la-et-dane-cook-20111201,0,2575521.story
'The Night Eternal' info
=> http://www.latimes.com/entertainment/news/books/la-et-book-night-eternal-box-20111201,0,1165802.story?track=rss

opinion
Kristof: A Banker Speaks
=> http://www.nytimes.com/2011/12/01/opinion/kristof-a-banker-speaks-with-regret.html?hp
Collins: Romney Pardon
=> http://www.nytimes.com/2011/12/01/opinion/gail-collins-mitt-romney-pardon.html?hp
Bono: A Decade of Progress on AIDS
=> http://www.nytimes.com/2011/12/01/opinion/a-decade-of-progress-on-aids.html
Editorial: Extend Benefits
=> http://www.nytimes.com/2011/12/01/opinion/high-stakes-little-time.html
Op-Ed: China’s Laws
=> http://www.nytimes.com/2011/12/01/opinion/to-understand-china-look-behind-its-laws.html

nation
Prosecutors seek stiff sentence for Blagojevich
=> http://www.latimes.com/news/nationworld/nation/la-na-blagojevich-20111201,0,7679329.story?track=rss
FAA promises changes to prevent tarmac delays
=> http://www.latimes.com/news/nationworld/nation/la-na-airports-20111201,0,4817588.story?track=rss
NLRB, split along party lines, may be put out of work
=> http://www.latimes.com/news/nationworld/nation/la-na-nlrb-board-20111201,0,4182874.story?track=rss
Britain shuts its Tehran embassy, expels Iran's diplomats
=> http://www.latimes.com/news/nationworld/world/la-fg-iran-britain-embassy-20111201,0,2833016.story?track=rss
In Tasmania, a Place to Watch &lsquo;Nature TV&rsquo;
=> http://www.nytimes.com/2011/12/01/greathomesanddestinations/in-tasmania-a-place-to-watch-nature-tv.html
National Christmas Tree lights up
=> http://www.washingtonpost.com/local/national-christmas-tree-lights-up/2011/12/01/gIQAJSI5HO_gallery.html
Landmark Clinton visit to Myanmar includes a weapons concern
=> http://www.latimes.com/news/nationworld/world/la-fg-myanmar-clinton-20111201,0,7597719.story?track=rss

dining

obituaries
Judy Lewis dies at 76; daughter of stars Loretta Young and Clark Gable
=> http://www.latimes.com/news/obituaries/la-me-judy-lewis-20111201,0,213916.story?track=rss
Judy Lewis, daughter of Loretta Young and Clark Gable, dies at 76
=> http://www.washingtonpost.com/local/obituaries/judy-lewis-daughter-of-loretta-young-and-clark-gable-dies-at-76/2011/12/01/gIQAe85sHO_story.html?tid=pm_pop
PASSINGS: Chester McGlockton, Ray Elder, Ante Markovic
=> http://www.latimes.com/news/obituaries/la-me-passings-20111201,0,2646776.story?track=rss

health
Arsenic fears aside, apple juice can pose a health threat _ from calories, nutriti
=> http://www.washingtonpost.com/lifestyle/wellness/arsenic-fears-aside-apple-juice-can-pose-a-health-threat-_-from-calories-nutritionists-say/2011/12/01/gIQAelLpHO_story.html?tid=pm_pop
On World AIDS Day, groups cheer progress and call for more focus
=> http://www.latimes.com/health/boostershots/la-heb-world-aids-day-roundup-20111201,0,7630943.story?track=rss
Diabetes: FDA provides guidance on artificial pancreas
=> http://www.latimes.com/health/boostershots/la-heb-artificial-pancreas-fda-20111201,0,810564.story?track=rss

world
Prosecutors seek stiff sentence for Blagojevich
=> http://www.latimes.com/news/nationworld/nation/la-na-blagojevich-20111201,0,7679329.story?track=rss
On World AIDS Day, groups cheer progress and call for more focus
=> http://www.latimes.com/health/boostershots/la-heb-world-aids-day-roundup-20111201,0,7630943.story?track=rss
FAA promises changes to prevent tarmac delays
=> http://www.latimes.com/news/nationworld/nation/la-na-airports-20111201,0,4817588.story?track=rss
NLRB, split along party lines, may be put out of work
=> http://www.latimes.com/news/nationworld/nation/la-na-nlrb-board-20111201,0,4182874.story?track=rss
Britain shuts its Tehran embassy, expels Iran's diplomats
=> http://www.latimes.com/news/nationworld/world/la-fg-iran-britain-embassy-20111201,0,2833016.story?track=rss
Landmark Clinton visit to Myanmar includes a weapons concern
=> http://www.latimes.com/news/nationworld/world/la-fg-myanmar-clinton-20111201,0,7597719.story?track=rss
Is Muslim Brotherhood's time here? 
=> http://www.cnn.com/2011/12/01/world/meast/egypt-muslim-brotherhood/index.html?hpt=hp_t2

politics
Rick Perry ad: Self-mockery, Iowans, and what was the 3rd thing?
=> http://www.latimes.com/news/politics/la-pn-perry-leno-ad-20111201,0,1625791.story
Newt Gingrich comeback surprises even the candidate himself
=> http://www.latimes.com/news/politics/la-pn-gingrich-campaign-20111201,0,6923880.story
Cain says wife didn't know about payments to White
=> http://www.latimes.com/news/politics/la-pn-cain-interview-20111201,0,5753490.story
Bono charms lawmakers in push for AIDS programs funding
=> http://www.latimes.com/news/politics/la-pn-bono-congress-20111201,0,7706900.story?track=rss
Bono charms lawmakers in push for AIDS funding
=> http://www.latimes.com/news/politics/la-pn-bono-congress-20111201,0,7706900.story
Herman Cain says wife didn't know about payments to Ginger White
=> http://www.latimes.com/news/politics/la-pn-cain-interview-20111201,0,5753490.story?track=rss
House votes to end public funding of presidential campaigns
=> http://www.latimes.com/news/politics/la-pn-campaign-finance-20111201,0,1953462.story

sports

local
Powerful Santa Ana winds sweep across Southland
=> http://www.latimes.com/news/local/la-me-winds-20111201,0,3189154.story?track=rss
Judy Lewis, daughter of Loretta Young and Clark Gable, dies at 76
=> http://www.washingtonpost.com/local/obituaries/judy-lewis-daughter-of-loretta-young-and-clark-gable-dies-at-76/2011/12/01/gIQAe85sHO_story.html?tid=pm_pop
National Christmas Tree lights up
=> http://www.washingtonpost.com/local/national-christmas-tree-lights-up/2011/12/01/gIQAJSI5HO_gallery.html


----------End of Sample URLs.txt file contents-----------------------------------------------------------

Image of the generated sample news summary html page: