I have been working with a company that recently got added to Google News, which is great.
I assumed that Google would do a fantastic job and grokking the news from the site.
I was unfortunately wrong.
In time, we started to see our content appear in Google News, but the headlines were all screwed up. Elements in a rightbar on the site would show up as a headline for an article. It all seemed very random too. Very strange indeed.
We contacted the Google News team, assuming that content providers could use some kind of microformat to help the Google document parser.
We would be very willing to say:
<h1 class=”googlenews-headline header”>Headline</h1>
No such thing existed. They thought that one of the problems was that the headline had a link within it (as it acts as a permalink to itself). They assume that a headline can not also be a list, so they ignore it.
As we go back and forward on this, I then think. Wait a minute. Why are they bothering screenscraping our HTML when we have RSS feeds for everything?
Surely it would be simpler to grok our feed than scrape our HTML? *sigh*