My Google Maps Subscribe to Package Tracking
Feb 18

Google News: Complicated algorythms, but not the simple ones

Google, Tech Add comments

I have been working with a company that recently got added to Google News, which is great.

I assumed that Google would do a fantastic job and grokking the news from the site.

I was unfortunately wrong.

In time, we started to see our content appear in Google News, but the headlines were all screwed up. Elements in a rightbar on the site would show up as a headline for an article. It all seemed very random too. Very strange indeed.

We contacted the Google News team, assuming that content providers could use some kind of microformat to help the Google document parser.

We would be very willing to say:

<h1 class=”googlenews-headline header”>Headline</h1>

<h1 rel=”headline”>Headline</h1>

No such thing existed. They thought that one of the problems was that the headline had a link within it (as it acts as a permalink to itself). They assume that a headline can not also be a list, so they ignore it.

As we go back and forward on this, I then think. Wait a minute. Why are they bothering screenscraping our HTML when we have RSS feeds for everything?

Surely it would be simpler to grok our feed than scrape our HTML? *sigh*

2 Responses to “Google News: Complicated algorythms, but not the simple ones”

  1. Anjan Bacchu Says:

    hi dion,

    You might have heard from news that BMW germany and another company were blacklisted from google search since they gave a different picture to the search engines while a different picture to ACTUAL users using webbrowsers.

    News Corporations could do similar manipulations with google — I assume that what google wants is the “REAL THING”.

    BR,
    ~A

  2. Dion Says:

    The RSS view is a real one. One that users actually use to interface with the site. This is different to having a special Google view, although I know that you can cheat the system.

    Dion

Leave a Reply

Spam is a pain, I am sorry to have to do this to you, but can you answer the question below?

Q: What is the number before 3? (just put in the digit)