Daily Dilbert

Date: 08/02/2005
Author: Wayne Eggert

Updated: 09/21/2009 - Made some coding changes since the Dilbert site changed the way they're writing out the comic strip. Whenever code changes on a site, it will affect any "scrapers" you have parsing the data from the site. There are better ways to write scrapers to mitigate impact, but I originally wrote this code in 2005 and it's not very tolerant when code output changes on the Dilbert site.

Introduction
One of the great perks to PHP and databases is the ability to always have dynamic content on your site, no matter how many years it takes you to actually add content of your own =) I've always wanted to parse eBay search results and grab content out, but my trials and tribulations always fell short -- partially because I had no idea how to parse through HTML code correctly (there's hardly any tutorials out there on the subject) and mostly due to the fact that eBay intentionally changes their pages so that they can stop people from "window scraping". So I decided to start small and parse a daily comic strip off of a web page, and what better than Dilbert?

Play Nice..
One thing you have to remember whenever you're window scraping is to be considerate. It's best if you can store the content you're pulling on your own site, whether it's in a database (if actual text) or in your file system (if images). Also be considerate when it comes to how often your program actually goes and and scrapes the website -- you don't want every visitor on your site to generate a full page download of CNN.com just to scrape out the latest article. So keep these things in mind as you develop these types of applications and everyone will be happy =)

Gameplan
Be realistic. The easiest sites to scrape are the ones that have data coming in the same way every day because they're fed out of a database & have predictable code. It gets A LOT trickier when you can't predict the code, as I mentioned with eBay's pages.. and near impossible if you want to cleanly scrape pages that someone is hand coding & changing on a routine basis. Since we're looking to just scrape an image from dilbert.com and the image is referenced pretty much the same way every day (it's just the image name that's changing), it makes it pretty easy to grab what we need -- today's image.

Putting The Wheels In Motion
First and foremost, go to http://www.dilbert.com if you haven't already, save the source code and open it up in your favorite editor. Search for the string "/dyn/str_strip/" and you should be immediately brought to the line of code containing the image. That's the image that we want to pull out and we'll be writing a "web bot" to do exactly what you just did -- pull the source code, search for the string, grab the image name and download the newest Dilbert comic strip.

Here's the code for grabbing the name of the image:
<?php // Define the URL we'll be pulling from $URL = "http://www.dilbert.com/"; // Grab the full contents of the web page, 8192 bytes at a time until done $file = fopen("$URL", "r"); $r = ""; do { $data = fread($file, 8192); if (strlen($data) == 0) { break; } $r .= $data; } while (true); // Define a Start & End location for the regular expression $Start = 'src="/dyn/str_strip/'; $End = '" border="0" />'; // Grab the text between the Start & End location (ie. The picture name) $stuff = eregi("$Start(.*)$End", $r, $content); // Replace the stuff from the image string we aren't interested in (NOTE: a REGEX would be better) $todaysDilbert = str_replace("src=\"","",$content[0]); $todaysDilbert = str_replace("\" border=\"0\" />","",$todaysDilbert); echo $todaysDilbert; // this will echo the image path from the Dilbert site ?>

I think the commenting is pretty much self explainatory, but just in case.. here's a quick overview of what's happening:

PHP script fetches the HTML code created by index.html on www.dilbert.com and stores in variable $r
Start & End parameters are defined for the regular expression (RE) to use.
The text between the Start & End parameters is parsed by eregi() and stored in array $contents
$content[1] contains the image name

At this point, if you wanted to, you could add the following line of code at the end of the PHP script to view the image:
<?php echo "<img src='http://www.dilbert.com".$todaysDilbert."' />" ?>

But, as I said before, this wouldn't be very nice of you because you would be using their bandwidth & we want to be considerate, right? So what to do.. well, we have the full URL to the daily image now, so let's store it on our own web server.

page1 page2 page3

Comments:

New Regex.

Posted 06/01/08 10:14AM by Anonymous Techdoser

The Regular expression had to be changed to reflect the new page with flash.

// Define a Start & End location for the regular expression
$Start = '';

Good one

Posted 10/24/07 7:11PM by Anonymous Techdoser

Good work, I was actually looking for something like this. in fact, didn't wanna spend time on that. thank you so much.

re: "An excellent article" and "Oddly written" com

Posted 11/14/05 4:43PM by Anonymous Techdoser

It looks correct. Just watch the curly braces.
if (file_exists($filename))
{
$filedate = date ("Y-m-d", filemtime($filename));
$todaysdate = date("Y-m-d", time());
if($filedate != $todaysdate)
{
$update = 1;
}
}
else
{
$update = 1;
}

See code at http://www.bestof417.com/calvin.php

Posted 08/14/05 11:01PM by wolfdogg

View source of my code at: http://www.bestof417.com/calvin.php
Thx!
Jeff

Trying to implement this on another cartoon site f

Posted 08/14/05 11:00PM by wolfdogg

I'm getting a blank image to pull when trying to do this on Calvin and Hobbes site. Can anyone figure out what might be wrong? The concept seems simple enough, but it won't pull.
My code:

Oddly written

Posted 03/30/05 5:53PM by AceBHound

I think that part of the code is just written oddly. It's actually saying "If the image exists, check the date & if the time of the file is old, grab the image again. If the image does not exist, grab the image."

An excellent article

Posted 03/29/05 10:56PM by Anonymous Techdoser

Possible error in the code on the last page - at the top, shouldn't the "else{$update = 1;" be anything but 1, since that is the setting to update?