News.Updates
OMG! Your Alive!

Yeah were back again for another round. We got some big things kicking off so stick around.

Massive Video Game Cheat Database!

Were bring you a tutorial on how to create a totally massive video game database. Just in time for E3.

Yellow Pages Scraper Tutorial

Written By: Cxlos - Updated: Tuesday, July 8, 2008

Hello, and welcome to this yellow pages scraper tutorial. In this tutorial, I will show you how to create your very own yellow pages scraper for free! That's right, Free!!! It's crazy that people charge you $250.00 for such a simple code. After this tutorial you will no longer have to pay for databases. You could just get the information yourself. So lets get started.

Prerequisite

  • The coding is very simple but it may help to have a little scripting experience.
  • Some of the functions that I use only work with PHP 5 , but don't let that stop you. Do a small search in google for the alternatives if you run into problems.
First take a look at the structure of these yellow pages url.
http://www.yellowpages.com/TX/Internet-Marketing -Advertising?search_mode=all&search_terms=seo
http://www.yellowpages.com/TX/Internet-Marketing -Advertising?sort=content&page=2&search_mode=all&search_terms=seo

Do you see the difference? In the second url

sort=content&page=2&

is added for pagination, but we need to only be concerned with the number. Of course if we increase or decrease this number, your browser will display the next or previous page. We need our scraper / spider to automatically start at the bottom of the search query, scrape the page, and continue on in that manner until the very last page. So for example, lets do a search for seo in Texas. The url is:

http://www.yellowpages.com/TX/Internet-Marketing -Advertising?search_mode=all&search_terms=seo

We need to know how many pages there are for this search. The easy way of doing this is to just look at the number in the url of the last page result. Which is located at the bottom of the page.

For now lets just remember that number ( 14 ). Later on, (if you wanted to develop this for a client $$$) you could scrape that number automatically. But first we need to create a function that can dynamically create our url's. For this example we will need 14 url's i.e

http://www.yellowpages.com/TX/Internet-Marketing -Advertising?sort=content&page=1&search_mode=all&search_terms=seo
http://www.yellowpages.com/TX/Internet-Marketing -Advertising?sort=content&page=2&search_mode=all&search_terms=seo
http://www.yellowpages.com/TX/Internet-Marketing -Advertising?sort=content&page=3&search_mode=all&search_terms=seo

. . . you get the ideal.

function createUrl($url,$lastnum)   {
    $find = "?";
    $trim = rtrim ($url,'a..z,A..Z,=,_,&');
    $remove_to = strpbrk($trim, '?');
    $number = 1;
    $counter= 0;
    
	while ($lastnum != $number) {
        $over = "?page=".$number."&";
        $replace = str_replace($find,$over,$url);
        $myArray[$counter] = $replace;
        $number++;
        $counter++;
	}
    return $myArray;
}

In the code above, we create a function called createUrl. createUrl takes the very first url and the number of the last url as its two arguments. Next we create a variable called $find and give it the character value of ?. In the next two lines of code, we will destroy the initial url down to the value in $find .Then comes the while loop, which recreates our urls for us and puts them into an array. Moving on.

This is all nice and cool, but what are we going to do now with all of these urls in our array? Simple, we are going to use php's built in file_get_contents function to open them all up and prepare them for the scraping.

function createList ($url ) {
  $counter=0;
  foreach ($url as $value)
  {
      $html=file_get_contents ($value);
      $myArray[$counter] = $html;
      $counter++;
  }
  return $myArray;
 }

This is our function that take the url array we created earlier and opens them up one at a time, then It puts there content ( i.e all of the information on the page) and puts it into a new array.

Once we have all of this information, we need to go through it and pick out the pieces we want like the name of the organization, address, state, zip code, phone number etc. In the code below we are going to use preg_match_all and preg_match to grab those specific pieces of data. Its pretty self explanatory. It will also output the data to the screen for us.

foreach ($list as $value){
    echo "<span style='width:8px; background:blue'> </span>";
  preg_match_all ("/<div class=\"description\">([^`]*?)<\/div>/", $value, $matches);
    foreach ($matches[0] as $match) {
        preg_match ("/<h2>([^`]*?)<\/h2>/", $match, $temp);
        preg_match ("/<p>([^`]*?)<\/p>/" , $match, $desc);
        preg_match ("/<ul>([^`]*?)<\/ul>/" , $match, $num);
        
  $title = $temp['1'];
    $title = strip_tags(trim($title));
    
  $description = $desc['1'];
    $description = strip_tags(trim($description));
    
  $phone = $num['1'];
    $phone = strip_tags(trim($phone));
    
  print "<b>$title</b><br>$description<br>$phone<br><br>";
    } 
}

This is the final code. Just replace $lastnum with the number of pages the search has plus one. Click here for a example of the output.

ini_set('memory_limit', '99999M');
function createUrl($url,$lastnum) {
    $find = "?";
    $trim = rtrim ($url,'a..z,A..Z,=,_,&');
    $remove_to = strpbrk($trim, '?');
    $number = 1;
    $counter= 0;
    while ($lastnum != $number) {
        $over = "?page=".$number."&";
        $replace = str_replace($find,$over,$url);
        $myArray[$counter] = $replace;
        $number++;
        $counter++;
    }
    return $myArray;
}
   
   
$url = "http://www.yellowpages.com/TX/Internet-Marketing-Advertising
?search_mode=all&search_terms=seo";
$lastnum = 1 +1;
$url = createUrl($url,$lastnum);
function createList ($url ) {
    $counter=0;
   	foreach ($url as $value){
        $html=file_get_contents ($value);
        $myArray[$counter] = $html;
        $counter++;
    }
    return $myArray;
}
  $list = createList($url);
  
   
foreach ($list as $value){
    echo "<span style='width:8px; background:blue'> </span>";
  preg_match_all ("/<div class=\"description\">([^`]*?)<\/div>/", $value, $matches);
    foreach ($matches[0] as $match) {
        preg_match ("/<h2>([^`]*?)<\/h2>/", $match, $temp);
        preg_match ("/<p>([^`]*?)<\/p>/" , $match, $desc);
        preg_match ("/<ul>([^`]*?)<\/ul>/" , $match, $num);
        
  $title = $temp['1'];
    $title = strip_tags(trim($title));
    
  $description = $desc['1'];
    $description = strip_tags(trim($description));
    
  $phone = $num['1'];
    $phone = strip_tags(trim($phone));
   
  print "<b>$title</b><br>$description<br>$phone<br><br>";
    } 
 }