Php Page Scraping (Part II).

by Computothought in Circuits > Websites

5602 Views, 10 Favorites, 0 Comments

Php Page Scraping (Part II).

Screenshot from 2012-08-13 00:09:53.png
Screenshot at 2012-04-08 14:13:50.png
In the first instructable, we extracted plain text. Now we want to extract data lists. You will need a server that supports php, unless you have it installed locally.   Here we will look at some more advanced scripts that I have collected and modified. One wonderful thing about doing the page scraping is that you can do your own information or news page. That way you do not have to depend on a search engine or other web pages that usually collect your favorites. Doing this gives you a bit more anonymity when using the web.

Note: Last two scripts worked in 2011. Minor bug in the scripts for 2012. Will wait till the week is over to see of it resolves itself. If it does not, Code will have to be modified.

Weather 1.

Screenshot at 2012-04-06 01:16:04.png
Getting a weather information from a special site, but do not include formatting.

<?php

$doc = new DOMDocument();
@$doc->loadHTMLFile('http://www.metoffice.gov.uk/weather/uk/he/stornoway_forecast_weather.html'); //load the file;
$desired_rows = 1; //How many rows you want from the table.
$table = $doc->getElementsByTagName('table'); //get our tables out, it should return 2 from the file, we only want the second.
$rows = $table->item(1)->getElementsByTagName('tr'); //pull the table rows from the second table (notice we select the second by item(1).)
$count = $rows->length; //returns a count of the table rows.

for($i=2,$start=$i;$i<($start + $desired_rows);$i++) { //for loop, goes through the rows.
 
   $columns = $rows->item($i)->getElementsByTagName('td'); //get columns for this row.
   $columnCount = $columns->length;
   for($n=0;$n<$columnCount;$n++) { //go through the columns.
  if($n == 2) {
   $img = $columns->item($n)->getElementsByTagName('img'); //the 3rd column is an image, so we must get the image title.
   $value = $img->item(0)->getAttribute('title');
  } else {
   $value = $columns->item($n)->nodeValue; //else we will just take what is in the column.
  }
${a.$n} = $value;
   }
}

$patterns[0] = '/[^0-9]/';
$replacements[0] = '';
ksort($patterns);
ksort($replacements);
$a3 = preg_replace($patterns, $replacements, $a3);
$a5 = preg_replace($patterns, $replacements, $a5);
$a6 = preg_replace($patterns, $replacements, $a6);

echo $a0, '</br>', $a1, '</br>', $a2, '</br>', $a3, '</br>', $a4, '</br>', $a5, '</br>', $a6, '</br>', $a7, '</br>', $a8;

?>

Weather 2.

Screenshot at 2012-04-06 01:15:06.png
Getting a weather information from a special site, but do include formatting.

<?php
echo '<style type="text/css">
  table {
   border-collapse: collapse;
  }
  table, th, td {
   border: 1px solid black;
   padding: 2px;
  }
 
</style>';
$doc = new DOMDocument();
@$doc->loadHTMLFile('http://www.metoffice.gov.uk/weather/uk/he/stornoway_forecast_weather.html'); //load the file;
$desired_rows = 1; //How many rows you want from the table.
$table = $doc->getElementsByTagName('table'); //get our tables out, it should return 2 from the file, we only want the second.
$rows = $table->item(1)->getElementsByTagName('tr'); //pull the table rows from the second table (notice we select the second by item(1).)
$count = $rows->length; //returns a count of the table rows.
echo '<table id="weather"><tr>
          <th rowspan="2">Date</th>
          <th rowspan="2">Time</th>
          <th rowspan="2">Weather</th>

          <th rowspan="2">Temp</th>
          <th colspan="3">Wind</th>
          <th rowspan="2">Visibility</th>
         </tr>
         <tr>
          <th>Dir</th>
          <th>Speed</th>

          <th>Gust</th>
         </tr>'; //mock up of the original table headers.
for($i=2,$start=$i;$i<($start + $desired_rows);$i++) { //for loop, goes through the rows.
  echo '<tr>'; //start row.
   $columns = $rows->item($i)->getElementsByTagName('td'); //get columns for this row.
   $columnCount = $columns->length;
   for($n=0;$n<$columnCount;$n++) { //go through the columns.
  if($n == 2) {
   $img = $columns->item($n)->getElementsByTagName('img'); //the 3rd column is an image, so we must get the image title.
   $value = $img->item(0)->getAttribute('title');
  } else {
   $value = $columns->item($n)->nodeValue; //else we will just take what is in the column.
  }
  echo '<td>' . $value . '</td>'; //push the column to the screen.
   }
   echo '</tr>'; //end the row.
}
echo '</table>'; //end the table.

?>

Pro Football Scores Example.

Screenshot from 2012-08-13 00:20:07.png
This is the original script I found, but I wanted a more user friendly version.

<?php
//set current game week
$Current_Week = 'p4'; #preweek 3, just to see if the script works. When the season starts, '1' will denote week one etc.

//load source code, depending on the current week, of the website into a variable as a string
$url = "http://sports.yahoo.com/nfl/scoreboard?w=$Current_Week";
$string = file_get_contents($url);

//set search pattern (using regular expressions)
$find = '|<a href="/nfl/teams/.*?">(.*?)</a>.*?<td align="right" class="ysptblclbg6 total">.*?<span class="yspscores">(.*?)&nbsp;|is';

//search the string for the pattern, and store the content found inside the set of parens in the array $matches
//$matches[1] is going to hold team names in the order they appear on the page, and $matches[2] the scores
preg_match_all($find, $string, $matches);

//initiate scores array, to group teams and scores together in games
$scores = array();

//count number of teams found, to be used in the loop below
$count = count($matches[1]);

//loop from 0 to $count, in steps of 2
//this is done in order to group 2 teams and 2 scores together in games, with each iteration of the loop
//trim() is used to trim away any whitespace surrounding the team names and scores
//strip_tags() is used to remove the HTML bold tag (<b>) from the winning scores
for ($i = 0; $i < $count; $i += 2) {
    $away_team = trim($matches[1][$i]);
    $away_score = trim($matches[2][$i]);
    $home_team = trim($matches[1][$i + 1]);
    $home_score = trim($matches[2][$i + 1]);
    $winner = (strpos($away_score, '<') === false) ? $home_team : $away_team;
    $scores[] = array(
        'awayteam' => $away_team,
        'awayscore' => strip_tags($away_score),
        'hometeam' => $home_team,
        'homescore' => strip_tags($home_score),
        'winner' => $winner
    );
}

echo "<br><hr>";
echo  "Scores from week: $Current_Week";
echo "<hr>";
echo "<br>";
//see how the scores array looks
echo '<pre>' . print_r($scores, true) . '</pre>';

//game results and winning teams can now be accessed from the scores array
//e.g. $scores[0]['awayteam'] contains the name of the away team (['awayteam'] part) from the first game on the page ([0] part)
?>

Downloads

Pro Football Scores.

Screenshot from 2012-08-13 00:09:53.png
One thing I like about this script is that I can grab the pro football scores after the week is over to see how the teams did without having to go through the original site spending time to get to them. Of course the scores are from preseason week 2 2012 with one game to be played.  Script works best after all the games are played for the week. This was a combination of several scripts that I edited together.  Still have one more edit to do to trim up the table.

<?php
//----------------------------------------------------------------
//functions
function do_offset($level){
    $offset = "";             // offset for subarry
    for ($i=1; $i<$level;$i++){
    $offset = $offset  . "<td></td>";
    }
    return $offset;
}

function show_array($array, $level, $sub){
    if (is_array($array) == 1){          // check if input is an array
       foreach($array as $key_val => $value) {
           $offset = "";
           if (is_array($value) == 1){   // array is multidimensional
           echo "<tr>";
           $offset = do_offset($level);
           show_array($value, $level+1, 1);
           }
           else{                        // (sub)array is not multidim
           if ($sub != 1){          // first entry for subarray
           //    echo "<tr nosub>";
               $offset = do_offset($level);
           }
           $sub = 0;
           echo $offset . "<td main ".$sub." width=\"120\">" . $key_val .
               "</td><td width=\"120\">" . $value . "</td>";
           // echo "</tr>";
    }
       } //foreach $array
      } 
    else{ // argument $array is not an array
        return;
    }
}

function html_show_array($array){
  echo "<table cellspacing=\"0\" border=\"2\">\n";
  show_array($array, 1, 0);
  echo "</table>\n";
}

//end functions
//---------------------------------------------------------------
//set current game week
$Current_Week = 'p1'; #preweek 3, just to see if the script works. When the season starts, '1' will denote week one etc.

//load source code, depending on the current week, of the website into a variable as a string
$url = "http://sports.yahoo.com/nfl/scoreboard?w=$Current_Week";
$string = file_get_contents($url);

//set search pattern (using regular expressions)
$find = '|<a href="/nfl/teams/.*?">(.*?)</a>.*?<td align="right" class="ysptblclbg6 total">.*?<span class="yspscores">(.*?)&nbsp;|is';

//search the string for the pattern, and store the content found inside the set of parens in the array $matches
//$matches[1] is going to hold team names in the order they appear on the page, and $matches[2] the scores
preg_match_all($find, $string, $matches);

//initiate scores array, to group teams and scores together in games
$scores = array();

//count number of teams found, to be used in the loop below
$count = count($matches[1]);

//loop from 0 to $count, in steps of 2
//this is done in order to group 2 teams and 2 scores together in games, with each iteration of the loop
//trim() is used to trim away any whitespace surrounding the team names and scores
//strip_tags() is used to remove the HTML bold tag (<b>) from the winning scores
for ($i = 0; $i < $count; $i += 2) {
    $away_team = trim($matches[1][$i]);
    $away_score = trim($matches[2][$i]);
    $home_team = trim($matches[1][$i + 1]);
    $home_score = trim($matches[2][$i + 1]);
    $winner = (strpos($away_score, '<') === false) ? $home_team : $away_team;
    $scores[] = array(
    'awayteam' => $away_team,
        'awayscore' => strip_tags($away_score),
        'hometeam' => $home_team,
        'homescore' => strip_tags($home_score),
        'winner' => $winner
    );
}

echo "<br><hr>";
echo  "Scores from week: $Current_Week";
echo "<hr>";
echo "<br>";
//see how the scores array looks
// echo '<pre>' . print_r($scores, true) . '</pre>';

$input = $scores;
$cols = 5;
// echo "<br>";
// echo count($scores);
//echo "<br>";   
//    echo "<table border=\"5\" cellpadding=\"10\">";
// echo "<tr>";
// echo "<td>away_team</td>";
// echo "<td>away_score</td>";
// echo "<td>home_team</td>";
// echo "<td>home_score</td>";
// echo "<td>winner</td>";
// echo "</tr>";
//   for ($i=0; $i < count($input); $i++)
//    {
//    echo "<tr>";
//        for ($c=0; $c<$cols; $c++)
//      {
// echo "<td>$input[$i]</td>";
//         echo "<td>$away_team</td>";
//         echo "<td>$away_score</td>";
//         echo "<td>$home_team</td>";
//         echo "<td>$home_score</td>";
//         echo "<td>$winner</td>";

//      }
//    echo "</tr>";
//   }
       
//    echo "</table>"; 


// foreach($scores as $key_val => $value) {
//           $offset = "";
//           if (is_array($value) == 1){   // array is multidimensional
//           echo "<tr>";
//           $offset = do_offset($level);
//           echo $offset . "<td>" . $key_val . "</td>";
//           show_array($value, $level+1, 1);
//           }
//           else{                        // (sub)array is not multidim
//           if ($sub != 1){          // first entry for subarray
//               echo "<tr nosub>";
//               $offset = do_offset($level);
//           }
//           $sub = 0;
//           echo $offset . "<td main ".$sub." width=\"120\">" . $key_val .
//           "</td><td width=\"120\">" . $value . "</td>";
//           echo "</tr>\n";
//           }
//       } //foreach $array
//game results and winning teams can now be accessed from the scores array
//e.g. $scores[0]['awayteam'] contains the name of the away team (['awayteam'] part) from the first game on the page ([0] part)

html_show_array($scores);
?>

Downloads

First the Model the The Window Dressing.

Screenshot at 2012-04-08 14:13:50.png
I have given you two sets of scripts to play with. The first one of the set gets you the data. Then we added some tables to allow the data to be more readable in the second one of the set. In part III, we will put it all together.

Scraping Our Own Page.

Screenshot-2.png
Screenshot-1.png
We could even page scrape our own page if we wanted to and get the following.

.$ cat allscoresfile.txt
 Buffalo Bills    35     Green Bay        10
 Cincinnati       24     Atlanta          19
 Tennessee        30     Tampa Bay        7
 Buffalo          14     Minnesota        36
 Detroit          27     Baltimore        12
 Miami            17     Carolina         23
 Jacksonville     27     New Orleans      24
 Oakland          27     Arizona          31
 NY Giants        26     NY Jets          3
 Washington       31     Chicago          33
 San Francisco    9      Houston          20
 Kansas City      17     St. Louis        31
 Dallas           20     San Diego        28
 Seattle          30     Denver           10
 Indianapolis     24     Pittsburgh       26
 Philadelphia     27     New England      17

You need to use the local script:

scoreget.sh
[code]
#===================================
# Get score's
#
team=""
team="awayteam"
# output data
lynx -width 1000 -dump "http://oesrvr1/testcode/getscores1.php" | grep $team > scorefile
cut -c 12-25 scorefile > f1
cut -c 37-39 scorefile > f2
cut -c 49-60 scorefile > f3
cut -c 70-72 scorefile > f4
paste f1 f2 f3 f4 > allscoresfile.txt
[/code]

Espn App Developer

Screenshot from 2013-12-14 11:40:28.png
If you are really interested in getting scores off the web, you might be interested in becoming app app developer. http://developer.espn.com/