Php Page Scraping (Part II).
by Computothought in Circuits > Websites
5602 Views, 10 Favorites, 0 Comments
Php Page Scraping (Part II).
In the first instructable, we extracted plain text. Now we want to extract data lists. You will need a server that supports php, unless you have it installed locally. Here we will look at some more advanced scripts that I have collected and modified. One wonderful thing about doing the page scraping is that you can do your own information or news page. That way you do not have to depend on a search engine or other web pages that usually collect your favorites. Doing this gives you a bit more anonymity when using the web.
Note: Last two scripts worked in 2011. Minor bug in the scripts for 2012. Will wait till the week is over to see of it resolves itself. If it does not, Code will have to be modified.
Note: Last two scripts worked in 2011. Minor bug in the scripts for 2012. Will wait till the week is over to see of it resolves itself. If it does not, Code will have to be modified.
Weather 1.
Getting a weather information from a special site, but do not include formatting.
<?php
$doc = new DOMDocument();
@$doc->loadHTMLFile('http://www.metoffice.gov.uk/weather/uk/he/stornoway_forecast_weather.html'); //load the file;
$desired_rows = 1; //How many rows you want from the table.
$table = $doc->getElementsByTagName('table'); //get our tables out, it should return 2 from the file, we only want the second.
$rows = $table->item(1)->getElementsByTagName('tr'); //pull the table rows from the second table (notice we select the second by item(1).)
$count = $rows->length; //returns a count of the table rows.
for($i=2,$start=$i;$i<($start + $desired_rows);$i++) { //for loop, goes through the rows.
$columns = $rows->item($i)->getElementsByTagName('td'); //get columns for this row.
$columnCount = $columns->length;
for($n=0;$n<$columnCount;$n++) { //go through the columns.
if($n == 2) {
$img = $columns->item($n)->getElementsByTagName('img'); //the 3rd column is an image, so we must get the image title.
$value = $img->item(0)->getAttribute('title');
} else {
$value = $columns->item($n)->nodeValue; //else we will just take what is in the column.
}
${a.$n} = $value;
}
}
$patterns[0] = '/[^0-9]/';
$replacements[0] = '';
ksort($patterns);
ksort($replacements);
$a3 = preg_replace($patterns, $replacements, $a3);
$a5 = preg_replace($patterns, $replacements, $a5);
$a6 = preg_replace($patterns, $replacements, $a6);
echo $a0, '</br>', $a1, '</br>', $a2, '</br>', $a3, '</br>', $a4, '</br>', $a5, '</br>', $a6, '</br>', $a7, '</br>', $a8;
?>
<?php
$doc = new DOMDocument();
@$doc->loadHTMLFile('http://www.metoffice.gov.uk/weather/uk/he/stornoway_forecast_weather.html'); //load the file;
$desired_rows = 1; //How many rows you want from the table.
$table = $doc->getElementsByTagName('table'); //get our tables out, it should return 2 from the file, we only want the second.
$rows = $table->item(1)->getElementsByTagName('tr'); //pull the table rows from the second table (notice we select the second by item(1).)
$count = $rows->length; //returns a count of the table rows.
for($i=2,$start=$i;$i<($start + $desired_rows);$i++) { //for loop, goes through the rows.
$columns = $rows->item($i)->getElementsByTagName('td'); //get columns for this row.
$columnCount = $columns->length;
for($n=0;$n<$columnCount;$n++) { //go through the columns.
if($n == 2) {
$img = $columns->item($n)->getElementsByTagName('img'); //the 3rd column is an image, so we must get the image title.
$value = $img->item(0)->getAttribute('title');
} else {
$value = $columns->item($n)->nodeValue; //else we will just take what is in the column.
}
${a.$n} = $value;
}
}
$patterns[0] = '/[^0-9]/';
$replacements[0] = '';
ksort($patterns);
ksort($replacements);
$a3 = preg_replace($patterns, $replacements, $a3);
$a5 = preg_replace($patterns, $replacements, $a5);
$a6 = preg_replace($patterns, $replacements, $a6);
echo $a0, '</br>', $a1, '</br>', $a2, '</br>', $a3, '</br>', $a4, '</br>', $a5, '</br>', $a6, '</br>', $a7, '</br>', $a8;
?>
Weather 2.
Getting a weather information from a special site, but do include formatting.
<?php
echo '<style type="text/css">
table {
border-collapse: collapse;
}
table, th, td {
border: 1px solid black;
padding: 2px;
}
</style>';
$doc = new DOMDocument();
@$doc->loadHTMLFile('http://www.metoffice.gov.uk/weather/uk/he/stornoway_forecast_weather.html'); //load the file;
$desired_rows = 1; //How many rows you want from the table.
$table = $doc->getElementsByTagName('table'); //get our tables out, it should return 2 from the file, we only want the second.
$rows = $table->item(1)->getElementsByTagName('tr'); //pull the table rows from the second table (notice we select the second by item(1).)
$count = $rows->length; //returns a count of the table rows.
echo '<table id="weather"><tr>
<th rowspan="2">Date</th>
<th rowspan="2">Time</th>
<th rowspan="2">Weather</th>
<th rowspan="2">Temp</th>
<th colspan="3">Wind</th>
<th rowspan="2">Visibility</th>
</tr>
<tr>
<th>Dir</th>
<th>Speed</th>
<th>Gust</th>
</tr>'; //mock up of the original table headers.
for($i=2,$start=$i;$i<($start + $desired_rows);$i++) { //for loop, goes through the rows.
echo '<tr>'; //start row.
$columns = $rows->item($i)->getElementsByTagName('td'); //get columns for this row.
$columnCount = $columns->length;
for($n=0;$n<$columnCount;$n++) { //go through the columns.
if($n == 2) {
$img = $columns->item($n)->getElementsByTagName('img'); //the 3rd column is an image, so we must get the image title.
$value = $img->item(0)->getAttribute('title');
} else {
$value = $columns->item($n)->nodeValue; //else we will just take what is in the column.
}
echo '<td>' . $value . '</td>'; //push the column to the screen.
}
echo '</tr>'; //end the row.
}
echo '</table>'; //end the table.
?>
<?php
echo '<style type="text/css">
table {
border-collapse: collapse;
}
table, th, td {
border: 1px solid black;
padding: 2px;
}
</style>';
$doc = new DOMDocument();
@$doc->loadHTMLFile('http://www.metoffice.gov.uk/weather/uk/he/stornoway_forecast_weather.html'); //load the file;
$desired_rows = 1; //How many rows you want from the table.
$table = $doc->getElementsByTagName('table'); //get our tables out, it should return 2 from the file, we only want the second.
$rows = $table->item(1)->getElementsByTagName('tr'); //pull the table rows from the second table (notice we select the second by item(1).)
$count = $rows->length; //returns a count of the table rows.
echo '<table id="weather"><tr>
<th rowspan="2">Date</th>
<th rowspan="2">Time</th>
<th rowspan="2">Weather</th>
<th rowspan="2">Temp</th>
<th colspan="3">Wind</th>
<th rowspan="2">Visibility</th>
</tr>
<tr>
<th>Dir</th>
<th>Speed</th>
<th>Gust</th>
</tr>'; //mock up of the original table headers.
for($i=2,$start=$i;$i<($start + $desired_rows);$i++) { //for loop, goes through the rows.
echo '<tr>'; //start row.
$columns = $rows->item($i)->getElementsByTagName('td'); //get columns for this row.
$columnCount = $columns->length;
for($n=0;$n<$columnCount;$n++) { //go through the columns.
if($n == 2) {
$img = $columns->item($n)->getElementsByTagName('img'); //the 3rd column is an image, so we must get the image title.
$value = $img->item(0)->getAttribute('title');
} else {
$value = $columns->item($n)->nodeValue; //else we will just take what is in the column.
}
echo '<td>' . $value . '</td>'; //push the column to the screen.
}
echo '</tr>'; //end the row.
}
echo '</table>'; //end the table.
?>
Pro Football Scores Example.
This is the original script I found, but I wanted a more user friendly version.
<?php
//set current game week
$Current_Week = 'p4'; #preweek 3, just to see if the script works. When the season starts, '1' will denote week one etc.
//load source code, depending on the current week, of the website into a variable as a string
$url = "http://sports.yahoo.com/nfl/scoreboard?w=$Current_Week";
$string = file_get_contents($url);
//set search pattern (using regular expressions)
$find = '|<a href="/nfl/teams/.*?">(.*?)</a>.*?<td align="right" class="ysptblclbg6 total">.*?<span class="yspscores">(.*?) |is';
//search the string for the pattern, and store the content found inside the set of parens in the array $matches
//$matches[1] is going to hold team names in the order they appear on the page, and $matches[2] the scores
preg_match_all($find, $string, $matches);
//initiate scores array, to group teams and scores together in games
$scores = array();
//count number of teams found, to be used in the loop below
$count = count($matches[1]);
//loop from 0 to $count, in steps of 2
//this is done in order to group 2 teams and 2 scores together in games, with each iteration of the loop
//trim() is used to trim away any whitespace surrounding the team names and scores
//strip_tags() is used to remove the HTML bold tag (<b>) from the winning scores
for ($i = 0; $i < $count; $i += 2) {
$away_team = trim($matches[1][$i]);
$away_score = trim($matches[2][$i]);
$home_team = trim($matches[1][$i + 1]);
$home_score = trim($matches[2][$i + 1]);
$winner = (strpos($away_score, '<') === false) ? $home_team : $away_team;
$scores[] = array(
'awayteam' => $away_team,
'awayscore' => strip_tags($away_score),
'hometeam' => $home_team,
'homescore' => strip_tags($home_score),
'winner' => $winner
);
}
echo "<br><hr>";
echo "Scores from week: $Current_Week";
echo "<hr>";
echo "<br>";
//see how the scores array looks
echo '<pre>' . print_r($scores, true) . '</pre>';
//game results and winning teams can now be accessed from the scores array
//e.g. $scores[0]['awayteam'] contains the name of the away team (['awayteam'] part) from the first game on the page ([0] part)
?>
<?php
//set current game week
$Current_Week = 'p4'; #preweek 3, just to see if the script works. When the season starts, '1' will denote week one etc.
//load source code, depending on the current week, of the website into a variable as a string
$url = "http://sports.yahoo.com/nfl/scoreboard?w=$Current_Week";
$string = file_get_contents($url);
//set search pattern (using regular expressions)
$find = '|<a href="/nfl/teams/.*?">(.*?)</a>.*?<td align="right" class="ysptblclbg6 total">.*?<span class="yspscores">(.*?) |is';
//search the string for the pattern, and store the content found inside the set of parens in the array $matches
//$matches[1] is going to hold team names in the order they appear on the page, and $matches[2] the scores
preg_match_all($find, $string, $matches);
//initiate scores array, to group teams and scores together in games
$scores = array();
//count number of teams found, to be used in the loop below
$count = count($matches[1]);
//loop from 0 to $count, in steps of 2
//this is done in order to group 2 teams and 2 scores together in games, with each iteration of the loop
//trim() is used to trim away any whitespace surrounding the team names and scores
//strip_tags() is used to remove the HTML bold tag (<b>) from the winning scores
for ($i = 0; $i < $count; $i += 2) {
$away_team = trim($matches[1][$i]);
$away_score = trim($matches[2][$i]);
$home_team = trim($matches[1][$i + 1]);
$home_score = trim($matches[2][$i + 1]);
$winner = (strpos($away_score, '<') === false) ? $home_team : $away_team;
$scores[] = array(
'awayteam' => $away_team,
'awayscore' => strip_tags($away_score),
'hometeam' => $home_team,
'homescore' => strip_tags($home_score),
'winner' => $winner
);
}
echo "<br><hr>";
echo "Scores from week: $Current_Week";
echo "<hr>";
echo "<br>";
//see how the scores array looks
echo '<pre>' . print_r($scores, true) . '</pre>';
//game results and winning teams can now be accessed from the scores array
//e.g. $scores[0]['awayteam'] contains the name of the away team (['awayteam'] part) from the first game on the page ([0] part)
?>
Downloads
Pro Football Scores.
One thing I like about this script is that I can grab the pro football scores after the week is over to see how the teams did without having to go through the original site spending time to get to them. Of course the scores are from preseason week 2 2012 with one game to be played. Script works best after all the games are played for the week. This was a combination of several scripts that I edited together. Still have one more edit to do to trim up the table.
<?php
//----------------------------------------------------------------
//functions
function do_offset($level){
$offset = ""; // offset for subarry
for ($i=1; $i<$level;$i++){
$offset = $offset . "<td></td>";
}
return $offset;
}
function show_array($array, $level, $sub){
if (is_array($array) == 1){ // check if input is an array
foreach($array as $key_val => $value) {
$offset = "";
if (is_array($value) == 1){ // array is multidimensional
echo "<tr>";
$offset = do_offset($level);
show_array($value, $level+1, 1);
}
else{ // (sub)array is not multidim
if ($sub != 1){ // first entry for subarray
// echo "<tr nosub>";
$offset = do_offset($level);
}
$sub = 0;
echo $offset . "<td main ".$sub." width=\"120\">" . $key_val .
"</td><td width=\"120\">" . $value . "</td>";
// echo "</tr>";
}
} //foreach $array
}
else{ // argument $array is not an array
return;
}
}
function html_show_array($array){
echo "<table cellspacing=\"0\" border=\"2\">\n";
show_array($array, 1, 0);
echo "</table>\n";
}
//end functions
//---------------------------------------------------------------
//set current game week
$Current_Week = 'p1'; #preweek 3, just to see if the script works. When the season starts, '1' will denote week one etc.
//load source code, depending on the current week, of the website into a variable as a string
$url = "http://sports.yahoo.com/nfl/scoreboard?w=$Current_Week";
$string = file_get_contents($url);
//set search pattern (using regular expressions)
$find = '|<a href="/nfl/teams/.*?">(.*?)</a>.*?<td align="right" class="ysptblclbg6 total">.*?<span class="yspscores">(.*?) |is';
//search the string for the pattern, and store the content found inside the set of parens in the array $matches
//$matches[1] is going to hold team names in the order they appear on the page, and $matches[2] the scores
preg_match_all($find, $string, $matches);
//initiate scores array, to group teams and scores together in games
$scores = array();
//count number of teams found, to be used in the loop below
$count = count($matches[1]);
//loop from 0 to $count, in steps of 2
//this is done in order to group 2 teams and 2 scores together in games, with each iteration of the loop
//trim() is used to trim away any whitespace surrounding the team names and scores
//strip_tags() is used to remove the HTML bold tag (<b>) from the winning scores
for ($i = 0; $i < $count; $i += 2) {
$away_team = trim($matches[1][$i]);
$away_score = trim($matches[2][$i]);
$home_team = trim($matches[1][$i + 1]);
$home_score = trim($matches[2][$i + 1]);
$winner = (strpos($away_score, '<') === false) ? $home_team : $away_team;
$scores[] = array(
'awayteam' => $away_team,
'awayscore' => strip_tags($away_score),
'hometeam' => $home_team,
'homescore' => strip_tags($home_score),
'winner' => $winner
);
}
echo "<br><hr>";
echo "Scores from week: $Current_Week";
echo "<hr>";
echo "<br>";
//see how the scores array looks
// echo '<pre>' . print_r($scores, true) . '</pre>';
$input = $scores;
$cols = 5;
// echo "<br>";
// echo count($scores);
//echo "<br>";
// echo "<table border=\"5\" cellpadding=\"10\">";
// echo "<tr>";
// echo "<td>away_team</td>";
// echo "<td>away_score</td>";
// echo "<td>home_team</td>";
// echo "<td>home_score</td>";
// echo "<td>winner</td>";
// echo "</tr>";
// for ($i=0; $i < count($input); $i++)
// {
// echo "<tr>";
// for ($c=0; $c<$cols; $c++)
// {
// echo "<td>$input[$i]</td>";
// echo "<td>$away_team</td>";
// echo "<td>$away_score</td>";
// echo "<td>$home_team</td>";
// echo "<td>$home_score</td>";
// echo "<td>$winner</td>";
// }
// echo "</tr>";
// }
// echo "</table>";
// foreach($scores as $key_val => $value) {
// $offset = "";
// if (is_array($value) == 1){ // array is multidimensional
// echo "<tr>";
// $offset = do_offset($level);
// echo $offset . "<td>" . $key_val . "</td>";
// show_array($value, $level+1, 1);
// }
// else{ // (sub)array is not multidim
// if ($sub != 1){ // first entry for subarray
// echo "<tr nosub>";
// $offset = do_offset($level);
// }
// $sub = 0;
// echo $offset . "<td main ".$sub." width=\"120\">" . $key_val .
// "</td><td width=\"120\">" . $value . "</td>";
// echo "</tr>\n";
// }
// } //foreach $array
//game results and winning teams can now be accessed from the scores array
//e.g. $scores[0]['awayteam'] contains the name of the away team (['awayteam'] part) from the first game on the page ([0] part)
html_show_array($scores);
?>
<?php
//----------------------------------------------------------------
//functions
function do_offset($level){
$offset = ""; // offset for subarry
for ($i=1; $i<$level;$i++){
$offset = $offset . "<td></td>";
}
return $offset;
}
function show_array($array, $level, $sub){
if (is_array($array) == 1){ // check if input is an array
foreach($array as $key_val => $value) {
$offset = "";
if (is_array($value) == 1){ // array is multidimensional
echo "<tr>";
$offset = do_offset($level);
show_array($value, $level+1, 1);
}
else{ // (sub)array is not multidim
if ($sub != 1){ // first entry for subarray
// echo "<tr nosub>";
$offset = do_offset($level);
}
$sub = 0;
echo $offset . "<td main ".$sub." width=\"120\">" . $key_val .
"</td><td width=\"120\">" . $value . "</td>";
// echo "</tr>";
}
} //foreach $array
}
else{ // argument $array is not an array
return;
}
}
function html_show_array($array){
echo "<table cellspacing=\"0\" border=\"2\">\n";
show_array($array, 1, 0);
echo "</table>\n";
}
//end functions
//---------------------------------------------------------------
//set current game week
$Current_Week = 'p1'; #preweek 3, just to see if the script works. When the season starts, '1' will denote week one etc.
//load source code, depending on the current week, of the website into a variable as a string
$url = "http://sports.yahoo.com/nfl/scoreboard?w=$Current_Week";
$string = file_get_contents($url);
//set search pattern (using regular expressions)
$find = '|<a href="/nfl/teams/.*?">(.*?)</a>.*?<td align="right" class="ysptblclbg6 total">.*?<span class="yspscores">(.*?) |is';
//search the string for the pattern, and store the content found inside the set of parens in the array $matches
//$matches[1] is going to hold team names in the order they appear on the page, and $matches[2] the scores
preg_match_all($find, $string, $matches);
//initiate scores array, to group teams and scores together in games
$scores = array();
//count number of teams found, to be used in the loop below
$count = count($matches[1]);
//loop from 0 to $count, in steps of 2
//this is done in order to group 2 teams and 2 scores together in games, with each iteration of the loop
//trim() is used to trim away any whitespace surrounding the team names and scores
//strip_tags() is used to remove the HTML bold tag (<b>) from the winning scores
for ($i = 0; $i < $count; $i += 2) {
$away_team = trim($matches[1][$i]);
$away_score = trim($matches[2][$i]);
$home_team = trim($matches[1][$i + 1]);
$home_score = trim($matches[2][$i + 1]);
$winner = (strpos($away_score, '<') === false) ? $home_team : $away_team;
$scores[] = array(
'awayteam' => $away_team,
'awayscore' => strip_tags($away_score),
'hometeam' => $home_team,
'homescore' => strip_tags($home_score),
'winner' => $winner
);
}
echo "<br><hr>";
echo "Scores from week: $Current_Week";
echo "<hr>";
echo "<br>";
//see how the scores array looks
// echo '<pre>' . print_r($scores, true) . '</pre>';
$input = $scores;
$cols = 5;
// echo "<br>";
// echo count($scores);
//echo "<br>";
// echo "<table border=\"5\" cellpadding=\"10\">";
// echo "<tr>";
// echo "<td>away_team</td>";
// echo "<td>away_score</td>";
// echo "<td>home_team</td>";
// echo "<td>home_score</td>";
// echo "<td>winner</td>";
// echo "</tr>";
// for ($i=0; $i < count($input); $i++)
// {
// echo "<tr>";
// for ($c=0; $c<$cols; $c++)
// {
// echo "<td>$input[$i]</td>";
// echo "<td>$away_team</td>";
// echo "<td>$away_score</td>";
// echo "<td>$home_team</td>";
// echo "<td>$home_score</td>";
// echo "<td>$winner</td>";
// }
// echo "</tr>";
// }
// echo "</table>";
// foreach($scores as $key_val => $value) {
// $offset = "";
// if (is_array($value) == 1){ // array is multidimensional
// echo "<tr>";
// $offset = do_offset($level);
// echo $offset . "<td>" . $key_val . "</td>";
// show_array($value, $level+1, 1);
// }
// else{ // (sub)array is not multidim
// if ($sub != 1){ // first entry for subarray
// echo "<tr nosub>";
// $offset = do_offset($level);
// }
// $sub = 0;
// echo $offset . "<td main ".$sub." width=\"120\">" . $key_val .
// "</td><td width=\"120\">" . $value . "</td>";
// echo "</tr>\n";
// }
// } //foreach $array
//game results and winning teams can now be accessed from the scores array
//e.g. $scores[0]['awayteam'] contains the name of the away team (['awayteam'] part) from the first game on the page ([0] part)
html_show_array($scores);
?>
Downloads
First the Model the The Window Dressing.
I have given you two sets of scripts to play with. The first one of the set gets you the data. Then we added some tables to allow the data to be more readable in the second one of the set. In part III, we will put it all together.
Scraping Our Own Page.
We could even page scrape our own page if we wanted to and get the following.
.$ cat allscoresfile.txt
Buffalo Bills 35 Green Bay 10
Cincinnati 24 Atlanta 19
Tennessee 30 Tampa Bay 7
Buffalo 14 Minnesota 36
Detroit 27 Baltimore 12
Miami 17 Carolina 23
Jacksonville 27 New Orleans 24
Oakland 27 Arizona 31
NY Giants 26 NY Jets 3
Washington 31 Chicago 33
San Francisco 9 Houston 20
Kansas City 17 St. Louis 31
Dallas 20 San Diego 28
Seattle 30 Denver 10
Indianapolis 24 Pittsburgh 26
Philadelphia 27 New England 17
You need to use the local script:
scoreget.sh
[code]
#===================================
# Get score's
#
team=""
team="awayteam"
# output data
lynx -width 1000 -dump "http://oesrvr1/testcode/getscores1.php" | grep $team > scorefile
cut -c 12-25 scorefile > f1
cut -c 37-39 scorefile > f2
cut -c 49-60 scorefile > f3
cut -c 70-72 scorefile > f4
paste f1 f2 f3 f4 > allscoresfile.txt
[/code]
.$ cat allscoresfile.txt
Buffalo Bills 35 Green Bay 10
Cincinnati 24 Atlanta 19
Tennessee 30 Tampa Bay 7
Buffalo 14 Minnesota 36
Detroit 27 Baltimore 12
Miami 17 Carolina 23
Jacksonville 27 New Orleans 24
Oakland 27 Arizona 31
NY Giants 26 NY Jets 3
Washington 31 Chicago 33
San Francisco 9 Houston 20
Kansas City 17 St. Louis 31
Dallas 20 San Diego 28
Seattle 30 Denver 10
Indianapolis 24 Pittsburgh 26
Philadelphia 27 New England 17
You need to use the local script:
scoreget.sh
[code]
#===================================
# Get score's
#
team=""
team="awayteam"
# output data
lynx -width 1000 -dump "http://oesrvr1/testcode/getscores1.php" | grep $team > scorefile
cut -c 12-25 scorefile > f1
cut -c 37-39 scorefile > f2
cut -c 49-60 scorefile > f3
cut -c 70-72 scorefile > f4
paste f1 f2 f3 f4 > allscoresfile.txt
[/code]
Espn App Developer
If you are really interested in getting scores off the web, you might be interested in becoming app app developer. http://developer.espn.com/