17
Dec
2014

Scraping Images from a Website in PHP

While working through some SEO enhancements for Homebräu, like building Sitemap with images, I needed to find a way to gather all the images on a page and return them nicely. I also needed to use this logic in a couple places, so it needed to be flexible.

Here is tutorial for creating a script that will scrape a page or HTML string for images, and return all of their src attributes. Let's look at the PHP first then a couple example implementations.

PHP Function to Scrape Images from HTML
<?

////////////////////////////////////
//                                //
//  function scrapeImages($html)  //
//                                //
////////////////////////////////////
//
//  DESCRIPTION: scrapes an HTML string to find image tags and collect their "src" attributes
//
//  PARAMETERS:
//
//  $html - String - an HTML string
//
//  RETURNS:
//
//  returns an array of images src attributes e.g. ['/images/cat.jpg','/images/dog.jpg']

function scrapeImages($html) {
	$dom = new domDocument;
	$dom->loadHTML($html);
	//find all the images in the HTML
	$images = $dom->getElementsByTagName('img');
	$imgArray = array();
	//for each image tag, grab its src attribute and add it to the array
	foreach ($images as $image) {
	  array_push($img,$image->getAttribute('src'));
	}

	//return the array of image src values
	return $imgArray;
}
?>
Example Call to Scrape a URL

The first example shows how to use the function to strip images by passing it a URL.

<?
//example calls:

//scrape from a URL
$images1 = scrapeImages(file_get_contents('http://www.example.com/'));

//print all the images that were found
print_r($images1);

?>
Example Call to Scrape from a Database

In the second example, we'll strip the image tags from some HTML saved in a database.

<?

//scrape from a database entry

$query = "SELECT html FROM entries;";
$result = mysql_query($query);

$images2 = array();

//loop through the data query, and build a list of images for all the entries
while($row = mysql_fetch_array($result)) {
	$images2 = array_merge($images2,scrapeImages(stripslashes($row['html'])));
}

//print all the images that were found in the database
print_r($images2);

?>
Advanced Example - Scrape Images from HTML to Build an Images Sitemap

This example is pretty close to the implementation used for Homebräu. Note in this example, there are two fields in the database, one containing the HTML page entry and a second with the page URL.

<?

//scrape from a database entry, grab both the column containing the HTML and the column with the page URL
$query = "SELECT html,url FROM entries;";
$result = mysql_query($query);

$images = array();


//start by generating the header for the sitemap.xml file
$sitemap  = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>";
$sitemap .= "";


//loop through the data query, and build a list of images for all the entries
while($row = mysql_fetch_array($result)) {

	//find the array of images
	$images = scrapeImages(stripslashes($row['html']);


	//build the XML, remember all the paths need to be absolute, so remember to include the domain
	$sitemap .= "";
	$sitemap .= "http://www.example.com/".stripslashes($row['url']."";
	//iterate over the images, and build the XML for each image found on the page
	foreach($images as $src) {
		$sitemap .= "";
		$sitemap .= "http://www.example.com/".$src."";
		$sitemap .= "";
	}
	$sitemap .= "";
}

//close the xml
$sitemap .= "";


//print the XML for the Sitemap
echo($sitemap);

?>

There you have it, an easy automated way to scrape images from HTML. As shown in the example above, this can be leveraged to automate the process of including images in your Sitemap.

Got feedback or a tip? Let me know in the comments.

Share This:

Tags:

Comments:

View (1) Comments Post a Comment
  • Replying to Adam Konieska on Scraping a Website for Images with PHP to build an Images sitemap.xml







  • sunny techo
    Sunny Techo
    Saturday, July 9th 2016 at 1:51 AM

    There is correction in function "scrapeImages"
    because its always return empty array.
    In this line,
    array_push($img,$image->getAttribute('src'));
    Improve
    array_push($imgArray,$image->getAttribute('src'));

    • Replying to Sunny Techo