Developer Snippet Diary

how to scrape data from a website using PHP with 5 methods DOM, CURL, html dom parser, guzzle, Php Phantom Js


Web scraping is the process of extracting data from websites. In PHP, there are several methods to achieve this. Here are some of the most popular ones:

1.file_get_contents() and DOMDocument:
Using the file_get_contents() function, you can fetch the HTML content of a webpage as a string. Then, you can use DOMDocument to parse the HTML content and extract the desired data using DOMXPath.

2. cURL:
cURL (Client URL) is a library that allows you to make HTTP requests in PHP. You can use cURL to fetch the HTML content of a webpage and then parse it with DOMDocument and DOMXPath.

3. Simple HTML DOM Parser:
Simple HTML DOM Parser is an external PHP library specifically designed for web scraping. It allows you to select and manipulate HTML elements more easily with a jQuery like syntax. You can download the library from http://simplehtmldom.sourceforge.net/.

4. Guzzle and Symfony's DomCrawler:
Guzzle is a popular PHP HTTP client that can be used to fetch webpage content. Symfony's DomCrawler is a separate component used for web scraping. You can combine the two libraries to fetch and parse webpages with ease.

5. PHP PhantomJS:
PhantomJS is a headless web browser that can be used for web scraping. PHP PhantomJS is a PHP wrapper for PhantomJS, allowing you to use it within your PHP scripts. This method is especially useful when you need to scrape websites that rely heavily on JavaScript.

Which method is best for you depends on your specific needs and preferences. If you're looking for a simple and lightweight solution, using file_get_contents() or cURL with DOMDocument might be sufficient. If you prefer a more advanced and feature-rich library, you could consider using Simple HTML DOM Parser, Guzzle with Symfony's DomCrawler, or PHP PhantomJS.

1.file_get_contents() and DOMDocument with example

<?php
// The URL you want to scrape
$url = 'https://example.com';
$html = file_get_contents($url); // Fetch the HTML content of the webpage
$dom = new DOMDocument(); // Initialize DOMDocument
libxml_use_internal_errors(true); // Suppress warnings due to ill-formed HTML
$dom->loadHTML($html); // Load the HTML content into DOMDocument
libxml_clear_errors(); // Clear errors
$xpath = new DOMXPath($dom); // Initialize DOMXPath it is used to find nodes, text, html

######### SINGLE ELEMENT FIND #########
$element = $xpath->query('//h1')->item(0);
$element = (//h1[contains(@class, 'title')])[1] //or  first element
echo 'Single element: ' . $element->nodeValue . PHP_EOL;

######### MULTIPLE ELEMENT FIND #########
$elements = $xpath->query('//div[@class="example-class"]');

######### GET ATTRIBUTE VALUE #########
$attributeValue = $elements->item(0)->getAttribute('data-example-attribute');
echo 'Attribute value: ' . $attributeValue . PHP_EOL;

// 4. Loop through elements
foreach ($elements as $element) {
    // 5. Get the text content of an element
    echo 'Element text: ' . $element->nodeValue . PHP_EOL;
}
?>

In this example, we're using DOMDocument and DOMXPath for web scraping:

query() method Is used to find a single element or multiple elements 

getAttribute() method is used to To get the value of an attribte.

To get the text content of an element, use the nodeValue property of the DOMElement object. 

TO GET HTML of OBJECT 

$html= $element->ownerDocument->saveHTML($element);
print_r($html);die;

2. Using Curl

<?php
$url = "https://example.com";
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($curl);
curl_close($curl);
 ############### Now simply use DOM  ###############
$dom = new DOMDocument(); // Initialize DOMDocument
libxml_use_internal_errors(true); // Suppress warnings due to ill-formed HTML
$dom->loadHTML($output); // Load the HTML content into DOMDocument
libxml_clear_errors(); // Clear errors
$xpath = new DOMXPath($dom); // Initialize DOMXPath it is used to find nodes, text, html

######### SINGLE ELEMENT FIND #########q
$element = $xpath->query('//h1')->item(0);
$element = (//h1[contains(@class, 'title')])[1] //or  first element
echo 'Single element: ' . $element->nodeValue . PHP_EOL;

?>
Posted by: R GONDAL
Email: rizikmw@gmail.com