PHP scraping using DOM and XPath tutorial

Table of contents

Intorduction
DOM Tree Terminology
What are Nodes?
Retrieve your document
Prepare your file using Tidy
DOM Returned values
Locating Nodes
Load document
A single element example
A list of elements example
XPath
XPath Conditions
XPath real example
Resources

Introduction

To scrape data from a website you will need to look first into the website API if it will satisfy your needs, so use it to get the data you need, if you want to extract data not available through the website’s API, so it’s the time to roll your sleeves to write your own tiny script to extract what you want.

If you know how to program in PHP it won’t be hard to use it to extract data from any website, in this tutorial I will not explain how to build a scraper,  I will show you how to use the DOM and XPath together with PHP to navigate and extract the date you need.

DOM Tree Terminology

The Document Object Model (DOM) is a tree model structure for any given XML and HTML document, is like a family tree composed of nodes, each individual node can have child nodes

let’s take a look into the following xml example:

xml dom tree visualization

This document consists of

  1. elements  are in yellow color
  2. attributes are in red color
  3. node values are in green color

breakfast is the root node.

food is a child of the breakfast element

name is a child of the food eleemnt

price is next sibling of the name (they share the same parent)

name is the previous sibling of price (they share the same parent)

$6.95 is the node value of the price element ( aka tag)

What are Nodes ?

There are different types of nodes including:
elements nodes: they are of the form <tagname>  and everything in between </tagname>
attributes nodes: they are attributes inside element tags: <tagname attribute=value>
text nodes: the text content of element tags, the text between <tagname> and </tagname>
for example:

xml nodes types

Retrieve your document

in PHP there are many ways to retrieve the document in which we will parse it there are some builtin functions like file_get_contents and you can use the cURL extension for this tutorial I will keep it simple and retrieve the documents using file_get_contents which works with http and https protocols.

this will retrieve the contents of the webpage in the $file variable

Prepare your file using Tidy

at this point you have retrived the body of the document and it’s ready for parsing, in this phase we need to cleanup markups malformations, so you can use Tidy extension to handle this task for you i have created a function for this task

 

This function will clean  your document markups, it’s deafult input and the output is xhtml

$input_string, is your html or xml source if you dealing with xml document pass a second pramaeter as the follwing

now you have retrieved the document and cleant it up it’s the time to extract the data we want from it

DOM Returned values

Each node we return has methods and properties in this tutorial i will focus only on which are frequently used in parsing.

  1. nodeName
  2. nodeValue
  3. getAttribute()

nodeName = price
nodeValue = $9.95
getAttribute(‘currency’) = usd
and if we return a nodelist it has a useful length property which holds the number of nodes returned, i will explain it later

Locating Nodes

In  parsing routings we need to locate a single node or a list of nodes and the the DOMdocument class offer to 2 methods to help programmers to do that:

  1. getElementById() it will locate a single element with a certain id
  2. getElementsByTagName() it will locate a list of elements with a certain tag name

getElementsByTagName() method has useful length property to determine how many nodes in the list and return 0 if it’s empty, and the list is iterable as any other PHP aarray can be used in foreach loop

Load document

The DOMDocument class offers to methods to load the document:

  1. loadHTML — Load HTML source code
  2. loadHTMLFile — Load HTML from a file

The The DOMDocument class will emit warnings if the html page is not well formatted and this is something you don’t want to encounter, so first you pass the html code to the tidy function to clean it up if this doesn’t eliminate the problems, errors can be suppressed by this class method:

An examples:

so far we have loaded the document to the DOMDocument object to parse it

 

A single element example

At this point i explain what the DOMdocument class load documents, locate nodes and its Returned values let’s put all this together
i will work on the xml example which at the beginning of this tutorial and you can find here in this link
https://raw.githubusercontent.com/abdul202/demos/master/dom_tree
in this example i want to extract the food element with all its child nodes with its full html markup

the result:

Note that i passed $element as a parameter for the saveHTML($element) to return the full html markup for just the node returned by getElementById() method, otherwise you will get the full page source instead

A list of elements example

in this example i want to extract all the foods names details

The result

The result explained

 

The $nodelist->length property is useful to check whether your script returned any number of nodes or not

in our previous example it will return 5 nodes because the <name> tag are 5

XPath

The XPath is a query language for selecting nodes from an XML or HTML document if you had never used it please there’s a great an overview about it in the resources section at the end of this tutorial

The XPath syntax is designed to mimic URL (Uniform Resource resources) and Unix-style file path syntax
in our prevouse xml example:
/breakfast/food/name
Returns a list of all name nodes  for each food element in the breakfast.
another exaple to get all name nodes
//title

A single forward slash / it means it’s like a path to our nodes
A double forward slash // it means get all the nodes no matter where are they located in the document

XPath Conditions

Square brackets are used to delimit the conditional expression.
//food[@id = “2”]/name’
Returns all name nodes of the food node with an id of “2”

XPath real example

in thsi example i will extract a list of of  New Movies titles In Theaters from the imdb from this link http://www.imdb.com/movies-in-theaters/

we need first to take a look into the page html source code you will find that each movie details is surrounded by a table and and the movie title is a link (a tag ) and surrounded by h4 tag with this attribute itemprop=”name”

So to build a XPath query to extract a list of all movie a tags

it will return all a tags which contain the titles to explain it in more details:

 

To extract all titles from this table here’s the code

The result:

i hope i explained to you how to scrape some data using PHP DOMDocument and DOMXPath

Resources

XPath overview

The DOMDocument class

 

 

1 Comment
  1. 2 years ago
    Michiel

    Thank you, very informative. Could you also give an example for e.g. IMDB where you extract the movie title AND the movie description together?

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *