Regex / Regular expressions and XPaths alternatives every SEO needs

Tobias Willmann
6 min readApr 25, 2018

--

In some situations regular expressions and crawls with XPath make your SEO life much easier. This article will list some examples about

  • Search within the URL with regex
  • Search within the HTML with regex + XPath alternatives

You can use both regex and XPath to get insights out of your Crawls. In many cases its useful to setup your crawler before you start crawling and extract these things directly while crawling. We use DeepCrawl + Screaming Frog as default crawlers at blick.ch. DeepCrawl for example allows you to setup extraction rules with regular expressions to get custom stuff like this:

Screaming Frog with XPath and regular expressions:

In addition check Selenium if you want to do custom crawls or crawl more complicated websites.

Search within the URL with regex

Extract the host / subdomain.domain.tld out of the URL with regex

Check e.g. to which external domain you link a lot.

https?:\/\/(.*?)\/

https://regex101.com/r/hGE6RY/1

use it with Sublime e.g. like this

Javascript:

To get with protocol:

const input = "https://www.blick.ch/sport/"
const regex = RegExp("https?:\\/\\/(.*?)\\/", "g")
const output = regex.exec(input)[0]
console.log(output)
// https://www.blick.ch/

To get without protocol:

const input = "https://www.blick.ch/sport/"
const regex = RegExp("https?:\\/\\/(.*?)\\/", "g")
const output = regex.exec(input)[1]
console.log(output)
// www.blick.ch

Extract last part of the URL path with regex

with or without trailing slash:

([^\/]+[^\/]|[^\/]+[\/])$

https://regex101.com/r/Hx68gr/2

Useful if you want to ignore the hierarchy in the URL and e.g. check for duplicate detail pages which were created in multiple folders.

Javascript:

const input = "https://www.blick.ch/sport/formel1/neues-reglement-muss-wieder-warten-schwache-fuehrung-legt-formel-1-lahm-id15375623.html"
const regex = RegExp("([^\/]+[^\/]|[^\/]+[\/])$", "g")
const output = regex.exec(input)[0]
console.log(output)
// neues-reglement-muss-wieder-warten-schwache-fuehrung-legt-formel-1-lahm-id15375623.html

Find article IDs in URLs with regex

This example shows an example URL with id. The pattern looks like this

-id(THE-ID).html

https://www.blick.ch/life /essen/rezept/saucen/rezept-mit-weniger-kalorien-so-macht-man-leichtere-sauce-hollandaise-id6460280.html

-id([0-9]+)\.html

https://regex101.com/r/yvolLF/1

Google Sheets

REGEXEXTRACT(A2,"(?:.*id)([0–9]+)(?:.html)")

Javascript:

const input = "https://www.blick.ch/sport/formel1/neues-reglement-muss-wieder-warten-schwache-fuehrung-legt-formel-1-lahm-id15375623.html"
const regex = RegExp("-id([0-9]+)\.html", "g")
const output = regex.exec(input)[1]
console.log(output)
// 15375623

Find article IDs in URL list with regex

Your not just interested in the id itself, but all the URLs with an id.

If the URL is https://www.blick.ch/life/essen/rezept/saucen/rezept-mit-weniger-kalorien-so-macht-man-leichtere-sauce-hollandaise-id6460280.html and you want to filter e.g. in Google Analytics for them.

.*-id[0-9].*

https://regex101.com/r/CEjmWK/1

Javascript:

const input = "https://www.blick.ch/sport/formel1/neues-reglement-muss-wieder-warten-schwache-fuehrung-legt-formel-1-lahm-id15375623.html"
const regex = RegExp("-id([0-9]+)\.html", "g")
const output = regex.exec(input)[1]
console.log(output)
// 15375623

Find the fragment / hashtag in a URL with regex

#(.+)

Javascript:

const input = "https://zrce.eu#test"
const regex = RegExp("#(.+)", "g")
const output = regex.exec(input)[1]
console.log(output)
// test

Search for http and https URLs with regex

In this case the escaped Slash might be interesting…

http:\/

or

https:\/

https://regex101.com/r/oDOQKg/1

Javascript:

This will return an array if the URL is https and null if http

const input = "https://zrce.eu#test"
const regex = RegExp("https:\/", "g")
const output = regex.exec(input)
console.log(output)
// [ 'https:/',
// index: 0,
// input: 'https://zrce.eu#test',
// groups: undefined ]

Find subdomain with regex

http[s]?:\/\/(.*?)\..*\/

https://regex101.com/r/541B9X/2

Search for the first or second folder in the URL with regex

Search for the first folder within a URL with regex:

http[s]:\/\/.*?\/(.*?)\/

https://regex101.com/r/0RxSTe/1

Search for the second folder within a URL with regex:

http[s]:\/\/.*?\/.*?\/(.*?)\/

Find all parameters in a URL with regex

(\?|\&)([^=\n]+)\=([^&\n]+)

https://regex101.com/r/pwaKkv/1

Search within the HTML with regex + XPath alternatives

Extract JSON-LD out of HTML using regex

type=”application\/ld\+json”>?([^<]*)

https://regex101.com/r/guBw4z/1

Extract JSON-LD out of HTML using XPath

//script[@type="application/ld+json"]

You can use e.g. Screaming Frog for this

Find meta news_keywords from HTML with regex

(?i)name\s*=\s*['"]?news_keywords[^>]+content\s*=\s*['"]?([^'"]*)['"]?|content\s*=\s*['"]?([^"']*)['"]?[^>]+name\s*=\s*['"]?news_keywords['"]?

https://regex101.com/r/0HCfaA/1

Find meta news_keywords from HTML with XPath

//meta[@name='news_keywords']/@content

Find hreflang with regex. Example with de-de

(?i)hreflang\s*=\s*['"]?de-de[^>]+href\s*=\s*['"]?([^'"]*)['"]?|href\s*=\s*['"]?([^"']*)['"]?[^>]+hreflang\s*=\s*['"]?de-de['"]?

https://regex101.com/r/xfkmot/1

Find RSS Feed URL with regex

I’m searching for something like

<link rel=”alternate” type=”application/rss+xml” title=”Zrce.eu &raquo; Feed” href=”https://zrce.eu/feed/" />

Probably this will look different in your setup

(?i)type\s*=\s*['"]?application\/rss\+xml[^>]+href\s*=\s*['"]?([^'"]*)['"]?|href\s*=\s*['"]?([^"']*)['"]?[^>]+type\s*=\s*['"]?application\/rss\+xml['"]?

https://regex101.com/r/hk6fGm/1

Find meta robots from HTML with regex

(?i)name\s*=\s*['"]?robots[^>]+content\s*=\s*['"]?([^'"]*)['"]?|content\s*=\s*['"]?([^"']*)['"]?[^>]+name\s*=\s*['"]?robots['"]?

https://regex101.com/r/J94ZaS/1

Find meta date from HTML with regex

(?i)name\s*=\s*['"]?date[^>]+content\s*=\s*['"]?([^'"]*)['"]?|content\s*=\s*['"]?([^"']*)['"]?[^>]+name\s*=\s*['"]?date['"]?

https://regex101.com/r/mkJcVl/1

Find meta date from HTML with XPath

//meta[@name='date']/@content

Extract Google Analytics property from HTML with regex

Is there the same Google Analytics tracking code on all URLs?

UA-?([^’]*)

https://regex101.com/r/R7HAAB/1

Extract Google Tag Manager property from HTML using regex

GTM-?([^’]*)

Extract Google Search Console site verification from HTML using regex

(?i)name\s*=\s*[‘“]?google-site-verification[^>]+content\s*=\s*[‘“]?([^’”]*)[‘“]?

https://regex101.com/r/y2KKT3/30

Extract Google Search Console site verification from HTML using XPath

//meta[@name='google-site-verification']/@content

Extract AMP URL from HTML using regex

(?i)rel\s*=\s*[‘“]?amphtml[^>]+href\s*=\s*[‘“]?([^’”]*)[‘“]?|href\s*=\s*[‘“]?([^”’]*)[‘“]?[^>]+rel\s*=\s*[‘“]?amphtml[‘“]?

https://regex101.com/r/y2KKT3/24

Extract AMP URL from HTML using XPath

//link[@rel='amphtml']/@href

Find duplicate words in URLs

(\w+\-).*\1

https://regex101.com/r/Xpl8dP/1

Extract, if e.g. class name contains

If you are looking for a specific part within a class name. E.g. article-metadata

<div class="ArticleMetadata__Wrapper-sc-1xm0v61-0 dKWmkV article-metadata">

you could use contains like this:

Anything missing?

Which regular expression and XPaths do you use frequently?

Any option to make the rules above more efficient and nice?

Please let me know if there are shorter / nice options of the regex rules in this article. Thanks!

Used tools

--

--