Regex / Regular expressions and XPaths alternatives every SEO needs
In some situations regular expressions and crawls with XPath make your SEO life much easier. This article will list some examples about
- Search within the URL with regex
- Search within the HTML with regex + XPath alternatives
You can use both regex and XPath to get insights out of your Crawls. In many cases its useful to setup your crawler before you start crawling and extract these things directly while crawling. We use DeepCrawl + Screaming Frog as default crawlers at blick.ch. DeepCrawl for example allows you to setup extraction rules with regular expressions to get custom stuff like this:
Screaming Frog with XPath and regular expressions:
In addition check Selenium if you want to do custom crawls or crawl more complicated websites.
Search within the URL with regex
Extract the host / subdomain.domain.tld out of the URL with regex
Check e.g. to which external domain you link a lot.
https?:\/\/(.*?)\/
https://regex101.com/r/hGE6RY/1
use it with Sublime e.g. like this
Javascript:
To get with protocol:
const input = "https://www.blick.ch/sport/"
const regex = RegExp("https?:\\/\\/(.*?)\\/", "g")
const output = regex.exec(input)[0]
console.log(output)
// https://www.blick.ch/
To get without protocol:
const input = "https://www.blick.ch/sport/"
const regex = RegExp("https?:\\/\\/(.*?)\\/", "g")
const output = regex.exec(input)[1]
console.log(output)
// www.blick.ch
Extract last part of the URL path with regex
with or without trailing slash:
([^\/]+[^\/]|[^\/]+[\/])$
https://regex101.com/r/Hx68gr/2
Useful if you want to ignore the hierarchy in the URL and e.g. check for duplicate detail pages which were created in multiple folders.
Javascript:
const input = "https://www.blick.ch/sport/formel1/neues-reglement-muss-wieder-warten-schwache-fuehrung-legt-formel-1-lahm-id15375623.html"
const regex = RegExp("([^\/]+[^\/]|[^\/]+[\/])$", "g")
const output = regex.exec(input)[0]
console.log(output)
// neues-reglement-muss-wieder-warten-schwache-fuehrung-legt-formel-1-lahm-id15375623.html
Find article IDs in URLs with regex
This example shows an example URL with id. The pattern looks like this
-id(THE-ID).html
-id([0-9]+)\.html
https://regex101.com/r/yvolLF/1
Google Sheets
REGEXEXTRACT(A2,"(?:.*id)([0–9]+)(?:.html)")
Javascript:
const input = "https://www.blick.ch/sport/formel1/neues-reglement-muss-wieder-warten-schwache-fuehrung-legt-formel-1-lahm-id15375623.html"
const regex = RegExp("-id([0-9]+)\.html", "g")
const output = regex.exec(input)[1]
console.log(output)
// 15375623
Find article IDs in URL list with regex
Your not just interested in the id itself, but all the URLs with an id.
If the URL is https://www.blick.ch/life/essen/rezept/saucen/rezept-mit-weniger-kalorien-so-macht-man-leichtere-sauce-hollandaise-id6460280.html and you want to filter e.g. in Google Analytics for them.
.*-id[0-9].*
https://regex101.com/r/CEjmWK/1
Javascript:
const input = "https://www.blick.ch/sport/formel1/neues-reglement-muss-wieder-warten-schwache-fuehrung-legt-formel-1-lahm-id15375623.html"
const regex = RegExp("-id([0-9]+)\.html", "g")
const output = regex.exec(input)[1]
console.log(output)
// 15375623
Find the fragment / hashtag in a URL with regex
#(.+)
Javascript:
const input = "https://zrce.eu#test"
const regex = RegExp("#(.+)", "g")
const output = regex.exec(input)[1]
console.log(output)
// test
Search for http and https URLs with regex
In this case the escaped Slash might be interesting…
http:\/
or
https:\/
https://regex101.com/r/oDOQKg/1
Javascript:
This will return an array if the URL is https and null if http
const input = "https://zrce.eu#test"
const regex = RegExp("https:\/", "g")
const output = regex.exec(input)
console.log(output)
// [ 'https:/',
// index: 0,
// input: 'https://zrce.eu#test',
// groups: undefined ]
Search for the first or second folder in the URL with regex
Search for the first folder within a URL with regex:
http[s]:\/\/.*?\/(.*?)\/
https://regex101.com/r/0RxSTe/1
Search for the second folder within a URL with regex:
http[s]:\/\/.*?\/.*?\/(.*?)\/
Search within the HTML with regex + XPath alternatives
Extract JSON-LD out of HTML using regex
type=”application\/ld\+json”>?([^<]*)
https://regex101.com/r/guBw4z/1
Extract JSON-LD out of HTML using XPath
//script[@type="application/ld+json"]
You can use e.g. Screaming Frog for this
Find meta news_keywords from HTML with regex
(?i)name\s*=\s*['"]?news_keywords[^>]+content\s*=\s*['"]?([^'"]*)['"]?|content\s*=\s*['"]?([^"']*)['"]?[^>]+name\s*=\s*['"]?news_keywords['"]?
https://regex101.com/r/0HCfaA/1
Find meta news_keywords from HTML with XPath
//meta[@name='news_keywords']/@content
Find hreflang with regex. Example with de-de
(?i)hreflang\s*=\s*['"]?de-de[^>]+href\s*=\s*['"]?([^'"]*)['"]?|href\s*=\s*['"]?([^"']*)['"]?[^>]+hreflang\s*=\s*['"]?de-de['"]?
https://regex101.com/r/xfkmot/1
Find RSS Feed URL with regex
I’m searching for something like
<link rel=”alternate” type=”application/rss+xml” title=”Zrce.eu » Feed” href=”https://zrce.eu/feed/" />
Probably this will look different in your setup
(?i)type\s*=\s*['"]?application\/rss\+xml[^>]+href\s*=\s*['"]?([^'"]*)['"]?|href\s*=\s*['"]?([^"']*)['"]?[^>]+type\s*=\s*['"]?application\/rss\+xml['"]?
Find meta robots from HTML with regex
(?i)name\s*=\s*['"]?robots[^>]+content\s*=\s*['"]?([^'"]*)['"]?|content\s*=\s*['"]?([^"']*)['"]?[^>]+name\s*=\s*['"]?robots['"]?
Find meta date from HTML with regex
(?i)name\s*=\s*['"]?date[^>]+content\s*=\s*['"]?([^'"]*)['"]?|content\s*=\s*['"]?([^"']*)['"]?[^>]+name\s*=\s*['"]?date['"]?
https://regex101.com/r/mkJcVl/1
Find meta date from HTML with XPath
//meta[@name='date']/@content
Extract Google Analytics property from HTML with regex
Is there the same Google Analytics tracking code on all URLs?
UA-?([^’]*)
Extract Google Tag Manager property from HTML using regex
GTM-?([^’]*)
Extract Google Search Console site verification from HTML using regex
(?i)name\s*=\s*[‘“]?google-site-verification[^>]+content\s*=\s*[‘“]?([^’”]*)[‘“]?
https://regex101.com/r/y2KKT3/30
Extract Google Search Console site verification from HTML using XPath
//meta[@name='google-site-verification']/@content
Extract AMP URL from HTML using regex
(?i)rel\s*=\s*[‘“]?amphtml[^>]+href\s*=\s*[‘“]?([^’”]*)[‘“]?|href\s*=\s*[‘“]?([^”’]*)[‘“]?[^>]+rel\s*=\s*[‘“]?amphtml[‘“]?
https://regex101.com/r/y2KKT3/24
Extract AMP URL from HTML using XPath
//link[@rel='amphtml']/@href
Find duplicate words in URLs
(\w+\-).*\1
https://regex101.com/r/Xpl8dP/1
Extract, if e.g. class name contains
If you are looking for a specific part within a class name. E.g. article-metadata
<div class="ArticleMetadata__Wrapper-sc-1xm0v61-0 dKWmkV article-metadata">
you could use contains like this:
Anything missing?
Which regular expression and XPaths do you use frequently?
Any option to make the rules above more efficient and nice?
Please let me know if there are shorter / nice options of the regex rules in this article. Thanks!