How to cleanup by accident duplicated articles using Screaming Frog and Python pandas

If you website is growing it happens that editors or some bug in the system copies articles. You end up e.g. with two identical articles with just the URL having another id:

lalala/atp-turnier-dubai-federer-verpasst-revanche-gegen-djokovic-id1411458.html
lalala/atp-turnier-dubai-federer-verpasst-revanche-gegen-djokovic-id1411459.html

That’s not just bad for SEO but it in addition makes tracking and article updates complicated. You should really clean up that kind of duplicate errors.

How to clean up here

My thesis here is that URLs where the only difference is the article id in the URL are good candidates to check for being duplicates. I label them candidates because there may be cases in which editors copied an article, edited the whole text of the article but forgot to change the URL.

Here is some python code to do that: https://colab.research.google.com/drive/1ImtxYFiVZgC01Ael5OH4BwL5ER3N-Qkx

You can go with a crawl, Google Analyitcs export or other source

If you have a list of all URLs you want to cut off the id here

dfSitemap['ohneID'] = dfSitemap['loc'].str.extract('^(.*-id)', expand=False)

Next search for duplicated URLs within all URLs without ID and keep all duplicates.

dfSitemap = dfSitemap[dfSitemap.duplicated(subset=['woID'],keep=False)]

Cleanup the list to have an import ready txt for Screaming Frog

dfSitemapJustDuplicatesExport = dfSitemap['loc']
dfSitemapJustDuplicatesExport = dfSitemapJustDuplicatesExport.drop_duplicates()
dfSitemapJustDuplicatesExport.to_csv('dfSitemapJustDuplicatesExport.txt', index=False, header=False)

Before you start the crawl you need to setup 2 things in Screaming Frog

Follow the guide to setup the content area

In case of duplicates you want to keep the better version. So setup e.g. Google Search Console to get you Impression data. The plan is to keep the copy URL with the highest GSC impressions.

I went for one year of GSC data here.

You can now work with another python script or just import below.

dfContent = pd.DataFrame(pd.read_excel('./near_duplicates_report.xlsx', header=0))

Near Duplicate Report has every row “Address => Near Duplicate Address” again as row “Near Duplicate Address => Address”. We don’t need both. I hashed both, then summed up the hashes and dropped duplicates with the same summed hash (If you have a nicer solution for this let me know. Thanks)

dfContent['hashAddress'] = dfContent['Address'].apply(hash)
dfContent['hashNear Duplicate Address'] = dfContent['Near Duplicate Address'].apply(hash)
dfContent['hashSum'] = dfContent['hashAddress'] + dfContent['hashNear Duplicate Address']
dfContent = dfContent.drop_duplicates(subset=['hashSum'], keep='first')

Next we need to add every Address with a reference to itself again to not miss something.

dfContent2 = pd.DataFrame({'Address' : []})
dfContent2["Address"] = dfContent["Address"]
dfContent2["Near Duplicate Address"] = dfContent["Address"]
dfContent2 = dfContent2.drop_duplicates(subset=["Address"], keep="first")
dfContent = dfContent.append(dfContent2)
dfContent = dfContent.sort_values(["Address"], ascending = (False))

This basically transformes the data like this

Load GSC data into pandas datafram

dfSearchConsole = pd.DataFrame(pd.read_excel('./search_console_all.xlsx', header=0))

Merge with the duplicate report. Merge “Near Duplicate Address” with the “Address”

dfContent3 = pd.merge(dfContent[['Address','Near Duplicate Address']], dfSearchConsole[['Address','Clicks','Impressions']], left_on='Near Duplicate Address', right_on='Address', how='left')

and sort by Address + highest Impression

dfContent3 = dfContent3.sort_values(["Address", "Impressions_Near Duplicate Address"], ascending = (False, False))

Now we have something like this. Impression are Impressions of “Near Duplicate Address”

as mentioned the first URL is highest impression. We want to keep this one

dfKeepURLs = dfKeepURLs.drop_duplicates(subset=['Near Duplicate Address'], keep='first')

I select all the rows which are not in dfKeepURLs

dfDeleteURLs = dfDeleteURLs[~dfDeleteURLs['hashNear Duplicate Address_y'].notnull()]

The full jupyter notebook

it’s not possible to run exactly this in Google Colab… the code works with local files. Feel free to edit.

Head of #SEO at Blick.ch