How to cleanup by accident duplicated articles using Screaming Frog and Python pandas
If you website is growing it happens that editors or some bug in the system copies articles. You end up e.g. with two identical articles with just the URL having another id:
lalala/atp-turnier-dubai-federer-verpasst-revanche-gegen-djokovic-id1411458.html
lalala/atp-turnier-dubai-federer-verpasst-revanche-gegen-djokovic-id1411459.html
That’s not just bad for SEO but it in addition makes tracking and article updates complicated. You should really clean up that kind of duplicate errors.
How to clean up here
My thesis here is that URLs where the only difference is the article id in the URL are good candidates to check for being duplicates. I label them candidates because there may be cases in which editors copied an article, edited the whole text of the article but forgot to change the URL.
1) Load all URLs from the sitemap
Here is some python code to do that: https://colab.research.google.com/drive/1ImtxYFiVZgC01Ael5OH4BwL5ER3N-Qkx
You can go with a crawl, Google Analyitcs export or other source
2) Prepare candidates for a crawl
If you have a list of all URLs you want to cut off the id here
dfSitemap['ohneID'] = dfSitemap['loc'].str.extract('^(.*-id)', expand=False)
Next search for duplicated URLs within all URLs without ID and keep all duplicates.
dfSitemap = dfSitemap[dfSitemap.duplicated(subset=['woID'],keep=False)]
Cleanup the list to have an import ready txt for Screaming Frog
dfSitemapJustDuplicatesExport = dfSitemap['loc']
dfSitemapJustDuplicatesExport = dfSitemapJustDuplicatesExport.drop_duplicates()
dfSitemapJustDuplicatesExport.to_csv('dfSitemapJustDuplicatesExport.txt', index=False, header=False)
Before you start the crawl you need to setup 2 things in Screaming Frog
3.1) Setup Screaming Frog to check for duplicates
Follow the guide to setup the content area
3.2) Setup Screaming Frog to get some performance data
In case of duplicates you want to keep the better version. So setup e.g. Google Search Console to get you Impression data. The plan is to keep the copy URL with the highest GSC impressions.
I went for one year of GSC data here.
4) Export near_duplicates_report and search_console_all from Screaming Frog
You can now work with another python script or just import below.
5) Import near_duplicates_report into Pandas
dfContent = pd.DataFrame(pd.read_excel('./near_duplicates_report.xlsx', header=0))
Near Duplicate Report has every row “Address => Near Duplicate Address” again as row “Near Duplicate Address => Address”. We don’t need both. I hashed both, then summed up the hashes and dropped duplicates with the same summed hash (If you have a nicer solution for this let me know. Thanks)
dfContent['hashAddress'] = dfContent['Address'].apply(hash)
dfContent['hashNear Duplicate Address'] = dfContent['Near Duplicate Address'].apply(hash)
dfContent['hashSum'] = dfContent['hashAddress'] + dfContent['hashNear Duplicate Address']
dfContent = dfContent.drop_duplicates(subset=['hashSum'], keep='first')
Next we need to add every Address with a reference to itself again to not miss something.
dfContent2 = pd.DataFrame({'Address' : []})
dfContent2["Address"] = dfContent["Address"]
dfContent2["Near Duplicate Address"] = dfContent["Address"]
dfContent2 = dfContent2.drop_duplicates(subset=["Address"], keep="first")
dfContent = dfContent.append(dfContent2)
dfContent = dfContent.sort_values(["Address"], ascending = (False))
This basically transformes the data like this
6) Add the GSC data
Load GSC data into pandas datafram
dfSearchConsole = pd.DataFrame(pd.read_excel('./search_console_all.xlsx', header=0))
Merge with the duplicate report. Merge “Near Duplicate Address” with the “Address”
dfContent3 = pd.merge(dfContent[['Address','Near Duplicate Address']], dfSearchConsole[['Address','Clicks','Impressions']], left_on='Near Duplicate Address', right_on='Address', how='left')
and sort by Address + highest Impression
dfContent3 = dfContent3.sort_values(["Address", "Impressions_Near Duplicate Address"], ascending = (False, False))
Now we have something like this. Impression are Impressions of “Near Duplicate Address”
7) Find the URLs to keep
as mentioned the first URL is highest impression. We want to keep this one
dfKeepURLs = dfKeepURLs.drop_duplicates(subset=['Near Duplicate Address'], keep='first')
8) Find the URLs to delete or redirect
I select all the rows which are not in dfKeepURLs
dfDeleteURLs = dfDeleteURLs[~dfDeleteURLs['hashNear Duplicate Address_y'].notnull()]
The full jupyter notebook
it’s not possible to run exactly this in Google Colab… the code works with local files. Feel free to edit.