There are various factors you could possibly want to discover each of the URLs on a website, but your correct target will establish Whatever you’re trying to find. For instance, you might want to:
Establish just about every indexed URL to research concerns like cannibalization or index bloat
Gather present and historic URLs Google has seen, specifically for web-site migrations
Uncover all 404 URLs to Recuperate from publish-migration errors
In Each individual circumstance, an individual Instrument received’t Supply you with almost everything you will need. Sadly, Google Search Console isn’t exhaustive, in addition to a “internet site:instance.com” look for is restricted and hard to extract data from.
During this put up, I’ll walk you through some applications to construct your URL record and ahead of deduplicating the info using a spreadsheet or Jupyter Notebook, determined by your internet site’s size.
Previous sitemaps and crawl exports
When you’re looking for URLs that disappeared from your Stay web page lately, there’s a chance an individual in your group could have saved a sitemap file or maybe a crawl export ahead of the changes had been created. In case you haven’t by now, look for these data files; they can usually present what you need. But, in the event you’re studying this, you most likely didn't get so Fortunate.
Archive.org
Archive.org
Archive.org is a useful Instrument for Web optimization jobs, funded by donations. If you seek out a website and choose the “URLs” selection, you could access approximately 10,000 shown URLs.
Even so, there are a few limitations:
URL limit: You may only retrieve as many as web designer kuala lumpur ten,000 URLs, which happens to be inadequate for larger sites.
High quality: Numerous URLs could be malformed or reference useful resource files (e.g., photographs or scripts).
No export possibility: There isn’t a constructed-in strategy to export the record.
To bypass the lack of an export button, make use of a browser scraping plugin like Dataminer.io. Having said that, these limitations indicate Archive.org might not provide an entire Option for more substantial web sites. Also, Archive.org doesn’t reveal no matter if Google indexed a URL—however, if Archive.org uncovered it, there’s a very good prospect Google did, far too.
Moz Pro
Although you could normally use a link index to find exterior sites linking to you personally, these instruments also learn URLs on your web site in the process.
How you can utilize it:
Export your inbound links in Moz Professional to get a rapid and simple list of target URLs from the internet site. When you’re handling a huge Web site, consider using the Moz API to export information beyond what’s workable in Excel or Google Sheets.
It’s imperative that you Take note that Moz Pro doesn’t verify if URLs are indexed or found by Google. Nonetheless, considering that most websites implement exactly the same robots.txt rules to Moz’s bots because they do to Google’s, this process typically operates effectively as a proxy for Googlebot’s discoverability.
Google Research Console
Google Look for Console features various precious resources for creating your listing of URLs.
Links reports:
Similar to Moz Professional, the Inbound links part supplies exportable lists of goal URLs. Regrettably, these exports are capped at 1,000 URLs Every. You are able to implement filters for distinct webpages, but given that filters don’t apply into the export, you may perhaps should depend upon browser scraping equipment—limited to five hundred filtered URLs at a time. Not perfect.
Performance → Search engine results:
This export will give you a summary of pages obtaining research impressions. Although the export is limited, you can use Google Research Console API for much larger datasets. You can also find free Google Sheets plugins that simplify pulling extra comprehensive data.
Indexing → Pages report:
This portion provides exports filtered by concern variety, nevertheless they are also confined in scope.
Google Analytics
Google Analytics
The Engagement → Pages and Screens default report in GA4 is a wonderful resource for gathering URLs, which has a generous limit of one hundred,000 URLs.
Better yet, you'll be able to apply filters to develop diverse URL lists, effectively surpassing the 100k Restrict. For instance, if you would like export only site URLs, follow these techniques:
Stage one: Add a section into the report
Phase two: Click “Develop a new segment.”
Phase three: Define the section with a narrower URL pattern, which include URLs containing /blog site/
Be aware: URLs present in Google Analytics may not be discoverable by Googlebot or indexed by Google, but they offer precious insights.
Server log files
Server or CDN log information are Potentially the ultimate Resource at your disposal. These logs seize an exhaustive checklist of each URL route queried by customers, Googlebot, or other bots over the recorded period.
Factors:
Details size: Log information is usually huge, countless web pages only retain the last two months of knowledge.
Complexity: Examining log files might be complicated, but numerous equipment can be obtained to simplify the procedure.
Combine, and excellent luck
When you finally’ve collected URLs from each one of these sources, it’s time to mix them. If your internet site is sufficiently small, use Excel or, for bigger datasets, equipment like Google Sheets or Jupyter Notebook. Assure all URLs are consistently formatted, then deduplicate the list.
And voilà—you now have a comprehensive list of recent, outdated, and archived URLs. Superior luck!