No we’re not talking about the “Watermelon crawl” by Country music singer Tracy Byrd.
Webmasters, SEO’s, Site authors need to understand the crawling process to better optimize their site’s content so as to avoid possible pitfalls/prevention of indexing. This is a summary of the spider’s crawling process by using Hoobastanks “Crawling in the Dark” lyrics as a comparison tool.
“Crawling In The Dark”
(Spider version)
“I will dedicate
And sacrifice my everything for just a second’s worth
Of how my story’s ending
(comparison- Are there new internal or external links? Are there new URL’s [posts/pages]? Additional content?)
And I wish I could know if the directions that I take
And all the choices that I make won’t end up all for nothing
(comparison- Will the page be there? Will I find an empty house [404])
Show me what it’s for
Make me understand it
I’ve been crawling in the dark looking for the answer
Is there something more than what i’ve been handed?
(comparison - [relevancy] this won’t be addressed when crawling, but relevancy will be addressed in the future upon a search query once the indexing from a crawl has taken place)
I’ve been crawling in the dark looking for the answer
Help me carry on
Assure me it’s ok to use my heart and not my eyes
To navigate the darkness
(ok, actually the spiders use their eyes and not their heart)
Will the ending be ever coming suddenly?
Will I ever get to see the ending to my story?
Show me what it’s for
Make me understand it
I’ve been crawling in the dark looking for the answer
Is there something more than what i’ve been handed?
I’ve been crawling in the dark looking for the answer
So when and how will I know?
How much further do I have to go?
How much longer until I finally know?
Because I’m looking and I just can’t see what’s in front of me
In front of me
(comparison – Spiders like to read content only once. Since dynamic URL’s usually indicate that there’s ever changing content [dynamic], the spiders can have problems with those URL’s. The spiders prefer the site visitor sees what they saw, and if it’s ever changing, they can’t vouch for it being the same as what they’d indexed upon last crawl.
Imagine it crawls, indexes the content as ‘little Sally Sue in pigtails’, places little Sally in the database, then sends an unsuspecting visitor who was querying to find little Sally, to a page that advertises porn at a discount. Eeeek!
To Sum it Up
Once your pages have been crawled the spiders will repeat this process at a later date/time.
It will log the up-to-date version as it crawls. It reads the Robots Exclusion Protocol (robots.txt - robots text file) from your site.
This communication uses
The user agent will travel via HTTP to your server to see what you’ve got available. (new URL’s, content etc) And it repeats this process upon scheduled crawls.
Once this takes place, it’s loaded into a database for safe keeping and voila! the job is done until the next crawl.
Wanna know what the spiders are seeing on your site?