Conversing With the Search Engine Bot

If you’re thinking conversing with the search engine spiders is anything as creepy as talking to spiders, you may need to scroll down.

You might have been wondering why in spite of doing most of your SEO really well, why there has been some missing piece in the puzzle?

Despite having done your on page and off page SEO fairly well, if your page isn’t user friendly for the search engine bots, then you’ve skipped a huge chunk of your SEO strategy.

To complement your website’s relevant content, good structure and appealing design, it needs to be easily comprehensible for the web crawlers.

By understanding how the search engine bots crawl and index your site, without much effort, you will be able to change your SEO game. By default, the bots are instructed to crawl every single page in your website. However, this is unnecessary and will eat up much of what is the Google’s crawl budget.

In this blog you will be getting an overview on how a search engine perceives your website. You can take home deep insights on how rightly conversing with the search engine bot can ensure efficient crawling of your website and improve indexing on the search engine.

ROBOTS.TXT

The main job of the robots.txt file is to tell the search engines which pages on your website are to be accessed and indexed and which pages should not be. Its suggested to make sure that search engines crawl only content rich pages on your site which ads value to user. This helps you in ranking higher for the targeted keywords.

Making changes in this file using notepad (or any other simple text editor) can make a world of difference on how the search engine perceives your website and its content. The robot.txt file also known as the robots exclusion protocol, if used wisely, could be a deciding factor in ranking your page higher in search listings.

Whenever the web robots decide to crawl your site to rank it, the first step it will be going through is the robot.txt file. When left unedited, this file will stay as default which may lead to inefficient crawling of the bots. While editing your robots.txt file, you may want to keep the following in mind:

User agent

The user agent command signifies the specific bot or web crawler which we are referring to.

In a typical scenario, we may use ‘User-agent: *’ this means that we are referring to all bots that are supposed to crawl the website.

If the command says “User-agent: Googlebot”, it means that they are referring to the command below the user agent will be applicable only to the Googlebot web crawler. Similarly we can prevent a Bing, yahoo bot or any other search engine bot.

Disallow

As the name suggests, this is added when we do not want the web crawler(user agent) to crawl a certain page or set of pages. This is quite often used when there are pages on your site which you don’t want users to be able to access unless they take a specific action.

For example, if you have a thank you page which users get to view after sign up or providing their email address, you may not want people to be able to find that page by doing a Google or Bing search. You can take a look at the below examples to gain better understanding:

To block all web crawlers from viewing all content in the website,

User-agent: *

Disallow: /

To give all robots complete access the below format can be applied.

User-agent:*

Disallow:

To restrict a specific bot i.e googlebot from viewing a specific folder (in this case, a blog folder).

User-agent: Googlebot

Disallow: /blog/

To block a specific webpage by entering the link

User-agent: Googlebot

Disallow: /blog/blocked-page.html

Sitemap

A sitemap is nothing but a quick summary of all the URLs which are created for the search engine bots to come crawl in your website.

This is important for individual ranking of web pages as website since a whole doesn’t get ranked.

Creating a sitemap would help the bots understand the key pages in your website. While creating one requires significant programming knowledge,however, you can always use free sitemap generators available online. This can be later edited as per preference in excel sheets and should be saved as .xml file.

Sitemaps are crucial if your website is not well structured or not well inter linked. E commerce sites where pages are created dynamically also need to have this as a must.

Robots.txt tester

Good news is, after all this work you don’t have to really break your head thinking if it turned out right. Google’s robots.txt tester is here to your rescue specifically for this purpose. Quickly enter the desired URL and select the user agent from the dropdown.

Voila! you will be able to check if the URL is accepted or blocked by webcrawlers.

META TAGS

NO follow & DO follow

By default all links given within the HTML source code will come under do follow unless inserted with a no follow meta tag.

A nofollow link which is the completely opposite essentially means that google(or any other search engine) is not allowed to pass link value(or link juice)to the where this meta tag is inserted.

The link juice, which is generally given importance from SEO perspective, will not be passed on to the links with a no follow meta tag.

Often we get unwanted comments on our blog with someone trying to backlink to their site by commenting something like ‘Hey, nice blog. Check out a similar one..’. We would ideally put a nofollow meta tag for such links.

No index

If a web page is not in the search engine’s index, the users will not be able to find the particular page. This meta tag is highly important as it will prevent the given page from being indexed on the SERP.

It is a must when the site created is not fit for crawling or is still in its testing phase. The meta “no index” tells Search engines that they are not allowed to display the URL in Search engine result page. This is often also applied on the older site once the new one is ready after testing. By using this meta tag, you will be instructing the search engine to not rank the old site hence avoiding duplicate content.

Canonicalisation

Canonical tags are used to claim a single page as its source or for duplicate pages to refer to their origin. Search Engines find duplicate content as a black mark which can also negatively impact your SEO. The canonical tag is used to fight duplicate content issues and provide search engine ranking value to that content that is designated as the “source” URL.

The “rel=canonical” tag added in the header section can tell apart the desired URL which is to be displayed in the results page from the other duplicate content which may have a different url but displays the same content or leads to the same page.

For instance, all of the below URLs point to the homepage of a top digital marketing agency, but a search engine will only consider one of them to be the canonical form of the URL. In this example, all these below URL will have <link rel=”canonical” href=”The preferred URL you want to rank” />

socialorange.in

socialorange.in/?source=asdf

Check out the compete video here

Key takeaways

So far you must’ve gotten an overall idea of how to converse with the search engine bot but you might need a quick re cap to sum it all up:

User agent: This defines which search engine bot you are referring to when typed in the robots.txt file.
Disallow: This determines which pages to restrict the user agent from crawling.
Sitemap: Giving a sitemap in the robots.txt file makes it easily navigable for the bots.
No follow: This meta tag ensures that your precious link juice isn’t transferred to undesirable links.
No index: This meta tag is added so that the given link does not appear in the search engine results page.
Canonicalization: The canonical url tag helps eliminate self created duplicate content from the index of the search engine.

You may also like to read:

How to learn Digital Marketing for Free?

The What Why and How of E mail Marketing