Search Engine Indexing and Crawling

When user enters keywords in the search engine, a list of results are found that best-matches the keywords of the user. These results are found based on a huge amount of pages stored in the database of the search engine. To be able to obtain such results, some processes must be done first to build the database which will be searched then when the user does the search. Mainly two basic sub-processes are done to build the database:

1. Crawling: In the crawling (also called spidering or robotics) process, the search engine begins to discover the web or the pages of the overall sites on the web. It performs this by beginning to download pages from some sites. Regularly, it will begin with some small sites stored in its database. When it crawls some initial site, it will observe the links in those sites. If the search engine discovered new link that is not stored in the initial database, it will append it to the list to be crawled later. In the new links discovered, it may also discover new links that will be appended also to the list of the sites that will be crawled soon. Note that as the sites are crawled, it will update its list of sites that are discovered to be new.

These processes are repeated continuously without stopping to discover the changing content on the web. So every new link is discovered will lead to crawling of this page and may lead to crawling the entire site. This is because when the spider crawl a page from a site, it will look also for a links to other pages in the same site as well as links to external sites. Thus, it is important for website owners to build such links to get visibility to search engine. The more links they build, the more frequently their sites will be crawled and updated if it was indexed.

2. Indexing: once the search engine collects the pages from the sites crawled in the crawling process, it will feed them to the indexing algorithm. Mainly, the indexing algorithm compares or rankings the related pages with each other so that when users make a search for a keyword, it will then extract the pages with the highest rank. Each search engine has its own algorithm for indexing and ranking. when ranked I will be put in the database with the specified rank.

One may imagine that only the keywords on the page controls the rank but recently there are a key factor that controls hat ranking which is the backlink. Mainly the concept of backlinks is related to voting and reachability keywords because an existing link to a page means that the page is good for that site and this it effectively votes for it. Also the page will be reachable from that site. Recently, the search engines are concerned with the concept of reachability, they say that if one browse randomly through the sites on the web, what is the probability of reaching to certain site. This probability means a higher ranking if it is high and the reverse is true.Thus we find that the ranking depends on the backlinks also.

3. Searching: once the database is built, the user can now search and the results are extracted from the search engine database.

Leave a Reply