# Building a search engine: crawl the web

We learnt to get the links from a web page. Now we use those links to discover other web pages and retrieve the links also from them and store everything in a list.

This machine is called web crawler.

## The previous code

This is the previous code modified to make it work even for links with the single quote.

In [1]:
def fixInternalLink(address,url):
    if url[0:4]=="http":
        return url
    else:
        if url[0:0]=="/" and address[-1]=="/":
            url=address[:-2]+url
        elif  url[0:0]!="/" and address[-1]!="/":
            url=address+"/"+url
        else:
            url=address+url
        return url

            
def getLink(p,n): # this function gets the first link in p starting from position n, adjusted to deal with single quotations
    startPosition1=p.find('href="',n)
    startPosition2=p.find("href=\\'",n)
    if startPosition1>-1 and (startPosition2==-1 or startPosition1<startPosition2):
        startPosition=startPosition1+6
        endPosition=p.find('"',startPosition)-1
        url=p[startPosition:endPosition+1]
        return endPosition,url
    elif startPosition2>-1 and (startPosition1==-1 or startPosition2<startPosition1):
        startPosition=startPosition2+7
        endPosition=p.find("\\'",startPosition)-1
        url=p[startPosition:endPosition+1]
        return endPosition,url
    else:
        return -1,""

    
import requests
pageAddress="http://www.paolocoletti.it/test"
page = requests.get(pageAddress)

n=0
while n>=0:
    n,url=getLink(str(page.content),n+1)
    if n>-1 and url!="": 
        print(fixInternalLink(pageAddress,url))

http://www.paolocoletti.it/test/test2.html
http://www.paolocoletti.it/test/test3.html
http://www.paolocoletti.it/test/test4.html
http://www.paolocoletti.it/test/../images/CervelliInFuga_copertina.jpg


## Use the links to get more web pages

We want to use the links that we found in the first web page to scan other web pages. In these web pages we will find other links and we will use them to scan other web pages. And so on! This technique is called web crawling. 

There is however a problem: if web page A has a link to B and B has a link to C and C has a link to A, we will continue forever to scan them! 

Stop now and think how would you do it manually. Really stop now.


... stop and think ...

... do it manually ...

You probably have a piece of paper with the list of pages that you have already visited and a piece of paper with the list of pages to visit. Whenever you visit a page, you delete it from the first paper and write it in the second. Whenever new pages to visit arrive, you check carefully that they are not already in any list.

So we must keep a list of web pages to scan (<b>toCrawl</b>) and another list of web pages which have been scanned (<b>Crawled</b>). When we scan a web page, we move its link from <b>toCrawl</b> to <b>Crawled</b>. When we have new links to insert in <b>toCrawl</b>, we have to check that they are not present in <b>Crawled</b> nor in <b>toCrawl</b>.

In [2]:
import requests
Crawled=[]
toCrawl=["http://www.paolocoletti.it/test"]

while len(toCrawl)>0: # until there are pages to crawl...
    pageAddress=toCrawl.pop(0) # we pop out the FIRST of the list and not the last one (just our choice)
    Crawled.append(pageAddress) # moved to the other list

    page = requests.get(pageAddress) # get page HTML code
  
    n=0
    while n>=0:
        n,url=getLink(str(page.content),n+1)
        if n>-1 and url!="":
            url=fixInternalLink(pageAddress,url)
            if url not in toCrawl and url not in Crawled:   # append to list only if not in other two lists
                toCrawl.append(url)

print(Crawled) # prints all visited links


['http://www.paolocoletti.it/test', 'http://www.paolocoletti.it/test/test2.html', 'http://www.paolocoletti.it/test/test3.html', 'http://www.paolocoletti.it/test/test4.html', 'http://www.paolocoletti.it/test/../images/CervelliInFuga_copertina.jpg', 'http://www.paolocoletti.it/test/test.html', 'http://www.paolocoletti.it/test/test.html/../images/CervelliInFuga_copertina.jpg']


## Setting up some limits

If we run this code on a real web page, it will start finding tons of other webpages very soon and we will really scan the entire web. That's trillions of pages!

It is better if we put a limit on the number of pages that are visited, i.e. on the size of Crawled. Once this limit is reached, the while loop ends and the program prints Crawled and toCrawl.

In [3]:
import requests
limitVisitedPages=30
Crawled=[]
toCrawl=["http://www.unibz.it"]

while len(toCrawl)>0 and len(Crawled)<limitVisitedPages: # until there are pages to crawl... but no more than 30 crawled
    pageAddress=toCrawl.pop(0) # we pop out the FIRST of the list and not the last one (just our choice)
    Crawled.append(pageAddress) # moved to the other list

    page = requests.get(pageAddress) # get page HTML code
  
    n=0
    while n>=0:
        n,url=getLink(str(page.content),n+1)
        if n>-1 and url!="":
            url=fixInternalLink(pageAddress,url)
            if url not in toCrawl and url not in Crawled:   # append to list only if not in other two lists
                toCrawl.append(url)

print(Crawled) # prints all visited links
print(toCrawl) # prints all links still to visit


['http://www.unibz.it', 'https://www.unibz.it/', 'https://www.unibz.it/de/', 'https://www.unibz.it/it/', 'https://www.unibz.it//_resources/themes/unibz/images/fav/apple-touch-icon-57x57.png', 'https://www.unibz.it//_resources/themes/unibz/images/fav/apple-touch-icon-60x60.png', 'https://www.unibz.it//_resources/themes/unibz/images/fav/apple-touch-icon-72x72.png', 'https://www.unibz.it//_resources/themes/unibz/images/fav/apple-touch-icon-76x76.png', 'https://www.unibz.it//_resources/themes/unibz/images/fav/apple-touch-icon-114x114.png', 'https://www.unibz.it//_resources/themes/unibz/images/fav/apple-touch-icon-120x120.png', 'https://www.unibz.it//_resources/themes/unibz/images/fav/apple-touch-icon-144x144.png', 'https://www.unibz.it//_resources/themes/unibz/images/fav/apple-touch-icon-152x152.png', 'https://www.unibz.it//_resources/themes/unibz/images/fav/apple-touch-icon-180x180.png', 'https://www.unibz.it//_resources/themes/unibz/images/fav/favicon-32x32.png', 'https://www.unibz.it//_

A much more interesting limit to set is a limit on the depth. The starting page is at depth 1 and all the pages linked by it are at depth 2. If any of these pages links another page, then that one is at depth 3 and so on.

We have therefore to indicate for each web page its depth. So now our toCrawl list will be organized as
<br>[ [url, depth], [url, depth], ....]
<br>In this way we can set a limit and not visit web pages which are too deep (these ones will be added to a special Skipped list).

In [None]:
import requests
limitVisitedPages=200
depthLimit=2
Crawled=[]
Skipped=[]
toCrawl=["http://www.unibz.it"]
toCrawlWithDepth=[["http://www.unibz.it",1]] # new structure with URL and depth

while len(toCrawl)>0 and len(Crawled)<limitVisitedPages: 
    pageAddress,depth = toCrawlWithDepth.pop(0) # pay attention that now I pop out the url and the depth
    toCrawl.pop(0) 
    Crawled.append(pageAddress)
    print("I am crawling page: " + pageAddress) # I put this to have an idea of what the program is doing

    page = requests.get(pageAddress) 

    n=0
    while n>=0:
        n,url=getLink(str(page.content),n+1)
        if n>-1 and url!="":
            url=fixInternalLink(pageAddress,url)
            if url not in toCrawl and url not in Crawled:
                if depth<depthLimit:
                    toCrawlWithDepth.append([url,depth+1]) # if the depth of the current page is below the limit, I add it to the list
                    toCrawl.append(url) 
                elif depth==depthLimit:
                    if url not in Skipped:
                        Skipped.append(url) # otherwise if the depth of the current page is AT THE LIMIT, then the link is beyond the limit and I add it to the list of the Skipped
                else:
                    print("Error: what am I doing here????") # it should never happen that the depth of this page is above the limit!
                    
print(len(Crawled),len(toCrawl),len(Skipped))
print(Crawled)
print(toCrawl)
print(Skipped)