# Building a search engine: retrieve the links

Yes, you read right... a search engine, like Google!

But do not worry, it's just an excuse to build a specific and clear project. It doesn't have to be so performing as Google!

## Step one: get web page's HTML code

Our search engine must be able to automatically retrieve web pages, in particular their HTML code.

In [2]:
import requests
page = requests.get("http://www.unibz.it/")
print(str(page.content))



Take these lines as magic. These statements will be explained later on.
<br>Now inside page.content you have the web page's HTML code

## Get a link from HTML code

Search engines use links contained in HTML code to get to know about other web pages. Therefore we must have a function which extracts the link from a web page and also tells us where that link finishes (we'll use this information to extract the other links).

In [3]:
def getLink(p,n): # this function gets the first link in p starting from position n
    startPosition=p.find('href="',n)
    if startPosition>-1:
        startPosition=startPosition+6
        endPosition=p.find('"',startPosition)-1
        url=p[startPosition:endPosition+1]
        return endPosition,url
    else:
        return -1,""

import requests
page = requests.get("http://www.unibz.it/")
print(getLink(str(page.content),2000))

(2931, 'https://www.unibz.it//_resources/themes/unibz/images/fav/apple-touch-icon-57x57.png')


## Get all the links from HTML code

In [15]:
def getLink(p,n): # this function gets the first link in p starting from position n
    startPosition=p.find('href="',n)
    if startPosition>-1:
        startPosition=startPosition+6
        endPosition=p.find('"',startPosition)-1
        url=p[startPosition:endPosition+1]
        return endPosition,url
    else:
        return -1,""

    
import requests
page = requests.get("http://www.unibz.it")
n=0
while n>=0:
    n,url=getLink(str(page.content),n+1)
    print(url)

https://www.unibz.it/
https://www.unibz.it/de/
https://www.unibz.it/
https://www.unibz.it/it/
https://www.unibz.it//_resources/themes/unibz/images/fav/apple-touch-icon-57x57.png
https://www.unibz.it//_resources/themes/unibz/images/fav/apple-touch-icon-60x60.png
https://www.unibz.it//_resources/themes/unibz/images/fav/apple-touch-icon-72x72.png
https://www.unibz.it//_resources/themes/unibz/images/fav/apple-touch-icon-76x76.png
https://www.unibz.it//_resources/themes/unibz/images/fav/apple-touch-icon-114x114.png
https://www.unibz.it//_resources/themes/unibz/images/fav/apple-touch-icon-120x120.png
https://www.unibz.it//_resources/themes/unibz/images/fav/apple-touch-icon-144x144.png
https://www.unibz.it//_resources/themes/unibz/images/fav/apple-touch-icon-152x152.png
https://www.unibz.it//_resources/themes/unibz/images/fav/apple-touch-icon-180x180.png
https://www.unibz.it//_resources/themes/unibz/images/fav/favicon-32x32.png
https://www.unibz.it//_resources/themes/unibz/images/fav/favicon-

## Fixing internal links

We have the problem that many links are internal, i.e. they do not start with http because they refer to pages in the same folder as the current page.
<br>Therefore, we shall fix them adding in front the address of the current webpage.

In [1]:
def fixInternalLink(address,url):
    if url[0:4]=="http":
        return url
    else:
        if url[0]=="/" and address[-1]=="/":
            url=address[:-2]+url
        elif  url[0]!="/" and address[-1]!="/":
            url=address+"/"+url
        else:
            url=address+url
        return url

            
def getLink(p,n): # this function gets the first link in p starting from position n
    startPosition=p.find('href="',n)       
    if startPosition>-1:
        if p[startPosition+5]=='"':   
            startPosition=startPosition+6
            endPosition=p.find('"',startPosition)-1
            url=p[startPosition:endPosition+1]
            return endPosition,url        
    else:
        return -1,""

    
import requests
pageAddress="http://www.paolocoletti.it/test/test.html"
page = requests.get(pageAddress)

n=0
while n>=0:
    n,url=getLink(str(page.content),n+1)
    if n>=0: # just to be sure and avoid empty urls
        print(fixInternalLink(pageAddress,url))

http://www.paolocoletti.it/test/test2.html
http://www.paolocoletti.it/test/test4.html
http://www.paolocoletti.it/test/test.html/../images/CervelliInFuga_copertina.jpg
