Example HTML page

Internet Access

In this section, we will have discussion on Internet access, fetching URLs, data encoding and some other things related to it.

Internet access

Till now we have seen how to access files, directories, modules and so on through programs. We can also access websites through programs. For this we have a standard library provided in Python 3 is called an urllib module.


 By using this module we can access websites, download data, parsing data, headers modifying and so on.

The below is an example of how to use urllib. Initially we have to import urllib.requests. from there, we can assign the opening of the url to a variable, where we can finally use a ‘.read()’ command to read the data.

Ex

>>> import urllib.request

>>> a=urllib.request.urlopen(‘https://www.google.com/’)

>>> print(a.read( ))

The output for the above program displays the data that exists in ‘https://www.google.com’ url.

Fetching URL’s

To fetch URL’s we use urllib.request. It is as follows:

import urllib.request

response=urllib.request.urlopen(‘http://python.org/’)

html=response.read( )

You can create a Request object that specifies the URL you want to fetch. Calling urlopen with this Request object returns a response object for the URL requested. This response is file-like  object, which means take .read( ) on the response. It is shown in below:

import urllib.request

req=urllib.request.Request(‘http://python.org/’)

response=urllib.request.urlopen(req)

the_page=response.read( )

Similarly you can make an FTP request ,(from the above code replace ‘http’ with ‘ftp’). It is as follows:

req=urllib.request.Request(‘ftp://google.com/’)

 Fetching URL data:

Through urllib we can fetch the url’s data. It is as follows:

>>> import urllib.request
>>> with urllib.request.urlopen('http://www.python.org/') as f:
	print(f.read(500))

	
b'<!doctype html>\n<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->\n<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->\n<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->\n<!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr">  <!--<![endif]-->\n\n<head>\n    <meta charset="utf-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n\n    <link rel="prefetch" href="//ajax.googleapis.com/ajax/libs/jqu'

If you change the count in read( ) method then you can observe the following output.

>>> import urllib.request
>>> with urllib.request.urlopen('http://www.python.org/') as f:
	print(f.read(526))

	
b'<!doctype html>\n<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->\n<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->\n<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->\n<!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr">  <!--<![endif]-->\n\n<head>\n    <meta charset="utf-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n\n    <link rel="prefetch" href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js">\n'

If you extend the count to 526 the next immediate bytes will print.

As the python.org website uses utf-8 encoding as specified in its meta tag, we will use the same for decoding the bytes object. Then:

>>> import urllib.request
>>> with urllib.request.urlopen('http://www.python.org/') as f:
	print(f.read(526).decode('utf-8'))

	
<!doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr">  <!--<![endif]-->

<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <link rel="prefetch" href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js">

It is also possible to achieve the same result as shown in below:

>>> import urllib.request

>>> f=urllib.request.urlopen(‘http://www.python.org/’)

>>> print(f.read(123).decode(‘utf-8’))

<!doctype html>

<!–[if lt IE 7]>   <html class=”no-js ie6 lt-ie7 lt-ie8 lt-ie9″>   <![endif]–>

<!–[if IE 7]>      <html

Data encoding in URL:

Data can also be passed in an HTTP GET request by encoding it in the URL itself. It is as follows:

import urllib.request

import urllib.parse

data={ }

data[‘Website’]=’VMS’

data[‘Purpose’]=’Online classes’

data[‘Languages’]=’C’,’Python’

url_values=urllib.parse.urlencode(data)

print(url_values)

If you execute the above code it produces the following output:

Website=VMS&Languages=%28%27C%27%2C+%27Python%27%29 Purpose=Online+classes                                                                                      #Case 1

 

 

Languages=%28%27C%27%2C+%27Python%27%29&Website=VMS&Purpose=Online+classes                                                                                       #Case 2

 

 

Purpose=Online+classes&Website=VMS&Languages=%28%27C%27%2C+%27Python%27%29                                                                                      #Case 3

Note: The output changes every time after your execution.

There are two methods of data transfer with url’s. Those are GET and POST.

GET – You can make a request and get the data.

POST – You can post or send some data into server and you get a request based on the post.

Download website:

With urllib we can download a webpages HTML using 3 lines of code:

>>> import urllib.request

>>> html=urllib.request.urlopen(“https://google.com”).read()

>>> print(html)

If you execute the above lines of code, the variable html will contain the webpage data in html formatting. Traditionally a web-browser like Google Chrome visualizes this data.

URLError:

Often , URLError is raised because there is no network connection(no route to the specified server),or the specified server doesn’t exist. In this case, the exception raised will have a reason attribute, which is a tuple containing an error code and a text error message.

>>> req=urllib.request.Request("https://www.google.com")
>>> try:
	urllib.request.urlopen(req)
except urllib.error.URLError as e:
	print(e.reason)

	
[Errno 11004] getaddrinfo failed

If there is network connection and also the availability of specified server it produces as follows:

>>> req=urllib.request.Request("https://www.google.com")
>>> try:
	urllib.request.urlopen(req)
except urllib.error.URLError as e:
	print(e.reason)


<http.client.HTTPResponse object at 0x0000000003792CC0>

Retrieving data from file:

The following example shows how to retrieve data from file.

>>> import urllib.request

>>> local_filename,headers=urllib.request.urlretrieve(‘http://python.org/’)

>>> html=open(local_filename)

>>> html.read( )

We can perform close operation using html.close(). After including html.close(), then it displays the following output:

>>> import urllib.request

>>> local_filename,headers=urllib.request.urlretrieve(‘http://python.org/’)

>>> html=open(local_filename)

>>> html.close( )

>>> html.read( )

Traceback (most recent call last):

File “<pyshell#4>”, line 1, in <module>

html.read( )

ValueError: I/O operation on closed file.

To cleanup temporary files that may have been left behind previous calls to urlretrieve( ), we use “urllib.request.urlcleanup( )”.

 

Example HTML page

Leave a Reply

Your email address will not be published. Required fields are marked *

Pin It on Pinterest