Writing Web Scraped HTML to a File

I was working on a project for a client where I needed to scrape data from a Web page. I wanted to save the page to a file so that I wouldn’t be making requests to the server hosting the page each time I wanted to test my code. I was using Python3 and the Requests library. When attempting to perform the write to a file, I ran into encoding issues. This task was not as straightforward as I first imagined.

My First Attempt

On my first try, I tried to save the text output of a request directly to a file.

import requests

url = 'http://www.example.com'
html = requests.get(url).text

with open('test.html', 'w') as test_file:
    test_file.write(html)

I was surprised to get this error.

Traceback (most recent call last):
  File ".\scrape.py", line 15, in 
    test_file.write(page.text)
  File "C:\Program Files\Python36\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u014d' in position 93579: character maps to 

Strike Two

Okay, so there’s an encoding error. Python represents characters using Unicode, which essentially assigns a bit value to each character. Unicode can be implemented using different characters sets (charsets). The most popular character set on the Web is UTF-8, which stores characters in 1 to 4 8-bit bytes.

So, I decided I’d tell python to encode my text using UTF-8.

import requests

url = 'http://www.example.com'
html = requests.get(url).text

with open('test.html', 'w') as test_file:
    page = str(html, encoding='utf-8')
    test_file.write(page)

Third Time’s a Charm?

Another surprise… I thought with the str method I would be able to transform the Unicode into UTF-8 encoded string, yet I received the following traceback.

Traceback (most recent call last):
  File ".\scrape.py", line 15, in 
    page = str(html, encoding='utf-8')
TypeError: decoding str is not supported

Apparently the str method doesn’t work on an str object. Okay… since Unicode works on the byte level, let’s try a bytes object.

import requests

url = 'http://www.example.com'
html = requests.get(url).text

with open('test.html', 'w') as test_file:
    page = str(bytes(html), encoding='utf-8')
    test_file.write(page)

Again, no bueno.

Traceback (most recent call last):
  File ".\scrape.py", line 15, in 
    page = str(bytes(html), encoding='utf-8')
TypeError: string argument without an encoding

The Solution

I looked up the str class in the Python manual, but unfortunately that didn’t help me much – I like Python, but IMHO their docs are a tad overly terse. While on the docs, I followed the link to the builtin bytes function. I discovered it too had an argument for encoding. So I thought, let’s try that.

import requests

url = 'http://www.example.com'
html = requests.get(url).text

with open('test.html', 'w') as test_file:
    page = bytes(html, 'utf-8')
    test_file.write(page)

Oops…

Traceback (most recent call last):
  File ".\scrape.py", line 16, in 
    test_file.write(page)
TypeError: write() argument must be str, not bytes

I was able to transform the HTML text into UTF-8 encoded bytes, but I forgot to turn the bytes back into a Python string to be able to write to the open test.html file. That’s easy to do.

import requests

url = 'http://www.example.com'
html = requests.get(url).text

with open('test.html', 'w') as test_file:
    page = bytes(html, 'utf-8')
    test_file.write(str(page))

Finally, I was able to grab the page from the Web and write it to a file.

Leave a Reply

Your email address will not be published. Required fields are marked *