I was working on a project for a client where I needed to scrape data from a Web page. I wanted to save the page to a file so that I wouldn’t be making requests to the server hosting the page each time I wanted to test my code. I was using Python3 and the Requests library. When attempting to perform the write to a file, I ran into encoding issues. This task was not as straightforward as I first imagined.
My First Attempt
On my first try, I tried to save the text output of a request directly to a file.
import requests url = 'http://www.example.com' html = requests.get(url).text with open('test.html', 'w') as test_file: test_file.write(html)
I was surprised to get this error.
Traceback (most recent call last): File ".\scrape.py", line 15, intest_file.write(page.text) File "C:\Program Files\Python36\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u014d' in position 93579: character maps to
Strike Two
Okay, so there’s an encoding error. Python represents characters using Unicode, which essentially assigns a bit value to each character. Unicode can be implemented using different characters sets (charset
s). The most popular character set on the Web is UTF-8
, which stores characters in 1 to 4 8-bit bytes.
So, I decided I’d tell python to encode my text using UTF-8.
import requests url = 'http://www.example.com' html = requests.get(url).text with open('test.html', 'w') as test_file: page = str(html, encoding='utf-8') test_file.write(page)
Third Time’s a Charm?
Another surprise… I thought with the str
method I would be able to transform the Unicode into UTF-8 encoded string, yet I received the following traceback.
Traceback (most recent call last): File ".\scrape.py", line 15, inpage = str(html, encoding='utf-8') TypeError: decoding str is not supported
Apparently the str
method doesn’t work on an str
object. Okay… since Unicode works on the byte level, let’s try a bytes object.
import requests url = 'http://www.example.com' html = requests.get(url).text with open('test.html', 'w') as test_file: page = str(bytes(html), encoding='utf-8') test_file.write(page)
Again, no bueno.
Traceback (most recent call last): File ".\scrape.py", line 15, inpage = str(bytes(html), encoding='utf-8') TypeError: string argument without an encoding
The Solution
I looked up the str
class in the Python manual, but unfortunately that didn’t help me much – I like Python, but IMHO their docs are a tad overly terse. While on the docs, I followed the link to the builtin bytes
function. I discovered it too had an argument for encoding. So I thought, let’s try that.
import requests url = 'http://www.example.com' html = requests.get(url).text with open('test.html', 'w') as test_file: page = bytes(html, 'utf-8') test_file.write(page)
Oops…
Traceback (most recent call last): File ".\scrape.py", line 16, intest_file.write(page) TypeError: write() argument must be str, not bytes
I was able to transform the HTML text into UTF-8 encoded bytes, but I forgot to turn the bytes back into a Python string to be able to write to the open test.html
file. That’s easy to do.
import requests url = 'http://www.example.com' html = requests.get(url).text with open('test.html', 'w') as test_file: page = bytes(html, 'utf-8') test_file.write(str(page))
Finally, I was able to grab the page from the Web and write it to a file.