Lately I've found this vocabulary.com site that give a lot of knowledge. Let's check how can we automate reading of new vocabulary by scraping data from site into pdf.
To The Point
Scraping data from vocabulary.com
First what we need is to gather links from vocabulary.com at Choose Your Words.
Script like that should do that:
import requests
from bs4 import BeautifulSoup
domain_href = 'https://www.vocabulary.com/'
output = requests.get('{}articles/chooseyourwords/'.format(domain_href))
bsobj = BeautifulSoup(output.text, 'html.parser')
distinct_links = {}
for link in bsobj.find_all('a'):
href = link.get("href")
if href:
if "chooseyourwords" in href:
distinct_links[href] = domain_href + href
print(distinct_links.keys())
After gathering unique links we need to disassemble it.
def disassemble_vocabulary_page(link):
output = requests.get(link)
bsobj = BeautifulSoup(output.text, 'html.parser')
return bsobj.find('div', class_="articlebody")
The script in one piece looks like this:
import requests
from bs4 import BeautifulSoup
def get_vocabulary_links():
domain_href = 'https://www.vocabulary.com/'
output = requests.get('{}articles/chooseyourwords/'.format(domain_href))
bsobj = BeautifulSoup(output.text, 'html.parser')
distinct_links = {}
for link in bsobj.find_all('a'):
href = link.get("href")
if href:
if "chooseyourwords" in href:
distinct_links[href] = domain_href + href
return distinct_links.values()
def disassemble_vocabulary_page(link):
output = requests.get(link)
bsobj = BeautifulSoup(output.text, 'html.parser')
return bsobj.find('div', class_="articlebody")
if __name__ == "__main__":
link_to_vocabulary = {}
links = get_vocabulary_links()
for link in links:
link_to_vocabulary[link] = disassemble_vocabulary_page(link)
Saving html data into pdf
Let's use pdfkit to make new pdf.
It's using wkhtmltopdf
, so first we need to install it with:
sudo apt-get install wkhtmltopdf
Then we can use pipenv install pdfkit
to make pdfkit
library available at env.
Making use of your html that we have scraped with this script:
def html_to_pdf(link, data):
pdfkit.from_string(data, "{}.pdf".format(link.split("chooseyourwords")[-1]))
Source code
The output is not yet most prettiest, but still we have data in a pdf.
import requests
import pdfkit
from bs4 import BeautifulSoup
def get_vocabulary_links():
domain_href = 'https://www.vocabulary.com/'
output = requests.get('{}articles/chooseyourwords/'.format(domain_href))
bsobj = BeautifulSoup(output.text, 'html.parser')
distinct_links = {}
for link in bsobj.find_all('a'):
href = link.get("href")
if href:
if "chooseyourwords" in href:
distinct_links[href] = domain_href + href
return distinct_links.values()
def disassemble_vocabulary_page(link):
output = requests.get(link)
bsobj = BeautifulSoup(output.text, 'html.parser')
return bsobj.find('div', class_="articlebody")
def html_to_pdf(link, data):
file_name = link.split("chooseyourwords")[-1].replace("/", "")
pdfkit.from_string(data, "{}.pdf".format(file_name))
if __name__ == "__main__":
link_to_vocabulary = {}
links = get_vocabulary_links()
for link in links:
html_to_pdf(link, disassemble_vocabulary_page(link).text)
Acknowledgements
Auto-promotion
Related links
- Page Not Found : Vocabulary.com
- pdfkit 0.6.1 : Python Package Index
- GitHub - JazzCore/python-pdfkit: Wkhtmltopdf python wrapper to convert html to pdf
- Beautiful Soup Documentation — Beautiful Soup 4.4.0 documentation
- How to create PDF files in Python - Stack Overflow
- PyPDF2 Documentation — PyPDF2 1.26.0 documentation
- Automate the Boring Stuff with Python
Thanks!
That's it :) Comment, share or don't - up to you.
Any suggestions what I should blog about? Post me a comment in the box below or poke me at Twitter: @anselmos88.
See you in the next episode! Cheers!
Comments
comments powered by Disqus