Yesterday when using pelican_link_to_title plugin I've found a 'no title found' problem. Let's focus on that and fix it.
To The Point
The issue
Yesterday I wanted to use pelican_link_to_title
plugin with a file instead of html page. I've forgot that plugin is reading html page which can cause issue while using a file.
Issue I've got while using a link to file was simlar to this:
plugins/pelican_link_to_title/pelican_link_to_title.py", line 36, in read_page
title = soup.find("title").string
AttributeError: 'NoneType' object has no attribute 'string'
Fixing the problem
Originally the plugin was meant to save page titles. It was meant for html pages. While I used it, one day I added a link to file. It crashed. The problem was rather conceptional than technical - in idea of the plugin.
To fix the problem we need to find, if the link is actually an html page or not.
Checking if link is html page.
While searching for solution for finding if page is html page or not I've found this StackOverflow issue comment that describes a solution of accessing url with HEAD
request.
It will give only metadata information - without content - different from request with GET
.
The part of solution with head request looks like this in plugin:
r = requests.head(url_page)
if "text/html" in r.headers["content-type"]:
html = requests.get(url_page).text
Source of plugin fixed.
# -*- coding: utf-8 -*-
""" This is a main script for pelican_link_to_title """
from pelican import signals
from bs4 import BeautifulSoup
import requests
def link_to_title_plugin(generator):
"Link_to_Title plugin "
article_ahreftag= {}
for article in generator.articles:
soup = BeautifulSoup(article._content, 'html.parser')
ahref_tag = soup.find_all('ahref')
if ahref_tag:
article_ahreftag[article] = (ahref_tag, soup)
for article, (p_tags, soup) in article_ahreftag.items():
for tag in p_tags:
url_page = tag.string
if url_page:
if "http://" in url_page or "https://" in url_page:
tag.name = "a"
tag.string = read_page(url_page)
tag.attrs = {"href": url_page}
else:
continue
article._content = str(soup).decode("utf-8")
def read_page(url_page):
import redis
redconn = redis.Redis(host='localhost', port=6379, db=0)
found = redconn.get(url_page)
if not found:
header_response = requests.head(url_page)
if "text/html" in header_response.headers["content-type"]:
html = requests.get(url_page).text
soup = BeautifulSoup(html , "html.parser")
title = soup.find("title").string
redconn.set(url_page, title)
return title
else:
return get_non_html_page_title(url_page, header_response)
else:
return found
def get_non_html_page_title(url_page, header_response):
file_str = url_page.split("/")[-1]
file_ext = file_str.split(".")
url_domain = url_page.split("//")[1].split("/")[0]
if len(file_ext) > 1:
# file with extension in url.
return "Url to {} file: {} on domain: {}".format(file_ext[-1], file_str, url_domain)
else:
# no file with extension in url
return "Url to: {}".format(url_page)
def register():
""" Registers Plugin """
signals.article_generator_finalized.connect(link_to_title_plugin)
Snippets
r = requests.head(url_page)
if "text/html" in r.headers["content-type"]:
# this url_page is an text/html page type content.
Acknowledgements
Auto-promotion
Related links
- Python-requests: Check if URL is not HTML webpage - Stack Overflow
- List of HTTP header fields - Wikipedia
- HTTP headers - HTTP | MDN
Thanks!
That's it :) Comment, share or don't - up to you.
Any suggestions what I should blog about? Post me a comment in the box below or poke me at Twitter: @anselmos88.
See you in the next episode! Cheers!
Comments
comments powered by Disqus