Today let's focus on gathering results from Grammarly document checking in our mini-automation-project :)
This article is a part of Grammarly Selenium Pelican Automation Mini project
S0-E17/E30 :)
Grammarly Automation Gathering Report.
So Grammarly creates an report in a PDF. But the report in PDF is just a simple PDF creation of HTML file.
Why not disassemble that HTML page and get the results that we need instead of generating PDF and gathering only part of results?
Let's check if our assumptions meet with reality :)
So we have a text, lets for our example:
A simple
Multiline
text
That wraps
after max
2 words.
Put this into grammarly and gather HTML output it will generate.
Grammarly As I've found out will only generate pdf for premium users.
So let's focus on disassembling the html.
After gathering output from script:
def test_simple_text_gather_html(self):
""" This test is suppose to return html """
page_login = GrammarlyLogin(self.driver)
page_login.make_login('za2217279@mvrht.net', 'test123')
page_new_doc = GrammarlyNewDocument(self.driver)
page_new_doc.make_new_document("")
page_doc = GrammarlyDocument(self.driver)
text_to_put = "A simple \
Multiline \
text \
That wraps \
after max \
2 words. \
"
page_doc.text = text_to_put
self.sleep(10)
print self.driver.page_source
I've found that there is an unique div called with css-class _adbfa1e6-editor-page-cardsCol
That does not change from document to document. Maybe that's just a feature for react-app to know where to put the result of checking ? Either way - we have output!
But ... The output is not perfect. Data gathered in this way only gives a "shape" information - not exact details about where text is written in bad way - only information about specific text that is incorrect and the proposition for correct.
So instead of using Selenium for sculpturing I'll use python's html extractor - Beautiful Soup.
Let's make a source that will output html that then can be used within Beautiful Soup.
Now the GrammarlyDocument src looks like this:
class GrammarlyDocument(PageObject):
title = PageElement(css='input[type="text"]')
text = PageElement(id_='textarea')
# This button below will only be visible for grammarly premium users.
score_button = PageElement(css='span[class="_ff9902-score"]')
download_pdf_btn = PageElement(css='div[class="_d0e45e-button _d0e45e-medium"]')
def put_title(self, title):
self.title = title
def put_text(self, text):
self.text = text
time.sleep(10)
def get_page_source(self):
return self.w.page_source
And test:
def test_get_page_source(self):
page_login = GrammarlyLogin(self.driver)
page_login.make_login('za2217279@mvrht.net', 'test123')
page_new_doc = GrammarlyNewDocument(self.driver)
page_new_doc.make_new_document("")
page_doc = GrammarlyDocument(self.driver)
text_to_put = "A simple \n\
Multiline \n\
text \n\
That wraps \n\
after max \n\
2 words. \n\
"
page_doc.put_text(text_to_put)
page_source = GrammarlyDocument(self.driver)
actual_source = page_source.get_page_source()
self.assertTrue("<html" in actual_source and "</html" in actual_source)
Yeah It's very silly test, but for now is sufficient.
Reverse parse HTML
Let's start with what the Beautiful Soup
is.
It's a python html extractor that can be usefull when you want to scrape data from html. It's used widly with web-crawlers.
Now, let's make a DocumentScraper that will scrape data from html for us with at least the results data for now.
from bs4 import BeautifulSoup
class DocumentScraper(object):
def __init__(self, html_source):
# self.html_source = html_source
self.bs = BeautifulSoup(html_source, "html.parser")
def get_issue_div(self):
# DIV with class=_adbfa1e6-editor-page-cardsCol
return self.bs.find('div', {'class': '_adbfa1e6-editor-page-cardsCol'})
def get_all_warnings(self):
return self.get_issue_div().contents
def get_all_warnings_texts(self):
return [element.text for element in self.get_all_warnings()]
def iterate_over_warnings(self):
for innerelement in self.get_all_warnings():
print innerelement.text
And the simplest test on pre-downloaded file (output from selenium):
# -*- coding: utf-8 -*-
import unittest
from document_scraper import DocumentScraper
class GrammarlyScrapingTests(unittest.TestCase):
def setUp(self):
filename = "bs_output_test1.html"
with open(filename, 'r+') as f:
self.data_scrape1 = f.read()
def test1(self):
assert len(self.data_scrape1) == 21022
scraper = DocumentScraper(self.data_scrape1)
expected = [("Incorrect spacingwraps after → wraps after".decode("utf-8"))]
result = list(scraper.get_all_warnings_texts())
assert result == expected
if __name__ == "__main__":
unittest.main()
Part 3 !
There is going to be a part3 that will sum-up this mini-project and make this draft accessable.
So stay tuned :)
Acknowledgements
- python - install beautiful soap using pip
- reading and writing files in python
- Gui and headless browser testing
Thanks!
That's it :) Comment, share or don't :)
If you have any suggestions what I should blog about in the next articles - please give me a hint :)
See you tomorrow! Cheers!
Comments
comments powered by Disqus