Creating Paperoni - Part 1 of X
Introduction⌗
In this first of, hopefully, many posts we will look at the thought process behind implementing Paperoni. While the article is based on a Rust project, the concepts covered here can be applied in making your own alternative to Paperoni.
This article is suitable for anyone with experience making web (HTTP) applications, especially if you have worked with web scrapers before.
The what⌗
Paperoni is a program used to scrape articles on the web and saves them on an epub file. Specifically, the epub presents the articles in a newspaper print format.
The why⌗
I found myself with a large number of unread articles on Pocket (close to 700) and thought that it might be easier to read them if they were clustered up into a single downloadable file. Of course, this was a stupid idea as reading all these articles one by one would be less of a burden. Nonetheless, the thought of converting web articles into a single downloadable file seemed to be an interesting idea to work on. The task becomes more interesting when I then wanted to make this single file, a newspaper.
Why .epub and not (insert file format)?⌗
Using EPUB as the output format was chosen because, to put it simply, epub files are zipped
files of XHTML, CSS and SVG. Therefore less work would be needed to convert HTML pages to
a .epub
file as compared to other file formats.
The how⌗
The process of converting a web article to an epub can be generalised to 4 steps:
- Fetch a web page from a URL.
- Locate the article in the downloaded web page.
- Download the assets in that article section.
- Build an epub from the web page and its assets.
These 4 fairly simple steps are then repeated for every web page you want to download. Let us look at how to go about each of the 4 steps.
Fetch a web page from a URL⌗
Fetching a web page can be done in your favourite language using an HTTP client. In my case, I chose to use surf which is an asynchronous Rust HTTP client that uses async-std. Using an asynchronous HTTP client comes in handy in step 3 when you may need to download multiple assets concurrently.
Using surf, you can make a GET
request and store the HTML response in a String
as
shown below:
#[async-std::main]
async fn main(){
let html_response = surf::get("https://blog.hipstermojo.xyz/posts/redis-orm-preface/").recv_string().await.unwrap();
dbg!(html_response);
}
In the code above, we make a GET
request and provide a URL to the get
function.
Using the recv_string
method, you get back your response body and convert it to a
String
. Ideally you will want to add proper error handling and not use unwrap()
.
This step is the simplest of all. In fact, it is so simple that some of the tools
you would use in step 2 implicitly do step 1 for you.
Aside from handling errors concerned with making a network request, that is all there is to step 1. Scroll down to “The Edge Cases” section for a more nuanced problem you may encounter.
Locate the article in the downloaded web page⌗
This is the step where most if not all the work is done. Locating where in an HTML file the article is located, requires you as the programmer to search the DOM of the file.
The DOM, Document Object Model, is the interface used to access and modify parts of an HTML document. Elements in an HTML document are accessed from the DOM using their corresponding CSS selectors. Read more on this here
A tool that is particularly good at this job is a web scraper. Languages like Python have a very good set of web scraping libraries. If you choose to use a web scraper, it takes care of doing step 1. In the case of Paperoni, I instead chose to use libraries to traverse and manipulate the HTML DOM, since it would be the only thing I would need from a web scraper anyway. I chose kuchiki for this.
There a few caveats to take note of when using Kuchiki. First of all, the only usable example in the Github repo was written in January 2016. Secondly, references to DOM nodes are internally represented as an
Rc<RefCell<T>>
type. Though this is a common pattern in Rust, it is susceptible to memory leaks if you are not mindful of your borrows. TheRc<T>
type is also not thread safe so all your DOM work can only be in a single thread.
The HTML string retrieved from step 1 is passed into a library like Kuchiki to be
converted to an HTML DOM tree. With this DOM tree, you then need to figure out where
the article is. You may be wondering, “Why not just use all the content in the <body>
tag?".
The <body>
element of an article tends to be full of bloat. Most of the time, this
bloat refers to the advertisements on a web page, a whole comments section or article
recommendations. Bloat is not a bad thing to have but when you are downloading an
article for, possibly offline, use then it would be unnecessary to have it.
Figuring out where the article actually is, happens to be a very complicated task.
The naive approach, currently used by Paperoni, is to look for the first <article>
element in the DOM. I chose the <article>
tag since it is used to semantically
represent an independent composition on a web page.
Using Kuchiki the code looks something like this:
fn get_article_content(html_str: String) -> NodeRef {
let document = kuchiki::parse_html().one(html_str);
document.select_first("article").unwrap()
}
The document
variable is set to hold a NodeRef
type of the entire HTML document.
You then pass a CSS selector to the select_first
method which will return a NodeRef
of the first DOM node that matches this selector.
The <article>
tag is used to represent the main content on some websites such as
Medium and dev.to so it can be used to some levels of success.
However, this is a naive and unreliable strategy for 2 reasons:
Some websites do not have an
<article>
tag at all and are still blogs. One of these websites is the one you’re reading from right now.<article>
tags can appear more than once and there is a likelihood that the main content of the web page is either in the latter<article>
tags or simply not wrapped in an<article>
tag. An<article>
tag only refers to content that can be independently represented thus is not reserved for blog content.
You could argue that such websites are not using Semantic HTML for their content but that happens to be most websites anyway. A much smarter solution would be needed and I had come up with a few which I will share with you in part 2.
Download the assets in that article section⌗
Once the root of the article is determined, the links to the assets are extracted from
their DOM nodes. Currently, the only assets I saw necessary to download are images.
Image links are found in the src
attribute of <img>
elements. After getting these
links, you can begin downloading the images using your HTTP client or a web scraper.
In Rust this can be done in this way with Kuchiki and Surf:
fn get_img_links(article_ref: NodeRef) -> Vec<String> {
let mut links = Vec::new();
for img_ref in article_ref.select("img").unwrap() {
img_ref.as_node().as_element().map(|img_elem|{
img_elem.attributes.borrow().get("src").map(|img_url|{
if !img_url.is_empty(){
links.push(img_url.to_string())
}
})
})
}
links
}
async fn download_images(img_urls: Vec<String>, client: &Client) {
for img_url in img_urls {
let img_response = client.get(img_url).await.unwrap();
let img_content: Vec<u8> = img_response.body_bytes().unwrap();
let img_ext = img_response.header("Content-Type").and_then(map_mime_to_ext).unwrap();
let file_name = format!("{}{}",hash_str(img_url),img_ext);
let mut file = File::create(file_name).await.unwrap();
file.write_all(&img_response).await.unwrap();
}
}
You can read on the implementation of get_img_links
in “The Edge Cases” here
download_images
loops through the image urls, downloads them and then stores them in
a Vec<u8>
(vector of bytes) using the body_bytes
method. The Content-Type
header
returned is used to determine the file extension of the image you just downloaded and
is converted to an extension with map_mime_to_ext
. The URL of the image is then hashed
(using MD5 in the case of Paperoni) by the hash_str
function and then concatenated with
the extension to create the filename used for the locally stored image. The image url is
hashed so as to have a file name that does not contain URL characters such as the ?
character which would make file creation difficult. The File
type here is provided by
async-std
which is why we can use .await
. The Client
type is provided by surf for
when you are making multiple HTTP requests.
But wait…⌗
As I mentioned earlier, using an asynchronous client can be efficient when downloading
multiple images concurrently. However, the implementation of download_images
actually
downloads images sequentially. This is not efficient and defeats the purpose of using
an asynchronous HTTP client. Instead, we need to change to download images concurrently.
So what should the code look like?
async fn download_images(img_urls: Vec<String>, client: &Client) -> {
let mut async_tasks = Vec::with_capacity(img_urls.len());
for img_url in img_urls {
async_tasks.push(
task::spawn(async {
let img_response = client.get(img_url).await.unwrap();
let img_content: Vec<u8> = img_response.body_bytes().unwrap();
let img_ext = img_response.header("Content-Type").and_then(map_mime_to_ext).unwrap();
let file_name = format!("{}{}",hash_str(img_url),img_ext);
let mut file = File::create(file_name).await.unwrap();
file.write_all(&img_response).await.unwrap();
})
);
}
for async_task in async_tasks {
async_task.await;
}
}
The key difference here is in using the function task::spawn
from async-std
.
This is used to spawn an asynchronous task similar to how threads are spawn, in
fact it also returns a JoinHandle
. In this code, we create a vector of spawned
asynchronous tasks and then await
for them to finish. From my understanding,
the work starts as soon as you call task::spawn
and the equivalent of joining
threads then happens in the second for
loop when .await
is used. This led to
a significant speedup in my own code from about 11s to 1.3s spent just on downloading
(fetching and saving) images from this dev.to article.
The tokio runtime also has a similar function
tokio::spawn
but I have not tested it out yet. If at all you would prefer something
else, there is the FuturesUnordered
trait in the futures
crate which can also be used to represent multiple Futures
that can complete in any
order. If there is a more efficient way to go about, I am more than happy to hear about
it. I found this particular method of concurrency from this article
that covers many ways of executing futures in async-std.
An important thing to know is that using task::spawn
or even tokio::spawn
also
requires you to use data that has a 'static
lifetime which means borrowed data
may not work as easily inside these functions.
After downloading images, you should ensure you update the sources of the article’s
images. This process is a little different if you are using task::spawn
since you
cannot update the DOM node inside the spawned task. As I mentioned before, the DOM
node reference is not thread safe (implements the !Send
marker trait) and all the
data in task::spawn
requires data that is thread safe. You can have it run in the
second for
loop and have the spawned task return the new image URL. You can look
at the source code to see how I implemented it.
map_mime_to_ext and hash_str are simple utility functions and their implementation is not key to understanding the workings of Paperoni.
Build an epub from the web page and its assets⌗
Once you have all your images downloaded and your HTML article, it is time to piece them together in an epub. I used the epub-builder crate for this. All the content downloaded so far has no styling unless it was inlined so the final epub will be pretty barebones. Adding styling will come in later parts.
Creating an epub with just the HTML you extracted would look like such:
#[async-std::main]
async fn main() {
// Let us assume we already extracted content and downloaded images
let mut out_file = File::create("out.epub").unwrap();
let html_content_buf: Vec<u8> = extract_content();
let mut epub = EpubBuilder::new(ZipLibrary::new().unwrap()).unwrap();
epub.add_content("content.xhtml",html_content_buf.as_bytes()).unwrap();
epub.generate(&mut out_file).unwrap();
}
Saving the content on an epub is worthy of its own dedicated article so I will leave it at that.
Hopefully, you enjoyed this first part. You can find the code on Github.
This is very much a work in progress so you will find most of the code on the dev
branch.
The Edge Cases⌗
In Step 1⌗
While making a GET
request is all you need to accomplish this step, you must be aware
of URLs that have redirects. The most common reason for having a website redirect you
is because your URL either has or lacks a trailing slash. A redirect is actually still
an HTTP response (HTTP 301) and so some HTTP clients will simply assume this HTTP
response is actually the web page when it is not instead of making another request to
the correct URL.
In Step 3⌗
The get_img_links
function loops through all elements matching the “img” selector
and retrieves the value stored in the src
attribute if it is not empty. It is likely
that an img
element has an empty src
attribute, such as if lazy loading is managed
by the JavaScript on your page.