Introduction

In this first of, hopefully, many posts we will look at the thought process behind implementing Paperoni. While the article is based on a Rust project, the concepts covered here can be applied in making your own alternative to Paperoni.

This article is suitable for anyone with experience making web (HTTP) applications, especially if you have worked with web scrapers before.

The what

Paperoni is a program used to scrape articles on the web and saves them on an epub file. Specifically, the epub presents the articles in a newspaper print format.

The why

I found myself with a large number of unread articles on Pocket (close to 700) and thought that it might be easier to read them if they were clustered up into a single downloadable file. Of course, this was a stupid idea as reading all these articles one by one would be less of a burden. Nonetheless, the thought of converting web articles into a single downloadable file seemed to be an interesting idea to work on. The task becomes more interesting when I then wanted to make this single file, a newspaper.

Why .epub and not (insert file format)?

Using EPUB as the output format was chosen because, to put it simply, epub files are zipped files of XHTML, CSS and SVG. Therefore less work would be needed to convert HTML pages to a .epub file as compared to other file formats.

The how

The process of converting a web article to an epub can be generalised to 4 steps:

  1. Fetch a web page from a URL.
  2. Locate the article in the downloaded web page.
  3. Download the assets in that article section.
  4. Build an epub from the web page and its assets.

These 4 fairly simple steps are then repeated for every web page you want to download. Let us look at how to go about each of the 4 steps.

Fetch a web page from a URL

Fetching a web page can be done in your favourite language using an HTTP client. In my case, I chose to use surf which is an asynchronous Rust HTTP client that uses async-std. Using an asynchronous HTTP client comes in handy in step 3 when you may need to download multiple assets concurrently.

Using surf, you can make a GET request and store the HTML response in a String as shown below:

#[async-std::main]
async fn main(){
    let html_response = surf::get("https://blog.hipstermojo.xyz/posts/redis-orm-preface/").recv_string().await.unwrap();
    dbg!(html_response);
}

In the code above, we make a GET request and provide a URL to the get function. Using the recv_string method, you get back your response body and convert it to a String. Ideally you will want to add proper error handling and not use unwrap(). This step is the simplest of all. In fact, it is so simple that some of the tools you would use in step 2 implicitly do step 1 for you.

Aside from handling errors concerned with making a network request, that is all there is to step 1. Scroll down to “The Edge Cases” section for a more nuanced problem you may encounter.

Locate the article in the downloaded web page

This is the step where most if not all the work is done. Locating where in an HTML file the article is located, requires you as the programmer to search the DOM of the file.

The DOM, Document Object Model, is the interface used to access and modify parts of an HTML document. Elements in an HTML document are accessed from the DOM using their corresponding CSS selectors. Read more on this here

A tool that is particularly good at this job is a web scraper. Languages like Python have a very good set of web scraping libraries. If you choose to use a web scraper, it takes care of doing step 1. In the case of Paperoni, I instead chose to use libraries to traverse and manipulate the HTML DOM, since it would be the only thing I would need from a web scraper anyway. I chose kuchiki for this.

There a few caveats to take note of when using Kuchiki. First of all, the only usable example in the Github repo was written in January 2016. Secondly, references to DOM nodes are internally represented as an Rc<RefCell<T>> type. Though this is a common pattern in Rust, it is susceptible to memory leaks if you are not mindful of your borrows. The Rc<T> type is also not thread safe so all your DOM work can only be in a single thread.

The HTML string retrieved from step 1 is passed into a library like Kuchiki to be converted to an HTML DOM tree. With this DOM tree, you then need to figure out where the article is. You may be wondering, “Why not just use all the content in the <body> tag?".

The <body> element of an article tends to be full of bloat. Most of the time, this bloat refers to the advertisements on a web page, a whole comments section or article recommendations. Bloat is not a bad thing to have but when you are downloading an article for, possibly offline, use then it would be unnecessary to have it.

Figuring out where the article actually is, happens to be a very complicated task. The naive approach, currently used by Paperoni, is to look for the first <article> element in the DOM. I chose the <article> tag since it is used to semantically represent an independent composition on a web page.

Using Kuchiki the code looks something like this:

fn get_article_content(html_str: String) -> NodeRef {
    let document = kuchiki::parse_html().one(html_str);
    document.select_first("article").unwrap()
}

The document variable is set to hold a NodeRef type of the entire HTML document. You then pass a CSS selector to the select_first method which will return a NodeRef of the first DOM node that matches this selector.

The <article> tag is used to represent the main content on some websites such as Medium and dev.to so it can be used to some levels of success. However, this is a naive and unreliable strategy for 2 reasons:

  1. Some websites do not have an <article> tag at all and are still blogs. One of these websites is the one you’re reading from right now.

  2. <article> tags can appear more than once and there is a likelihood that the main content of the web page is either in the latter <article> tags or simply not wrapped in an <article> tag. An <article> tag only refers to content that can be independently represented thus is not reserved for blog content.

You could argue that such websites are not using Semantic HTML for their content but that happens to be most websites anyway. A much smarter solution would be needed and I had come up with a few which I will share with you in part 2.

Download the assets in that article section

Once the root of the article is determined, the links to the assets are extracted from their DOM nodes. Currently, the only assets I saw necessary to download are images. Image links are found in the src attribute of <img> elements. After getting these links, you can begin downloading the images using your HTTP client or a web scraper.

In Rust this can be done in this way with Kuchiki and Surf:

fn get_img_links(article_ref: NodeRef) -> Vec<String> {
    let mut links = Vec::new();
    for img_ref in article_ref.select("img").unwrap() {
        img_ref.as_node().as_element().map(|img_elem|{
            img_elem.attributes.borrow().get("src").map(|img_url|{
                if !img_url.is_empty(){
                    links.push(img_url.to_string())
                }
            })
        })
    }
    links
}

async fn download_images(img_urls: Vec<String>, client: &Client) {
    for img_url in img_urls {
        let img_response = client.get(img_url).await.unwrap();
        let img_content: Vec<u8> = img_response.body_bytes().unwrap();
        let img_ext = img_response.header("Content-Type").and_then(map_mime_to_ext).unwrap();
        let file_name = format!("{}{}",hash_str(img_url),img_ext);
        let mut file = File::create(file_name).await.unwrap();
        file.write_all(&img_response).await.unwrap();
    }
}

You can read on the implementation of get_img_links in “The Edge Cases” here

download_images loops through the image urls, downloads them and then stores them in a Vec<u8> (vector of bytes) using the body_bytes method. The Content-Type header returned is used to determine the file extension of the image you just downloaded and is converted to an extension with map_mime_to_ext. The URL of the image is then hashed (using MD5 in the case of Paperoni) by the hash_str function and then concatenated with the extension to create the filename used for the locally stored image. The image url is hashed so as to have a file name that does not contain URL characters such as the ? character which would make file creation difficult. The File type here is provided by async-std which is why we can use .await. The Client type is provided by surf for when you are making multiple HTTP requests.

But wait…

As I mentioned earlier, using an asynchronous client can be efficient when downloading multiple images concurrently. However, the implementation of download_images actually downloads images sequentially. This is not efficient and defeats the purpose of using an asynchronous HTTP client. Instead, we need to change to download images concurrently. So what should the code look like?

async fn download_images(img_urls: Vec<String>, client: &Client) -> {
    let mut async_tasks = Vec::with_capacity(img_urls.len());
    for img_url in img_urls {
        async_tasks.push(
            task::spawn(async {
                let img_response = client.get(img_url).await.unwrap();
                let img_content: Vec<u8> = img_response.body_bytes().unwrap();
                let img_ext = img_response.header("Content-Type").and_then(map_mime_to_ext).unwrap();
                let file_name = format!("{}{}",hash_str(img_url),img_ext);
                let mut file = File::create(file_name).await.unwrap();
                file.write_all(&img_response).await.unwrap();
            })
        );
    }

    for async_task in async_tasks {
        async_task.await;
    }
}

The key difference here is in using the function task::spawn from async-std. This is used to spawn an asynchronous task similar to how threads are spawn, in fact it also returns a JoinHandle. In this code, we create a vector of spawned asynchronous tasks and then await for them to finish. From my understanding, the work starts as soon as you call task::spawn and the equivalent of joining threads then happens in the second for loop when .await is used. This led to a significant speedup in my own code from about 11s to 1.3s spent just on downloading (fetching and saving) images from this dev.to article.

The tokio runtime also has a similar function tokio::spawn but I have not tested it out yet. If at all you would prefer something else, there is the FuturesUnordered trait in the futures crate which can also be used to represent multiple Futures that can complete in any order. If there is a more efficient way to go about, I am more than happy to hear about it. I found this particular method of concurrency from this article that covers many ways of executing futures in async-std.

An important thing to know is that using task::spawn or even tokio::spawn also requires you to use data that has a 'static lifetime which means borrowed data may not work as easily inside these functions.

After downloading images, you should ensure you update the sources of the article’s images. This process is a little different if you are using task::spawn since you cannot update the DOM node inside the spawned task. As I mentioned before, the DOM node reference is not thread safe (implements the !Send marker trait) and all the data in task::spawn requires data that is thread safe. You can have it run in the second for loop and have the spawned task return the new image URL. You can look at the source code to see how I implemented it.

map_mime_to_ext and hash_str are simple utility functions and their implementation is not key to understanding the workings of Paperoni.

Build an epub from the web page and its assets

Once you have all your images downloaded and your HTML article, it is time to piece them together in an epub. I used the epub-builder crate for this. All the content downloaded so far has no styling unless it was inlined so the final epub will be pretty barebones. Adding styling will come in later parts.

Creating an epub with just the HTML you extracted would look like such:

#[async-std::main]
async fn main() {
    // Let us assume we already extracted content and downloaded images
    let mut out_file = File::create("out.epub").unwrap();
    let html_content_buf: Vec<u8> = extract_content();
    let mut epub = EpubBuilder::new(ZipLibrary::new().unwrap()).unwrap();
    epub.add_content("content.xhtml",html_content_buf.as_bytes()).unwrap();
    epub.generate(&mut out_file).unwrap();
}

Saving the content on an epub is worthy of its own dedicated article so I will leave it at that.

Hopefully, you enjoyed this first part. You can find the code on Github. This is very much a work in progress so you will find most of the code on the dev branch.

The Edge Cases

In Step 1

While making a GET request is all you need to accomplish this step, you must be aware of URLs that have redirects. The most common reason for having a website redirect you is because your URL either has or lacks a trailing slash. A redirect is actually still an HTTP response (HTTP 301) and so some HTTP clients will simply assume this HTTP response is actually the web page when it is not instead of making another request to the correct URL.

In Step 3

The get_img_links function loops through all elements matching the “img” selector and retrieves the value stored in the src attribute if it is not empty. It is likely that an img element has an empty src attribute, such as if lazy loading is managed by the JavaScript on your page.