"I'll read it later": the difficult fate of an offline collection of Internet pages

There are types of software that some people cannot live without, while others do not even imagine that such a thing exists and that someone needs it at all. For me, for many years, such a program was Macropool WebResearch, which allowed you to save, read and organize Internet pages into a kind of offline library. I'm sure many readers are fine with a collection of links or a combination of a browser and a folder with a set of saved documents. I would like to be able to at least mark documents as "read" or "favorites", quickly move from one text to another and not depend on the availability of the Internet or a particular site. It happens that there is time to read exactly when there is no Internet (on the road, for example), and links, unfortunately, often turn out to be short-lived.

Apparently, the authors of WebResearch were counting on such people. This program was stuffed with a wide variety of functions: cataloging by sections and by tags, editing notes, all kinds of export / import, and so on. However, around 2013, the project stopped being updated, and then the developer's website also ceased to exist. For a few more years I managed to ride this horse, but first browser plugins fell off (available only for the then versions of IE and FireFox), and then modern sites stopped displaying normally in the viewer based on the old IE engine.

"I'll read it later": the difficult fate of an offline collection of Internet pages
WebResearch main window, PC Week/RE #17 (575)

The Road of Disappointment

As soon as it became clear that the replacement could not be avoided, I started looking for a decent analogue in the background. It seemed to me that there would be no particular difficulties here, since my desires are extremely modest. I was prepared to get by with just a small subset of WebResearch tools, including:

  • saving the HTML page from the browser using the extension;
  • at least minimal means of cataloging (renaming, organization of directories, labels);
  • (preferably) support for PDF documents;
  • any decent way to sync your collection with other devices.

To my surprise, I didn’t manage to find anything similar, although I honestly crawled up and down the Internet and carefully studied about a dozen programs suitable for the annotation (with the exception of Evernote, where functionality similar in description is available only by subscription). To date, at least somehow satisfy my wishes, except that projects TagSpaces и myBase. Their study, generally speaking, is of a certain cultural interest.

TagSpaces is such a "stylish-fashionable-youth" organizer on Electron with a beautiful website, adaptive layout and, of course, a dark theme, where without it. At the same time, the ill-fated table of contents of the collection with fashionable rounded icons occupies half the screen, while accommodating about twenty elements maximum, and basic things like hotkey support or rendering of the document being viewed are written according to the residual principle. As a result, documents are displayed crookedly, and working with the collection turns into a boring and time-consuming set of exercises with the mouse.

Its antipode myBase comes from the late nineties: here, in addition to purely functional interface we have an extremely rich set of settings and functions. However, the same browser based on the old IE is used as the viewport here (which already makes reading difficult), and all documents are stored in a monolithic database. If you put it in the Dropbox folder, for example (there are still no other ways to synchronize with other devices), then with the slightest change in the collection, you have to wait for hundreds of megabytes of information to be uploaded to the server.

Поворотный момент

Probably, the further content of the note seems obvious to the reader: now we will be offered our own bike, which, of course, will be head and shoulders above any existing analogue. Kind of yes, but not really. I really could not stand the ordeals with myBase and TagSpaces and sketched out my own document manager, the link to which I will give towards the end. However, this small project for personal needs in itself would not deserve a separate article; I am writing mostly because it seemed interesting to me to share the experience gained in the process of work, and a number of unpleasant surprises that I did not count on.

Goals and objectives

To begin with, I now have a rather stressful life, and there is simply no time for full-fledged hobby projects. Therefore, from the very beginning, I decided that I was ready to sculpt my instrument from any components that come to hand, if this would speed things up. In addition, for the time being, I undertake to implement only the absolute minimum of functionality, without which it is impossible to do without.

Data Format and Page Saving

How to store web pages on disk? Given the previously formulated requirements, it seemed to me that the choice was small: either the "full web page" save format, that is, the main HTML file and a folder with related resources, or the MHTML format. The first option immediately seemed less preferable to me: there is little joy in having a garbage heap on the disk from a bunch of files from which you will need to extract significant documents, filter out excess when searching, and monitor integrity when copying. When I tried to work with TagSpaces, I had to re-save all my documents so that the resource folder name starts with a period: then the system recognizes them as "hidden" and does not display them.

This problem is hidden from view in myBase, since everything is stored in the database, but in my case, the principle of simplicity prevailed: I really wanted to store everything as ordinary files on disk so that I didn’t have to deal with the implementation of routine operations like copying, renaming, deleting and synchronizing .

The MHTML format is going through hard times. An easy way to save MHTML was kicked out of Chrome this summer, and I don’t even know where pages are supposed to be stored now? It is clear that the possibility has not gone anywhere yet, there are third-party extensions, but in general this is some kind of bad sign. Also, save as MHTML not supported in Chromium Embedded Framework, which also does not add optimism.

In parallel, I began to look for an easy way to save pages from the browser to the specified folder. As a result, both problems were resolved with little bloodshed: I stumbled upon a wonderful project Single File, which can store the contents of a web page in a separate independent HTML file. This is done by converting all related resources to base64 format and embedding them directly into the HTML. Of course, at the same time, the file size grows, and the content looks somewhat littered, but in general, the approach seemed to me reliable and simple, and I settled on it.

SingleFile comes both as a browser extension and as a command line application. Now I just use the extension: it's quite convenient, except for the fact that you have to manually select the target folder for saving. In the future, I will probably try to refine the application to simplify this process. To call a third-party application from Chrome, you can use the extension External Application Button This is another useful discovery of mine. By the way, the application has already been useful: with its help, I converted a collection of folders and files from TagSpaces into a set of independent HTML documents.

Hassle with GUI and Browser

I found Python to be a good fit for all sorts of simple file and line operations, and since one of my work projects uses wxWidgets, choice wxPython as the main framework looked logical.

Further, after looking at jambs with displaying pages in other programs, I concluded for myself that the only reliable way to deal with them is to introduce a visualizer based on a modern browser, that is, Chrome or Firefox, into the program.

I must admit that the last time I had to do something like this was 15 years ago, and I did not expect any dirty tricks. It turned out that "just slapping a browser on a form" is impossible: somehow, humanity has not been able to reliably and universally cope with this task. Any listbox or button on the form can be placed in any GUI framework, and even generate cross-platform code, and it seemed to me that in 2019, displaying HTML should also be a problem solved everywhere.

It turned out that in wxWidgets, for example, the standard "browser" component is a cross-platform wrapper over the system-dependent "browser", which in the case of Windows, for example, means Internet Explorer 7, and the situation in Windows Forms is no better, and versions of fresh IE9 are only available using non-trivial registry manipulation. As you can see, I'm not the only one who has been doing other things for the last 15 years - here, too, nothing has moved forward.

Then I had a choice: to change the framework or look for an alternative component for the browser. After hesitating, I decided to try the second way first and quickly stumbled upon the project CEF Python: Python bindings for Chromium Embedded Framework, designed specifically for the task of embedding Chromium in Python applications.

Assess the situation: Python is one of the most popular programming languages ​​in the world, Chrome is essentially a monopoly in the browser market. At the same time, CEF Python is actually supported by energy one guy, strength and health to him. Doesn't anyone need this anymore?

However, CEF Python did not help me in the end: although even the basic example of integration with wxWidgets from the project repository is frankly buggy, I tried to tinker with it longer, but could not solve all the problems that arose. I won’t even delve into the topic, it hardly deserves it.

I studied the components based on the Chromium Embedded Framework in more detail and finally decided to try C# version. Since I work almost all the time with Windows, the prospect of abandoning cross-platform, in general, did not particularly bother me.

After some inevitable fuss at the start, things went much faster: the combination of CefSharp and Windows Forms turned out to be winning, and I managed to solve most of the technical tasks without any problems.

About the untried

You can try to embed FireFox in a C# application using the component Geckofxbut I can't say anything about it. A standard browser component of the Qt framework called QWebEngineView based on Chromium, so it will probably work just as well as CefSharp.

Fans of Qt may be tempted to comment: they say, if I took Qt, I would not have problems. It is possible that this is the case, but wxWidgets can be considered, if not the first, then the second option when choosing a GUI framework for applications in Python or C ++. And in my humble opinion, such a thing as a browser should be built into any more or less developed GUI framework without dancing with a tambourine.

weblibrary

Let's return, however, to my application with the working name weblibrary. Today it looks like this (drumroll) like this:

"I'll read it later": the difficult fate of an offline collection of Internet pages

Besides clean and concise interface only the most basic functions are implemented here:

  • Display any specified directory on the system as a document library.
  • View documents in a browser window. Navigation through the list in the usual way (cursor keys, PgUp, PgDn, Home, End), scrolling in the browser with the Space and Shift + Space keys.
  • Renaming documents.
  • Mark documents as read or favorites using hotkeys.
  • Sort documents by any field.
  • Refreshes the application window on any changes to the library folder.
  • Save window settings on exit.

All this may seem like a trivial functionality, but, let's say, saving the sizes of columns in TagSpaces is still not supported - apparently, the authors have other priorities.

The status (read/favorite) is simply stored in the file name (read file doc.html renamed to doc{R,S}.html). Synchronization as such is not implemented, but I simply keep the library in Dropbox - after all, it's just a folder with files.

The plans are to refine simple things like moving and deleting files, as well as to implement marks with arbitrary tags. If anyone wants to help, I'll be glad.

Conclusions

Variety. As I said at the outset, it's amazing how different one person's toolbox can be from another. It's natural for me to use a tool like WebResearch, and I felt almost physically uncomfortable with its absence. At the same time, apparently, I have few like-minded people, otherwise there would be no problems with finding analogues. On the other hand, similar cases happen with much more mainstream software: for example, Microsoft is not going to update the desktop version of OneNote, so I have to use the 2016 version, and sooner or later I will have to move somewhere else as well.

Another surprising thing is how difficult it is to navigate the current landscape of libraries and frameworks. I rarely have to write desktop applications from start to finish in the line of duty, and I assumed that for my task (one window, three components, trivial interactions) literally any tool for any programming language would do. Here we just take anything and do it within a few days.

It turned out that the reality is much less benevolent, and you can run into a problem just out of the blue. Let's say I have two splitters that can be used to stretch the browser window. Well, restoring their positions after loading into wxWidgets is extremely difficult, because the system puts them in their default position after almost all events available to me, and I have to do all sorts of hacking to achieve what I want. Who would have guessed?

On the other hand, it is clear that in Windows Forms everything is geared towards "business interfaces". Almost everything that was required was available out of the box: saving / restoring application settings, and a convenient component interface (for example, I didn’t expect that the TreeView component could be asked for the full path from the root to any child element as a string), and non-trivial tools like a folder content change tracker.

In any case, the time spent was not in vain, and the result can be considered satisfactory, so what more could you want from life, right?

Source: habr.com

Add a comment