🥇Telegram bot for a personalized selection of articles from Habr

For questions like "why?" there is an older article - Natural Geektimes - make space cleaner.

There are many articles, for subjective reasons, some do not like, and some, on the contrary, it is a pity to skip. I want to optimize this process and save time.

The above article suggested an in-browser scripting approach, but I didn't really like it (even though I've used it before) for the following reasons:

For different browsers on a computer / phone, you have to configure it again, if at all possible.
Rigid filtering by authors is not always convenient.
The problem with authors whose articles do not want to be missed, even if they are published once a year, has not been resolved.

The built-in site filtering by article rating is not always convenient, since highly specialized articles, for all their value, can receive a rather modest rating.

Initially, I wanted to generate an rss feed (or even several), leaving only interesting things there. But in the end it turned out that reading rss seemed not very convenient: in any case, to comment / vote for an article / add it to favorites, you have to go through the browser. Therefore, I wrote a telegram bot that sends interesting articles to me in PM. Telegram itself makes beautiful previews of them, which, combined with information about the author / rating / views, looks quite informative.

Under the cut are details of the type of work features, the writing process and technical solutions.

Briefly about the bot

Repository: https://github.com/Kright/habrahabr_reader

Telegram bot: https://t.me/HabraFilterBot

The user sets an incremental rating for tags and authors. After that, a filter is applied to the articles - the rating of the article on Habré, the author's user rating and the average for user ratings by tags are added up. If the amount is greater than a user-specified threshold, then the article passes the filter.

A side goal of writing a bot was to gain fun and experience. In addition, I regularly reminded myself that i am not google, and therefore many things are made as simply and even primitively as possible. However, this did not prevent the process of writing a bot from stretching for three months.

It was summer outside

July was coming to an end, and I decided to write a bot. And not alone, but with a friend who mastered scala and wanted to write something on it. The beginning looked promising - the code would be sawn "in a team", the task seemed easy and I thought that in a couple of weeks or a month the bot would be ready.

Despite the fact that I myself have been writing code on the rock from time to time for the past few years, usually no one sees or looks at this code: pet projects, checking some ideas, preprocessing data, mastering some concepts from FP. I was really interested in what it looks like to write code in a team, because code on the rock can be written in so many different ways.

What could go so? However, let's not rush things.
Everything that happens can be tracked by the history of commits.

A friend created a repository on July 27th, but did nothing else, so I started writing code.

July 30

Briefly: I wrote the parsing of Habr's rss feed.

com.github.pureconfig for reading typesafe configs straight into case classes (it turned out to be very convenient)
scala-xml for reading xml: since initially I wanted to write my own implementation for rss - feeds, and rss feed in xml format, I used this library for parsing. Actually, rss parsing also appeared.
scalatest for tests. Even for tiny projects, writing tests saves time - for example, when debugging xml parsing, it is much easier to download it into a file, write tests and fix errors. When later a bug appeared with parsing some strange html with invalid utf-8 characters, it turned out to be again more convenient to put it in a file and add a test.
actors from Akka. Objectively, they were not needed at all, but the project was written for fun, I wanted to try them. As a result, I'm ready to say that I liked it. You can look at the idea of OOP from the other side - there are actors that exchange messages. What is more interesting - you can (and should) write code in such a way that the message may not reach or be processed (generally speaking, messages should not be lost when working on a single computer). At first I racked my brains and there was a trash in the code with subscriptions of actors to each other, but in the end I managed to come up with a rather simple and elegant architecture. The code inside each actor can be considered single-threaded; when the actor crashes, the account restarts it - it turns out to be a rather fault-tolerant system.

9 of August

I added to the project scala-scrapper for parsing html pages from habr (to pull out information such as article rating, number of bookmarks, etc.).

And Cats. The ones in the rock.

I then read one book about distributed databases, I liked the idea of CRDT (Conflict-free replicated data type, https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type, there will), so I wrote down the type class of the commutative semigroup for information about the article on Habré.

In fact, the idea is very simple - we have counters that change monotonously. The number of views is gradually growing, the number of pluses too (however, as well as the number of minuses). If I have two versions of information about an article, then I can “merge them into one” - consider the state of the counter, which is more relevant, to be more relevant.

Semigroup means that two objects with information about the article can be merged into one. Commutative means that you can merge both A + B and B + A, the result does not depend on the order, as a result, the newest version will remain. By the way, there is also associativity here.

For example, by design, rss after parsing gave slightly weakened information about the article - without metrics such as the number of views. A special actor then took the information about the articles and ran to the html pages to update it and merge it with the old version.

Generally speaking, as in akka, there was no need for this, you could just store updateDate for the article and take a newer one without any merging, but the road of adventure led me.

12 of August

I began to feel freer and, for the sake of interest, made sure that each chat was a separate actor. Theoretically, the actor itself weighs about 300 bytes and you can create at least millions of them, so this is a completely normal approach. It turned out, it seems to me, quite an interesting solution:

One actor was a bridge between the telegram server and the messaging system in the account. He simply received messages and sent them to the desired chat actor. The actor-chat in response could send something back - and it was sent back to telegrams. What was very convenient - this actor turned out to be as simple as possible and contained only the logic of responding to messages. By the way, information about new articles came to every chat, but again I don’t see any problems in this.

In general, the bot was already working, responding to messages, storing a list of articles sent to the user, and I was already thinking that the bot was almost ready. I slowly added small features such as normalizing author names and tags (replacing "sd f" with "s_d_f").

There was only one thing left small but - the state was not saved anywhere.

Everything went wrong

You may have noticed that I wrote the bot mostly alone. So, the second participant joined the development, and the following changes appeared in the code:

MongoDB appeared to store the state. At the same time, the logs broke down in the project, because for some reason Monga started spamming them and some people simply turned them off globally.
The telegram actor-bridge has changed beyond recognition and started parsing messages itself.
Actors for chats were mercilessly cut out, instead of them an actor appeared, which hid in itself all the information about all chats at once. For every sneeze, this actor climbed into the monga. Well, yes, it’s hard to send it to all chat actors when updating information about an article (we’re like Google, millions of users are waiting for a million articles in the chat for each), but with every update of the chat, getting into monga is normal. As I understood much later, the working logic of the chats was also completely cut out and something non-working appeared instead.
There was no trace left of the type classes.
Some kind of unhealthy logic appeared in the actors with their subscriptions to each other, leading to the race condition.
Data structures with type fields Option[Int] turned into Int with magic default values like -1. Later I realized that mongoDB stores json and there is nothing wrong with storing it there Option well, or at least parse -1 as None, but at that time I didn’t know this and took the word that “it’s necessary”. That code was not written by me, and I did not climb to change it for the time being.
I found out that my public IP address tends to change, and each time I had to add it to the monge whitelist. I ran the bot locally, monga was somewhere on the servers of mongi as a company.
The normalization of tags and message formatting for the telegram suddenly disappeared. (Hmm, why would that be?)
I liked that the state of the bot is stored in an external database, and when restarted, it continues to work as if nothing had happened. However, that was the only plus.

The second person was in no particular hurry, and all these changes appeared in one big pile already in early September. I did not immediately appreciate the scale of the damage received and began to understand the operation of the database, because. never dealt with them before. Only later did I realize how much working code was cut and how many bugs were added in return.

September

At first, I thought that it would be useful to master mongu and do everything well. Then I slowly began to understand that organizing communication with the database is also an art in which you can make races and just make mistakes. For example, if the user receives two messages like /subscribe - and in response to each we will create an entry in the tablet, because at the time of processing those messages, the user is not subscribed. I had a suspicion that communication with Monga in the existing form is not written in the best way. For example, the user's settings were created at the moment when he signed up. If he tried to change them before the fact of subscription… the bot didn’t answer anything, because the code in the actor climbed into the database for the settings, didn’t find it, and crashed. When asked why not create settings as needed, I found out that there is nothing to change them if the user has not subscribed ... The message filtering system was somehow not obvious, and even after a close look at the code I could not understand whether it was originally intended or there is an error.

There was no list of articles sent to the chat, instead it was suggested that I write them myself. This surprised me - in general, I was not against dragging all sorts of things into the project, but it would be logical for the person who dragged these things in and screw them on. But no, the second participant seemed to forget everything, but said that the list inside the chat was supposedly a bad decision, and it was necessary to make a sign with events like “an article y was sent to user x”. Then, if the user requested to send new articles, it was necessary to send a request to the database, which would select events related to the user from the events, get a list of new articles, filter them, send them to the user and send events about this back to the database.

The second participant was carried somewhere towards abstractions, when the bot will receive not only articles from Habr and go not only to telegrams.

I somehow implemented the events in the form of a separate plate by the second half of September. Not optimal, but at least the bot worked and started sending me articles again, and I slowly figured out what was happening in the code.

Now you can go back to the beginning and remember that the repository was not originally created by me. What could have gone like this? My pull request was rejected. It turned out that I had a bad code, that I didn’t know how to work in a team, and I had to fix bugs in the current implementation curve, and not modify it to a usable state.

I got upset, looked at the commit history, the amount of code written. I looked at the moments that were originally written well, and then broken back ...

Fuck it

I remembered the article You are not Google.

I thought that no one really needs an idea without implementation. I thought that I want to have a working bot that will work in a single copy on a single computer as a simple java program. I know that my bot will work for months without restarts, since I have already written such bots in the past. If he suddenly still falls and does not send another article to the user, the sky will not fall to the ground and nothing catastrophic will happen.

Why do I need docker, mongoDB and other cargo cult of “serious” software if the code doesn’t work stupidly or works crookedly?

I forked the project and did everything as I wanted.

Around the same time, I changed my place of work and free time became sorely lacking. In the morning I woke up exactly on the train, in the evening I returned late and did not want to do anything. I didn’t do anything for a while, then the desire to finish the bot overcame, and I began to slowly rewrite the code while driving to work in the morning. I won’t say that it was productive: sitting in a shaking train with a laptop on your lap and peeping at stack overflow from your phone is not very convenient. However, the time for writing the code flew by completely unnoticed, and the project began to slowly move towards a working state.

Somewhere in the back of my mind there was a worm of doubt who wanted to use mongoDB, but I thought that in addition to the pluses with "reliable" state storage, there are noticeable minuses:

The database becomes another point of failure.
The code becomes more complex, and I will write it longer.
The code becomes slow and inefficient, instead of changing the object in memory, the changes are sent to the database and pulled back if necessary.
There are restrictions on the type of storing events in a separate table, which are associated with the features of the database.
There are some restrictions in the trial version of mongo, and if you run into them, you will have to run and configure mongo on something.

I cut out the monga, now the state of the bot is simply stored in the program's memory and from time to time is saved to a file in the form of json. Perhaps in the comments they will write that I'm wrong, this is where the database should be used, etc. But this is my project, the approach with the file is as simple as possible and it works in a transparent way.

Threw magic values like -1 and returned normal Option, added storage of a hash table with sent articles back to the object with information about the chat. Added removal of information about articles older than five days, so as not to store everything in a row. Brought logging to a working state - logs are written in reasonable quantities both to the file and to the console. Added several admin commands such as saving the state or getting statistics such as the number of users and articles.

Fixed a bunch of little things: for example, for articles, the number of views, likes, dislikes and comments are now indicated at the time the user filter was passed. In general, it's amazing how many little things had to be corrected. I kept a list, noted all the "roughness" there and corrected them as far as possible.

For example, I added the ability to set all settings directly in one message:

/subscribe
/rating +20
/author a -30
/author s -20
/author p +9000
/tag scala 20
/tag akka 50

And another team /settings displays them in this form, you can take the text from it and send all the settings to a friend.
It seems like a trifle, but there are dozens of similar nuances.

Implemented article filtering in the form of a simple linear model - the user can set an additional rating for authors and tags, as well as a threshold value. If the sum of the author's rating, the average rating for tags and the actual rating of the article is greater than the threshold value, then the article is shown to the user. You can either ask the bot for articles with the /new command, or subscribe to the bot and it will send articles to the PM at any time of the day.

Generally speaking, I had an idea for each article to pull out more signs (hubs, the number of comments, bookmarks, the dynamics of rating changes, the amount of text, pictures and code in the article, keywords), and show the user an ok / not ok vote under each article and for each user to train the model, but I became too lazy.

In addition, the logic of work will not be so obvious. Now I can manually set a rating of +9000 for patientZero and with a threshold rating of +20 I will be guaranteed to receive all his articles (unless, of course, I set -100500 for some tags).

The final architecture turned out to be quite simple:

An actor that stores the state of all chats and articles. It loads its state from a file on disk and saves it back from time to time, each time in a new file.
An actor that visits the rss feed from time to time, learns about new articles, looks at links, parses, and sends these articles to the first actor. In addition, it sometimes asks the first actor for a list of articles, selects those that are no older than three days, but have not been updated for a long time, and updates them.
An actor that communicates with a telegram. I still took out the parsing of messages completely here. In a good way, I want to divide it into two - so that one parses incoming messages, and the second deals with transport problems such as resending unsent messages. Now there is no resending, and the message that did not reach due to an error will simply be lost (unless it is noted in the logs), but so far this does not cause problems. Maybe problems will arise if a bunch of people subscribe to the bot and I reach the limit on sending messages).

What I liked is that thanks to akka, falling actors 2 and 3 generally do not affect the performance of the bot. Perhaps some articles are not updated on time or some messages do not reach the telegram, but the account restarts the actor and everything continues to work. I save the information that the article is shown to the user only when the telegram actor responds that he has successfully delivered the message. The worst thing that threatens me is to send a message several times (if it is delivered, but the confirmation is lost in some unknown way). In principle, if the first actor did not store the state in itself, but communicated with some database, then it could also quietly fall and return to life. I could also try akka persistance to restore the state of the actors, but the current implementation suits me for its simplicity. It's not that my code often crashes - on the contrary, I put in quite a bit of effort to make it impossible. But shit happens, and the ability to break the program into isolated pieces-actors seemed really convenient and practical to me.

Added circle-ci in order to immediately find out about it when the code breaks. At a minimum, that the code stopped compiling. Initially I wanted to add travis, but it only showed my projects without forks. In general, both of these things can be freely used on open repositories.

Results

It's already November. The bot is written, I used it for the last two weeks and I liked it. If you have ideas for improvement - write. I see no point in monetizing it - let it just work and send interesting articles.

Link to the bot: https://t.me/HabraFilterBot
Github: https://github.com/Kright/habrahabr_reader

Small conclusions:

Even a small project can take a long time.
You are not Google. It makes no sense to shoot sparrows with a cannon. A simple solution can work just as well.
Pet projects are very good for experimenting with new technologies.
Telegram bots are written quite simply. If not for "teamwork" and experiments with technology, the bot would have been written in a week or two.
The actor model is an interesting thing that goes well with multithreading and code fault tolerance.
I think I felt firsthand why the open source community loves forks.
Databases are good because the state of the application ceases to depend on crashes / restarts of the application, but working with the database complicates the code and imposes restrictions on the data structure.

Source: habr.com

Telegram bot for a personalized collection of articles from Habr