Simple website failover (monitoring + dynamic DNS)

In this article, I want to show how easy and free you can make a failover scheme for a website (or any other Internet service) on a combination of monitoring okerr and dynamic DNS service. That is, in case of any problems with the main site (from the problem with the “PHP Error” on the page, to lack of space or just a suspiciously small number of orders in the case of an online store), new visitors will be directed to the second (third, and so further) a known working server, or on the “Sorry” page, where they will politely explain that “there is a problem, we are already aware and are already fixing it, we will fix it soon” (and in this case you will actually be already aware and will be able to repair).

Live with failover or without?

Until something goes wrong, it doesn't make much of a difference. But when it happens, the following often happens without a failover: you try to quickly figure out what the problem is, it doesn’t work (backups don’t deploy, for some reason the software doesn’t work as it should from the documentation, etc.), but there is no time, the server -sites are lying, customers are calling, everyone is on their nerves, trying to somehow fix it rudely and dirty “on scotch tape”, then somehow it starts up with crutches and lives. You think that at your leisure it will be necessary to understand in more detail and redo everything beautifully, but there is nothing more permanent than temporary.

Now, how it happens in a beautiful version with a filer:

  • Mistake happens
  • The error is detected automatically
  • An alert is being sent
  • Switching to one of the backup servers is being transferred
  • Calmly and without panic, the problem is sorted out, corrected and the server is put back into operation.

This scheme, of course, can also have its own zaperdyky, but still, the scheme is linear, each stage here is simple and, most importantly, it can be debugged separately, so the chance of failure of this scheme is much lower, and all actions can be automated and executed quickly (unlike from the task of finding and fixing unknown epic crap). Your plane has landed in a distant country, you turn on your phone and see a notification in the telegram that the server has gone down, but everything is fine, the backup server has activated, you can continue your trip, you do not need to fly back or repair via SSH from the nearest cafe with WiFi . Find out when it's more convenient.

The future is already here!

Previously, the main problem that made failover an often unacceptable solution was the amount of cost involved. Or it was necessary to buy expensive pieces of iron (and invite even more expensive specialists). Or collective farming something complicated according to the guides (I even came across an option when two servers are connected additionally with a null-modem cable, and heartbeat is driven along it, so that at the right time the spare server finds out and takes control). Now there are ways easier and free. If you have a site with cats, there is no excuse for you if you have not yet implemented a failover for it!

Well, besides, for the failover scheme, you need another server (and maybe more than one) and earlier it was a big expense, now you can take VDSku for a penny.

The most trusted cat site

For a practical illustration of the solution with okerr + dynamic dns, we launched our website with cats cat.okerr.com. We hate cats, so there won't be many of them. There are three sites in total, each looking pretty much the same (all on the same template) but with different kittens to make it easy to tell the difference, and each posting technical info to see how the failover works. The page is updated itself once every 1 minute, but you can always click reload in the browser.

There is a line “status=OK” in the technical information. Sometimes servers feign problems and write status=ERR. The main server “seems to crash” at 20 minutes of every hour (0:20, 1:20, 2:20, …). Spare (backup) server in 40 minutes. The last server (“sorry” server) is always running. At 0 minutes of every hour, the primary and secondary servers are “restored”.

Simple website failover (monitoring + dynamic DNS)

If you open the site and leave it in a tab, you will see that it never crashes (although each individual server periodically simulates a problem), and in case of a server problem, it simply “runs” between live servers. The picture, the name and address of the server and its role will change. Sometimes you can catch the moment when status=ERR (there is already a problem, but the whole failover scheme has not worked out yet), but the next update will show you a page from the working site.

Failover on okerr + dynamic DNS

Let's see how it works under the hood. The task of the filer is to ensure that the address cat.okerr.com always points to the IP address of the working server.
Behind each of the servers that keep our cat site in okerr there is an indicator that checks its status once a minute.

Simple website failover (monitoring + dynamic DNS)

In this screenshot, we see how the cat.okerr.com website is checked from the alpha.okerr.com server. The page should contain status=OK, and as we can see at the top, the status of the indicator is now OK. When the server "breaks", there will be ERR. (This is just one example of an indicator, okerr is monitoring, so you can stick any type of indicator, for example, check free disk space, the number of new orders in the database, and even logical indicators, for example, there will be one error criteria at night and others during the day) .

In the project settings, we created a failover scheme with these indicators:

Simple website failover (monitoring + dynamic DNS)

There are three indicators (three servers) in the scheme, different in priorities. The main server for the site is charlie, if it is down (there will be no “status=OK” or simply unavailable), then bravo and in the last case alpha. The right side of the page shows the status of the DNS record on different servers.

For those who notice that cat.he.okerr.com is used: We use a slightly more complicated scheme. Instead of just changing the cat.okerr.com DNS entry, we are changing cat.he.okerr.com (on the Dynamic DNS provider Hurricane Electric), and cat.okerr.com is a CNAME (alias) that does not change, always points to cat.he.okerr.com. We just like Hurricane better as a dynamic DNS and it has keys to manage a single record (rather than the entire zone), we feel it's safer. You can also not specify key passwords in okerr to manage the entire domain, but only for a subdomain or record.

From falling to rising

Step by step how this scheme works:

  1. Happens (simulates) a problem on the server
  2. The okerr sensor checks the status of each server once a minute and reports to the main server of the project in okerr
  3. The corresponding server indicator changes state from OK to ERR
  4. When the status of the indicator changes, the failover is recalculated, it is calculated which address needs to be set (if necessary. For example, if the main server is running, and at that time the spare server has died, there will be no changes)
  5. This address is reported to the dynamic dns service. Upon completion of this stage, you will see the status “synced” on the right
  6. Very soon (seconds) the record will reach the DNS servers of your domain (for the cat site it is ns1-ns5.he.net).
  7. From now on, some users will already be on the new live server. But not all DNS servers in the world have updated records yet, and the old record may be cached somewhere else. You can see how the data on public DNS servers “dance”, showing either a new value or an old value. If you refresh the failover configuration page, the okerr itself will request new data from the DNS servers.
  8. After the data has stabilized, the old cached entry is rotten everywhere - all 100% of requests go to the new server.

To speed up stage 7 (often the longest), the TTL of the dynamic DNS record should be set as low as possible. Usually services allow intervals of 90-120 seconds. It's a perfectly reasonable compromise.

Additionally

All this can be set up in an evening (if you already have a duplicate server). Both okerr and dynamic DNS services are free. To get more checks in okerr and a shorter check period, you need to complete the training (from the profile page). Upon passing, the level immediately increases (20 indicators per hour + 1 fast, 10 minute). And if there are not enough of them - write to [email protected], most likely it will be possible to increase (so far there has always been an opportunity, I have never refused, on the contrary, I myself offered). It’s just that initially I don’t want to promise everything to everyone, I’m not sure that there will be enough power to keep my word. But so far there are few users, so there are no problems with increasing the limits.

What can okerr do - look at the site presentation. In general, this is monitoring (zabbix from the cloud), and filer is a nice extra feature. Also from the site you can go to the demo without registration.

When the indicator state changes, a notification is sent to the mail or Telegram. (We looked at what was happening here, and realized that it seems that telegram is the most reliable messenger. Thanks to RKN for the stress test!) With the right settings for okerr, any notification is either a signal “drop everything, we need to fix it!”, Or “ hang up!”. There should be no extra alerts from the okerr (if they are, you need to set them up somehow differently). For example, for our cat site, the alpha server is the last one and never fakes an error. If he lies down, we need to know. But the rest of the servers constantly simulate errors, therefore, in order not to receive alerts several times per hour, those indicators have the “silent” status.

It also makes sense to make a sorry-server (on any cheapest hosting), which will either have your apology page (in case all the main and backup servers are down) or transfer it to the status page on okerr (for example, our cp.okerr.com/status/okerr) or statuspage.io.

Source: habr.com

Add a comment