Looking for a problem in the wrong place

This is a small story from real practice, when a small problem, well disguised by fault tolerance, turns into a headache.

Small disposition:

A small branch, it has its own PBX (asterisk + FreePBX) based on desktop hardware and the same local terminal server with 1C, a file dump and a virtual RO domain controller. The Internet distributes Mikrotik. The branch is small, that's enough for them.
It all started with monitoring (due to lack of time and laziness, not everything monitors), which reported about the overheating of one server (with PBX) in the branch. While the locals were solving the problem, the old man hung up and broke the MySQL database a bit.

Many foreshadowed trouble, but not this one ...

It doesn't matter, the base was repaired, everything should work. But the locals are complaining, calls are cut off. Okay - there are problems in FreePBX, I take a backup, deploy it, everything is OK.
But the trouble is still there, the locals are still complaining, calls are not going normally. Before them, the call seems to go normally, but when they themselves call, or call each other, there is a delay of several seconds. I start looking at the voluminous and incomprehensible logs of Asterisk and FreePBX, but I can’t see the problem in them. I remember there was a problem with STUN and ICE, which gave a similar delay. I disable everything to hell, the result is zero.

Discouragement is a path to making bad decisions:

I fall into despondency, many hours of picking the automatic telephone exchange does not lead to anything good, it's already late at night, but the problem is not being solved.
I left the problem until the morning, hoping for a fresh mind. In the morning, another unsuccessful decision was made: since the system is broken (although the dependency could not be so destructive), I try to fix the system by reinstalling all packages. The result is slightly more than zero, the delay has been reduced (not significantly, but already a success).
I make another bad decision: if a partial repair of the OS (and databases from the backup) had little success, and the root of the problem is still not clear, and at the same time a lot of time has already been spent searching for the cause, then I decide to act radically: we demolish the OS and we roll everything from scratch (fortunately, the automation of the process does this in a reasonable time). I roll the FreePBX configuration from a copy. Another failure. Zero result!

Despair - the mind is eclipsed, the decisions become even worse

I fall into despair. Very bad thoughts begin to come, I think: maybe the conf in the backup is a curve (it happened to me after a series of updates that it didn’t work after them, and I didn’t manage to find the reason), nothing remains: I have to roll everything from scratch with my hands. What a disgrace! The result is strictly zero, and even a lot of time wasted!

Acceptance is the path to awareness

In desperate attempts to understand what is happening, I begin to carefully study the logs. I notice a pattern. The Extension call takes exactly 5 seconds, and for a group of calls from 3 Extensions, it takes 15! I start to google about the call delay, but already indicating a specific delay. And I stumble upon the answer I have already found, people say that the problem is in the DNS, but I know for sure that there is no problem, all addresses are resolved!

Obvious - not probable

There is nothing to do, I pick up nslookup and bingo (I wish I could do it right away)! The primary DNS lies (a virtual machine with a controller), but I didn’t notice! There would be one DNS, there would be an error πŸ˜‰

Π‘onclusion

An elementary problem that monitoring (which is worth configuring for all nodes) could see, masked by DNS failover, led to the loss of almost two working days on solving a stupid situation. Too lazy for everything, set up monitoring for a minute - look for a problem where it does not exist - two days.

Only registered users can participate in the survey. Sign in, you are welcome.

Has this happened to you?

  • Yes, very rarely

  • Yes, rarely

  • Often

  • Very often

  • No, with anyone, but not with me!

  • No, I'm infallible!

2 users voted. 1 user abstained.

Source: habr.com

Add a comment