The easier the task, the more often I make mistakes

The easier the task, the more often I make mistakes

This trivial task arose on one of the Fridays and was supposed to take 2-3 minutes. In general, as always.

A colleague asked me to fix the script on his server. I did it, handed it over to him and dropped it inadvertently: "Time is in a hurry for 5 minutes." His server, let him deal with the synchronization. Half an hour, an hour has passed, and he is still puffing and swearing quietly.

“Idiot! - I thought, switching to the server console - well, I'll come off for a couple more minutes.

Look ntp, rdate, sdwdate not installed timesyncd disabled and not running.

# timedatectl
      Local time: Sun 2019-08-25 20:44:39 +03
  Universal time: Sun 2019-08-25 17:44:39 UTC
        RTC time: Sun 2019-08-25 17:39:52
       Time zone: Europe/Minsk (+03, +0300)
     NTP enabled: no
NTP synchronized: no
 RTC in local TZ: no
      DST active: n/a

Here I will immediately note that the hardware time is correct: it will be easier to navigate further by it.

This is where the series of mistakes began.

The first mistake. self-confidence

Clack-clack…

# systemctl enable systemd-timesyncd.service && systemctl start systemd-timesyncd.service && ntpdate 0.ru.pool.ntp.org && timedatectl set-ntp on && timedatectl
25 Aug 21:00:10 ntpdate[28114]: adjust time server 195.210.189.106 offset -249.015251 sec
      Local time: Sun 2019-08-25 21:00:10 +03
  Universal time: Sun 2019-08-25 18:00:10 UTC
        RTC time: Sun 2019-08-25 18:00:10
       Time zone: Europe/Minsk (+03, +0300)
     NTP enabled: yes
NTP synchronized: yes
 RTC in local TZ: no
      DST active: n/a

Everything is fine, the time is synchronized, the system coincides with the hardware. “Take it,” I said, and went back to my business.

“What do you take? - the colleague was indignant. “The time is past!”

The more you solve typical problems, the more your thinking becomes blinkered and you no longer think that the hundredth or thousandth situation will be different, but not this time.

# timedatectl
      Local time: Sun 2019-08-25 21:09:15 +03
  Universal time: Sun 2019-08-25 18:09:15 UTC
        RTC time: Sun 2019-08-25 18:05:04
       Time zone: Europe/Minsk (+03, +0300)
     NTP enabled: yes
NTP synchronized: no
 RTC in local TZ: no
      DST active: n/a

The system time is wrong again.

Let's try again:

# ntpdate 0.ru.pool.ntp.org && timedatectl && sleep 1 && timedatectl
25 Aug 21:07:37 ntpdate[30350]: step time server 89.175.20.7 offset -249.220828 sec
      Local time: Sun 2019-08-25 21:07:37 +03
  Universal time: Sun 2019-08-25 18:07:37 UTC
        RTC time: Sun 2019-08-25 18:07:37
       Time zone: Europe/Minsk (+03, +0300)
     NTP enabled: yes
NTP synchronized: yes
 RTC in local TZ: no
      DST active: n/a
      Local time: Sun 2019-08-25 21:11:46 +03
  Universal time: Sun 2019-08-25 18:11:46 UTC
        RTC time: Sun 2019-08-25 18:07:37
       Time zone: Europe/Minsk (+03, +0300)
     NTP enabled: yes
NTP synchronized: no
 RTC in local TZ: no
      DST active: n/a

Let's do it differently:

# date -s "2019-08-25 21:10:30" && date && sleep 1 && timedatectl
Sun Aug 25 21:10:30 +03 2019
Sun Aug 25 21:10:30 +03 2019
      Local time: Sun 2019-08-25 21:14:36 +03
  Universal time: Sun 2019-08-25 18:14:36 UTC
        RTC time: Sun 2019-08-25 18:10:30
       Time zone: Europe/Minsk (+03, +0300)
     NTP enabled: yes
NTP synchronized: no
 RTC in local TZ: no
      DST active: n/a

And like this:

# hwclock --hctosys && timedatectl && sleep 1 && timedatectl
      Local time: Sun 2019-08-25 21:11:31 +03
  Universal time: Sun 2019-08-25 18:11:31 UTC
        RTC time: Sun 2019-08-25 18:11:31
       Time zone: Europe/Minsk (+03, +0300)
     NTP enabled: yes
NTP synchronized: yes
 RTC in local TZ: no
      DST active: n/a
      Local time: Sun 2019-08-25 21:15:36 +03
  Universal time: Sun 2019-08-25 18:15:36 UTC
        RTC time: Sun 2019-08-25 18:11:32
       Time zone: Europe/Minsk (+03, +0300)
     NTP enabled: yes
NTP synchronized: no
 RTC in local TZ: no
      DST active: n/a

The time is set for a fraction of a second, and immediately begins to "rush" again.

At the same time, in the logs, at the time of such a manual change, we see only system reports that the time has changed, respectively, in the right / wrong direction and occasionally Resyncing from systemd-timesyncd.

Aug 25 21:18:51 wisi systemd[1]: Time has been changed
Aug 25 21:18:51 wisi systemd-timesyncd[29258]: System time changed. Resyncing.
Aug 25 21:18:51 wisi systemd[1187]: Time has been changed
Aug 25 21:18:51 wisi systemd[1]: Time has been changed
Aug 25 21:18:51 wisi systemd[1187]: Time has been changed

here

# ps afx | grep "[1]187"
 1187 ?        Ss     0:02 /lib/systemd/systemd --user

At this moment, it was already necessary to look for the cause, but the brain for 18 years of administration has accumulated statistics of “time” errors and, out of habit, again blames synchronization.
Turn it off completely.

# timedatectl set-ntp off && systemctl stop systemd-timesyncd.service
# hwclock --hctosys && timedatectl && sleep 1 && timedatectl
      Local time: Sun 2019-08-25 21:25:40 +03
  Universal time: Sun 2019-08-25 18:25:40 UTC
        RTC time: Sun 2019-08-25 18:25:40
       Time zone: Europe/Minsk (+03, +0300)
     NTP enabled: no
NTP synchronized: no
 RTC in local TZ: no
      DST active: n/a
      Local time: Sun 2019-08-25 21:29:31 +03
  Universal time: Sun 2019-08-25 18:29:31 UTC
        RTC time: Sun 2019-08-25 18:25:41
       Time zone: Europe/Minsk (+03, +0300)
     NTP enabled: no
NTP synchronized: no
 RTC in local TZ: no
      DST active: n/a

and in logs

Aug 25 21:25:40 wisi systemd[1]: Time has been changed
Aug 25 21:25:40 wisi systemd[1187]: Time has been changed
Aug 25 21:29:30 wisi systemd[1]: Time has been changed
Aug 25 21:29:30 wisi systemd[1187]: Time has been changed

Resyncing disappeared and the rest of the logs are pristine.

Checking the conclusions Tcpdump on port 123 on all interfaces. There are no requests, but time still “runs away”.

Second error. Rush

There is an hour left until the end of the working week, and I don’t want to leave for the weekend with a trifling unsolved problem (ignore the time in the code, the article was written in the following days).
And here again, instead of looking for the cause, I began to try to come up with an explanation for the result. I say “invent” because no matter how logical the explanations for the result are, this is the wrong approach to solving the problem.

This server is streaming and converts the DVB-S2 stream to IP. Timestamps are present in the DVB-S stream, so receivers, multiplexers, scramblers and televisions often use them to synchronize the system clock. Drivers for DVB-S boards are compiled into the kernel, so the fastest way to ensure that the DVB-S2 stream is removed is to disable the cables coming from the "plates". Fortunately, the server is behind the wall, so be it.

Of course, if the logs had what should be there, this would not have happened, but more on that, again, at the end of the article.

Well, since we have already removed all satellite signals, we will also remove terrestrial ones - along the way, we pull out all the network cables. The server becomes cut off from the outside world and works completely autonomously, but the system clock is still in a hurry.

The work week is over, and the date / time issue itself is not critical on it, so you can just go home, but here I make a new mistake.

Error three. advisers

Never! Never ask questions on forums and general purpose (a la stackoverflow) sites if the answer to it requires more than learning Google's first page and reading one man page.

You will be sent back to google, read the same man and popularly explain the rules of the forum / site, but will not give an answer.

Here are some objective factors:

  • no one but you can know the problem as well;
  • no one can test in the same conditions as you

and subjective:

  • you may not give out all the input for solving the problem, because you have already come up with the “right” direction and state the essence of the issue, focusing on it;
  • the foreman (moderator, old-timer, admin) is always right, if the foreman is wrong ... well, you know ...

If during the response comments you remained within the framework of censorship, then you have strong nerves.

Solution

No need to divide tasks into simple and complex.

We stop relying on our experience, statistics, advisers and start not to “explain” the end result, but to consistently look for the cause.

Since someone sets the time, the corresponding system call must occur.

As in software documentation, the best docks are source codes, so in system administration, the best assistant is an audit, in our case auditd.

moment of doubtI ran through the manas, but was not completely sure that the time in Linux can only be set clock_settime и settimeofday, so for the first test I chose all the "suitable" calls:

# man syscalls | col | grep -F '(2)' | grep -vE '(:|;)' | grep -E '(time|date|clock)' | sed "s/(2).*//" | xargs -I SYSCALL echo "-S SYSCALL " | xargs echo
-S adjtimex -S clock_adjtime -S clock_getres -S clock_gettime -S clock_nanosleep -S clock_settime -S futimesat -S getitimer -S gettimeofday -S mq_timedreceive -S mq_timedsend -S rt_sigtimedwait -S s390_runtime_instr -S setitimer -S settimeofday -S stime -S time -S timer_create -S timer_delete -S timer_getoverrun -S timer_gettime -S timer_settime -S timerfd_create -S timerfd_gettime -S timerfd_settime -S times -S utime -S utimensat -S utimes

and discarding s390_runtime_instr, stime, timerfd_create, of which auditctl did not recognize, initially launched the audit in the form:

auditctl -a exit,always -S adjtimex -S clock_adjtime -S clock_getres -S clock_nanosleep -S clock_settime -S futimesat -S getitimer -S gettimeofday -S mq_timedreceive -S mq_timedsend -S rt_sigtimedwait -S semtimedop -S setitimer -S settimeofday -S time -S timer_create -S timer_delete -S timer_getoverrun -S timer_gettime -S timer_settime -S timerfd_gettime -S timerfd_settime -S times -S utime -S utimensat -S utimes

After making sure that there are no other logs in the places of interest to me syscalls besides these two, I used only them further.

Start auditing system calls clock_settime и settimeofday and try to change the date:

# auditctl -a exit,always -S clock_settime -S settimeofday && date -s "2019-08-22 12:10:00" && sleep 5 && auditctl -D

A five second delay is added to make sure our "parasite" corrects the time.

We look at the report:

# aureport -s -i

Syscall Report
=======================================
# date time syscall pid comm auid event
=======================================
Warning - freq is non-zero and incremental flushing not selected.
1. 08/22/2019 12:10:00 settimeofday 3088 chkcache_proces root 479630
2. 08/26/2019 09:37:06 clock_settime 1538 date root 479629

Here we see our data and unknown to us chkcache_processes. It ended up in the report above because aureport sorted the output by date when converted from a binary view, and the event happened at the time we set date -s "2019-08-22 12:10:00".
Who gave birth to it?

# ausearch -sc settimeofday --comm "chkcache_proces"
----
time->Thu Aug 22 12:10:00 2019
type=PROCTITLE msg=audit(1566465000.000:479630): proctitle="/usr/local/bin/oscam"
type=SYSCALL msg=audit(1566465000.000:479630): arch=c000003e syscall=164 success=yes exit=0 a0=7fde0dfc6e60 a1=0 a2=136cf a3=713ba56 items=0 ppid=3081 pid=3088 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts20 ses=68149 comm="chkcache_proces" exe="/usr/local/bin/oscam" key=(null)

/usr/local/bin/oscam Our parasite has been found. Despite its "malicious" behavior, you can not refuse the conditional access system, but still I would like to know if oscam, WTF?

The answer is quickly found in sources:

#if defined(CLOCKFIX)
if (tv.tv_sec > lasttime.tv_sec || (tv.tv_sec == lasttime.tv_sec && tv.tv_usec >= lasttime.tv_usec)) // check for time issues!
{
  lasttime = tv; // register this valid time
}
  else
{
  tv = lasttime;
  settimeofday(&tv, NULL); // set time back to last known valid time
  //fprintf(stderr, "*** WARNING: BAD TIME AFFECTING WHOLE OSCAM ECM HANDLING, SYSTEMTIME SET TO LAST KNOWN VALID TIME **** n");
}

How cute does it look here? commented out line warning...

Source: habr.com

Add a comment