Falling Down the Rabbit Hole: The story of one varnish reboot error - Part 1

Ghostinushanka, after pounding on the buttons for the previous 20 minutes as if his life depended on it, turns to me with a semi-wild expression in his eyes and a sly grin - "Dude, I think I understand."

“Look here,” he says, pointing to one of the characters on the screen, “I bet my red hat that if we add here what I just sent you” - pointing to another section of code - “the error is no longer will be displayed."

A little puzzled and tired, I change the sed statement we've been working on for a while, save the file, and run systemctl varnish reload. The error message has disappeared...

“The emails I exchanged with the candidate,” my colleague continued, as his smirk morphs into a genuine smile full of joy, “It suddenly dawned on me that this is exactly the same problem!”

How did it all start

The article assumes an understanding of how bash, awk, sed and systemd work. Knowledge of varnish is preferred but not required.
Timestamps in snippets have been changed.
Written with Ghostinushanka.
This text is a translation of the original published in English two weeks ago; translation boyikoden.

The sun shines through the panoramic windows on another warm autumn morning, a cup of freshly brewed caffeinated drink rests to the side of the keyboard, a favorite symphony of sounds plays in the headphones over the rustle of mechanical keyboards, and the first entry in the list of backlog tickets on the kanban board playfully glows with the fateful title “Investigate varnishreload sh: echo: I/O error in staging” (Investigate “varnishreload sh: echo: I/O error” in staging). When it comes to varnish, there are no and can't be any mistakes, even if they don't result in any problems, as in this case.

For those who are not familiar with varnishreload, this is a simple shell script used to reload the configuration varnish - also called VCL.

As the title of the ticket suggests, the error occurred on one of the servers in the stage, and since I was confident that varnish's routing in the stage was working properly, I assumed that this would be a minor mistake. So, just a message that got into an already closed output stream. I take a ticket for myself, in full confidence that I will mark it ready in less than 30 minutes, pat myself on the shoulder for clearing the board of the next junk and get back to more important things.

Crashing into a wall at 200 km/h

Opening the file varnishreload, on one of the servers running Debian Stretch, I saw a shell script less than 200 lines long.

Running through the script, I didn't see anything that could cause problems when running it multiple times directly from the terminal.

After all, this is a stage, even if it breaks, no one will complain, well ... not too much. I run the script and see what will be written out to the terminal, but the errors are no longer visible.

A couple more runs to make sure I can't reproduce the error without some extra effort, and I start to figure out how to change this script and make it still throw an error.

Can the script block STDOUT (using > &-)? Or STDERR? Neither worked in the end.

Obviously systemd changes the run environment in some way, but how, and why?
I turn on vim and edit varnishreload, adding set -x right under the shebang, hoping that debugging the output of the script will shed some light.

The file is fixed, so I reload varnish and see that the change completely broke everything ... The exhaust is a complete mess, with tons of C-like code in it. Even scrolling in the terminal is not enough to find where it starts. I am completely confused. Can debug mode affect the work of programs run in a script? No, bullshit. Bug in the shell? Several possible scenarios are flying in my head like cockroaches in different directions. A cup of caffeine-full drink empties instantly, a quick trip to the kitchen for a resupply and… let's go. I open the script and take a closer look at the shebang: #!/bin/sh.

/bin/sh - this is just a bash symlink, so the script is interpreted in POSIX-compatible mode, right? It wasn't there! The default shell on Debian is dash, which is exactly what refers /bin/sh.

# ls -l /bin/sh
lrwxrwxrwx 1 root root 4 Jan 24  2017 /bin/sh -> dash

For the sake of trial, I changed the shebang to #!/bin/bash, removed set -x and tried again. Finally, on subsequent reload of varnish, a tolerable error appeared in the output:

Jan 01 12:00:00 hostname varnishreload[32604]: /usr/sbin/varnishreload: line 124: echo: write error: Broken pipe
Jan 01 12:00:00 hostname varnishreload[32604]: VCL 'reload_20190101_120000_32604' compiled

Line 124, here it is!

114 find_vcl_file() {
115         VCL_SHOW=$(varnishadm vcl.show -v "$VCL_NAME" 2>&1) || :
116         VCL_FILE=$(
117                 echo "$VCL_SHOW" |
118                 awk '$1 == "//" && $2 == "VCL.SHOW" {print; exit}' | {
119                         # all this ceremony to handle blanks in FILE
120                         read -r DELIM VCL_SHOW INDEX SIZE FILE
121                         echo "$FILE"
122                 }
123         ) || :
124
125         if [ -z "$VCL_FILE" ]
126         then
127                 echo "$VCL_SHOW" >&2
128                 fail "failed to get the VCL file name"
129         fi
130
131         echo "$VCL_FILE"
132 }

But as it turned out, line 124 is rather empty and of no interest. I could only assume that the error occurred as part of a multiline starting on line 116.
What is finally written to the variable VCL_FILE as a result of executing the above sub-shell?

At the beginning, it sends the contents of the variable VLC_SHOW, created on line 115, to the next command through the pipe. And then what happens there?

First, it uses varnishadm, which is part of the varnish installation package, to configure varnish without restarting.

subcommand vcl.show -v is used to output the entire VCL configuration specified in ${VCL_NAME}, to STDOUT.

To display the currently active VCL configuration, as well as several previous versions of varnish's routing configurations still in memory, you can use the varnishadm vcl.list, the output of which will be similar to the following:

discarded   cold/busy       1 reload_20190101_120000_11903
discarded   cold/busy       2 reload_20190101_120000_12068
discarded   cold/busy       16 reload_20190101_120000_12259
discarded   cold/busy       16 reload_20190101_120000_12299
discarded   cold/busy       28 reload_20190101_120000_12357
active      auto/warm       32 reload_20190101_120000_12397
available   auto/warm       0 reload_20190101_120000_12587

Variable value ${VCL_NAME} set in another part of the script varnishreload to the name of the currently active VCL, if any. In this case it will be “reload_20190101_120000_12397”.

Okay, variable. ${VCL_SHOW} contains the complete configuration for varnish, so far clear. Now I finally understand why dash output with set -x turned out to be so broken - it included the contents of the resulting configuration.

It is important to understand that a complete VCL configuration can often be cobbled together from multiple files. C-style comments are used to define where one configuration file has been included in another, and that's exactly what the whole line of code snippet below is about.
The syntax for comments describing included files has the following format:

// VCL.SHOW <NUM> <NUM> <FILENAME>

The numbers in this context are not important, we are interested in the file name.

So what happens in the swamp of commands that starts on line 116?
Let's face it.
The command consists of four parts:

  1. Simple echo, which displays the value of the variable ${VCL_SHOW}
    echo "$VCL_SHOW"
  2. awk, which looks for a line (record), where the first field, after splitting the text, will be “//”, and the second one will be “VCL.SHOW”.
    Awk will write out the first line that matches these patterns and then stop processing immediately.

    awk '$1 == "//" && $2 == "VCL.SHOW" {print; exit}'
  3. A block of code that stores field values ​​separated by spaces into five variables. The fifth variable FILE receives the rest of the line. Finally, the last echo writes out the contents of the variable ${FILE}.
    { read -r DELIM VCL_SHOW INDEX SIZE FILE; echo "$FILE" }
  4. Since all steps 1 to 3 are enclosed in a sub-shell, the output of the value $FILE will be written to a variable VCL_FILE.

As the comment on line 119 suggests, this serves the sole purpose of reliably handling cases where the VCL will refer to files with whitespace characters in their names.

I have commented out the original processing logic for ${VCL_FILE} and tried to change the sequence of commands, but it did not lead to anything. Everything worked cleanly for me, and in the case of starting the service, it gave an error.

It seems that the error is simply not reproducible when running the script manually, while the estimated 30 minutes have already ended six times and, in addition, a higher priority task has appeared, pushing the rest of the cases aside. The rest of the week was filled with a variety of tasks and was only slightly diluted with a talk on sed and an interview with the candidate. Error problem in varnishreload irretrievably lost in the sands of time.

Your so-called sed-fu... actually... rubbish

The following week had one fairly free day, so I decided to take up this ticket again. I hoped that in my brain, some background process all this time was looking for a solution to this problem, and this time I will definitely understand what's wrong.

Since last time just changing the code did not help, I just decided to rewrite it starting from the 116th line. In any case, the existing code was silly. And there is absolutely no need to use read.

Looking at the error again:
sh: echo: broken pipe - in this command, echo is in two places, but I suspect that the first is the more likely culprit (well, or at least an accomplice). Awk doesn't inspire confidence either. And in case it really is awk | {read; echo} the design leads to all these problems, why not replace it? This one-line command does not use all the features of awk, and even this extra read in appendage.

Since last week there was a report on sedI wanted to try my newly acquired skills and simplify echo | awk | { read; echo} into a more understandable echo | sed. While this is definitely not the best approach to catch the bug, I thought I'd at least try my sed-fu and maybe learn something new about the problem. Along the way, I asked my colleague, the sed talk writer, to help me come up with a more efficient sed script.

I dropped the content varnishadm vcl.show -v "$VCL_NAME" to a file so I can focus on writing the sed script without any of the hassle of service restarts.

A brief description of exactly how sed handles input can be found in his GNU manual. In the sed sources, the symbol n explicitly specified as a line separator.

In several passes, and with the advice of my colleague, we wrote a sed script that gave the same result as the entire original line 116.

Below is a sample file with input data:

> cat vcl-example.vcl
Text
// VCL.SHOW 0 1578 file with 3 spaces.vcl
More text
// VCL.SHOW 0 1578 file.vcl
Even more text
// VCL.SHOW 0 1578 file with TWOspaces.vcl
Final text

It may not be obvious from the description above, but we are only interested in the first comment // VCL.SHOW, and there can be several of them in the input data. This is why the original awk terminates after the first match.

# шаг первый, вывести только строки с комментариями
# используя возможности sed, определяется символ-разделитель с помощью конструкции '#' вместо обычно используемого '/', за счёт этого не придётся экранировать косые в искомом комментарии
# определяется регулярное выражение “// VCL.SHOW”, для поиска строк с определенным шаблоном
# флаг -n позаботится о том, чтобы sed не выводил все входные данные, как он это делает по умолчанию (см. ссылку выше)
# -E позволяет использовать расширенные регулярные выражения
> cat vcl-processor-1.sed
#// VCL.SHOW#p
> sed -En -f vcl-processor-1.sed vcl-example.vcl
// VCL.SHOW 0 1578 file with 3 spaces.vcl
// VCL.SHOW 0 1578 file.vcl
// VCL.SHOW 0 1578 file with TWOspaces.vcl

# шаг второй, вывести только имя файла
# используя команду “substitute”, с группами внутри регулярных выражений, отображается только нужная группa
# и это делается только для совпадений, ранее описанного поиска
> cat vcl-processor-2.sed
#// VCL.SHOW# {
    s#.* [0-9]+ [0-9]+ (.*)$#1#
    p
}
> sed -En -f vcl-processor-2.sed vcl-example.vcl
file with 3 spaces.vcl
file.vcl
file with TWOspaces.vcl

# шаг третий, получить только первый из результатов
# как и в случае с awk, добавляется немедленное завершения после печати первого найденного совпадения
> cat vcl-processor-3.sed
#// VCL.SHOW# {
    s#.* [0-9]+ [0-9]+ (.*)$#1#
    p
    q
}
> sed -En -f vcl-processor-3.sed vcl-example.vcl
file with 3 spaces.vcl

# шаг четвертый, схлопнуть всё в однострочник, используя двоеточия для разделения команд
> sed -En -e '#// VCL.SHOW#{s#.* [0-9]+ [0-9]+ (.*)$#1#p;q;}' vcl-example.vcl
file with 3 spaces.vcl

So the contents of the varnishreload script would look something like this:

VCL_FILE="$(echo "$VCL_SHOW" | sed -En '#// VCL.SHOW#{s#.*[0-9]+ [0-9]+ (.*)$#1#p;q;};')"

The above logic can be summarized as follows:
If string matches regular expression // VCL.SHOW, then greedily devour the text that includes both numbers in that line, and save whatever is left after this operation. Issue the stored value and end the program.

Simple, isn't it?

We were happy with the sed script and the fact that it replaces all of the original code. All my tests gave the desired results, so I changed the “varnishreload” on the server and ran again systemctl reload varnish. Filthy mistake echo: write error: Broken pipe laughed in our face again. A winking cursor was waiting for a new command to be entered in the dark void of the terminal...

Source: habr.com

Add a comment