How I recovered data in an unknown format from a magnetic tape

prehistory

Being a lover of retro iron, I once bought a ZX Spectrum + from a seller from the UK. Complete with the computer itself, I got several audio cassettes with games (in the original packaging with instructions), as well as programs recorded on cassettes without any special designations. Surprisingly, data from 40-year-old cassettes read well and I was able to download almost all the games and programs from them.

How I recovered data in an unknown format from a magnetic tape

However, on some cassettes, I found recordings that were clearly not made by the ZX Spectrum computer. They sounded completely different and, unlike the recordings from the mentioned computer, did not start with a short BASIC bootloader, which is usually present in the recordings of all programs and games.

For a while, I was haunted by this - I really wanted to know what was hidden in them. If it were possible to read the audio signal as a sequence of bytes, one could search them for characters or something that indicates the origin of the signal. A kind of retro archeology.

Now that I've come all the way and look at the labels on the cassettes themselves, I'm smiling because

the answer was right in front of my eyes all this time
On the label of the left cassette is the name of the TRS-80 computer, and just below the name of the manufacturer: "Manufactured by Radio Shack in USA"

(If you want to keep the intrigue to the end, do not go under the spoiler)

Comparison of audio signals

First of all, let's digitize the audio recordings. You can listen to how it sounds:


And as usual, the recording from the ZX Spectrum computer sounds:


In both cases, at the beginning of the recording there is a so-called pilot tone - a sound of one frequency (on the first recording it is very short <1 sec, but distinguishable). The pilot tone signals to the computer that it needs to prepare to receive data. As a rule, each computer recognizes only "its" pilot tone in terms of the waveform and its frequency.

It is necessary to say about the form of the signal itself. For example, on the ZX Spectrum, its shape is rectangular:

How I recovered data in an unknown format from a magnetic tape

When a pilot tone is detected, the ZX Spectrum displays alternating red and blue stripes on the border of the screen, indicating that the signal is recognized. Pilot tone ends sync pulse, which signals the computer to start receiving data. It is characterized by a shorter (compared to the pilot tone and subsequent data) duration (see figure)

After the sync pulse is received, the computer captures each rise / fall of the signal, measuring its duration. If the duration is less than a certain limit, bit 1 is written to the memory, otherwise 0. The bits are collected into bytes and the process is repeated until N bytes are received. The number N is usually taken from the header of the uploaded file. The boot sequence is as follows:

  1. pilot tone
  2. header (fixed length), contains the size of the uploaded data (N), file name and file type
  3. pilot tone
  4. the data itself

To make sure that the data is loaded correctly, the ZX Spectrum reads the last byte of the so-called parity byte (parity byte), which is calculated when the file is saved by the XOR operation over all bytes of the written data. When reading a file, the computer calculates the parity byte from the received data and, if the result differs from the stored one, displays an error message "R Tape loading error". Strictly speaking, the computer can issue this message even earlier if, when reading, it cannot recognize the pulse (missed or its duration does not correspond to certain boundaries)

So let's see what the unknown signal looks like now:

How I recovered data in an unknown format from a magnetic tape

This is the pilot tone. The waveform is significantly different, but it can be seen that the signal consists of repeating short pulses of a certain frequency. With a sampling rate of 44100 Hz, the distance between the "peaks" is approximately 48 samples (which corresponds to a frequency of ~ 918 Hz) Let's remember this figure.

Now let's look at the fragment with the data:

How I recovered data in an unknown format from a magnetic tape

If you measure the distance between individual pulses, it turns out that the distance between β€œlong” pulses is still ~48 samples, and between short ones - ~24. Looking ahead a little, I’ll say that in the end it turned out that β€œreference” pulses with a frequency of 918 Hz follow continuously, from the beginning to the end of the file. It can be assumed that during data transmission, if an additional pulse occurs between the reference pulses, we consider it as bit 1, otherwise 0.

What about the sync pulse? Let's look at the beginning of the data:

How I recovered data in an unknown format from a magnetic tape

The pilot tone ends and data begins immediately. A little later, after analyzing several different audio recordings, we managed to find that the first data byte is always the same (10100101b, A5h). Perhaps the computer starts reading the data after it receives it.

You can also pay attention to the shift of the first reference pulse immediately after the last 1 in the sync byte. It was discovered much later in the process of developing a program for data recognition, when the data at the beginning of the file could not be read stably.

Now let's try to describe the algorithm that will process the audio file and load the data.

Loading data

First, let's consider a few assumptions so as not to complicate the algorithm:

  1. We will consider files only in WAV format;
  2. The audio file must begin with a pilot tone and must not contain silence at the beginning
  3. The source file must have a sampling rate of 44100 Hz. In this case, the distance between the reference pulses of 48 samples is already determined and we do not need to programmatically calculate it;
  4. The sample format can be any (8/16 bit/floating point) - since when reading we can convert it to the desired one;
  5. We assume that the source file is normalized in amplitude, which should stabilize the result;

The reading algorithm will be as follows:

  1. We read the file into memory, at the same time we convert the sample format to 8 bits;
  2. Determine the position of the first pulse in the audio data. To do this, you need to calculate the number of the sample with the maximum amplitude. For simplicity, we calculate it once manually. Save to prev_pos variable;
  3. Add to the position of the last pulse 48 (pos := prev_pos + 48)
  4. Since increasing the position by 48 does not guarantee that we will get to the position of the next reference pulse (tape defects, unstable operation of the tape drive mechanism, etc.), we need to correct the position of the pos pulse. To do this, take a small piece of data (pos-8;pos+8) and find the maximum amplitude value on it. The position corresponding to the maximum will be saved in pos. Here, 8 = 48/6 is an experimentally obtained constant, which ensures that we determine the correct maximum and do not affect other impulses that may be nearby. In very bad cases, when the distance between pulses is much less than or greater than 48, it is possible to implement a forced search for a pulse, but within the framework of the article, I will not describe this in the algorithm;
  5. In the previous step, it would also be necessary to check that the reference pulse has been found at all. That is, if you simply look for a maximum, this does not guarantee that there is momentum in this segment. In my latest implementation of the reader, I check the difference between the maximum and minimum amplitude values ​​on the segment, and if it exceeds a certain limit, I count the presence of an impulse. The question is also what to do if the reference pulse is not found. There are 2 options: either the data is over and then silence follows, or this should be considered a read error. However, we omit this to simplify the algorithm;
  6. The next step is to determine the presence of a data pulse (bit 0 or 1), for this we take the middle of the segment (prev_pos;pos) middle_pos equal to middle_pos := (prev_pos+pos)/2 and in some neighborhood of middle_pos on the segment (middle_pos-8;middle_pos +8) calculate the maximum and minimum amplitude. If the difference between them is greater than 10, we write bit 1 into the result, otherwise 0. 10 is a constant obtained empirically;
  7. Store current position in prev_pos (prev_pos := pos)
  8. Repeat from step 3 until the entire file has been read;
  9. The resulting bitmap must be stored as a set of bytes. Since we did not take into account the sync byte when reading, the number of bits may not be a multiple of 8, and the required bit offset is also unknown. In the first implementation of the algorithm, I did not know about the existence of a sync byte and therefore simply saved 8 files with a different number of offset bits. One of them contained correct data. In the final algorithm, I simply remove all bits up to A5h, which allows you to immediately get the correct output file

Algorithm in Ruby, who cares
As a language for writing a program, I chose Ruby, because. Most of the time I program on it. The option is not high-performance, but the task of making the reading speed as fast as possible is not worth it.

# Π˜ΡΠΏΠΎΠ»ΡŒΠ·ΡƒΠ΅ΠΌ gem 'wavefile'
require 'wavefile'

reader = WaveFile::Reader.new('input.wav')
samples = []
format = WaveFile::Format.new(:mono, :pcm_8, 44100)

# Π§ΠΈΡ‚Π°Π΅ΠΌ WAV Ρ„Π°ΠΉΠ», ΠΊΠΎΠ½Π²Π΅Ρ€Ρ‚ΠΈΡ€ΡƒΠ΅ΠΌ Π² Ρ„ΠΎΡ€ΠΌΠ°Ρ‚ Mono, 8 bit 
# Массив samples Π±ΡƒΠ΄Π΅Ρ‚ ΡΠΎΡΡ‚ΠΎΡΡ‚ΡŒ ΠΈΠ· Π±Π°ΠΉΡ‚ со значСниями 0-255
reader.each_buffer(10000) do |buffer|
  samples += buffer.convert(format).samples
end

# ΠŸΠΎΠ·ΠΈΡ†ΠΈΡ ΠΏΠ΅Ρ€Π²ΠΎΠ³ΠΎ ΠΈΠΌΠΏΡƒΠ»ΡŒΡΠ° (вмСсто 0)
prev_pos = 0
# РасстояниС ΠΌΠ΅ΠΆΠ΄Ρƒ ΠΈΠΌΠΏΡƒΠ»ΡŒΡΠ°ΠΌΠΈ
distance = 48
# Π—Π½Π°Ρ‡Π΅Π½ΠΈΠ΅ расстояния для окрСстности поиска локального максимума
delta = (distance / 6).floor
# Π‘ΠΈΡ‚Ρ‹ Π±ΡƒΠ΄Π΅ΠΌ ΡΠΎΡ…Ρ€Π°Π½ΡΡ‚ΡŒ Π² Π²ΠΈΠ΄Π΅ строки ΠΈΠ· "0" ΠΈ "1"
bits = ""

loop do
  # РассчитываСм ΠΏΠΎΠ·ΠΈΡ†ΠΈΡŽ ΡΠ»Π΅Π΄ΡƒΡŽΡ‰Π΅Π³ΠΎ ΠΈΠΌΠΏΡƒΠ»ΡŒΡΠ°
  pos = prev_pos + distance
  
  # Π’Ρ‹Ρ…ΠΎΠ΄ΠΈΠΌ ΠΈΠ· Ρ†ΠΈΠΊΠ»Π° Ссли Π΄Π°Π½Π½Ρ‹Π΅ Π·Π°ΠΊΠΎΠ½Ρ‡ΠΈΠ»ΠΈΡΡŒ 
  break if pos + delta >= samples.size

  # ΠšΠΎΡ€Ρ€Π΅ΠΊΡ‚ΠΈΡ€ΡƒΠ΅ΠΌ ΠΏΠΎΠ·ΠΈΡ†ΠΈΡŽ pos ΠΎΠ±Π½Π°Ρ€ΡƒΠΆΠ΅Π½ΠΈΠ΅ΠΌ максимума Π½Π° ΠΎΡ‚Ρ€Π΅Π·ΠΊΠ΅ [pos - delta;pos + delta]
  (pos - delta..pos + delta).each { |p| pos = p if samples[p] > samples[pos] }

  # Находим сСрСдину ΠΎΡ‚Ρ€Π΅Π·ΠΊΠ° [prev_pos;pos]
  middle_pos = ((prev_pos + pos) / 2).floor

  # Π‘Π΅Ρ€Π΅ΠΌ ΠΎΠΊΡ€Π΅ΡΡ‚Π½ΠΎΡΡ‚ΡŒ Π² сСрСдинС 
  sample = samples[middle_pos - delta..middle_pos + delta]

  # ΠžΠΏΡ€Π΅Π΄Π΅Π»ΡΠ΅ΠΌ Π±ΠΈΡ‚ ΠΊΠ°ΠΊ "1" Ссли Ρ€Π°Π·Π½ΠΈΡ†Π° ΠΌΠ΅ΠΆΠ΄Ρƒ ΠΌΠ°ΠΊΡΠΈΠΌΠ°Π»ΡŒΠ½Ρ‹ΠΌ ΠΈ ΠΌΠΈΠ½ΠΈΠΌΠ°Π»ΡŒΠ½Ρ‹ΠΌ Π·Π½Π°Ρ‡Π΅Π½ΠΈΠ΅ΠΌ Π½Π° ΠΎΡ‚Ρ€Π΅Π·ΠΊΠ΅ ΠΏΡ€Π΅Π²Ρ‹ΡˆΠ°Π΅Ρ‚ 10
  bit = sample.max - sample.min > 10
  bits += bit ? "1" : "0"
end

# ΠžΠΏΡ€Π΅Π΄Π΅Π»ΡΠ΅ΠΌ синхро-Π±Π°ΠΉΡ‚ ΠΈ замСняСм всС ΠΏΡ€Π΅Π΄ΡˆΠ΅ΡΡ‚Π²ΡƒΡŽΡ‰ΠΈΠ΅ Π±ΠΈΡ‚Ρ‹ Π½Π° 256 Π±ΠΈΡ‚ Π½ΡƒΠ»Π΅ΠΉ (согласно спСцификации Ρ„ΠΎΡ€ΠΌΠ°Ρ‚Π°) 
bits.gsub! /^[01]*?10100101/, ("0" * 256) + "10100101"

# БохраняСм Π²Ρ‹Ρ…ΠΎΠ΄Π½ΠΎΠΉ Ρ„Π°ΠΉΠ», упаковывая Π±ΠΈΡ‚Ρ‹ Π² Π±Π°ΠΉΡ‚Ρ‹
File.write "output.cas", [bits].pack("B*")

Experience the Power of Effective Results

After trying several variants of the algorithm and constants, I was lucky to get something extremely interesting:

How I recovered data in an unknown format from a magnetic tape

So, judging by the character strings, we have a program for plotting graphs. However, there are no keywords in the text of the program. All keywords are encoded as bytes (value of each > 80h). Now we need to find out which computer from the 80s could save programs in this format.

In fact, it is very similar to a BASIC program. About the same format, the ZX Spectrum computer stores in memory and saves programs to tape. Just in case, I checked the keywords against table. However, the result was apparently negative.

I also checked the BASIC keywords of Atari, Commodore 64 computers popular at the time, and several others for which I could find documentation, but to no avail - my knowledge of the types of retro computers was not so wide.

Then I decided to go list, and then my eyes fell on the name of the manufacturer of Radio Shack and the TRS-80 computer. These are the titles that were written on the labels of the cassettes that were on my desk! I did not know these names before and was not familiar with the TRS-80 computer, so it seemed to me that Radio Shack is an audio cassette manufacturer, such as BASF, Sony or TDK, and TRS-80 is the playback duration. Why not?

Computer Tandy/Radio Shack TRS-80

It is very likely that the audio recording in question, which I gave as an example at the beginning of the article, was made on such a computer:

How I recovered data in an unknown format from a magnetic tape

It turned out that this computer and its variants (Model I/Model III/Model IV, etc.) were very popular in their time (of course, not in Russia). It is noteworthy that the processor they used is also the Z80. For this computer on the Internet you can find a lot of information. In the 80s, information about the computer was distributed in magazines. At the moment there are several emulators computers for different platforms.

I downloaded the emulator trs80gp and for the first time I was able to see how this computer worked. Of course, the computer did not support color output, the screen resolution was only 128x48 pixels, but there were many extensions and modifications that could increase the screen resolution. There were also many options for operating systems for this computer and options for implementing the BASIC language (which, unlike the ZX Spectrum, in some models was not even β€œflashed” into ROM and any option could be loaded from a floppy disk, as well as the OS itself)

Also I found utility to convert audio recordings to the CAS format, which are supported by emulators, but for some reason it was not possible to read recordings from my cassettes with their help.

After understanding the CAS file format (which turned out to be just a bit-by-bit copy of the data from the tape I already had on hand, except for the header with the presence of a sync byte), I made a few changes to my program and was able to get a working CAS file as output, which earned in the emulator (TRS-80 Model III):

How I recovered data in an unknown format from a magnetic tape

The last version of the utility for converting with automatic detection of the first pulse and the distance between the reference pulses, I issued in the form of a GEM package, the source code is available at Github.

Conclusion

The path traveled turned out to be a fascinating journey into the past, and I am glad that in the end I found a clue. Among other things, I:

  • I figured out the format for saving data in the ZX Spectrum and studied the subroutines built into the ROM for saving / reading data from audio cassettes
  • I got acquainted with the TRS-80 computer and its varieties, studied the operating system, looked at program examples and even had the opportunity to debug in machine codes (after all, all the Z80 mnemonics are familiar to me)
  • Wrote a full-fledged utility for converting audio recordings to CAS format, which can read data that is unrecognizable by the "official" utility

Source: habr.com

Add a comment