Data recovery from XtraDB tables without structure file using ibd file byte parsing

Data recovery from XtraDB tables without structure file using ibd file byte parsing

prehistory

It so happened that the server was attacked by a ransomware virus, which, by "lucky chance", partially left the .ibd files (raw data files of innodb tables) intact, but at the same time completely encrypted the .fpm files (structure files). At the same time, .idb could be divided into:

  • recoverable through standard tools and guides. For such cases, there is an excellent become;
  • partially encrypted tables. Mostly these are large tables, for which (as I understand it), the attackers did not have enough RAM for full encryption;
  • Well, fully encrypted tables that cannot be recovered.

It was possible to determine which option the tables belong to by simply opening it in any text editor under the desired encoding (in my case it is UTF8) and simply viewing the file for text fields, for example:

Data recovery from XtraDB tables without structure file using ibd file byte parsing

Also, at the beginning of the file, you can observe a large number of 0 bytes, and viruses using the block cipher algorithm (the most common) usually affect them.
Data recovery from XtraDB tables without structure file using ibd file byte parsing

In my case, the attackers left a string of 4 bytes (1, 0, 0, 0) at the end of each encrypted file, which made the task easier. A script was enough to search for uninfected files:

def opened(path):
    files = os.listdir(path)
    for f in files:
        if os.path.isfile(path + f):
            yield path + f

for full_path in opened("C:somepath"):
    file = open(full_path, "rb")
    last_string = ""
    for line in file:
        last_string = line
        file.close()
    if (last_string[len(last_string) -4:len(last_string)]) != (1, 0, 0, 0):
        print(full_path)

Thus, it turned out to find files belonging to the first type. The second implies a long manualism, but what was already found was enough. Everything would be fine, but you need to know absolutely precise structure and (of course) there was such a case that I had to work with a frequently changing table. No one remembered whether the field type had changed or a new column had been added.

The wilds of the city, unfortunately, could not help with such a case, which is why this article is being written.

Get to the point

There is a table structure 3 months old that does not match the current one (possibly one field, and possibly more). Table structure:

CREATE TABLE `table_1` (
    `id` INT (11),
    `date` DATETIME ,
    `description` TEXT ,
    `id_point` INT (11),
    `id_user` INT (11),
    `date_start` DATETIME ,
    `date_finish` DATETIME ,
    `photo` INT (1),
    `id_client` INT (11),
    `status` INT (1),
    `lead__time` TIME ,
    `sendstatus` TINYINT (4)
); 

while extracting:

  • id_point int(11);
  • id_user int(11);
  • date_start DATETIME;
  • date_finish DATETIME.

For recovery, byte-by-byte analysis of the .ibd file is used, followed by their translation into a more readable form. Since in order to find the required one, it is enough for us to analyze such data types as int and datatime, only they will be described in the article, but sometimes I will also refer to other data types, which can help in other similar incidents.

1 problem: there were NULL values ​​in the fields with DATETIME and TEXT types, and they are simply skipped in the file, because of this, it was not possible to determine the structure for restoration in my case. In the new columns, the default value was null, and part of the transaction could be lost due to the innodb_flush_log_at_trx_commit = 0 setting, so additional time would have to be spent to determine the structure.

2 problem: it should be noted that the rows deleted via DELETE will all be exactly in the ibd file, but with ALTER TABLE their structure will not be updated. As a result, the data structure can vary from the beginning of the file to its end. If you often use OPTIMIZE TABLE, then you are unlikely to encounter such a problem.

Note, the DBMS version affects how data is stored, and this example may not work for other major versions. In my case, windows version of mariadb 10.1.24 was used. Also, although in mariadb you work with InnoDB tables, in fact they are XtraDB, which excludes the applicability of the method with InnoDB mysql.

File analysis

In python, data type bytes() maps unicode data to a regular set of numbers. Although the file can be viewed in this form, but for convenience, you can convert the bytes into a numeric form by converting the byte array into a regular array (list(example_byte_array)). In any case, both methods are suitable for analysis.

Looking through several ibd files, you can find the following:

Data recovery from XtraDB tables without structure file using ibd file byte parsing

Moreover, if you divide the file by these keywords, you will get mostly even blocks of data. We will use infimum as a divisor.

table = table.split("infimum".encode())

An interesting observation, for tables with a small amount of data, between infimum and supremum there is a pointer to the number of rows in the block.

Data recovery from XtraDB tables without structure file using ibd file byte parsing - test table with 1st row

Data recovery from XtraDB tables without structure file using ibd file byte parsing - test table with 2 rows

The array of strings table[0] can be skipped. After reviewing it, I could not find the raw data of the tables. Most likely, this block is used to store indexes and keys.
Starting with table[1] and translating it into a numeric array, you can already notice some patterns, namely:

Data recovery from XtraDB tables without structure file using ibd file byte parsing

These are int values ​​stored in a string. The first byte indicates whether the number is positive or negative. In my case, all numbers are positive. From the remaining 3 bytes, you can determine the number using the following function. Script:

def find_int(val: str):  # example '128, 1, 2, 3'
    val = [int(v) for v in  val.split(", ")]
    result_int = val[1]*256**2 + val[2]*256*1 + val[3]
    return result_int

For example, the 128, 0, 0, 1 = 1, or 128, 0, 75, 108 = 19308.
The table had an auto-incrementing primary key, and it can also be found here

Data recovery from XtraDB tables without structure file using ibd file byte parsing

Comparing the data from the test tables, it was revealed that the DATETIME object consists of 5 bytes starting with 153 (most likely indicating annual intervals). Since the DATTIME range is '1000-01-01' to '9999-12-31', I think the number of bytes may vary, but in my case, the data falls within the period from 2016 to 2019, so we will assume that 5 bytes enough.

To determine the time without seconds, the following functions were written. Script:

day_ = lambda x: x % 64 // 2  # {x,x,X,x,x }

def hour_(x1, x2):  # {x,x,X1,X2,x}
    if x1 % 2 == 0:
        return x2 // 16
    elif x1 % 2 == 1:
        return x2 // 16 + 16
    else:
        raise ValueError

min_ = lambda x1, x2: (x1 % 16) * 4 + (x2 // 64)  # {x,x,x,X1,X2}

For the year and month, it was not possible to write a healthy-working function, so I had to hard code. Script:

ym_list = {'2016, 1': '153, 152, 64', '2016, 2': '153, 152, 128', 
           '2016, 3': '153, 152, 192', '2016, 4': '153, 153, 0',
           '2016, 5': '153, 153, 64', '2016, 6': '153, 153, 128', 
           '2016, 7': '153, 153, 192', '2016, 8': '153, 154, 0', 
           '2016, 9': '153, 154, 64', '2016, 10': '153, 154, 128', 
           '2016, 11': '153, 154, 192', '2016, 12': '153, 155, 0',
           '2017, 1': '153, 155, 128', '2017, 2': '153, 155, 192', 
           '2017, 3': '153, 156, 0', '2017, 4': '153, 156, 64',
           '2017, 5': '153, 156, 128', '2017, 6': '153, 156, 192',
           '2017, 7': '153, 157, 0', '2017, 8': '153, 157, 64',
           '2017, 9': '153, 157, 128', '2017, 10': '153, 157, 192', 
           '2017, 11': '153, 158, 0', '2017, 12': '153, 158, 64', 
           '2018, 1': '153, 158, 192', '2018, 2': '153, 159, 0',
           '2018, 3': '153, 159, 64', '2018, 4': '153, 159, 128', 
           '2018, 5': '153, 159, 192', '2018, 6': '153, 160, 0',
           '2018, 7': '153, 160, 64', '2018, 8': '153, 160, 128',
           '2018, 9': '153, 160, 192', '2018, 10': '153, 161, 0', 
           '2018, 11': '153, 161, 64', '2018, 12': '153, 161, 128',
           '2019, 1': '153, 162, 0', '2019, 2': '153, 162, 64', 
           '2019, 3': '153, 162, 128', '2019, 4': '153, 162, 192', 
           '2019, 5': '153, 163, 0', '2019, 6': '153, 163, 64',
           '2019, 7': '153, 163, 128', '2019, 8': '153, 163, 192',
           '2019, 9': '153, 164, 0', '2019, 10': '153, 164, 64', 
           '2019, 11': '153, 164, 128', '2019, 12': '153, 164, 192',
           '2020, 1': '153, 165, 64', '2020, 2': '153, 165, 128',
           '2020, 3': '153, 165, 192','2020, 4': '153, 166, 0', 
           '2020, 5': '153, 166, 64', '2020, 6': '153, 1, 128',
           '2020, 7': '153, 166, 192', '2020, 8': '153, 167, 0', 
           '2020, 9': '153, 167, 64','2020, 10': '153, 167, 128',
           '2020, 11': '153, 167, 192', '2020, 12': '153, 168, 0'}

def year_month(x1, x2):  # {x,X,X,x,x }

    for key, value in ym_list.items():
        key = [int(k) for k in key.replace("'", "").split(", ")]
        value = [int(v) for v in value.split(", ")]
        if x1 == value[1] and x2 // 64 == value[2] // 64:
            return key
    return 0, 0

I'm sure if you spend n amount of time, then this misunderstanding can be corrected.
Next, a function that returns a datetime object from a string. Script:

def find_data_time(val:str):
    val = [int(v) for v in val.split(", ")]
    day = day_(val[2])
    hour = hour_(val[2], val[3])
    minutes = min_(val[3], val[4])
    year, month = year_month(val[1], val[2])
    return datetime(year, month, day, hour, minutes)

It was possible to detect frequently repeated values ​​from int, int, datetime, datetime Data recovery from XtraDB tables without structure file using ibd file byte parsingseems to be what is needed. Moreover, such a sequence is not repeated twice per line.

Using a regular expression, we find the necessary data:

fined = re.findall(r'128, d*, d*, d*, 128, d*, d*, d*, 153, 1[6,5,4,3]d, d*, d*, d*, 153, 1[6,5,4,3]d, d*, d*, d*', int_array)

Please note that when searching by this expression, it will not be possible to determine NULL values ​​in the required fields, but in my case this is not critical. After in the loop, iterate over the found. Script:

result = []
for val in fined:
    pre_result = []
    bd_int  = re.findall(r"128, d*, d*, d*", val)
    bd_date= re.findall(r"(153, 1[6,5,4,3]d, d*, d*, d*)", val)
    for it in bd_int:
        pre_result.append(find_int(bd_int[it]))
    for bd in bd_date:
        pre_result.append(find_data_time(bd))
    result.append(pre_result)

Actually, the data from the result array is the data we need. ###PS.###
I understand that this method is not suitable for everyone, but the main goal of the article is to prompt action rather than solve all your problems. I think the most correct decision would be to start studying the source code of the mariadb, but due to limited time, the current method seemed to be the fastest.

In some cases, after analyzing the file, you can determine the approximate structure and restore it using one of the standard methods from the links above. It will be much more correct and cause less problems.

Source: habr.com

Add a comment