Something about inode

From time to time, in order to move to the CRS, I interview in various large companies, mainly in St. Petersburg and Moscow for the position of DevOps. I noticed that many companies (many good companies, for example, Yandex) ask two similar questions:

  • what is an inode
  • for what reasons you can get a disk write error (or for example: why disk space can run out, the essence is the same).

As often happens, I was sure that I knew this topic well, but as soon as I began to explain, there were gaps in knowledge. In order to systematize my knowledge, fill in the gaps and no longer be dishonored, I am writing this article, maybe it will come in handy for someone else.

I'll start "from the bottom", i.e. from a hard disk (we will discard flash drives, SSDs and other modern things, for example, consider any 20 or 80 gigabyte old disk, because there the block size is 512 bytes).

A hard disk cannot address its space byte by byte; it is conditionally divided into blocks. Block numbering starts from 0. (This is called LBA, details here: en.wikipedia.org/wiki/LBA)

Something about inode

As you can see from the figure, I designated the LBA blocks as the HDD level. By the way, you can see what block size your disk has like this:

root@ubuntu:/home/serp# blockdev --getpbsz /dev/sdb
512

The level above marks the partition, one for the entire disk (again, for simplicity). Most often, two types of partitioning are used: msdos and gpt. Accordingly, msdos is an old format that supports disks up to 2Tb, gpt is a new format that can address up to 1 zettabyte of 512 byte blocks. In our case, we have a partition of the msdos type, as can be seen from the figure, while the partition begins with block No. 1, while zero is used for the MBR.

In the first section, I created an ext2 file system, which has a default block size of 4096 bytes, which is also shown in the figure. You can view the block size of the file system like this:

root@ubuntu:/home/serp# tune2fs -l /dev/sdb1
tune2fs 1.42.9 (4-Feb-2014)
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          a600bf40-f660-41f6-a3e6-96c303995479
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      ext_attr resize_inode dir_index filetype sparse_super large_file
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              65536
Block count:              261888
Reserved block count:     13094
Free blocks:              257445
Free inodes:              65525
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      63
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Filesystem created:       Fri Aug  2 15:02:13 2019
Last mount time:          n/a
Last write time:          Fri Aug  2 15:02:14 2019
Mount count:              0
Maximum mount count:      -1
Last checked:             Fri Aug  2 15:02:13 2019
Check interval:           0 (<none>)
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Default directory hash:   half_md4
Directory Hash Seed:      c0155456-ad7d-421f-afd1-c898746ccd76

The parameter we need is "Block size".

Now the fun part, how to read the file /home/serp/testfile? A file consists of one or more file system blocks that store its data. Knowing the file name, how to find it? What blocks to read?

This is where inodes come in handy. The ext2fs file system has a "table" that contains information on all inodes. The number of inodes in the case of ext2fs is set when the file system is created. We look at the necessary numbers in the β€œInode count” parameter of the tune2fs output, i.e. we have 65536 pieces. The inode contains the information we need: a list of file system blocks for the file we are looking for. How to find the inode number for a given file?

The corresponding name and inode number is contained in a directory, and a directory in ext2fs is a special type of file, i.e. also has its own inode number. To break this vicious circle, a "fixed" inode number of "2" was assigned to the root directory. We look at the contents of inode number 2:

root@ubuntu:/# debugfs /dev/sdb1
debugfs 1.42.9 (4-Feb-2014)
debugfs:  stat <2>

Inode: 2   Type: directory    Mode:  0755   Flags: 0x0
Generation: 0    Version: 0x00000000:00000002
User:     0   Group:     0   Size: 4096
File ACL: 0    Directory ACL: 0
Links: 3   Blockcount: 8
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x5d43cb51:16b61bcc -- Fri Aug  2 16:34:09 2019
 atime: 0x5d43c247:b704301c -- Fri Aug  2 15:55:35 2019
 mtime: 0x5d43cb51:16b61bcc -- Fri Aug  2 16:34:09 2019
crtime: 0x5d43b5c6:00000000 -- Fri Aug  2 15:02:14 2019
Size of extra inode fields: 28
BLOCKS:
(0):579
TOTAL: 1

As you can see, the directory we need is contained in block number 579. In it, we will find the node number for the home folder, and so on along the chain until we see the node number for the requested file in the serp directory. If suddenly someone wants to check if the number is correct and if there is the necessary info there, it's not difficult. We do:

root@ubuntu:/# dd if=/dev/sdb1 of=/home/serp/dd_image bs=4096 count=1 skip=579
1+0 records in
1+0 records out
4096 bytes (4,1 kB) copied, 0,000184088 s, 22,3 MB/s
root@ubuntu:/# hexdump -c /home/serp/dd_image

In the output, you can read the names of the files in the directory.

So I came to the main question: β€œfor what reasons can a write error occur”?

Naturally, this will happen if there are no free blocks of the file system left. What can be done in this case? In addition to the obvious "delete something unnecessary", it should be remembered that in the ext2,3 and 4 file systems there is such a thing as "Reserved block count". If you look in the listing above, then we have such blocks "13094". These are blocks writable only by the root user. but if you need to quickly resolve the issue, as a temporary solution you can make them available to everyone, resulting in some free space:

root@ubuntu:/mnt# tune2fs -m 0 /dev/sdb1
tune2fs 1.42.9 (4-Feb-2014)
Setting reserved blocks percentage to 0% (0 blocks)

Those. By default, you have 5% of your disk space unwritable, and given the size of modern disks, this could be hundreds of gigabytes.

What else can be? It is also possible that there are free blocks, but the nodes have run out. This usually happens if you have a bunch of files in your filesystem that are smaller than the block size of the filesystem. Considering that 1 inode is spent on 1 file or directory, and in total we have 65536 of them (for a given file system), the situation is more than real. This can be clearly seen from the output of the df command:

serp@ubuntu:~$ df -hi
Filesystem     Inodes IUsed IFree IUse% Mounted on
udev             493K   480  492K    1% /dev
tmpfs            493K   425  493K    1% /run
/dev/xvda1       512K  240K  273K   47% /
none             493K     2  493K    1% /sys/fs/cgroup
none             493K     2  493K    1% /run/lock
none             493K     1  493K    1% /run/shm
none             493K     2  493K    1% /run/user
/dev/xvdc1       320K  4,1K  316K    2% /var
/dev/xvdb1        64K   195   64K    1% /home
/dev/xvdh1       4,0M  3,1M  940K   78% /var/www
serp@ubuntu:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            2,0G  4,0K  2,0G   1% /dev
tmpfs           395M  620K  394M   1% /run
/dev/xvda1      7,8G  2,9G  4,6G  39% /
none            4,0K     0  4,0K   0% /sys/fs/cgroup
none            5,0M     0  5,0M   0% /run/lock
none            2,0G     0  2,0G   0% /run/shm
none            100M     0  100M   0% /run/user
/dev/xvdc1      4,8G  2,6G  2,0G  57% /var
/dev/xvdb1      990M  4,0M  919M   1% /home
/dev/xvdh1       63G   35G   25G  59% /var/www

As can be clearly seen on the /var/www partition, the number of free filesystem blocks and the number of free nodes vary greatly.

In case you run out of inodes, I won’t tell you spells, because they are not (if not right, let me know). So for partitions in which small files breed, you should correctly choose the file system. For example, btrfs inodes cannot run out, because new ones are dynamically created as needed.

Source: habr.com

Add a comment