Ordering Requests to Accelerate Disk I/O

April 04, 2021

In the earlier post I showed how accessing data on an SSD in parallel can greatly improve read performance. However, that technique is not very effective for data stored on spinning drives. In some cases parallel access can even deteriorate performance significantly. Fortunately, there exists a class of optimizations that can strongly help with HDDs: request ordering. By requesting data in proper order, the disk seek latency can be reduced by an order of magnitude. Since I introduced that optimization in fclones 0.9, fclones became the fastest duplicate file finder I know of.

HDD Performance Characteristics

Spinning drives, contrary to SSDs, have significant access latency, limiting the effective number of IO operations per second they can serve. The disk access latency is mostly comprised of:

seek latency – the time needed to position the head on the desired track,
rotational latency – the time needed to wait for the disk to rotate so that the right sector is under the head,

The seek latency is higher the more distance the head has to move to get to the right track. The typical average seek latency advertised by manufacturers are about 5-12 ms. The average rotational latency is equal to the time needed for the disk plates to do half of the turn. In case of a 7200 RPM drive, this equals to 60/7200/2 = 4.2 ms. Overall the total average latency can be about 10 ms, and worst-case latency more than 20 ms.

Now if we want to process a bunch of tiny files placed randomly on the disk, and we access them in random order, we should not expect to process vastly more than about 100 per second (you might get lucky though, that some of your files would be located close to each other which may improve it a bit).

This back-of-the envelope calculation holds pretty well in the real world. Here is an example iostat output while searching for duplicates on a 7200 RPM HDD with an old version of fclones (0.8):

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             127,40       655,20         0,00       3276          0
sdb               0,60         0,00        67,20          0        336

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             135,00       659,20         0,00       3296          0
sdb              26,00         0,00       174,40          0        872

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             132,60       669,60         0,00       3348          0
sdb               0,40         0,00         8,00          0         40

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             127,40       683,20         0,00       3416          0
sdb               0,40         2,40        28,00         12        140

We can see the number of transactions per second (tps) is slightly more than 100. That’s a ridiculously low number considering SSDs can handle tens or even hundreds thousands random accesses per second. The data rate is also essentially “killed” by latency. This drive can read at more than 100 MB/s continuously, but here we get a rate in the range of hundreds of kilobytes.

Unfortunately, even in 2021 you may still have plenty of gigabytes in small files lying around on older HDDs. Does it mean you’re doomed to hours of waiting if you ever want to process all of them (e.g. search, backup or deduplicate)?

Basic Request Ordering

In [1] the authors presented some nice techniques of improving performance by sorting I/O requests before sending them to the operating system for execution. I implemented them in fclones and the results are truly amazing!

Up to version 0.8, fclones processed files in the order dictated by their size, because that was the order naturally obtained from the first phase of grouping. As you may expect, it turns out, file size isn’t correlated with the physical location of a file at all. Hence, the performance on HDD was actually worse than as if the files were processed in the order obtained from scanning the directory tree. At least, when processing files in the order returned by the directory listing, there are high chances they were saved at the similar time (e.g. as a result of a directory copy operation) and are actually placed very close to each other. And indeed, some alternative programs like fdupes or rdfind outperformed fclones on HDD, despite not really doing anything special to speed up disk access.

One of the first ideas I tried from the paper was to reorder the files by their inode identifiers. This was quite easy, because the inode identifiers were available already in the file metadata structures in order to properly detect hard-links. Honestly, I wasn’t expecting much improvement from this technique, as theoretically the inode number of a file has nothing to do with the physical data location. In practice though, there seems to be a lot of correlation. This technique alone worked like a charm, despite some minor added cost of sorting the entries!

Fig.1: Time to find duplicates among 50k+ files stored on a 7200 RPM drive

Ordering by Physical Data Location

We can do better. Some file systems, like Linux EXT4, offer an API for fetching information about file extents: FIEMAP ioctl. We can use this API to get a data structure that contains information on the physical placement of the file data. Then, the physical placement of the beginning of the data can be used to sort the files so that we can process all files in a single sweep. A great news is that this API is also available for non-root users.

Using FIEMAP in Rust is easy, because there is already a Rust crate for that: fiemap. The relevant fragment of fclones code looks like this:

#[cfg(target_os = "linux")]
pub fn get_physical_file_location(path: &Path) -> io::Result<Option<u64>> {
    let mut extents = fiemap::fiemap(&path.to_path_buf())?;
    match extents.next() {
        Some(fe) => Ok(Some(fe?.fe_physical)),
        None => Ok(None),
    }
}

I was initially worried that an additional system call for every file would add some initial cost, canceling the gains from the access ordering. Fortunately it turned out the cost was really low – 50k files could be queried for extents in less than a second! I guess that the fact the metadata for the files were already queried in an earlier stage, so all the required information was already in the cache. Fig. 2 shows that despite the higher number of system calls, the total time of the task decreased even more, down to about 19 seconds! This is over 10x faster than the earlier release.

Fig.2: Impact of physical data ordering on time to find duplicates among 50k+ files stored on a 7200 RPM drive

The number of transactions per second and throughput reported by iostat also went up considerably. Many files are read now in a single disk plate turn.

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda            2424,40     11605,60         0,00      58028          0
sdb               1,00         4,80        11,20         24         56

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda            2388,20     10436,80         0,00      52184          0
sdb               6,60       356,80        38,40       1784        192

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda            2397,00     11188,00         0,00      55940          0
sdb               3,20        80,80        56,80        404        284

Impact of Parallelism

Before introducing the reordering, I’ve found that issuing requests in parallel for reading small (4-64 kB) chunks of data improved speed. The operating system definitely made a good use of knowing some files in advance and reordered accesses by itself. Is it still the case after we order the reads? Maybe giving the operating system a bit more requests in advance could still save some time? I thought the system could technically work on fetching the next file while the app is still processing the earlier one.

Unfortunately, at least on my system, this seems to not work as I thought. Fetching files in parallel degraded performance a bit (Fig. 3). The effect wasn’t as huge as for sequential access of big files, but big enough that I changed the defaults in fclones 0.9.1 to now use always a single-thread per HDD device.

Fig.3: Impact of parallelism on performance of ordered disk access

Summary

The order of file I/O requests has a tremendous impact on I/O performance on spinning drives. If your application needs to process a batch of small files, make sure you request them in the same order as their physical placement on disk. If you can’t do it because your file system or your operating system does not provide physical block placement information, at least sort the files by their identifiers. If you’re lucky, the identifiers would be highly correlated with the physical placement of data, and such ordering would still do some magic.

Please let me know in the comments if you tried this and how big improvements you’ve got.

References

C. Lunde, H. Espeland, H. Stensland, and P. Halvorsen, “Improving File Tree Traversal Performance by Scheduling I/O Operations in User space,” Dec. 2009, pp. 145–152, doi: 10.1109/PCCC.2009.5403829.