Performance

Disk IO

Should files be read from multiple threads, even when the disk is the bottleneck? By having multiple concurrent reads, the operating system might be able to optimize the disk access pattern, and schedule reads more efficiently for higher throughput. Let’s measure.

Disk CacheThreadsTime (seconds)
Cold64106.632476927
Cold64106.155341479
Cold64104.968957864
Warm640.065452067
Warm640.065966143
Warm640.067338459
Cold6109.032390370
Cold6108.156613210
Cold6110.175107966
Warm60.056552910
Warm60.051793717
Warm60.057326269
Warm60.056153033
Cold1131.265989187
Cold1130.512200200
Cold1130.496186066
Warm10.145899503
Warm10.140550669
Warm10.140376767
Warm10.146533344

This is for roughly 11500 files. Program output was redirected to /dev/null. Single-threaded taken from commit 4a5982ceb94b6a3dc575abce3c47e148dd28aa9f. Multi-threaded taken from commit cc06a48af7c8ea8b8b647443db1a26f77374f9e4.

Conclusion: multithreaded ingestion is advantageous, both when indexing from the disk, as well as when indexing from memory. There must be a CPU-bound part as well then. (On my system, for my workload, that is.) The next question then, is how much threads to use. 64 threads probably already takes all of the parallel gains, and the returns diminish quickly. It could be worth optimizing the number of threads for running time with a warm disk cache, and that would likely also perform almost optimally for a cold cache.

Some more results after reducing the thread queue size and doing non-blocking pushes, to keep the queue sizes more even:

Disk CacheThreadsQueue SizeTime (seconds)
Cold12816105.591609927
Cold512097.509055644
Cold512096.345510293
Cold128194.403741744
Cold128085.897972147
Cold64082.595254011
Cold64083.793832797
Cold48080.877349368
Cold32080.913407455
Cold24082.893433723
Cold16083.807142608
Cold16083.967152892
Warm128160.075636796
Warm12810.072041480
Warm12800.075571860

And without queues or channels, protecting the directory iterator with a mutex instead:

Disk CacheThreadsTime (seconds)
Cold4883.731602753
Cold4883.806947689
Cold2481.919455988
Cold2480.765494864
Cold1282.537088779
Cold1283.135829488
Warm480.056744610
Warm240.059594100
Warm240.054264233
Warm120.056491306
Warm120.056685518

Precollect

At commit c6c611be9179d939dc5646dc43ab8bdf5ddc2962, with 24 threads. First collecting discovered paths into a vec, and constructing the index by iterating over the paths in the vec. Is this the right thing to do, or should we put the paths iterator in a mutex directly? Measurement setup:

echo 3 | sudo tee /proc/sys/vm/drop_caches
perf stat target/release/musium ~/music

Note that the server was disabled to terminate the program after indexing. Also, these results are not comparable to the previous numbers, as the library has grown, and more data is processed. Furthermore, I did not redirect stdout to /dev/null in this case, but for a cold disk cache that does not make so much of a difference anyway.

PrecollectTime (seconds)
Vec precollect 191.870704962
Vec precollect 190.106878818
Vec precollect 190.031705480
Vec precollect 286.926306901
Vec precollect 286.876997701
Vec precollect 289.131675265
Iter, double alloc93.370680604
Iter, double alloc93.180283609
Iter, double alloc93.259494622
Iter, single alloc94.026253229
Iter, single alloc94.147137607
Iter, single alloc94.352803977

Note that I did upgrade Walkdir when switching from vector precollect to the iterator-based version, so the comparison may be unfair. The data collected before the switch is labelled “Vec precollect 1”, the version after upgrading to Walkdir 2.1.4 is labelled “Vec precollect 2”. Furtherore, Walkdir 2.1.4 requires copying the path (labelled “double alloc”). I made a small change to the crate to be able to avoid the copy and extra allocation (labelled “single alloc”).

Counterintuitively, copying the path returned by the iterator is faster than not copying it. It might have something to do with ordering; spending more time in the iterator lock is actually a good thing? Or maybe I should collect more data, and this is just a statistical fluctuation. Just storing the paths is definitely faster if the copy is avoided:

copy    <- c(0.023223363, 0.022365082, 0.022318216, 0.022584837,
             0.020660742, 0.023839308, 0.022084252, 0.021812114,
             0.022180668, 0.019982074, 0.020979151, 0.023186709,
             0.024758619, 0.022889618, 0.024148854, 0.024708654)
noncopy <- c(0.022403112, 0.021863389, 0.019650964, 0.020984869,
             0.021901483, 0.021376926, 0.021668108, 0.021504715,
             0.023730031, 0.021861766, 0.021060567, 0.021986531,
             0.022680138, 0.019719019, 0.020053399, 0.021137137)
t.test(copy, noncopy)

#     Welch Two Sample t-test
#
# data:  copy and noncopy
# t = 2.6055, df = 28.297, p-value = 0.01447
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
#  0.000242829 0.002024684
# sample estimates:
#  mean of x  mean of y
# 0.02260764 0.02147388

So it is preferable to read many paths at once before processing them, perhaps due to better branch prediction. The gains are so big that the extra allocations and reallocations for storing the pathbuf pointers in a vec are totally worth it. It might be even better then to alternate beween scanning paths and processing them, to reduce peak memory usage, but let’s not worry about that at this point.

Fadvise

Command:

echo 3 | sudo tee /proc/sys/vm/drop_caches
perf stat target/release/musium cache /pool/music /pool/volatile/covers dummy

Measurements were performed with disks spinning. If the disks needed to spin up first, I restarted the measurement as soon as the disk was spinning.

Baseline, commit bcb01aac03b72c6250823d44d2b4dd71887e387c:

Disk CacheTracksWall time (seconds)User time (seconds)Sys time (seconds)
Cold15931142.6622332613.2831290008.579975000
Cold15931147.3488115393.2360580008.641414000
Cold15931145.9161035633.3761060008.547039000
Warm159310.3462677410.9871890000.427480000
Warm159310.3699518240.8863520000.523628000
Warm159310.3728063050.9292900000.480558000

Open files first, read later, commit 0f2d00be7ef2009fe19af79ae02ac29d11c766cf:

Disk CacheTracksWall time (seconds)User time (seconds)Sys time (seconds)
Cold15931200.3343200844.51310300010.766578000
Warm159310.8359454662.1315930002.420703000

Use “frontier” read pattern, commit 64371ff0aa834add77185531bae7160cfd6134ad:

Disk CacheTracksWall time (seconds)User time (seconds)Sys time (seconds)
Cold15931148.4440137424.52439800010.423234000
Cold15931147.1449408044.67093400010.321421000
Warm159311.1347596252.8317970004.271261000
Warm159311.2043047623.1837320004.562911000

After changing the IO queue size (and also tuning internal queue a bit), this could be brought down to 93 seconds, which suggests the win is really more in IO patterns, and for the warm case, simpler is probably better.

$ cat /sys/block/sd{b,c,d}/queue/nr_requests
4
4
4
$ echo 2048 | sudo tee /sys/block/sd{b,c,d}/queue/nr_requests
2048

Just regular blocking IO again, but tuning the disk queue size, commit a1ac4d2100f6d0bd61f0be99fc31b7da260be6af:

Disk CacheTracksnr_requestsThreadsWall time (seconds)User time (seconds)Sys time (seconds)
Cold1593120482491.4113243893.0868410006.717432000
Cold1593120482493.1452620043.1038390007.002719000
Cold15931204812872.8648331812.6168700005.983099000
Cold15931204812873.7017899032.6426800005.953727000
Cold15931204825672.8051361032.5039320005.793247000
Cold15931204825672.1083590362.6252220005.811989000
Warm159312048240.3207724130.9522220000.421805000
Warm159312048240.3590016630.9865300000.390287000
Warm1593120481280.3703684520.9274270000.480482000
Warm1593120481280.4547630220.9300030000.776343000
Warm1593120482560.4109414270.9222980000.711005000
Warm1593120482560.3651174370.9286340000.472226000