Performance¶
Disk IO¶
Should files be read from multiple threads, even when the disk is the bottleneck? By having multiple concurrent reads, the operating system might be able to optimize the disk access pattern, and schedule reads more efficiently for higher throughput. Let’s measure.
Disk Cache | Threads | Time (seconds) |
---|---|---|
Cold | 64 | 106.632476927 |
Cold | 64 | 106.155341479 |
Cold | 64 | 104.968957864 |
Warm | 64 | 0.065452067 |
Warm | 64 | 0.065966143 |
Warm | 64 | 0.067338459 |
Cold | 6 | 109.032390370 |
Cold | 6 | 108.156613210 |
Cold | 6 | 110.175107966 |
Warm | 6 | 0.056552910 |
Warm | 6 | 0.051793717 |
Warm | 6 | 0.057326269 |
Warm | 6 | 0.056153033 |
Cold | 1 | 131.265989187 |
Cold | 1 | 130.512200200 |
Cold | 1 | 130.496186066 |
Warm | 1 | 0.145899503 |
Warm | 1 | 0.140550669 |
Warm | 1 | 0.140376767 |
Warm | 1 | 0.146533344 |
This is for roughly 11500 files. Program output was redirected to /dev/null. Single-threaded taken from commit 4a5982ceb94b6a3dc575abce3c47e148dd28aa9f
. Multi-threaded taken from commit cc06a48af7c8ea8b8b647443db1a26f77374f9e4
.
Conclusion: multithreaded ingestion is advantageous, both when indexing from the disk, as well as when indexing from memory. There must be a CPU-bound part as well then. (On my system, for my workload, that is.) The next question then, is how much threads to use. 64 threads probably already takes all of the parallel gains, and the returns diminish quickly. It could be worth optimizing the number of threads for running time with a warm disk cache, and that would likely also perform almost optimally for a cold cache.
Some more results after reducing the thread queue size and doing non-blocking pushes, to keep the queue sizes more even:
Disk Cache | Threads | Queue Size | Time (seconds) |
---|---|---|---|
Cold | 128 | 16 | 105.591609927 |
Cold | 512 | 0 | 97.509055644 |
Cold | 512 | 0 | 96.345510293 |
Cold | 128 | 1 | 94.403741744 |
Cold | 128 | 0 | 85.897972147 |
Cold | 64 | 0 | 82.595254011 |
Cold | 64 | 0 | 83.793832797 |
Cold | 48 | 0 | 80.877349368 |
Cold | 32 | 0 | 80.913407455 |
Cold | 24 | 0 | 82.893433723 |
Cold | 16 | 0 | 83.807142608 |
Cold | 16 | 0 | 83.967152892 |
Warm | 128 | 16 | 0.075636796 |
Warm | 128 | 1 | 0.072041480 |
Warm | 128 | 0 | 0.075571860 |
And without queues or channels, protecting the directory iterator with a mutex instead:
Disk Cache | Threads | Time (seconds) |
---|---|---|
Cold | 48 | 83.731602753 |
Cold | 48 | 83.806947689 |
Cold | 24 | 81.919455988 |
Cold | 24 | 80.765494864 |
Cold | 12 | 82.537088779 |
Cold | 12 | 83.135829488 |
Warm | 48 | 0.056744610 |
Warm | 24 | 0.059594100 |
Warm | 24 | 0.054264233 |
Warm | 12 | 0.056491306 |
Warm | 12 | 0.056685518 |
Precollect¶
At commit c6c611be9179d939dc5646dc43ab8bdf5ddc2962
, with 24 threads. First collecting discovered paths into a vec, and constructing the index by iterating over the paths in the vec. Is this the right thing to do, or should we put the paths iterator in a mutex directly? Measurement setup:
echo 3 | sudo tee /proc/sys/vm/drop_caches
perf stat target/release/musium ~/music
Note that the server was disabled to terminate the program after indexing. Also, these results are not comparable to the previous numbers, as the library has grown, and more data is processed. Furthermore, I did not redirect stdout to /dev/null
in this case, but for a cold disk cache that does not make so much of a difference anyway.
Precollect | Time (seconds) |
---|---|
Vec precollect 1 | 91.870704962 |
Vec precollect 1 | 90.106878818 |
Vec precollect 1 | 90.031705480 |
Vec precollect 2 | 86.926306901 |
Vec precollect 2 | 86.876997701 |
Vec precollect 2 | 89.131675265 |
Iter, double alloc | 93.370680604 |
Iter, double alloc | 93.180283609 |
Iter, double alloc | 93.259494622 |
Iter, single alloc | 94.026253229 |
Iter, single alloc | 94.147137607 |
Iter, single alloc | 94.352803977 |
Note that I did upgrade Walkdir when switching from vector precollect to the iterator-based version, so the comparison may be unfair. The data collected before the switch is labelled “Vec precollect 1”, the version after upgrading to Walkdir 2.1.4 is labelled “Vec precollect 2”. Furtherore, Walkdir 2.1.4 requires copying the path (labelled “double alloc”). I made a small change to the crate to be able to avoid the copy and extra allocation (labelled “single alloc”).
Counterintuitively, copying the path returned by the iterator is faster than not copying it. It might have something to do with ordering; spending more time in the iterator lock is actually a good thing? Or maybe I should collect more data, and this is just a statistical fluctuation. Just storing the paths is definitely faster if the copy is avoided:
copy <- c(0.023223363, 0.022365082, 0.022318216, 0.022584837,
0.020660742, 0.023839308, 0.022084252, 0.021812114,
0.022180668, 0.019982074, 0.020979151, 0.023186709,
0.024758619, 0.022889618, 0.024148854, 0.024708654)
noncopy <- c(0.022403112, 0.021863389, 0.019650964, 0.020984869,
0.021901483, 0.021376926, 0.021668108, 0.021504715,
0.023730031, 0.021861766, 0.021060567, 0.021986531,
0.022680138, 0.019719019, 0.020053399, 0.021137137)
t.test(copy, noncopy)
# Welch Two Sample t-test
#
# data: copy and noncopy
# t = 2.6055, df = 28.297, p-value = 0.01447
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# 0.000242829 0.002024684
# sample estimates:
# mean of x mean of y
# 0.02260764 0.02147388
So it is preferable to read many paths at once before processing them, perhaps due to better branch prediction. The gains are so big that the extra allocations and reallocations for storing the pathbuf pointers in a vec are totally worth it. It might be even better then to alternate beween scanning paths and processing them, to reduce peak memory usage, but let’s not worry about that at this point.
Fadvise¶
Command:
echo 3 | sudo tee /proc/sys/vm/drop_caches
perf stat target/release/musium cache /pool/music /pool/volatile/covers dummy
Measurements were performed with disks spinning. If the disks needed to spin up first, I restarted the measurement as soon as the disk was spinning.
Baseline, commit bcb01aac03b72c6250823d44d2b4dd71887e387c
:
Disk Cache | Tracks | Wall time (seconds) | User time (seconds) | Sys time (seconds) |
---|---|---|---|---|
Cold | 15931 | 142.662233261 | 3.283129000 | 8.579975000 |
Cold | 15931 | 147.348811539 | 3.236058000 | 8.641414000 |
Cold | 15931 | 145.916103563 | 3.376106000 | 8.547039000 |
Warm | 15931 | 0.346267741 | 0.987189000 | 0.427480000 |
Warm | 15931 | 0.369951824 | 0.886352000 | 0.523628000 |
Warm | 15931 | 0.372806305 | 0.929290000 | 0.480558000 |
Open files first, read later, commit 0f2d00be7ef2009fe19af79ae02ac29d11c766cf
:
Disk Cache | Tracks | Wall time (seconds) | User time (seconds) | Sys time (seconds) |
---|---|---|---|---|
Cold | 15931 | 200.334320084 | 4.513103000 | 10.766578000 |
Warm | 15931 | 0.835945466 | 2.131593000 | 2.420703000 |
Use “frontier” read pattern, commit 64371ff0aa834add77185531bae7160cfd6134ad
:
Disk Cache | Tracks | Wall time (seconds) | User time (seconds) | Sys time (seconds) |
---|---|---|---|---|
Cold | 15931 | 148.444013742 | 4.524398000 | 10.423234000 |
Cold | 15931 | 147.144940804 | 4.670934000 | 10.321421000 |
Warm | 15931 | 1.134759625 | 2.831797000 | 4.271261000 |
Warm | 15931 | 1.204304762 | 3.183732000 | 4.562911000 |
After changing the IO queue size (and also tuning internal queue a bit), this could be brought down to 93 seconds, which suggests the win is really more in IO patterns, and for the warm case, simpler is probably better.
$ cat /sys/block/sd{b,c,d}/queue/nr_requests
4
4
4
$ echo 2048 | sudo tee /sys/block/sd{b,c,d}/queue/nr_requests
2048
Just regular blocking IO again, but tuning the disk queue size, commit a1ac4d2100f6d0bd61f0be99fc31b7da260be6af
:
Disk Cache | Tracks | nr_requests | Threads | Wall time (seconds) | User time (seconds) | Sys time (seconds) |
---|---|---|---|---|---|---|
Cold | 15931 | 2048 | 24 | 91.411324389 | 3.086841000 | 6.717432000 |
Cold | 15931 | 2048 | 24 | 93.145262004 | 3.103839000 | 7.002719000 |
Cold | 15931 | 2048 | 128 | 72.864833181 | 2.616870000 | 5.983099000 |
Cold | 15931 | 2048 | 128 | 73.701789903 | 2.642680000 | 5.953727000 |
Cold | 15931 | 2048 | 256 | 72.805136103 | 2.503932000 | 5.793247000 |
Cold | 15931 | 2048 | 256 | 72.108359036 | 2.625222000 | 5.811989000 |
Warm | 15931 | 2048 | 24 | 0.320772413 | 0.952222000 | 0.421805000 |
Warm | 15931 | 2048 | 24 | 0.359001663 | 0.986530000 | 0.390287000 |
Warm | 15931 | 2048 | 128 | 0.370368452 | 0.927427000 | 0.480482000 |
Warm | 15931 | 2048 | 128 | 0.454763022 | 0.930003000 | 0.776343000 |
Warm | 15931 | 2048 | 256 | 0.410941427 | 0.922298000 | 0.711005000 |
Warm | 15931 | 2048 | 256 | 0.365117437 | 0.928634000 | 0.472226000 |