668 nanoseconds on a network disk
A while ago I was crazily searching for a solution to leave Heroku’s database model and go somewhere else. Since we’ve been running a cluster on Hetzner, I decided to check it out. I picked a Hetzner CCX33 - 8vCPUs, 32 GB RAM, 240 GB NVMe. And… I ran fio on the disk, to find out if “ssd” was actually a local ssd or a network-attached storage. I needed the first because high IOPS was a priority.
I tested random 8kb reads, --direct=1 to bypass the page cache, iodepth=32, numjobs=4, and libaio:
read: IOPS=325k
clat: p1=668ns, p50=123us, p99=1.3ms
325,000 IOPS. The p50 of 123 microseconds looks great - it’s faster than a local NVMe I have at home, which never answered a single read below 61 microseconds.
Look at p1 though: 668 nanoseconds. How can this be possible?
--direct=1 bypasses the OS page cache, so this is not Linux serving me a cached page. Well, as far as I know, nothing mechanical answers in 668ns. Home NVMe has a floor around 61 microseconds. Sub-microsecond is RAM.
--direct=1 bypasses the OS page cache, so this is not Linux serving me a cached page. The read left the kernel. And still it came back in 668ns. Nothing mechanical answers in 668ns. My home NVMe has a floor around 61 microseconds, about 90x slower. Sub-microsecond is RAM. So some layer below the OS - the storage backend, a controller, a cache in front of the actual device - held that page in memory and answered before any disk was touched.
That is the thing about “network-attached storage”. It is not one disk. It is a system, and that system has its own cache. Some of my reads never reached a disk at all. The p50 of 123us is probably the cache too, or close to it. The p99 of 1.3ms is where I think we start paying for the network.
Then the write side, same machine. Random 8kb writes, but with --fsync=1 so each write has to become durable before the next:
write: IOPS=10.2k
clat: p50=12.1ms, p99=16.2ms
12 milliseconds for an 8kb write to become durable.
The reads hide their cost. The writes show it. A read can be answered by whatever cache is closest. A durable write cannot - it has to cross the network and wait for the backend to acknowledge the data is safe, and that round trip is 12ms at the median. Same disk, same machine, two numbers three orders of magnitude apart, and the difference is entirely whether someone is allowed to lie to you about where the data is.
A local NVMe does not have this gap. Read and write latencies sit much closer together because there is no network in the middle and no backend cache pretending the disk is faster than it is.
I never got to run this cluster in production. But the fio run told me what I needed: this disk is fast when it can cache, and the durability cost lives on the write path, hidden from any read benchmark you might run to evaluate it.