<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Production on</title><link>https://frn.sh/c/production/</link><description>Recent content in Production on</description><generator>Hugo</generator><language>en-US</language><copyright>Copyright © Fernando Simões.</copyright><lastBuildDate>Sat, 21 Mar 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://frn.sh/c/production/index.xml" rel="self" type="application/rss+xml"/><item><title>Where did 400 MiB go?</title><link>https://frn.sh/pmem/</link><pubDate>Sat, 21 Mar 2026 00:00:00 +0000</pubDate><guid>https://frn.sh/pmem/</guid><description>I restarted all 60+ pods of a Node.js websocket app earlier today. Every single pod sitting at ~330 MiB of memory. Except one, which was double the rest - at 640 MiB.
This is a statefulset. When I built the cluster, I estimated each pod&amp;rsquo;s footprint: ~198 MiB base, plus ~25 MiB per websocket. With 30 websockets per pod, that&amp;rsquo;s roughly 900 MiB. I was wrong about the per-websocket cost - it&amp;rsquo;s lower than 25 MiB in practice.</description></item><item><title>Between select and disk</title><link>https://frn.sh/iops/</link><pubDate>Sun, 08 Feb 2026 00:00:00 +0000</pubDate><guid>https://frn.sh/iops/</guid><description>We had a Postgres incident this week. Heroku timeouts, multiple queries running for 30+ minutes, and the IOPS pinned at the provisioned limit.
I knew I needed a better index, but I wanted to understand what &amp;ldquo;reading from disk&amp;rdquo; actually means first. How many layers of caching sit between a SELECT and the storage device?
Three.
First: shared buffers, Postgres&amp;rsquo; own cache living in the process memory. If the page is there, we need no system call - just a memory read.</description></item><item><title>108,725 forks</title><link>https://frn.sh/tforks/</link><pubDate>Thu, 11 Dec 2025 00:00:00 +0000</pubDate><guid>https://frn.sh/tforks/</guid><description>First week at a new job. A colleague was showing me around our Grafana dashboards, just routine monitoring of the baremetal machines. One caught my eye: a machine with 32GB RAM and a top-of-the-line processor was hitting 90% CPU. A few containers running, no alerts, and nobody had reported anything.
I found a process with cmd bash startup.sh that had been running for 28 minutes.
I straced it for a few minutes:</description></item><item><title>Sigterm a D state process</title><link>https://frn.sh/sigterm/</link><pubDate>Sun, 08 Jun 2025 00:00:00 +0000</pubDate><guid>https://frn.sh/sigterm/</guid><description>Load average hit 12 on a 2 vCPU machine during a production incident. My first thought was that CPU must be the bottleneck - 12 is 6x the core count.
But it wasn&amp;rsquo;t.
Linux load average counts three things: processes running on a CPU, processes waiting in the run queue, and processes in uninterruptible sleep - D state. From the kernel source:
The global load average is an exponentially decaying average of nr_running + nr_uninterruptible.</description></item></channel></rss>