Sounds to me that etcd has data handoff consistency issues, and is just flying by seat of its pants: in other words, the moment IO subsystem hiccups, it just shits the bed?
AntonFriberg•about 5 hours ago
Came across this at work as well during the early days of Kubernetes. At that time all VM storage was entirely backed by NFS or Network attached iSCSI, there was no local disk at all. We noticed intermittent issues that caused kube API to stop responding but nothing serious.
Then all the sudden there was a longer lasting outage where the ETCD did not really recover on its own. The kube API buffered requests but eventually crashed due to OOM.
The issue was due to the Kubernetes distro we picked (rancher) which runs a separate cluster in front of the deployed ones for the control plane. Large changes happening in the underlying clusters needed to be sent to the control plane. After we hit the latency threshold it started failing at regular intervals and the control plane started drifting more and more and needing more and more changes at start up. Until it could not recover.
Solving it took some time and confidence. Manually cleaning up large unused data in the underlying ETCD instances so that they did not cause upstream errors.
It was later during post-mortem investigation that I understood the RAFT algorithm and the storage latency issue. Convincing the company to install local disks took some time but I could correlate kube API issues to disk latency by setting up robust monitoring. Fun times!
> etcd is a strongly consistent, distributed key-value store, and that consistency comes at a cost: it is extraordinarily sensitive to I/O latency. etcd uses a write-ahead log and relies on fsync calls completing within tight time windows. When storage is slow, even intermittently, etcd starts missing its internal heartbeat and election deadlines. Leader elections fail. The cluster loses quorum. Pods that depend on the API server start dying.
This seems REALLY bad for reliability? I guess the idea is that it's better to have things not respond to requests than to lose data, but the outcome described in the article is pretty nasty.
It seems like the solution they arrived at was to "fix" this at the filesystem level by making fsync no longer deliver reliability, which seems like a pretty clumsy solution. I'm surprised they didn't find some way to make etcd more tolerant of slow storage. I'd be wary of turning off filesystem level reliability at the risk of later running postgres or something on the same system and experiencing data loss when what I wanted was just for kubernetes or whatever to stop falling over.
PunchyHamster•about 7 hours ago
> This seems REALLY bad for reliability? I guess the idea is that it's better to have things not respond to requests than to lose data, but the outcome described in the article is pretty nasty.
It is. Because if it really starts to crap out above 100ms just a small hiccup in network attached storage of VM it is running on cam
but it's not as simple as that, if you have multiple nodes, and one starts to lag, kicking it out is only way to keep the latency manageable.
Better solution would be to keep cluster-wide disk latency average and only kick node that is slow and much slower than other nodes; that would also auto tune to slow setups like someone running it on some spare hdds at homelab
denysvitali•about 10 hours ago
Yes, wouldn't their fix likely make etcd not consistent anymore since there's no guarantee that the data was persisted on disk?
weiliddat•about 9 hours ago
Yes, but they wrote it’s for a demo and it’s fine if they lost the last few seconds in the event of unexpected system shutdown.
And also in prod, etcd recommends you run with SSDs to minimize variance of fsync/write latencies
ahoka•about 8 hours ago
Getting into an inconsistent state does not just mean “losing a few seconds”.
justincormack•about 9 hours ago
Yes, they totally missed the point of the fsync...
_ananos_•about 5 hours ago
well, the actual issue (IMHO) is that this meta-orchestrator (karmada) needs quorum even for a single node cluster.
The purpose of the demo wasn't to show consistency, but to describe the policy-driven decision/mechanism.
What hit us in the first place (and I think this is what we should fix) is the fact that a brand new nuc-like machine, with a relatively new software stack for spawning VMs (incus / ZFS etc.) behaves so bad it can produce such hiccups for disk IO access...
justsomehnguy•about 2 hours ago
They used a desktop platform with an unknown SSD and with ZFS. There could be a chance what with at least a proper SSD they wouldn't even get in the trouble in the first place.
_ananos_•about 5 hours ago
well, indeed -- we should have found the proper parameters to make etcd wait for quorum (again, I'm stressing that it's a single node cluster -- banging my head to understand who else needs to coordinate with the single node ...)
api•about 6 hours ago
That’s a design issue in etcd.
landl0rd•about 7 hours ago
CAP theorem goes brrr. This is CP. ZooKeeper gives you AP. Postgres (k3s/kine translation layer) gives you roughly CA, and CP-ish with synchronous streaming replication.
If you run this on single-tenant boxes that are set up carefully (ideally not multi-tenant vCPUs, low network RTT, fast CPU, low swappiness, `nice` it to high I/O priority, `performance` over `ondemand` governor, XFS) it scales really nicely and you shouldn't run into this.
So there are cases where you actually do want this. A lot of k8s setups would be better served by just hitting postgres, sure, and don't need the big fancy toy with lots of sharp edges. It's got raison d'etre though. Also you just boot slow nodes and run a lot.
Discussion (13 Comments)
Then all the sudden there was a longer lasting outage where the ETCD did not really recover on its own. The kube API buffered requests but eventually crashed due to OOM.
The issue was due to the Kubernetes distro we picked (rancher) which runs a separate cluster in front of the deployed ones for the control plane. Large changes happening in the underlying clusters needed to be sent to the control plane. After we hit the latency threshold it started failing at regular intervals and the control plane started drifting more and more and needing more and more changes at start up. Until it could not recover.
Solving it took some time and confidence. Manually cleaning up large unused data in the underlying ETCD instances so that they did not cause upstream errors.
It was later during post-mortem investigation that I understood the RAFT algorithm and the storage latency issue. Convincing the company to install local disks took some time but I could correlate kube API issues to disk latency by setting up robust monitoring. Fun times!
The requirements are well documented nowadays! https://etcd.io/docs/v3.1/op-guide/performance/
This seems REALLY bad for reliability? I guess the idea is that it's better to have things not respond to requests than to lose data, but the outcome described in the article is pretty nasty.
It seems like the solution they arrived at was to "fix" this at the filesystem level by making fsync no longer deliver reliability, which seems like a pretty clumsy solution. I'm surprised they didn't find some way to make etcd more tolerant of slow storage. I'd be wary of turning off filesystem level reliability at the risk of later running postgres or something on the same system and experiencing data loss when what I wanted was just for kubernetes or whatever to stop falling over.
It is. Because if it really starts to crap out above 100ms just a small hiccup in network attached storage of VM it is running on cam
but it's not as simple as that, if you have multiple nodes, and one starts to lag, kicking it out is only way to keep the latency manageable.
Better solution would be to keep cluster-wide disk latency average and only kick node that is slow and much slower than other nodes; that would also auto tune to slow setups like someone running it on some spare hdds at homelab
And also in prod, etcd recommends you run with SSDs to minimize variance of fsync/write latencies
The purpose of the demo wasn't to show consistency, but to describe the policy-driven decision/mechanism.
What hit us in the first place (and I think this is what we should fix) is the fact that a brand new nuc-like machine, with a relatively new software stack for spawning VMs (incus / ZFS etc.) behaves so bad it can produce such hiccups for disk IO access...
If you run this on single-tenant boxes that are set up carefully (ideally not multi-tenant vCPUs, low network RTT, fast CPU, low swappiness, `nice` it to high I/O priority, `performance` over `ondemand` governor, XFS) it scales really nicely and you shouldn't run into this.
So there are cases where you actually do want this. A lot of k8s setups would be better served by just hitting postgres, sure, and don't need the big fancy toy with lots of sharp edges. It's got raison d'etre though. Also you just boot slow nodes and run a lot.