3/31/2025 at 1:50:33 AM
Before Bluestore, we ran Ceph on ZFS with the ZFS Intent Log on NVDIMM (basically non-volatile RAM backed by a battery). The performance was extremely good. Today, we run Bluestore on ZVOLs on the same setup and if the zpool is a "hybrid" pool we put the Ceph OSD databases on an all-NVMe zpool. Ceph WAL wants a disk slice for each OSD, so we don't do Ceph WAL and consolidate incoming writes on the ZiL/SLOG on NVDIMM.by acidmath
3/31/2025 at 5:06:06 AM
Why ceph on ZVOLs and not bare disks?by nightfly
3/31/2025 at 9:59:57 AM
In the servers we have only 16gb to 64gb of NVDIMM, depending on density of NVDIMM and how many slots are populated with NVDIMM. Whatever raw NVDIMM is, usable is half because we mirror the contents for physical redundancy (if we lose a transaction it is fatal to our business). NVMe is amazing, but not everything should be NVMe, like petabyte scale object storage for example does not need to be on all NVMe (which is super pricey).In newer DDR5 servers where we can't get NVDIMM, the alternative battery backed RAM options leave us with even less to work with.
Where we have counts of HDDs or SATA/SAS SSDs in the hundreds, we still want the performance improvements provided by WAL (or functional equivalent such as ZiL/SLOG) on NVDIMM and some layer-2 (where layer-1 is RAM) caching with NVMe.
Ceph OSDs want a dedicated WAL device. Some places use OpenCAS to make "hybrid" devices out of HDDs by pairing them with SSDs where the SSDs can accelerate reads for that HDD and the Ceph OSD goes on a logical OpenCAS device. OpenCAS is really great, but the devices acting as "caching layer" often end up underutilized.
By placing "big" Ceph OSDs on ZVOLs, we don't have individual disk slices for WAL (or equivalent) or individual disks for layer-2 read caching, but a consolidated layer in the form of ZFS Intent Log on "Separate Log" (NVDIMM) and another consolidated layer in the ZFS disk pool's L2ARC (layer-2 adaptive readback cache).
The ZVOLs are striped across multiple relatively large RAIDz3 arrays. Yeah, it's "less efficient" in some ways, but the tradeoff is worth it for us.
https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#devices
https://open-cas.com/
by acidmath
3/31/2025 at 4:01:21 PM
Do you have any recommendations or warnings about running ceph clusters?by __turbobrew__
3/31/2025 at 5:31:02 PM
Find people who understand it. I’ve seen epic failures when things grow , you lose a DC and hell rains on you. It’s not magic , you will need people who get it ( source : unstable cluster of a few petabytes where I work ).by Agingcoder
3/31/2025 at 6:52:30 PM
Just off the top of my head:Run Ceph on https://rook.io/ ; don't bother with Cephadm. Running Rook provides very helpful guard rails. Put the logs for Ceph Rook into Elasticsearch+Kibana on its own small (three or four node) dedicated Ceph Rook cluster. Which Kubernetes distro this runs on matters more than anything.
Recently we are looking at using https://www.parseable.com/ instead of Elasticsearch+Kibana. And we had somewhat recently started moving things from Elaticsearch+Kibana to OpenSearch+OpenSearchDashboards due to the license change.
The requirement outlined by Ceph documentation to dedicate layer-1 paths (can be same switches, but must be different ports) to Ceph replication is not about "performance" but about normal functionality.
If you have any pointed questions feel free to email "section two thirty audit@mail2tor dot com" (where "two thirty" are the three digits rather than spelled out).
by acidmath
3/31/2025 at 9:16:51 PM
I already set things up with Rook as we are super heavily invested into kubernetes, and things are working well so far. I built out a test cluster to 1PiB and was able to push more than a terrabit/second through the cluster which was good.I also set up topology aware replication so pg’s can be spread across racks/datacenters.
My main worry now is disaster recovery. From what I have seen, object recovery is quite manual if you lose any. I would like to write some scripts so we can bulk mark objects which we know are actually lost.
We already have a loki setup, so ceph logs just get put into there.
by __turbobrew__
4/1/2025 at 3:34:51 AM
> object recovery is quite manual if you lose anyWhen I read this I think "but you should never lose an object". Do you mean like the underlying data chunks Ceph stores? Can you elaborate on this part? I know some of the teams I work with do things in unorthodox ways and we tend to operate on different assumptions than others.
> so pg’s can be spread across racks/datacenters.
Some Ceph pools come to mind (this was a while ago, I'm sure they're still running though) where the erasure coding was done across cabinet rows and each cabinet row was on its own power distribution. I don't know how the power worked but I was told rather forwardly that some specific Ceph pools' failure domains aligned with the datacenter's failure domains.
> We already have a loki setup
Nice. We have logs go into S3 and then anyone who prefers a particular tool is welcome to load whatever sets of logs from S3 within the resource limits set for whatever K8s namespace they work with. Originally keeping logs append-only in S3 was for compliance but we wanted to limit team members by RAM quota rather than tools in line with the "people over tools over process" DevOps maxim.
by acidmath
4/1/2025 at 7:28:28 PM
> Do you mean like the underlying data chunks Ceph storesSay I 3x replicate data across racks and I have 3 concurrent rack failures where the stars align and I lose data. What do I do? I may want to make the tradeoff to have lower durability (say replicas are located within the same networking pod) for better performance due to lower latency between replicas. In that case maybe I am fine losing data once in a blue moon.
by __turbobrew__
4/2/2025 at 6:24:41 PM
Where three different disks (or storage underneath Ceph OSDs) each in separate disk silos fail?In some earlier Ceph clusters I was responsible for I set replication to four or even five.
There was a point where I wrote an internal memo to business leaders explaining that had I not set that replication to four or five we would have been unable to meet business goals (we would have lost the contracts keeping us afloat during critical stages), but now at scale the real dollar cost of that redundancy was showing. In that memo I explained what networking we needed vs what we had, what people we needed vs what we had, and so on. I eventually got the networking gear the company needed in place and was able to hire the kind of people we needed.
For Ceph clusters I am chiefly responsible for today we have Ceph pools in each availability zone doing erasure coding (four or more parity bits) and a small service makes sure objects are copied between the distanced datacenters (availability zones).
While I do get a bit of a boner (of the Hank Hill on propane kinda vibe) on how the erasure coding is distirbuted across cabinet rows, to be fair it is a bit of an optimization.
_
> to have lower durability (say replicas are located within the same networking pod)
A network segment should never ever go down. I know in some places this is "optimistic". I can say for us it's not "optimistic" because we don't allow Ethernet protocol to "hop" anywhere. Layer-2 broadcast domains either stop at top-of-rack and the rest is layer-3 or we are using Infiniband. Stable Ethernet networks are a thing, but not using Ethernet beyond where Ethernet belongs avoids so much risk.
100% network uptime is achievable.
In some places more critical nodes are connected to three Infiniband switches with two uninterrupted power supplies dedicated to each Infiniband switch. Maybe that's "excessive", but within the last year we had to contend with a switch failure. Then a week later a datacenter provider (one that will tell you they never ever EVER lose power) for one of our availability zones was unable to provide power for over two full days.
I don't think we're ever going to have downtime or lose data. In other places this would be "optimistic", but for this deployment I have access to the resources I need to achieve the thing. The only thing that has me concerned about downtime is the current WW3 we are already in heating up.
by acidmath