1/14/2025 at 7:08:59 AM
After years in the making ZFS raidz expansaion is finally here.Major features added in release:
- RAIDZ Expansion: Add new devices to an existing RAIDZ pool, increasing storage capacity without downtime.
- Fast Dedup: A major performance upgrade to the original OpenZFS deduplication functionality.
- Direct IO: Allows bypassing the ARC for reads/writes, improving performance in scenarios like NVMe devices where caching may hinder efficiency.
- JSON: Optional JSON output for the most used commands.
- Long names: Support for file and directory names up to 1023 characters.
by scrp
1/14/2025 at 7:16:27 PM
> RAIDZ Expansion: Add new devices to an existing RAIDZ pool, increasing storage capacity without downtime.More specifically:
> A new device (disk) can be attached to an existing RAIDZ vdev
by eatbitseveryday
1/15/2025 at 1:01:06 AM
So if I’m running a Proxmox on ZFS and NVMEs, will I be better off enabling Direct IO when 2.3 gets rolled out? What are the use cases for it?by cromka
1/16/2025 at 9:27:48 PM
Direct IO useful for databases and other applications that use their own disk caching layer. Without knowing what you run in Proxmox no one will be able to tell you if it's beneficial or not.by 0x457
1/15/2025 at 11:45:59 PM
I would guess for very high performance NVMe drives.by Saris
1/14/2025 at 7:36:25 AM
The first 4 seem like really big deals.by jdboyd
1/14/2025 at 8:40:34 AM
The fifth is also, once you consider non-ascii names.by snvzz
1/14/2025 at 8:58:39 PM
Could someone show a legit reason to use 1000-character filenames? Seems to me, when filenames are long like that, they are actually capturing several KEYS that can be easily searched via ls & re's. e.g.2025-Jan-14-1258.93743_Experiment-2345_Gas-Flow-375.3_etc_etc.dat
But to me this stuff should be in metadata. It's just that we don't have great tools for grepping the metadata.
Heck, the original Macintosh FS had no subdirectories - they were faked by burying subdirectory names in the (flat filesysytem) filename. The original Macintosh File System (MFS), did not support true hierarchical subdirectories. Instead, the illusion of subdirectories was created by embedding folder-like names into the filenames themselves.
This was done by using colons (:) as separators in filenames. A file named Folder:Subfolder:File would appear to belong to a subfolder within a folder. This was entirely a user interface convention managed by the Finder. Internally, MFS stored all files in a flat namespace, with no actual directory hierarchy in the filesystem structure.
So, there is 'utility' in "overloading the filename space". But...
by GeorgeTirebiter
1/14/2025 at 9:50:45 PM
> Could someone show a legit reason to use 1000-character filenames?1023 byte names can mean less than 250 characters due to use of unicode and utf-8. Add to it unicode normalization which might "expand" some characters into two or more combining characters, deliberate use of combining characters, emoji, rare characters, and you might end up with many "characters" taking more than 4 bytes. A single "country flag" character will be usually 8 bytes, usually most emoji will be at least 4 bytes, skin tone modifiers will add 4 bytes, etc.
this ' ' takes 27 bytes in my terminal, '' takes 28, another combo I found is 35 bytes.
And that's on top of just getting a long title using let's say one of CJK or other less common scripts - an early manuscript of somewhat successful Japanese novel has a non-normalized filename of 119 byte, and it's nowhere close to actually long titles, something that someone might reasonably have on disk. A random find on the internet easily points to a book title that takes over 300 bytes in non-normalized utf8.
P.S. proper title of "Robinson Crusoe" if used as filename takes at least 395 bytes...
by p_l
1/14/2025 at 10:24:19 PM
hah. Apparently HN eradicated the carefully pasted complex unicode emojis.The first was "man+woman kissing" with skin tone modifier, then there was few flags
by p_l
1/14/2025 at 7:29:06 AM
But I presume it is still not possible to remove a vdev.by cm2187
1/14/2025 at 8:57:56 AM
That was added a while ago:https://openzfs.github.io/openzfs-docs/man/master/8/zpool-re...
It works by making a readonly copy of the vdev being removed inside the remaining space. The existing vdev is then removed. Data can still be accessed from the copy, but new writes will go to an actual vdev while data no longer needed on the copy is gradually reclaimed as free space as the old data is no longer needed.
by ryao
1/14/2025 at 9:00:43 AM
Although "Top-level vdevs can only be removed if the primary pool storage does not contain a top-level raidz vdev, all top-level vdevs have the same sector size, and the keys for all encrypted datasets are loaded."by lutorm
1/14/2025 at 9:02:52 AM
I forgot we still did not have that last bit implemented. However, it is less important now that we have expansion.by ryao
1/15/2025 at 6:06:36 AM
> However, it is less important now that we have expansion.Not really sure if that's true. They seem like two different/distinct use cases, though there's probably some small overlap.
by justinclift
1/14/2025 at 11:47:47 AM
And in my case all the vdevs are raidzby cm2187
1/14/2025 at 7:35:28 AM
Is this possible elsewhere (re: other filesystems)?by mustache_kimono
1/14/2025 at 7:40:09 AM
It is possible with windows storage space (remove drive from a pool) and mdadm/lvm (remove disk from a RAID array, remove volume from lvm), which to me are the two major alternatives. Don't know about unraid.by cm2187
1/14/2025 at 8:02:01 AM
IIUC the ask (I have a hard time wrapping my head around zfs vernacular), btrfs allows this at least in some cases.If you can convince btrfs balance to not use the dev to remove it will simply rebalance data to the other devs and then you can btrfs device remove.
by lloeki
1/14/2025 at 7:44:17 AM
> It is possible with windows storage space (remove drive from a pool) and mdadm/lvm (remove disk from a RAID array, remove volume from lvm), which to me are the two major alternatives. Don't know about unraid.Perhaps I am misunderstanding you, but you can offline and remove drives from a ZFS pool.
Do you mean WSS and mdadm/lvm will allow an automatic live rebalance and then reconfigure the drive topology?
by mustache_kimono
1/14/2025 at 8:02:21 AM
So for instance I have a ZFS pool with 3 HDD data vdevs, and 2 SSD special vdevs. I want to convert the two SSD vdevs into a single one (or possibly remove one of them). From what I read the only way to do that is to destroy the entire pool and recreate it (it's in a server in a datacentre, don't want to reupload that much data).In windows, you can set a disk for removal, and as long as the other disks have enough space and are compatible with the virtual disks (eg you need at least 5 disks if you have parity with number of columns=5), it will rebalance the blocks onto the other disks until you can safely remove the disk. If you use thin provisioning, you can also change your mind about the settings of a virtual disk, create a new one on the same pool, and move the data from one to the other.
Mdadm/lvm will do the same albeit with more of a pain in the arse as RAID requires to resilver not just the occupied space but also the free space so takes a lot more time and IO than it should.
It's one of my beef with ZFS, there are lots of no return decisions. That and I ran into some race conditions with loading a ZFS array on boot with nvme drives on ubuntu. They seem to not be ready, resulting in randomly degraded arrays. Fixed by loading the pool with a delay.
by cm2187
1/14/2025 at 12:39:00 PM
My understanding is that ZFS does virtual <-> physical translation in the vdev layer, i.e. all block references in ZFS contain a (vdev, vblock) tuple, and the vdev knows how to translate that virtual block offset into actual on-disk block offset(s).This kinda implies that you can't actually remove data vdevs, because in practice you can't rewrite all references. You also can't do offline deduplication without rewriting references (i.e. actually touching the files in the filesystem). And that's why ZFS can't deduplicate snapshots after the fact.
On the other hand, reshaping a vdev is possible, because that "just" requires shuffling the vblock -> physical block associations inside the vdev.
by formerly_proven
1/14/2025 at 1:32:55 PM
There is a clever trick that is used to make top level removal work. The code will make the vdev readonly. Then it will copy its contents into free space on other vdevs (essentially, the contents will be stored behind the scenes in a file). Finally, it will redirect reads on that vdev into the stored vdev. This indirection allows you to remove the vdev. It is not implemented for raid-z at present though.by ryao
1/14/2025 at 2:14:18 PM
Though the vdev itself still exists after doing that? It just happens to be backed by, essentially, a "file" in the pool, instead of the original physical block devices, right?by formerly_proven
1/14/2025 at 2:53:36 PM
Yes.by ryao
1/14/2025 at 10:18:04 AM
The man page says that your example is doable with zpool remove:https://openzfs.github.io/openzfs-docs/man/master/8/zpool-re...
by ryao
1/14/2025 at 8:18:32 AM
> Do you mean WSS and mdadm/lvm will allow an automatic live rebalance and then reconfigure of the drive topo?mdadm can convert RAID-5 to a larger or smaller RAID-5, RAID-6 to a larger or smaller RAID-6, RAID-5 to RAID-6 or the other way around, RAID-0 to a degraded RAID-5, and many other fairly reasonable operations, while the array is online, resistant to power loss and the likes.
I wrote the first version of this md code in 2005 (against kernel 2.6.13), and Neil Brown rewrote and mainlined it at some point in 2006. ZFS is… a bit late to the party.
by Sesse__
1/14/2025 at 9:03:47 AM
Doing this with the on disk data in a merkle tree is much harder than doing it on more conventional forms of storage.By the way, what does MD do when there is corrupt data on disk that makes it impossible to know what the correct reconstruction is during a reshape operation? ZFS will know what file was damaged and proceed with the undamaged parts. ZFS might even be able to repair the damaged data from ditto blocks. I don’t know what the MD behavior is, but its options for handling this are likely far more limited.
by ryao
1/14/2025 at 9:06:59 AM
Well, then they made a design choice in their RAID implementation that made fairly reasonable things hard.I don't know what md does if the parity doesn't match up, no. (I've never ever had that happen, in more than 25 years of pretty heavy md use on various disks.)
by Sesse__
1/14/2025 at 9:21:48 AM
I am not sure if reshaping is a reasonable thing. It is not so reasonable in other fields. In architecture, if you build a bridge and then want more lanes, you usually build a new bridge, rather than reshape the bridge. The idea of reshaping a bridge while cars are using it would sound insane there, yet that is what people want from storage stacks.Reshaping traditional storage stacks does not consider all of the ways things can go wrong. Handling all of them well is hard, if not impossible to do in traditional RAID. There is a long history of hardware analogs to MD RAID killing parity arrays when they encounter silent corruption that makes it impossible to know what is supposed to be stored there. There is also the case where things are corrupted such that there is a valid reconstruction, but the reconstruction produces something wrong silently.
Reshaping certainly is easier to do with MD RAID, but the feature has the trade off that edge cases are not handled well. For most people, I imagine that risk is fine until it bites them. Then it is not fine anymore. ZFS made an effort to handle all of the edge cases so that they do not bite people and doing that took time.
by ryao
1/14/2025 at 9:46:36 AM
> I am not sure if reshaping is a reasonable thing.Yet people are celebrating when ZFS adds it. Was it all for nothing?
by Sesse__
1/14/2025 at 10:05:34 AM
People wanted it, but it was very hard to do safely. While ZFS now can do it safely, many other storage solutions cannot.Those corruption issues I mentioned, where the RAID controller has no idea what to do, affect far more than just reshaping. They affect traditional RAID arrays when disks die and when patrol scrubs are done. I have not tested MD RAID on edge cases lately, but the last time I did, I found MD RAID ignored corruption whenever possible. It would not detect corruption in normal operation because it assumed all data blocks are good unless SMART said otherwise. Thus, it would randomly serve bad data from corrupted mirror members and always serve bad data from RAID 5/6 members whenever the data blocks were corrupted. This was particularly tragic on RAID 6, where MD RAID is hypothetically able to detect and correct the corruption if it tried. Doing that would come with such a huge performance overhead that it is clear why it was not done.
Getting back to reshaping, while I did not explicitly test it, I would expect that unless a disk is missing or disappears during a reshape, MD RAID would ignore any corruption that can be detected using parity and assume all data blocks are good just like it does in normal operation. It does not make sense for MD RAID to look for corruption during a reshape operation, since not only would it be slower, but even if it finds corruption, it has no clue how to correct the corruption unless RAID 6 is used, there are no missing/failed members and the affected stripe does not have any read errors from SMART detecting a bad sector that would effectively make it as if there was a missing disk.
You could do your own tests. You should find that ZFS handles edge cases where the wrong thing is in a spot where something important should be gracefully while MD RAID does not. MD RAID is a reimplementation of a technology from the 1960s. If 1960s storage technology handled these edge cases well, Sun Microsystems would not have made ZFS to get away from older technologies.
by ryao
1/15/2025 at 6:10:15 AM
> While ZFS now can do it safely ...It's the first release with the code, so "safely" might not be the right description until a few point releases happen. ;)
by justinclift
1/15/2025 at 5:31:43 PM
It was in development for 8 years. I think it is safe, but time will tell.by ryao
1/14/2025 at 4:24:45 PM
I’ve experienced bit rot on md. It was not fun, and the tooling was of approximately no help recovering.by amluto
1/14/2025 at 8:41:49 AM
Storage Spaces doesn't dedicate drive to single purpose. It operates in chunks (256MB i think). So one drive can, at the same time, be part of a mirror and raid-5 and raid-0. This allows fully using drives with various sizes. And choosing to remove drive will cause it to redistribute the chunks to other available drives, without going offline.by TiredOfLife
1/14/2025 at 2:11:11 PM
And as a user it seems to me to be the most elegant design. The quality of the implementation (parity write performance in particular) is another matter.by cm2187
1/15/2025 at 10:25:02 AM
btrfs has supported online adding and removing of devices to the pool from the startby pantalaimon
1/14/2025 at 7:45:04 AM
Bcachefs allows itby c45y
1/14/2025 at 8:00:12 AM
Cool, just have to wait before it is stable enough for daily use of mission critical data. I am personally optimistic about bcachefs, but incredibly pessimistic about changing filesystems.by eptcyka
1/14/2025 at 10:15:29 AM
It seems easier to copy data to a new ZFS pool if you need to remove RAID-Z top level vdevs. Another possibility is to just wait for someone to implement it in ZFS. ZFS already has top level vdev removal for other types of vdevs. Support for top level raid-z vdev removal just needs to be implemented on top of that.by ryao
1/14/2025 at 9:19:51 AM
Btrfsby unixhero
1/14/2025 at 9:26:33 AM
Except you shouldn’t use btrfs for any parity based raid if you value your data at all. In fact, I’m not aware if any vendor that has implemented btrfs with parity based raid, they all resort to btrfs on md.by tw04
1/14/2025 at 9:07:42 PM
How well tested is this in combination with encryption?Is the ZFS team handling encryption as a first class priority at all?
ZFS on Linux inherited a lot of fame from ZFS on Solaris, but everyone using it in production should study the issue tracker very well for a realistic impression of the situation.
by BodyCulture
1/14/2025 at 9:17:47 PM
Main issue with encryption is occasional attempts by certain (specific) Linux kernel developer to lockout ZFS out of access to advanced instruction set extensions (far from the only weird idea of that specific developer).The way ZFS encryption is layered, the features should be pretty much orthogonal from each other, but I'll admit that there's a bit of lacking with ZFS native encryption (though mainly in upper layer tooling in my experience rather than actual on-disk encryption parts)
by p_l
1/15/2025 at 5:55:37 AM
These are actually wrappers around CPU instructions, so what ZFS does is implement its own equivalents. This does not affect encryption (beyond the inconvenience that we did not have SIMD acceleration for a while on certain architectures).by ryao
1/15/2025 at 2:37:57 AM
>occasional attempts by certain (specific) Linux kernel developerCan we please refer to them by the actual name?
by snvzz
1/15/2025 at 2:40:16 AM
Greg Kroah-Hartman.by E39M5S62
1/15/2025 at 5:54:07 AM
The new features should interact fine with encryption. They are implemented at different parts of ZFS' internal stack.There have been many man hours put into investigating bug reports involving encryption and some fixes were made. Unfortunately, something appears to be going wrong when non-raw sends of encrypted datasets are received by another system:
https://github.com/openzfs/zfs/issues/12014
I do not believe anyone has figured out what is going wrong there. It has not been for lack of trying. Raw sends from encrypted datasets appear to be fine.
by ryao