-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmlb99EACgkQxWXV+ddt
WDtJfQ//cppHAHSxb3NNDGXDiKx4ccCp9CWiOF7z+BTFfngsNGvbs2FzKFnYI2f3
dT/DlPV8uBgVX3uYL3ZI1na/5MShXvS+sajIRhz3woyKBb2shVqVnFmfA8A3pKf6
3Dfm6FWrJHGCgV28Oi5pbg/UQeTAHAmA2aPLYJKRnNwIq8pSSzDWRCVNFfYrt4o2
7UUW1PzasZ7tuqL55HcwzuXjVTYr/t3puLjq+ydVfGSJSZlmlMd3pnZXz8S7/BC6
jVQGOT6nK9SWCnfXD9plqqr4CY+ThJZJNSdhVTwfVxkxVHmEBWfqfhAToqZaLKX9
co3rXvvZyIQf5KeHMmtbb2P736zaAcKb7G41liRN7EZg/gOsROE+UziYRkTg+Xyg
rztTksc913DsuHj19sZhIgcKRcym2h57wyZyt7vYAdsv9uksLUgKUo3U9CiTbEsb
8d/vgt1e3+ELoVcc+xVZSSGRDVzvZnxVmRHQV2dAtIXK34FXzqCDeKnFG0wsjqtF
Kw6bV93cXLohfcB7fPPBdAHzVN89kfUXTBT8mrri7HnjSnZTJNeHrGpcRNNQ76BT
8RL6gSP32Mpo9HZOYYhl1Xj2hRonRiJrUQAb6x9CY1MMUP2vwVvVBUVj2NAohWdM
vAYwRQDigw92RoKIYvHu+X+E5PXgX2AQ9NV8qiL79od+A7NFLgY=
=hmbc
-----END PGP SIGNATURE-----
Merge tag 'for-6.19-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix potential deadlock due to mismatching transaction states when
waiting for the current transaction
- fix squota accounting with nested snapshots
- fix quota inheritance of qgroups with multiple parent qgroups
- fix NULL inode pointer in evict tracepoint
- fix writes beyond end of file on systems with 64K page size and 4K
block size
- fix logging of inodes after exchange rename
- fix use after free when using ref_tracker feature
- space reservation fixes
* tag 'for-6.19-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix reservation leak in some error paths when inserting inline extent
btrfs: do not free data reservation in fallback from inline due to -ENOSPC
btrfs: fix use-after-free warning in btrfs_get_or_create_delayed_node()
btrfs: always detect conflicting inodes when logging inode refs
btrfs: fix beyond-EOF write handling
btrfs: fix deadlock in wait_current_trans() due to ignored transaction type
btrfs: fix NULL dereference on root when tracing inode eviction
btrfs: qgroup: update all parent qgroups when doing quick inherit
btrfs: fix qgroup_snapshot_quick_inherit() squota bug
If we fail to allocate a path or join a transaction, we return from
__cow_file_range_inline() without freeing the reserved qgroup data,
resulting in a leak. Fix this by ensuring we call btrfs_qgroup_free_data()
in such cases.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fail to create an inline extent due to -ENOSPC, we will attempt to
go through the normal COW path, reserve an extent, create an ordered
extent, etc. However we were always freeing the reserved qgroup data,
which is wrong since we will use data. Fix this by freeing the reserved
qgroup data in __cow_file_range_inline() only if we are not doing the
fallback (ret is <= 0).
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously, btrfs_get_or_create_delayed_node() set the delayed_node's
refcount before acquiring the root->delayed_nodes lock.
Commit e8513c012de7 ("btrfs: implement ref_tracker for delayed_nodes")
moved refcount_set inside the critical section, which means there is
no longer a memory barrier between setting the refcount and setting
btrfs_inode->delayed_node.
Without that barrier, the stores to node->refs and
btrfs_inode->delayed_node may become visible out of order. Another
thread can then read btrfs_inode->delayed_node and attempt to
increment a refcount that hasn't been set yet, leading to a
refcounting bug and a use-after-free warning.
The fix is to move refcount_set back to where it was to take
advantage of the implicit memory barrier provided by lock
acquisition.
Because the allocations now happen outside of the lock's critical
section, they can use GFP_NOFS instead of GFP_ATOMIC.
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202511262228.6dda231e-lkp@intel.com
Fixes: e8513c012de7 ("btrfs: implement ref_tracker for delayed_nodes")
Tested-by: kernel test robot <oliver.sang@intel.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After rename exchanging (either with the rename exchange operation or
regular renames in multiple non-atomic steps) two inodes and at least
one of them is a directory, we can end up with a log tree that contains
only of the inodes and after a power failure that can result in an attempt
to delete the other inode when it should not because it was not deleted
before the power failure. In some case that delete attempt fails when
the target inode is a directory that contains a subvolume inside it, since
the log replay code is not prepared to deal with directory entries that
point to root items (only inode items).
1) We have directories "dir1" (inode A) and "dir2" (inode B) under the
same parent directory;
2) We have a file (inode C) under directory "dir1" (inode A);
3) We have a subvolume inside directory "dir2" (inode B);
4) All these inodes were persisted in a past transaction and we are
currently at transaction N;
5) We rename the file (inode C), so at btrfs_log_new_name() we update
inode C's last_unlink_trans to N;
6) We get a rename exchange for "dir1" (inode A) and "dir2" (inode B),
so after the exchange "dir1" is inode B and "dir2" is inode A.
During the rename exchange we call btrfs_log_new_name() for inodes
A and B, but because they are directories, we don't update their
last_unlink_trans to N;
7) An fsync against the file (inode C) is done, and because its inode
has a last_unlink_trans with a value of N we log its parent directory
(inode A) (through btrfs_log_all_parents(), called from
btrfs_log_inode_parent()).
8) So we end up with inode B not logged, which now has the old name
of inode A. At copy_inode_items_to_log(), when logging inode A, we
did not check if we had any conflicting inode to log because inode
A has a generation lower than the current transaction (created in
a past transaction);
9) After a power failure, when replaying the log tree, since we find that
inode A has a new name that conflicts with the name of inode B in the
fs tree, we attempt to delete inode B... this is wrong since that
directory was never deleted before the power failure, and because there
is a subvolume inside that directory, attempting to delete it will fail
since replay_dir_deletes() and btrfs_unlink_inode() are not prepared
to deal with dir items that point to roots instead of inodes.
When that happens the mount fails and we get a stack trace like the
following:
[87.2314] BTRFS info (device dm-0): start tree-log replay
[87.2318] BTRFS critical (device dm-0): failed to delete reference to subvol, root 5 inode 256 parent 259
[87.2332] ------------[ cut here ]------------
[87.2338] BTRFS: Transaction aborted (error -2)
[87.2346] WARNING: CPU: 1 PID: 638968 at fs/btrfs/inode.c:4345 __btrfs_unlink_inode+0x416/0x440 [btrfs]
[87.2368] Modules linked in: btrfs loop dm_thin_pool (...)
[87.2470] CPU: 1 UID: 0 PID: 638968 Comm: mount Tainted: G W 6.18.0-rc7-btrfs-next-218+ #2 PREEMPT(full)
[87.2489] Tainted: [W]=WARN
[87.2494] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[87.2514] RIP: 0010:__btrfs_unlink_inode+0x416/0x440 [btrfs]
[87.2538] Code: c0 89 04 24 (...)
[87.2568] RSP: 0018:ffffc0e741f4b9b8 EFLAGS: 00010286
[87.2574] RAX: 0000000000000000 RBX: ffff9d3ec8a6cf60 RCX: 0000000000000000
[87.2582] RDX: 0000000000000002 RSI: ffffffff84ab45a1 RDI: 00000000ffffffff
[87.2591] RBP: ffff9d3ec8a6ef20 R08: 0000000000000000 R09: ffffc0e741f4b840
[87.2599] R10: ffff9d45dc1fffa8 R11: 0000000000000003 R12: ffff9d3ee26d77e0
[87.2608] R13: ffffc0e741f4ba98 R14: ffff9d4458040800 R15: ffff9d44b6b7ca10
[87.2618] FS: 00007f7b9603a840(0000) GS:ffff9d4658982000(0000) knlGS:0000000000000000
[87.2629] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[87.2637] CR2: 00007ffc9ec33b98 CR3: 000000011273e003 CR4: 0000000000370ef0
[87.2648] Call Trace:
[87.2651] <TASK>
[87.2654] btrfs_unlink_inode+0x15/0x40 [btrfs]
[87.2661] unlink_inode_for_log_replay+0x27/0xf0 [btrfs]
[87.2669] check_item_in_log+0x1ea/0x2c0 [btrfs]
[87.2676] replay_dir_deletes+0x16b/0x380 [btrfs]
[87.2684] fixup_inode_link_count+0x34b/0x370 [btrfs]
[87.2696] fixup_inode_link_counts+0x41/0x160 [btrfs]
[87.2703] btrfs_recover_log_trees+0x1ff/0x7c0 [btrfs]
[87.2711] ? __pfx_replay_one_buffer+0x10/0x10 [btrfs]
[87.2719] open_ctree+0x10bb/0x15f0 [btrfs]
[87.2726] btrfs_get_tree.cold+0xb/0x16c [btrfs]
[87.2734] ? fscontext_read+0x15c/0x180
[87.2740] ? rw_verify_area+0x50/0x180
[87.2746] vfs_get_tree+0x25/0xd0
[87.2750] vfs_cmd_create+0x59/0xe0
[87.2755] __do_sys_fsconfig+0x4f6/0x6b0
[87.2760] do_syscall_64+0x50/0x1220
[87.2764] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[87.2770] RIP: 0033:0x7f7b9625f4aa
[87.2775] Code: 73 01 c3 48 (...)
[87.2803] RSP: 002b:00007ffc9ec35b08 EFLAGS: 00000246 ORIG_RAX: 00000000000001af
[87.2817] RAX: ffffffffffffffda RBX: 0000558bfa91ac20 RCX: 00007f7b9625f4aa
[87.2829] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
[87.2842] RBP: 0000558bfa91b120 R08: 0000000000000000 R09: 0000000000000000
[87.2854] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[87.2864] R13: 00007f7b963f1580 R14: 00007f7b963f326c R15: 00007f7b963d8a23
[87.2877] </TASK>
[87.2882] ---[ end trace 0000000000000000 ]---
[87.2891] BTRFS: error (device dm-0 state A) in __btrfs_unlink_inode:4345: errno=-2 No such entry
[87.2904] BTRFS: error (device dm-0 state EAO) in do_abort_log_replay:191: errno=-2 No such entry
[87.2915] BTRFS critical (device dm-0 state EAO): log tree (for root 5) leaf currently being processed (slot 7 key (258 12 257)):
[87.2929] BTRFS info (device dm-0 state EAO): leaf 30736384 gen 10 total ptrs 7 free space 15712 owner 18446744073709551610
[87.2929] BTRFS info (device dm-0 state EAO): refs 3 lock_owner 0 current 638968
[87.2929] item 0 key (257 INODE_ITEM 0) itemoff 16123 itemsize 160
[87.2929] inode generation 9 transid 10 size 0 nbytes 0
[87.2929] block group 0 mode 40755 links 1 uid 0 gid 0
[87.2929] rdev 0 sequence 7 flags 0x0
[87.2929] atime 1765464494.678070921
[87.2929] ctime 1765464494.686606513
[87.2929] mtime 1765464494.686606513
[87.2929] otime 1765464494.678070921
[87.2929] item 1 key (257 INODE_REF 256) itemoff 16109 itemsize 14
[87.2929] index 4 name_len 4
[87.2929] item 2 key (257 DIR_LOG_INDEX 2) itemoff 16101 itemsize 8
[87.2929] dir log end 2
[87.2929] item 3 key (257 DIR_LOG_INDEX 3) itemoff 16093 itemsize 8
[87.2929] dir log end 18446744073709551615
[87.2930] item 4 key (257 DIR_INDEX 3) itemoff 16060 itemsize 33
[87.2930] location key (258 1 0) type 1
[87.2930] transid 10 data_len 0 name_len 3
[87.2930] item 5 key (258 INODE_ITEM 0) itemoff 15900 itemsize 160
[87.2930] inode generation 9 transid 10 size 0 nbytes 0
[87.2930] block group 0 mode 100644 links 1 uid 0 gid 0
[87.2930] rdev 0 sequence 2 flags 0x0
[87.2930] atime 1765464494.678456467
[87.2930] ctime 1765464494.686606513
[87.2930] mtime 1765464494.678456467
[87.2930] otime 1765464494.678456467
[87.2930] item 6 key (258 INODE_REF 257) itemoff 15887 itemsize 13
[87.2930] index 3 name_len 3
[87.2930] BTRFS critical (device dm-0 state EAO): log replay failed in unlink_inode_for_log_replay:1045 for root 5, stage 3, with error -2: failed to unlink inode 256 parent dir 259 name subvol root 5
[87.2963] BTRFS: error (device dm-0 state EAO) in btrfs_recover_log_trees:7743: errno=-2 No such entry
[87.2981] BTRFS: error (device dm-0 state EAO) in btrfs_replay_log:2083: errno=-2 No such entry (Failed to recover log tr
So fix this by changing copy_inode_items_to_log() to always detect if
there are conflicting inodes for the ref/extref of the inode being logged
even if the inode was created in a past transaction.
A test case for fstests will follow soon.
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
For the following write sequence with 64K page size and 4K fs block size,
it will lead to file extent items to be inserted without any data
checksum:
mkfs.btrfs -s 4k -f $dev > /dev/null
mount $dev $mnt
xfs_io -f -c "pwrite 0 16k" -c "pwrite 32k 4k" -c pwrite "60k 64K" \
-c "truncate 16k" $mnt/foobar
umount $mnt
This will result the following 2 file extent items to be inserted (extra
trace point added to insert_ordered_extent_file_extent()):
btrfs_finish_one_ordered: root=5 ino=257 file_off=61440 num_bytes=4096 csum_bytes=0
btrfs_finish_one_ordered: root=5 ino=257 file_off=0 num_bytes=16384 csum_bytes=16384
Note for file offset 60K, we're inserting a file extent without any
data checksum.
Also note that range [32K, 36K) didn't reach
insert_ordered_extent_file_extent(), which is the correct behavior as
that OE is fully truncated, should not result any file extent.
Although file extent at 60K will be later dropped by btrfs_truncate(),
if the transaction got committed after file extent inserted but before
the file extent dropping, we will have a small window where we have a
file extent beyond EOF and without any data checksum.
That will cause "btrfs check" to report error.
[CAUSE]
The sequence happens like this:
- Buffered write dirtied the page cache and updated isize
Now the inode size is 64K, with the following page cache layout:
0 16K 32K 48K 64K
|/////////////| |//| |//|
- Truncate the inode to 16K
Which will trigger writeback through:
btrfs_setsize()
|- truncate_setsize()
| Now the inode size is set to 16K
|
|- btrfs_truncate()
|- btrfs_wait_ordered_range() for [16K, u64(-1)]
|- btrfs_fdatawrite_range() for [16K, u64(-1)}
|- extent_writepage() for folio 0
|- writepage_delalloc()
| Generated OE for [0, 16K), [32K, 36K] and [60K, 64K)
|
|- extent_writepage_io()
Then inside extent_writepage_io(), the dirty fs blocks are handled
differently:
- Submit write for range [0, 16K)
As they are still inside the inode size (16K).
- Mark OE [32K, 36K) as truncated
Since we only call btrfs_lookup_first_ordered_range() once, which
returned the first OE after file offset 16K.
- Mark all OEs inside range [16K, 64K) as finished
Which will mark OE ranges [32K, 36K) and [60K, 64K) as finished.
For OE [32K, 36K) since it's already marked as truncated, and its
truncated length is 0, no file extent will be inserted.
For OE [60K, 64K) it has never been submitted thus has no data
checksum, and we insert the file extent as usual.
This is the root cause of file extent at 60K to be inserted without
any data checksum.
- Clear dirty flags for range [16K, 64K)
It is the function btrfs_folio_clear_dirty() which searches and clears
any dirty blocks inside that range.
[FIX]
The bug itself was introduced a long time ago, way before subpage and
large folio support.
At that time, fs block size must match page size, thus the range
[cur, end) is just one fs block.
But later with subpage and large folios, the same range [cur, end)
can have multiple blocks and ordered extents.
Later commit 18de34daa7c6 ("btrfs: truncate ordered extent when skipping
writeback past i_size") was fixing a bug related to subpage/large
folios, but it's still utilizing the old range [cur, end), meaning only
the first OE will be marked as truncated.
The proper fix here is to make EOF handling block-by-block, not trying
to handle the whole range to @end.
By this we always locate and truncate the OE for every dirty block.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When wait_current_trans() is called during start_transaction(), it
currently waits for a blocked transaction without considering whether
the given transaction type actually needs to wait for that particular
transaction state. The btrfs_blocked_trans_types[] array already defines
which transaction types should wait for which transaction states, but
this check was missing in wait_current_trans().
This can lead to a deadlock scenario involving two transactions and
pending ordered extents:
1. Transaction A is in TRANS_STATE_COMMIT_DOING state
2. A worker processing an ordered extent calls start_transaction()
with TRANS_JOIN
3. join_transaction() returns -EBUSY because Transaction A is in
TRANS_STATE_COMMIT_DOING
4. Transaction A moves to TRANS_STATE_UNBLOCKED and completes
5. A new Transaction B is created (TRANS_STATE_RUNNING)
6. The ordered extent from step 2 is added to Transaction B's
pending ordered extents
7. Transaction B immediately starts commit by another task and
enters TRANS_STATE_COMMIT_START
8. The worker finally reaches wait_current_trans(), sees Transaction B
in TRANS_STATE_COMMIT_START (a blocked state), and waits
unconditionally
9. However, TRANS_JOIN should NOT wait for TRANS_STATE_COMMIT_START
according to btrfs_blocked_trans_types[]
10. Transaction B is waiting for pending ordered extents to complete
11. Deadlock: Transaction B waits for ordered extent, ordered extent
waits for Transaction B
This can be illustrated by the following call stacks:
CPU0 CPU1
btrfs_finish_ordered_io()
start_transaction(TRANS_JOIN)
join_transaction()
# -EBUSY (Transaction A is
# TRANS_STATE_COMMIT_DOING)
# Transaction A completes
# Transaction B created
# ordered extent added to
# Transaction B's pending list
btrfs_commit_transaction()
# Transaction B enters
# TRANS_STATE_COMMIT_START
# waiting for pending ordered
# extents
wait_current_trans()
# waits for Transaction B
# (should not wait!)
Task bstore_kv_sync in btrfs_commit_transaction waiting for ordered
extents:
__schedule+0x2e7/0x8a0
schedule+0x64/0xe0
btrfs_commit_transaction+0xbf7/0xda0 [btrfs]
btrfs_sync_file+0x342/0x4d0 [btrfs]
__x64_sys_fdatasync+0x4b/0x80
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Task kworker in wait_current_trans waiting for transaction commit:
Workqueue: btrfs-syno_nocow btrfs_work_helper [btrfs]
__schedule+0x2e7/0x8a0
schedule+0x64/0xe0
wait_current_trans+0xb0/0x110 [btrfs]
start_transaction+0x346/0x5b0 [btrfs]
btrfs_finish_ordered_io.isra.0+0x49b/0x9c0 [btrfs]
btrfs_work_helper+0xe8/0x350 [btrfs]
process_one_work+0x1d3/0x3c0
worker_thread+0x4d/0x3e0
kthread+0x12d/0x150
ret_from_fork+0x1f/0x30
Fix this by passing the transaction type to wait_current_trans() and
checking btrfs_blocked_trans_types[cur_trans->state] against the given
type before deciding to wait. This ensures that transaction types which
are allowed to join during certain blocked states will not unnecessarily
wait and cause deadlocks.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When evicting an inode the first thing we do is to setup tracing for it,
which implies fetching the root's id. But in btrfs_evict_inode() the
root might be NULL, as implied in the next check that we do in
btrfs_evict_inode().
Hence, we either should set the ->root_objectid to 0 in case the root is
NULL, or we move tracing setup after checking that the root is not
NULL. Setting the rootid to 0 at least gives us the possibility to trace
this call even in the case when the root is NULL, so that's the solution
taken here.
Fixes: 1abe9b8a138c ("Btrfs: add initial tracepoint support for btrfs")
Reported-by: syzbot+d991fea1b4b23b1f6bf8@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d991fea1b4b23b1f6bf8
Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a bug that if a subvolume has multi-level parent qgroups, and
is able to do a quick inherit, only the direct parent qgroup got
updated:
mkfs.btrfs -f -O quota $dev
mount $dev $mnt
btrfs subv create $mnt/subv1
btrfs qgroup create 1/100 $mnt
btrfs qgroup create 2/100 $mnt
btrfs qgroup assign 1/100 2/100 $mnt
btrfs qgroup assign 0/256 1/100 $mnt
btrfs qgroup show -p --sync $mnt
Qgroupid Referenced Exclusive Parent Path
-------- ---------- --------- ------ ----
0/5 16.00KiB 16.00KiB - <toplevel>
0/256 16.00KiB 16.00KiB 1/100 subv1
1/100 16.00KiB 16.00KiB 2/100 2/100<1 member qgroup>
2/100 16.00KiB 16.00KiB - <0 member qgroups>
btrfs subv snap -i 1/100 $mnt/subv1 $mnt/snap1
btrfs qgroup show -p --sync $mnt
Qgroupid Referenced Exclusive Parent Path
-------- ---------- --------- ------ ----
0/5 16.00KiB 16.00KiB - <toplevel>
0/256 16.00KiB 16.00KiB 1/100 subv1
0/257 16.00KiB 16.00KiB 1/100 snap1
1/100 32.00KiB 32.00KiB 2/100 2/100<1 member qgroup>
2/100 16.00KiB 16.00KiB - <0 member qgroups>
# Note that 2/100 is not updated, and qgroup numbers are inconsistent
umount $mnt
[CAUSE]
If the snapshot source subvolume belongs to a parent qgroup, and the new
snapshot target is also added to the new same parent qgroup, we allow a
quick update without marking qgroup inconsistent.
But that quick update only update the parent qgroup, without checking if
there is any more parent qgroups.
[FIX]
Iterate through all parent qgroups during the quick inherit.
Reported-by: Boris Burkov <boris@bur.io>
Fixes: b20fe56cd285 ("btrfs: qgroup: allow quick inherit if snapshot is created and added to the same parent")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
qgroup_snapshot_quick_inherit() detects conditions where the snapshot
destination would land in the same parent qgroup as the snapshot source
subvolume. In this case we can avoid costly qgroup calculations and just
add the nodesize of the new snapshot to the parent.
However, in the case of squotas this is actually a double count, and
also an undercount for deeper qgroup nestings.
The following annotated script shows the issue:
btrfs quota enable --simple "$mnt"
# Create 2-level qgroup hierarchy
btrfs qgroup create 2/100 "$mnt" # Q2 (level 2)
btrfs qgroup create 1/100 "$mnt" # Q1 (level 1)
btrfs qgroup assign 1/100 2/100 "$mnt"
# Create base subvolume
btrfs subvolume create "$mnt/base" >/dev/null
base_id=$(btrfs subvolume show "$mnt/base" | grep 'Subvolume ID:' | awk '{print $3}')
# Create intermediate snapshot and add to Q1
btrfs subvolume snapshot "$mnt/base" "$mnt/intermediate" >/dev/null
inter_id=$(btrfs subvolume show "$mnt/intermediate" | grep 'Subvolume ID:' | awk '{print $3}')
btrfs qgroup assign "0/$inter_id" 1/100 "$mnt"
# Create working snapshot with --inherit (auto-adds to Q1)
# src=intermediate (in only Q1)
# dst=snap (inheriting only into Q1)
# This double counts the 16k nodesize of the snapshot in Q1, and
# undercounts it in Q2.
btrfs subvolume snapshot -i 1/100 "$mnt/intermediate" "$mnt/snap" >/dev/null
snap_id=$(btrfs subvolume show "$mnt/snap" | grep 'Subvolume ID:' | awk '{print $3}')
# Fully complete snapshot creation
sync
# Delete working snapshot
# Q1 and Q2 will lose the full snap usage
btrfs subvolume delete "$mnt/snap" >/dev/null
# Delete intermediate and remove from Q1
# Q1 and Q2 will lose the full intermediate usage
btrfs qgroup remove "0/$inter_id" 1/100 "$mnt"
btrfs subvolume delete "$mnt/intermediate" >/dev/null
# Q1 should be at 0, but still has 16k. Q2 is "correct" at 0 (for now...)
# Trigger cleaner, wait for deletions
mount -o remount,sync=1 "$mnt"
btrfs subvolume sync "$mnt" "$snap_id"
btrfs subvolume sync "$mnt" "$inter_id"
# Remove Q1 from Q2
# Frees 16k more from Q2, underflowing it to 16EiB
btrfs qgroup remove 1/100 2/100 "$mnt"
# And show the bad state:
btrfs qgroup show -pc "$mnt"
Qgroupid Referenced Exclusive Parent Child Path
-------- ---------- --------- ------ ----- ----
0/5 16.00KiB 16.00KiB - - <toplevel>
0/256 16.00KiB 16.00KiB - - base
1/100 16.00KiB 16.00KiB - - <0 member qgroups>
2/100 16.00EiB 16.00EiB - - <0 member qgroups>
Fix this by simply not doing this quick inheritance with squotas.
I suspect that it is also wrong in normal qgroups to not recurse up the
qgroup tree in the quick inherit case, though other consistency checks
will likely fix it anyway.
Fixes: b20fe56cd285 ("btrfs: qgroup: allow quick inherit if snapshot is created and added to the same parent")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-12-16 22:09:52 +01:00
7 changed files with 65 additions and 38 deletions
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.