From 1aff297ffb925ee49299f86af5fcdf854cb48c5f Mon Sep 17 00:00:00 2001 From: Qu Wenruo Date: Sun, 7 Dec 2025 14:53:20 +1030 Subject: [PATCH 1/6] btrfs: avoid access-beyond-folio for bs > ps encoded writes [POTENTIAL BUG] If the system page size is 4K and fs block size is 8K, and max_inline mount option is set to 6K, we can inline a 6K sized data extent. Then a encoded write submitted a compressed extent which is at file offset 0, and the compressed length is 6K, which is allowed to be inlined. Now a read beyond page boundary is triggered inside write_extent_buffer() from insert_inline_extent(). [CAUSE] Currently the function __cow_file_range_inline() can only accept a single folio. For regular compressed write path, we always allocate the compressed folios using the minimal order matching the block size, thus the @compressed_folio should always cover a full fs block thus it is fine. But for encoded writes, they allocate page size folios, this means we can hit a case where the compressed data is smaller than block size but still larger than page size, in that case __cow_file_range_inline() will be called with @compressed_size larger than a page. In that case we will trigger a read beyond the folio inside insert_inline_extent(). Thankfully this is not that common, as the default max_inline is only 2048 bytes, smaller than PAGE_SIZE, and bs > ps support is still experimental. [FIX] We need to either allow insert_inline_extent() to accept a page array to properly support such case, or reject such inline extent. The latter is a much simpler solution, and considering bs > ps will stay as a corner case and non-default max_inline will be even rarer, I don't think we really need to fulfill such niche. So just reject any inline extent that's larger than PAGE_SIZE, and add an extra ASSERT() to insert_inline_extent() to catch such beyond-boundary access. Fixes: ec20799064c8 ("btrfs: enable encoded read/write/send for bs > ps cases") Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/inode.c | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 03337fa7a61c..bfb8e98d8c02 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -480,13 +480,15 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans, ASSERT(size <= sectorsize); /* - * The compressed size also needs to be no larger than a sector. - * That's also why we only need one page as the parameter. + * The compressed size also needs to be no larger than a page. + * That's also why we only need one folio as the parameter. */ - if (compressed_folio) + if (compressed_folio) { ASSERT(compressed_size <= sectorsize); - else + ASSERT(compressed_size <= PAGE_SIZE); + } else { ASSERT(compressed_size == 0); + } if (compressed_size && compressed_folio) cur_size = compressed_size; @@ -573,6 +575,18 @@ static bool can_cow_file_range_inline(struct btrfs_inode *inode, if (offset != 0) return false; + /* + * Even for bs > ps cases, cow_file_range_inline() can only accept a + * single folio. + * + * This can be problematic and cause access beyond page boundary if a + * page sized folio is passed into that function. + * And encoded write is doing exactly that. + * So here limits the inlined extent size to PAGE_SIZE. + */ + if (size > PAGE_SIZE || compressed_size > PAGE_SIZE) + return false; + /* Inline extents are limited to sectorsize. */ if (size > fs_info->sectorsize) return false; From 8731f2c50b0b1d2b58ed5b9671ef2c4bdc2f8347 Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Tue, 16 Dec 2025 14:51:52 +0000 Subject: [PATCH 2/6] btrfs: release path before initializing extent tree in btrfs_read_locked_inode() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In btrfs_read_locked_inode() we are calling btrfs_init_file_extent_tree() while holding a path with a read locked leaf from a subvolume tree, and btrfs_init_file_extent_tree() may do a GFP_KERNEL allocation, which can trigger reclaim. This can create a circular lock dependency which lockdep warns about with the following splat: [6.1433] ====================================================== [6.1574] WARNING: possible circular locking dependency detected [6.1583] 6.18.0+ #4 Tainted: G U [6.1591] ------------------------------------------------------ [6.1599] kswapd0/117 is trying to acquire lock: [6.1606] ffff8d9b6333c5b8 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x39/0x2f0 [6.1625] but task is already holding lock: [6.1633] ffffffffa4ab8ce0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x195/0xc60 [6.1646] which lock already depends on the new lock. [6.1657] the existing dependency chain (in reverse order) is: [6.1667] -> #2 (fs_reclaim){+.+.}-{0:0}: [6.1677] fs_reclaim_acquire+0x9d/0xd0 [6.1685] __kmalloc_cache_noprof+0x59/0x750 [6.1694] btrfs_init_file_extent_tree+0x90/0x100 [6.1702] btrfs_read_locked_inode+0xc3/0x6b0 [6.1710] btrfs_iget+0xbb/0xf0 [6.1716] btrfs_lookup_dentry+0x3c5/0x8e0 [6.1724] btrfs_lookup+0x12/0x30 [6.1731] lookup_open.isra.0+0x1aa/0x6a0 [6.1739] path_openat+0x5f7/0xc60 [6.1746] do_filp_open+0xd6/0x180 [6.1753] do_sys_openat2+0x8b/0xe0 [6.1760] __x64_sys_openat+0x54/0xa0 [6.1768] do_syscall_64+0x97/0x3e0 [6.1776] entry_SYSCALL_64_after_hwframe+0x76/0x7e [6.1784] -> #1 (btrfs-tree-00){++++}-{3:3}: [6.1794] lock_release+0x127/0x2a0 [6.1801] up_read+0x1b/0x30 [6.1808] btrfs_search_slot+0x8e0/0xff0 [6.1817] btrfs_lookup_inode+0x52/0xd0 [6.1825] __btrfs_update_delayed_inode+0x73/0x520 [6.1833] btrfs_commit_inode_delayed_inode+0x11a/0x120 [6.1842] btrfs_log_inode+0x608/0x1aa0 [6.1849] btrfs_log_inode_parent+0x249/0xf80 [6.1857] btrfs_log_dentry_safe+0x3e/0x60 [6.1865] btrfs_sync_file+0x431/0x690 [6.1872] do_fsync+0x39/0x80 [6.1879] __x64_sys_fsync+0x13/0x20 [6.1887] do_syscall_64+0x97/0x3e0 [6.1894] entry_SYSCALL_64_after_hwframe+0x76/0x7e [6.1903] -> #0 (&delayed_node->mutex){+.+.}-{3:3}: [6.1913] __lock_acquire+0x15e9/0x2820 [6.1920] lock_acquire+0xc9/0x2d0 [6.1927] __mutex_lock+0xcc/0x10a0 [6.1934] __btrfs_release_delayed_node.part.0+0x39/0x2f0 [6.1944] btrfs_evict_inode+0x20b/0x4b0 [6.1952] evict+0x15a/0x2f0 [6.1958] prune_icache_sb+0x91/0xd0 [6.1966] super_cache_scan+0x150/0x1d0 [6.1974] do_shrink_slab+0x155/0x6f0 [6.1981] shrink_slab+0x48e/0x890 [6.1988] shrink_one+0x11a/0x1f0 [6.1995] shrink_node+0xbfd/0x1320 [6.1002] balance_pgdat+0x67f/0xc60 [6.1321] kswapd+0x1dc/0x3e0 [6.1643] kthread+0xff/0x240 [6.1965] ret_from_fork+0x223/0x280 [6.1287] ret_from_fork_asm+0x1a/0x30 [6.1616] other info that might help us debug this: [6.1561] Chain exists of: &delayed_node->mutex --> btrfs-tree-00 --> fs_reclaim [6.1503] Possible unsafe locking scenario: [6.1110] CPU0 CPU1 [6.1411] ---- ---- [6.1707] lock(fs_reclaim); [6.1998] lock(btrfs-tree-00); [6.1291] lock(fs_reclaim); [6.1581] lock(&delayed_node->mutex); [6.1874] *** DEADLOCK *** [6.1716] 2 locks held by kswapd0/117: [6.1999] #0: ffffffffa4ab8ce0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x195/0xc60 [6.1294] #1: ffff8d998344b0e0 (&type->s_umount_key#40){++++}- {3:3}, at: super_cache_scan+0x37/0x1d0 [6.1596] stack backtrace: [6.1183] CPU: 11 UID: 0 PID: 117 Comm: kswapd0 Tainted: G U 6.18.0+ #4 PREEMPT(lazy) [6.1185] Tainted: [U]=USER [6.1186] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023 [6.1187] Call Trace: [6.1187] [6.1189] dump_stack_lvl+0x6e/0xa0 [6.1192] print_circular_bug.cold+0x17a/0x1c0 [6.1194] check_noncircular+0x175/0x190 [6.1197] __lock_acquire+0x15e9/0x2820 [6.1200] lock_acquire+0xc9/0x2d0 [6.1201] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0 [6.1204] __mutex_lock+0xcc/0x10a0 [6.1206] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0 [6.1208] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0 [6.1211] ? __btrfs_release_delayed_node.part.0+0x39/0x2f0 [6.1213] __btrfs_release_delayed_node.part.0+0x39/0x2f0 [6.1215] btrfs_evict_inode+0x20b/0x4b0 [6.1217] ? lock_acquire+0xc9/0x2d0 [6.1220] evict+0x15a/0x2f0 [6.1222] prune_icache_sb+0x91/0xd0 [6.1224] super_cache_scan+0x150/0x1d0 [6.1226] do_shrink_slab+0x155/0x6f0 [6.1228] shrink_slab+0x48e/0x890 [6.1229] ? shrink_slab+0x2d2/0x890 [6.1231] shrink_one+0x11a/0x1f0 [6.1234] shrink_node+0xbfd/0x1320 [6.1236] ? shrink_node+0xa2d/0x1320 [6.1236] ? shrink_node+0xbd3/0x1320 [6.1239] ? balance_pgdat+0x67f/0xc60 [6.1239] balance_pgdat+0x67f/0xc60 [6.1241] ? finish_task_switch.isra.0+0xc4/0x2a0 [6.1246] kswapd+0x1dc/0x3e0 [6.1247] ? __pfx_autoremove_wake_function+0x10/0x10 [6.1249] ? __pfx_kswapd+0x10/0x10 [6.1250] kthread+0xff/0x240 [6.1251] ? __pfx_kthread+0x10/0x10 [6.1253] ret_from_fork+0x223/0x280 [6.1255] ? __pfx_kthread+0x10/0x10 [6.1257] ret_from_fork_asm+0x1a/0x30 [6.1260] This is because: 1) The fsync task is holding an inode's delayed node mutex (for a directory) while calling __btrfs_update_delayed_inode() and that needs to do a search on the subvolume's btree (therefore read lock some extent buffers); 2) The lookup task, at btrfs_lookup(), triggered reclaim with the GFP_KERNEL allocation done by btrfs_init_file_extent_tree() while holding a read lock on a subvolume leaf; 3) The reclaim triggered kswapd which is doing inode eviction for the directory inode the fsync task is using as an argument to btrfs_commit_inode_delayed_inode() - but in that call chain we are trying to read lock the same leaf that the lookup task is holding while calling btrfs_init_file_extent_tree() and doing the GFP_KERNEL allocation. Fix this by calling btrfs_init_file_extent_tree() after we don't need the path anymore and release it in btrfs_read_locked_inode(). Reported-by: Thomas Hellström Link: https://lore.kernel.org/linux-btrfs/6e55113a22347c3925458a5d840a18401a38b276.camel@linux.intel.com/ Fixes: 8679d2687c35 ("btrfs: initialize inode::file_extent_tree after i_mode has been set") Reviewed-by: Qu Wenruo Signed-off-by: Filipe Manana Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/inode.c | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index bfb8e98d8c02..5ea1c392bbc7 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4046,11 +4046,6 @@ static int btrfs_read_locked_inode(struct btrfs_inode *inode, struct btrfs_path btrfs_set_inode_mapping_order(inode); cache_index: - ret = btrfs_init_file_extent_tree(inode); - if (ret) - goto out; - btrfs_inode_set_file_extent_range(inode, 0, - round_up(i_size_read(vfs_inode), fs_info->sectorsize)); /* * If we were modified in the current generation and evicted from memory * and then re-read we need to do a full sync since we don't have any @@ -4137,6 +4132,20 @@ cache_acl: btrfs_ino(inode), btrfs_root_id(root), ret); } + /* + * We don't need the path anymore, so release it to avoid holding a read + * lock on a leaf while calling btrfs_init_file_extent_tree(), which can + * allocate memory that triggers reclaim (GFP_KERNEL) and cause a locking + * dependency. + */ + btrfs_release_path(path); + + ret = btrfs_init_file_extent_tree(inode); + if (ret) + goto out; + btrfs_inode_set_file_extent_range(inode, 0, + round_up(i_size_read(vfs_inode), fs_info->sectorsize)); + if (!maybe_acls) cache_no_acl(vfs_inode); From 30bcf4e824aa37d305502f52e1527c7b1eabef3d Mon Sep 17 00:00:00 2001 From: Qu Wenruo Date: Thu, 18 Dec 2025 15:15:28 +1030 Subject: [PATCH 3/6] btrfs: only enforce free space tree if v1 cache is required for bs < ps cases [BUG] Since the introduction of btrfs bs < ps support, v1 cache was never on the plan due to its hard coded PAGE_SIZE usage, and the future plan to properly deprecate it. However for bs < ps cases, even if 'nospace_cache,clear_cache' mount option is specified, it's never respected and free space tree is always enabled: mkfs.btrfs -f -O ^bgt,fst $dev mount $dev $mnt -o clear_cache,nospace_cache umount $mnt btrfs ins dump-super $dev ... compat_ro_flags 0x3 ( FREE_SPACE_TREE | FREE_SPACE_TREE_VALID ) ... This means a different behavior compared to bs >= ps cases. [CAUSE] The forcing usage of v2 space cache is done inside btrfs_set_free_space_cache_settings(), however it never checks if we're even using space cache but always enabling v2 cache. [FIX] Instead unconditionally enable v2 cache, only forcing v2 cache if the old v1 cache is required. Now v2 space cache can be properly disabled on bs < ps cases: mkfs.btrfs -f -O ^bgt,fst $dev mount $dev $mnt -o clear_cache,nospace_cache umount $mnt btrfs ins dump-super $dev ... compat_ro_flags 0x0 ... Fixes: 9f73f1aef98b ("btrfs: force v2 space cache usage for subpage mount") Reviewed-by: Filipe Manana Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/super.c | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 1999533b52be..9f8546498818 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -736,14 +736,12 @@ bool btrfs_check_options(const struct btrfs_fs_info *info, */ void btrfs_set_free_space_cache_settings(struct btrfs_fs_info *fs_info) { - if (fs_info->sectorsize < PAGE_SIZE) { + if (fs_info->sectorsize < PAGE_SIZE && btrfs_test_opt(fs_info, SPACE_CACHE)) { + btrfs_info(fs_info, + "forcing free space tree for sector size %u with page size %lu", + fs_info->sectorsize, PAGE_SIZE); btrfs_clear_opt(fs_info->mount_opt, SPACE_CACHE); - if (!btrfs_test_opt(fs_info, FREE_SPACE_TREE)) { - btrfs_info(fs_info, - "forcing free space tree for sector size %u with page size %lu", - fs_info->sectorsize, PAGE_SIZE); - btrfs_set_opt(fs_info->mount_opt, FREE_SPACE_TREE); - } + btrfs_set_opt(fs_info->mount_opt, FREE_SPACE_TREE); } /* From cefd80925180a85c818e18c2876911b002a595fd Mon Sep 17 00:00:00 2001 From: Qu Wenruo Date: Thu, 18 Dec 2025 15:15:29 +1030 Subject: [PATCH 4/6] btrfs: force free space tree for bs > ps cases [BUG] Currently we only enforcing the free space tree for bs < ps cases, but with the recently added bs > ps support, we lack the free space tree enforcing, causing explicit v1 cache mount option to fail on bs > ps cases: # mount -o space_cache=v1 /dev/test/scratch1 /mnt/btrfs/ mount: /mnt/btrfs: wrong fs type, bad option, bad superblock on /dev/mapper/test-scratch1, missing codepage or helper program, or other error. dmesg(1) may have more information after failed mount system call. # dmesg -t | tail -n7 BTRFS: device fsid ac14a6fa-4ec9-449e-aec9-7d1777bfdc06 devid 1 transid 11 /dev/mapper/test-scratch1 (253:3) scanned by mount (2849) BTRFS info (device dm-3): first mount of filesystem ac14a6fa-4ec9-449e-aec9-7d1777bfdc06 BTRFS info (device dm-3): using crc32c checksum algorithm BTRFS warning (device dm-3): support for block size 8192 with page size 4096 is experimental, some features may be missing BTRFS warning (device dm-3): space cache v1 is being deprecated and will be removed in a future release, please use -o space_cache=v2 BTRFS warning (device dm-3): v1 space cache is not supported for page size 4096 with sectorsize 8192 BTRFS error (device dm-3): open_ctree failed: -22 [FIX] Just enable the same free space tree for bs > ps cases, aligning the behavior to bs < ps cases. Reviewed-by: Filipe Manana Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/super.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 9f8546498818..af56fdbba65d 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -736,7 +736,7 @@ bool btrfs_check_options(const struct btrfs_fs_info *info, */ void btrfs_set_free_space_cache_settings(struct btrfs_fs_info *fs_info) { - if (fs_info->sectorsize < PAGE_SIZE && btrfs_test_opt(fs_info, SPACE_CACHE)) { + if (fs_info->sectorsize != PAGE_SIZE && btrfs_test_opt(fs_info, SPACE_CACHE)) { btrfs_info(fs_info, "forcing free space tree for sector size %u with page size %lu", fs_info->sectorsize, PAGE_SIZE); From 530e3d4af566ca44807d79359b90794dea24c4f3 Mon Sep 17 00:00:00 2001 From: Suchit Karunakaran Date: Fri, 19 Dec 2025 22:44:34 +0530 Subject: [PATCH 5/6] btrfs: fix NULL pointer dereference in do_abort_log_replay() Coverity reported a NULL pointer dereference issue (CID 1666756) in do_abort_log_replay(). When btrfs_alloc_path() fails in replay_one_buffer(), wc->subvol_path is NULL, but btrfs_abort_log_replay() calls do_abort_log_replay() which unconditionally dereferences wc->subvol_path when attempting to print debug information. Fix this by adding a NULL check before dereferencing wc->subvol_path in do_abort_log_replay(). Fixes: 2753e4917624 ("btrfs: dump detailed info and specific messages on log replay failures") Reviewed-by: Filipe Manana Signed-off-by: Suchit Karunakaran Signed-off-by: Filipe Manana Signed-off-by: David Sterba --- fs/btrfs/tree-log.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 5831754bb01c..2d9d38b82daa 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -190,7 +190,7 @@ static void do_abort_log_replay(struct walk_control *wc, const char *function, btrfs_abort_transaction(wc->trans, error); - if (wc->subvol_path->nodes[0]) { + if (wc->subvol_path && wc->subvol_path->nodes[0]) { btrfs_crit(fs_info, "subvolume (root %llu) leaf currently being processed:", btrfs_root_id(wc->root)); From 2bb83bc42be6280d9bc363b8fbcd6fdab690d16d Mon Sep 17 00:00:00 2001 From: Mark Harmstone Date: Fri, 19 Dec 2025 18:15:28 +0000 Subject: [PATCH 6/6] btrfs: show correct warning if can't read data reloc tree If a filesystem is missing its data reloc tree, we get something like this in dmesg: BTRFS warning (device loop11): failed to read root (objectid=4): -2 BTRFS error (device loop11): open_ctree failed: -2 objectid is BTRFS_DEV_TREE_OBJECTID, but this should actually be the value of BTRFS_DATA_RELOC_TREE_OBJECTID. btrfs_read_roots() prints location.objectid on failure, but this isn't set when reading the data reloc tree. Set location.objectid to the correct value on failure, so that the error message makes sense. Reviewed-by: Qu Wenruo Signed-off-by: Mark Harmstone Reviewed-by: David Sterba Signed-off-by: David Sterba --- fs/btrfs/disk-io.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 89149fac804c..d8ca5b6e88e0 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2255,6 +2255,7 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info) BTRFS_DATA_RELOC_TREE_OBJECTID, true); if (IS_ERR(root)) { if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) { + location.objectid = BTRFS_DATA_RELOC_TREE_OBJECTID; ret = PTR_ERR(root); goto out; }