torvalds-linux

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-01-11 09:00:12 +00:00

Author	SHA1	Message	Date
Linus Torvalds	68ad2095ca	io_uring-6.19-20260109 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmlhS1wQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgphGvD/9NR0RzWZdM0DwfbK4kyzfmQPCSs1kkqQF4 LECsSc3B7OrJ/4yX27CiWNRlGdHWpmrOc8mtlAiUv+eArpoBmatjfn1UZACN0u/t CC0/ZXeYA6NQ8vnbnQZk+guHE7r9K66EFPFvMcEWmGGQ/CUBUKt1gkDkB1gD8qBp pdI/A+tZujCQA3XuyCE+qc5GJX+cFXqEx06GRDGQ+UnnAsJmSrtbUtZrEZsladMj 16dclDfOX2X0bu9+P42rSkV2IrjwddNntDsLyF933uaayAJX9HbTTWxxw/mWr2Bt Be0Xh1+FniVAQFFM7qFQRqrWqecKrnh6RKg9lufWiuB4d5rq7eJT1xGb+IXX6xlr j/Lwbi8UkjpJmG1xnyWtk9oDQK4h+7p5MvCgCSqLrp1rY8nYT1CeCEzt1OJjeVWj cqm2hhkEUcioCz4gTHU8PBRxhfd4PRr/GBwZJ4jBFBeFTip1vy9kAn94Afrk/VLH HAreWZtsNsTDTF9cUSXyKDHYR9uGSi8NpZSEEV8dUaAKYpYBNSIutX5uqT9NBs8y 3TJ0NrhlpVJIGwa5XtwKli62CXNsibQlNbnsd092+zvkiAAiYUiLNRaQnt3MGctk 4eBp0sWTUkHxxKO3njUIDXZPB5g9jZvJxpjqK+V0CYzVdeimw2qs8fBQJvnYPtNw k3C0E2aJKQ== =ObKS -----END PGP SIGNATURE----- Merge tag 'io_uring-6.19-20260109' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: "A single fix for a regression introduced in 6.15, where a failure to wake up idle io-wq workers at ring exit will wait for the timeout to expire. This isn't normally noticeable, as the exit is async. But if a parent task created a thread that sets up a ring and uses requests that cause io-wq threads to be created, and the parent task then waits for the thread to exit, then it can take 5 seconds for that pthread_join() to succeed as the child thread is waiting for its children to exit. On top of that, just a basic cleanup as well" * tag 'io_uring-6.19-20260109' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/io-wq: remove io_wq_for_each_worker() return value io_uring/io-wq: fix incorrect io_wq_for_each_worker() termination logic	2026-01-09 15:21:15 -10:00
Jens Axboe	e4fdbca2dc	io_uring/io-wq: remove io_wq_for_each_worker() return value The only use of this helper is to iterate all of the workers, and hence all callers will pass in a func that always returns false to do that. As none of the callers use the return value, get rid of it. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-05 15:39:20 -07:00
Jens Axboe	e0392a10c9	io_uring/io-wq: fix incorrect io_wq_for_each_worker() termination logic A previous commit added this helper, and had it terminate if false is returned from the handler. However, that is completely opposite, it should abort the loop if true is returned. Fix this up by having io_wq_for_each_worker() keep iterating as long as false is returned, and only abort if true is returned. Cc: stable@vger.kernel.org Fixes: 751eedc4b4b7 ("io_uring/io-wq: move worker lists to struct io_wq_acct") Reported-by: Lewis Campbell <info@lewiscampbell.tech> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-05 14:37:33 -07:00
Linus Torvalds	509b5b1152	io_uring-6.19-20260102 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmlX7O8QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpkzbD/0SoEnTZ+jlbJojq6eAFtYAU3ial6sRdKC9 15+WqlsMN5MHoV/xLMqHGxofpxCyXMmZZSPholWaUIiGJDcf4Q4olFFDTAgZPZYk XxpN9KeE4/n17eFXe+TE/D172MVM0gt9QbJFoV+TLyayrGiB5QyocH6Vg4FoWvjr YvyicIRE3SLiBQ8zdfPC4SR28VBE3LKZxjZJxr2HQjJQw4O4/+gKkYz7upACc4Xk qN3JioIayuM3hrqcBSm7P0t4tlTCYHZvcGr7WI26CV6hcHD7j7N9jOVPZb4ce8et GIYwASYx4FTPrzAebQXXNL39RjoSeaRa/ppcdFHbT9ZZkI9yY9g3umg3kEml8RkF DFFwmPxlz2RuRLs+KdZ4UjLRf14W5qYlcThN7bgpTH4H0XUeDzT7HI9BiXBC7gjl p0Z1Y3NPAzMxil48ZPpopJxmQGcBIC8fMnDT0KVpvuILrN3ME0TMg82lQ2X/eTwf S/oPLebqqWy4N8Ff5x+GYmWxZvFEOxmO0AoSSiN3nlZ1skNqRlpMISTsFJXy+luq V31d0cLBfrWL9MNTE+yjLNT/5pc1l+HgVLxdoxCioEKWXXdB27YEDlh0CVNtjZ9j /ZVMJcZhzRBUvWLUvzQrtY65m0I8h6XYJAr7TXbbsL70yFAsgQmBUZPklqe6eijy HFYYO4vnJg== =KgoB -----END PGP SIGNATURE----- Merge tag 'io_uring-6.19-20260102' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - Removed dead argument length for io_uring_validate_mmap_request() - Use GFP_NOWAIT for overflow CQEs on legacy ring setups rather than GFP_ATOMIC, which makes it play nicer with memcg limits - Fix a potential circular locking issue with tctx node removal and exec based cancelations * tag 'io_uring-6.19-20260102' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/memmap: drop unused sz param in io_uring_validate_mmap_request() io_uring/tctx: add separate lock for list of tctx's in ctx io_uring: use GFP_NOWAIT for overflow CQEs on legacy rings	2026-01-02 12:07:55 -08:00
Caleb Sander Mateos	70eafc7430	io_uring/memmap: drop unused sz param in io_uring_validate_mmap_request() io_uring_validate_mmap_request() doesn't use its size_t sz argument, so remove it. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-01 08:16:48 -07:00
Jens Axboe	5623eb1ed0	io_uring/tctx: add separate lock for list of tctx's in ctx ctx->tcxt_list holds the tasks using this ring, and it's currently protected by the normal ctx->uring_lock. However, this can cause a circular locking issue, as reported by syzbot, where cancelations off exec end up needing to remove an entry from this list: ====================================================== WARNING: possible circular locking dependency detected syzkaller #0 Tainted: G L ------------------------------------------------------ syz.0.9999/12287 is trying to acquire lock: ffff88805851c0a8 (&ctx->uring_lock){+.+.}-{4:4}, at: io_uring_del_tctx_node+0xf0/0x2c0 io_uring/tctx.c:179 but task is already holding lock: ffff88802db5a2e0 (&sig->cred_guard_mutex){+.+.}-{4:4}, at: prepare_bprm_creds fs/exec.c:1360 [inline] ffff88802db5a2e0 (&sig->cred_guard_mutex){+.+.}-{4:4}, at: bprm_execve+0xb9/0x1400 fs/exec.c:1733 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&sig->cred_guard_mutex){+.+.}-{4:4}: __mutex_lock_common kernel/locking/mutex.c:614 [inline] __mutex_lock+0x187/0x1350 kernel/locking/mutex.c:776 proc_pid_attr_write+0x547/0x630 fs/proc/base.c:2837 vfs_write+0x27e/0xb30 fs/read_write.c:684 ksys_write+0x145/0x250 fs/read_write.c:738 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xec/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #1 (sb_writers#3){.+.+}-{0:0}: percpu_down_read_internal include/linux/percpu-rwsem.h:53 [inline] percpu_down_read_freezable include/linux/percpu-rwsem.h:83 [inline] __sb_start_write include/linux/fs/super.h:19 [inline] sb_start_write+0x4d/0x1c0 include/linux/fs/super.h:125 mnt_want_write+0x41/0x90 fs/namespace.c:499 open_last_lookups fs/namei.c:4529 [inline] path_openat+0xadd/0x3dd0 fs/namei.c:4784 do_filp_open+0x1fa/0x410 fs/namei.c:4814 io_openat2+0x3e0/0x5c0 io_uring/openclose.c:143 __io_issue_sqe+0x181/0x4b0 io_uring/io_uring.c:1792 io_issue_sqe+0x165/0x1060 io_uring/io_uring.c:1815 io_queue_sqe io_uring/io_uring.c:2042 [inline] io_submit_sqe io_uring/io_uring.c:2320 [inline] io_submit_sqes+0xbf4/0x2140 io_uring/io_uring.c:2434 __do_sys_io_uring_enter io_uring/io_uring.c:3280 [inline] __se_sys_io_uring_enter+0x2e0/0x2b60 io_uring/io_uring.c:3219 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xec/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #0 (&ctx->uring_lock){+.+.}-{4:4}: check_prev_add kernel/locking/lockdep.c:3165 [inline] check_prevs_add kernel/locking/lockdep.c:3284 [inline] validate_chain kernel/locking/lockdep.c:3908 [inline] __lock_acquire+0x15a6/0x2cf0 kernel/locking/lockdep.c:5237 lock_acquire+0x107/0x340 kernel/locking/lockdep.c:5868 __mutex_lock_common kernel/locking/mutex.c:614 [inline] __mutex_lock+0x187/0x1350 kernel/locking/mutex.c:776 io_uring_del_tctx_node+0xf0/0x2c0 io_uring/tctx.c:179 io_uring_clean_tctx+0xd4/0x1a0 io_uring/tctx.c:195 io_uring_cancel_generic+0x6ca/0x7d0 io_uring/cancel.c:646 io_uring_task_cancel include/linux/io_uring.h:24 [inline] begin_new_exec+0x10ed/0x2440 fs/exec.c:1131 load_elf_binary+0x9f8/0x2d70 fs/binfmt_elf.c:1010 search_binary_handler fs/exec.c:1669 [inline] exec_binprm fs/exec.c:1701 [inline] bprm_execve+0x92e/0x1400 fs/exec.c:1753 do_execveat_common+0x510/0x6a0 fs/exec.c:1859 do_execve fs/exec.c:1933 [inline] __do_sys_execve fs/exec.c:2009 [inline] __se_sys_execve fs/exec.c:2004 [inline] __x64_sys_execve+0x94/0xb0 fs/exec.c:2004 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xec/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f other info that might help us debug this: Chain exists of: &ctx->uring_lock --> sb_writers#3 --> &sig->cred_guard_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&sig->cred_guard_mutex); lock(sb_writers#3); lock(&sig->cred_guard_mutex); lock(&ctx->uring_lock); * DEADLOCK * 1 lock held by syz.0.9999/12287: #0: ffff88802db5a2e0 (&sig->cred_guard_mutex){+.+.}-{4:4}, at: prepare_bprm_creds fs/exec.c:1360 [inline] #0: ffff88802db5a2e0 (&sig->cred_guard_mutex){+.+.}-{4:4}, at: bprm_execve+0xb9/0x1400 fs/exec.c:1733 stack backtrace: CPU: 0 UID: 0 PID: 12287 Comm: syz.0.9999 Tainted: G L syzkaller #0 PREEMPT(full) Tainted: [L]=SOFTLOCKUP Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025 Call Trace: <TASK> dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 print_circular_bug+0x2e2/0x300 kernel/locking/lockdep.c:2043 check_noncircular+0x12e/0x150 kernel/locking/lockdep.c:2175 check_prev_add kernel/locking/lockdep.c:3165 [inline] check_prevs_add kernel/locking/lockdep.c:3284 [inline] validate_chain kernel/locking/lockdep.c:3908 [inline] __lock_acquire+0x15a6/0x2cf0 kernel/locking/lockdep.c:5237 lock_acquire+0x107/0x340 kernel/locking/lockdep.c:5868 __mutex_lock_common kernel/locking/mutex.c:614 [inline] __mutex_lock+0x187/0x1350 kernel/locking/mutex.c:776 io_uring_del_tctx_node+0xf0/0x2c0 io_uring/tctx.c:179 io_uring_clean_tctx+0xd4/0x1a0 io_uring/tctx.c:195 io_uring_cancel_generic+0x6ca/0x7d0 io_uring/cancel.c:646 io_uring_task_cancel include/linux/io_uring.h:24 [inline] begin_new_exec+0x10ed/0x2440 fs/exec.c:1131 load_elf_binary+0x9f8/0x2d70 fs/binfmt_elf.c:1010 search_binary_handler fs/exec.c:1669 [inline] exec_binprm fs/exec.c:1701 [inline] bprm_execve+0x92e/0x1400 fs/exec.c:1753 do_execveat_common+0x510/0x6a0 fs/exec.c:1859 do_execve fs/exec.c:1933 [inline] __do_sys_execve fs/exec.c:2009 [inline] __se_sys_execve fs/exec.c:2004 [inline] __x64_sys_execve+0x94/0xb0 fs/exec.c:2004 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xec/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7ff3a8b8f749 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ff3a9a97038 EFLAGS: 00000246 ORIG_RAX: 000000000000003b RAX: ffffffffffffffda RBX: 00007ff3a8de5fa0 RCX: 00007ff3a8b8f749 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000200000000400 RBP: 00007ff3a8c13f91 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ff3a8de6038 R14: 00007ff3a8de5fa0 R15: 00007ff3a8f0fa28 </TASK> Add a separate lock just for the tctx_list, tctx_lock. This can nest under ->uring_lock, where necessary, and be used separately for list manipulation. For the cancelation off exec side, this removes the need to grab ->uring_lock, hence fixing the circular locking dependency. Reported-by: syzbot+b0e3b77ffaa8a4067ce5@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-01 08:16:40 -07:00
Alexandre Negrel	fc5ff25009	io_uring: use GFP_NOWAIT for overflow CQEs on legacy rings Allocate the overflowing CQE with GFP_NOWAIT instead of GFP_ATOMIC. This changes causes allocations to fail earlier in out-of-memory situations, rather than being deferred. Using GFP_ATOMIC allows a process to exceed memory limits. Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220794 Signed-off-by: Alexandre Negrel <alexandre@negrel.dev> Link: https://lore.kernel.org/io-uring/20251229201933.515797-1-alexandre@negrel.dev/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-30 18:49:31 -07:00
Linus Torvalds	4079a38693	io_uring-6.19-20251226 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmlOro8QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgprhhEADWKzgb1tqALkgjSNfzA9mEBpxt6Xp6++7x 4lwUnnX/m4C1ILphAD4ii7YY2hm7xjRBaUeqH2OifznxXnAw1Rbhjr9GJdKqq43y 1HWIzz48wP9wn0LI0tXkHA2AZR3Rs0IbtHZ+87gHA0HVoSNxqBIM9DKuFbbKDFNq P2ef4WoPZekThIRKXM05bkeVZuiMekMWpC5YUKPpm+XzPOE0rOtsPDypbmGVN/hT vpTjtAFEEDh3zTmmjd/Egk9fMJJMBMhr7E5us8VrUHFtNyZ6DmGCHOKAcJj3iJyT /BEOHQ1Yu/COXHlqj1kDefmoD/glsCnZFeGAXQu45CaU/ZU8jcm6Yy04hRbd9FLu QT6ycA84N0HOqoe1+UYnTy8/vqseCo5E6e9thPak6CWLsZA6Xc/8LcXZbNhe6fEW m9ldoQRnLnPkmiX3+gtEyjb+fa8DjYxzq/7ab37khIjdaJxesqFWkZ2Z+M2GKzVq 9Qvth2ribWq0xtm9vZ0jb07JSss/Q+lQMKGlZ3vlvG2CODB5ll7Gq18LAxS7xoO3 ciA+tcrMgTZcw9R7Ha5gbqkjpKZbwqA1malsNUo4YNRI/S4d6UU74nEVuyyAcsTv czFeU+d2nffwp1HV94ipNrCR2adKIQ1ESRxLTDNKt9hcFDlL9tHB//Kr28BO123V SrDcDR3Slw== =vmEn -----END PGP SIGNATURE----- Merge tag 'io_uring-6.19-20251226' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fix from Jens Axboe: "Just a single fix for a bug that can cause a leak of the filename with IORING_OP_OPENAT, if direct descriptors are asked for and O_CLOEXEC has been set in the request flags" * tag 'io_uring-6.19-20251226' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring: fix filename leak in __io_openat_prep()	2025-12-26 11:34:38 -08:00
Prithvi Tambewagh	b14fad5553	io_uring: fix filename leak in __io_openat_prep() __io_openat_prep() allocates a struct filename using getname(). However, for the condition of the file being installed in the fixed file table as well as having O_CLOEXEC flag set, the function returns early. At that point, the request doesn't have REQ_F_NEED_CLEANUP flag set. Due to this, the memory for the newly allocated struct filename is not cleaned up, causing a memory leak. Fix this by setting the REQ_F_NEED_CLEANUP for the request just after the successful getname() call, so that when the request is torn down, the filename will be cleaned up, along with other resources needing cleanup. Reported-by: syzbot+00e61c43eb5e4740438f@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=00e61c43eb5e4740438f Tested-by: syzbot+00e61c43eb5e4740438f@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Prithvi Tambewagh <activprithvi@gmail.com> Fixes: b9445598d8c6 ("io_uring: openat directly into fixed fd table") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-25 07:58:33 -07:00
Linus Torvalds	d245b2e53e	io_uring-6.19-20251218 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmlEvc8QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpuG5D/4wvdxdfrYL0cM8A18dS+Ngl5kCRsWWulV1 mWyOcPilyUQ266g879nD3YFJW/C1CbCYhvrX446agKnFMbGyQNmt5OyiXQUEDS/W LlgCWSn3Rum8a5i3m92zvgp24cZI9DSWnuq0GnwcEvMauNsRkgowZ6hy89Mtthvy DUCimJawGCxd++yuN2LChe356WpiT10ed9calfCEDdq4iOuYKSMNJRtUFl0jaFUG 0GlWPnxaRjuukn1So3/6XiVRtBoC5ZPCApMitG+Pu5AfIeBz4LfHyZS5VpOm1GSo hZm7PEOHiW4Jt6xiyZH38l9gBZuqUZ3w1HDJItGVvioYWrPHV1ZwUezkNEfdlSkj oiH4oujIFt6P+yRCbhpETvYizaS68y/EseKbyjPg5eSJE3UCquMD5jKhCdESm16l gZIxbiT7EgT19ivFeFcT10G7sQZ/JphdkS/q9HXuR1tVWSHIm+8x90zmVo5Dli8r Tio/uJrAEOD0fepPNFv7CwUwIie7nipIdxG/mt+bDBH8DMAii85pW+J/1jwl4OHh b3Z8Kq9ZVi9x2bKJ6ai87DhH2+nbVGGlSvIU+fQ8hKeTIloYsXO4wycZxNheheRo bXiS0rmxpQvErhHRY2FNTyqXd5qLYJZzdUBRf1thNTNzoDzlFLLQzMAwK26Zdvce 6p4NiDtgaQ== =heZQ -----END PGP SIGNATURE----- Merge tag 'io_uring-6.19-20251218' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fix from Jens Axboe: "Just a single fix this week, for an issue with the calculation of the number of segments in the ublk kbuf import path" * tag 'io_uring-6.19-20251218' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring: fix nr_segs calculation in io_import_kbuf	2025-12-20 09:38:56 -08:00
huang-jl	114ea9bbaf	io_uring: fix nr_segs calculation in io_import_kbuf io_import_kbuf() calculates nr_segs incorrectly when iov_offset is non-zero after iov_iter_advance(). It doesn't account for the partial consumption of the first bvec. The problem comes when meet the following conditions: 1. Use UBLK_F_AUTO_BUF_REG feature of ublk. 2. The kernel will help to register the buffer, into the io uring. 3. Later, the ublk server try to send IO request using the registered buffer in the io uring, to read/write to fuse-based filesystem, with O_DIRECT. >From a userspace perspective, the ublk server thread is blocked in the kernel, and will see "soft lockup" in the kernel dmesg. When ublk registers a buffer with mixed-size bvecs like [4K]6 + [12K] and a request partially consumes a bvec, the next request's nr_segs calculation uses bvec->bv_len instead of (bv_len - iov_offset). This causes fuse_get_user_pages() to loop forever because nr_segs indicates fewer pages than actually needed. Specifically, the infinite loop happens at: fuse_get_user_pages() -> iov_iter_extract_pages() -> iov_iter_extract_bvec_pages() Since the nr_segs is miscalculated, the iov_iter_extract_bvec_pages returns when finding that i->nr_segs is zero. Then iov_iter_extract_pages returns zero. However, fuse_get_user_pages does still not get enough data/pages, causing infinite loop. Example: - Bvecs: [4K, 4K, 4K, 4K, 4K, 4K, 12K, ...] - Request 1: 32K at offset 0, uses 64K + 8K of the 12K bvec - Request 2: 32K at offset 32K - iov_offset = 8K (8K already consumed from 12K bvec) - Bug: calculates using 12K, not (12K - 8K) = 4K - Result: nr_segs too small, infinite loop in fuse_get_user_pages. Fix by accounting for iov_offset when calculating the first segment's available length. Fixes: b419bed4f0a6 ("io_uring/rsrc: ensure segments counts are correct on kbuf buffers") Signed-off-by: huang-jl <huang-jl@deepseek.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-17 07:35:42 -07:00
Linus Torvalds	14df4eb7e7	io_uring-6.19-20251211 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmk7RvYQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpr7UD/4qqOmmEq92nPJ9v9zeebrTCtO5Hah1HY11 D1Cc7KRqt2F6Jy1oN+LzQc1jXguLnQCxMuIRp1f3waTC/j4v4qfjjba+no/CZnLp jwC0UWEeXx8wVx3RkAg3qJJzixGEvJ4pWPYQCYp29LkCkzLL47y7VTBoVPxHiSog MpRGhTWjdFOLX4C/czm6kwpZM/2LkaG6dDpjZFIZm3inHTbP6QGBWdS+Q3VRzAJn YodteJifva4cefcEhP8zZeWLmzHMJLz0cQgnrhYG9michaCNoqM1y3TThvf6h3Ls 8ivwD+0PxzoQ32yImUfSOsqlmn+EI0kevaUVvN74vPT4d5RUxSCOR27z9fPxwI60 JE2xpPN5i6CICc8c1tJJDYDHvElM1C7cueKfr/PTusOvmiI8W+n3VTMl8rVuo25d 2k3ATGq3/E9yfx+/yu3jsjPVYPV52byPsDJ0vs3ixipxS4CdLlvwa3rIvyyrOatl Z6gFhbtTmB5LExSGfI/v+D6Lv7Pe1j6AHfo1RY64D0HR7FmnPMn/95WgTAXr28to vUnOWQ+XjFE4DzLaOZvHV170QajHh2oCBICOlkEdRWc1ubQJ96zTRpco74xGNcw6 Z4+3BxwotR205vbrQwiEIKaGfHDWK72hLdObX2wuPuiPZltchSHb4zZAdmavlUBV e2QzsEzaJQ== =ZPJt -----END PGP SIGNATURE----- Merge tag 'io_uring-6.19-20251211' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fix from Jens Axboe: "Single fix for io_uring headed to stable, fixing an issue introduced with the min_wait support earlier this year, where SQPOLL didn't get correctly woken if an event arrived once the event waiting has finished the min_wait portion. As we already have regression tests for this added and people reporting new failures there, let's get this one flushed out so it can bubble back down to stable as well" * tag 'io_uring-6.19-20251211' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring: fix min_wait wakeups for SQPOLL	2025-12-12 22:01:32 +12:00
Jens Axboe	e15cb2200b	io_uring: fix min_wait wakeups for SQPOLL Using min_wait, two timeouts are given: 1) The min_wait timeout, within which up to 'wait_nr' events are waited for. 2) The overall long timeout, which is entered if no events are generated in the min_wait window. If the min_wait has expired, any event being posted must wake the task. For SQPOLL, that isn't the case, as it won't trigger the io_has_work() condition, as it will have already processed the task_work that happened when an event was posted. This causes any event to trigger post the min_wait to not always cause the waiting application to wakeup, and instead it will wait until the overall timeout has expired. This can be shown in a test case that has a 1 second min_wait, with a 5 second overall wait, even if an event triggers after 1.5 seconds: axboe@m2max-kvm /d/iouring-mre (master)> zig-out/bin/iouring info: MIN_TIMEOUT supported: true, features: 0x3ffff info: Testing: min_wait=1000ms, timeout=5s, wait_nr=4 info: 1 cqes in 5000.2ms where the expected result should be: axboe@m2max-kvm /d/iouring-mre (master)> zig-out/bin/iouring info: MIN_TIMEOUT supported: true, features: 0x3ffff info: Testing: min_wait=1000ms, timeout=5s, wait_nr=4 info: 1 cqes in 1500.3ms When the min_wait timeout triggers, reset the number of completions needed to wake the task. This should ensure that any future events will wake the task, regardless of how many events it originally wanted to wait for. Reported-by: Tip ten Brink <tip@tenbrinkmeijs.com> Cc: stable@vger.kernel.org Fixes: 1100c4a2656d ("io_uring: add support for batch wait timeout") Link: https://github.com/axboe/liburing/issues/1477 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-09 16:54:12 -07:00
Linus Torvalds	cfd4039213	io_uring-6.19-20251208 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmk3KXIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpo91EACGlORRzg4FJXox8DcItdOQsZGFIqCXts9p SVtbxV6sPdsHwRB/xGTzHWP2iWUjA4+i5l3n4mt8vzGAmQU50gtdaIsJEMq7SOfB nJW0wNi905qcLihOfTpQ/2xpE5Am/iWPavFkAqOF7qo6GlS7aN47TIaHCPmAm3Nx Kla2XMDnneFhl8xCdnJHaLrzyD94xlArywG5UPjkgFGCmLEu2ZE6T9ivq86DHQZJ Ujy3ueMO/7SErfoDbY4I/gPs4ONxBaaieKycuyljQQB3n6sj15EBNB0TMDPA/Rwx Aq4WD/MC48titpxV2BT9RKCjYvJ4wsBww4uFLkCTKDlFCRH0pqclzgtd2iB46kge tj9KfTS9tkLBp9steMcw45FStu0iiHBwqqTcqUr1q/wzIPbPAQ/L/Mu6AlUOheW/ MmedhtPP22IShpkKYWSv923P2Qp2HhKa6LtoKJzxOK9rb6yoYvHl0zEQlKbWtPgq lpGzjbBoCtjqwlQKTpcH8diwaZ/fafrIP4h80Hg1pRiQEwzBgDpA3/N0EcfigkmU 2IgyH3k6F9v/IgyVPkpzNh4w6hrr9RnxVA8yaf2ItkfWKwajWJAtPLUBuING8qqa 3xg1MZ27NS6gUKEdCEy/mAaz8Vt2SGRUc3szHYrZHy7OFEW94WoiKAYK9qsZXGzX ms2VldIiQA== =Mbok -----END PGP SIGNATURE----- Merge tag 'io_uring-6.19-20251208' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring updates from Jens Axboe: "Followup set of fixes for io_uring for this merge window. These are either later fixes, or cleanups that don't make sense to defer. This pull request contains: - Fix for a recent regression in io-wq worker creation - Tracing cleanup - Use READ_ONCE/WRITE_ONCE consistently for ring mapped kbufs. Mostly for documentation purposes, indicating that they are shared with userspace - Fix for POLL_ADD losing a completion, if the request is updated and now is triggerable - eg, if POLLIN is set with the updated, and the polled file is readable - In conjunction with the above fix, also unify how poll wait queue entries are deleted with the head update. We had 3 different spots doing both the list deletion and head write, with one of them nicely documented. Abstract that into a helper and use it consistently - Small series from Joanne fixing an issue with buffer cloning, and cleaning up the arg validation" * tag 'io_uring-6.19-20251208' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/poll: unify poll waitqueue entry and list removal io_uring/kbuf: use WRITE_ONCE() for userspace-shared buffer ring fields io_uring/kbuf: use READ_ONCE() for userspace-mapped memory io_uring/rsrc: fix lost entries after cloned range io_uring/rsrc: rename misleading src_node variable in io_clone_buffers() io_uring/rsrc: clean up buffer cloning arg validation io_uring/trace: rename io_uring_queue_async_work event "rw" field io_uring/io-wq: always retry worker create on ERESTART* io_uring/poll: correctly handle io_poll_add() return value on update	2025-12-09 09:07:28 +09:00
Linus Torvalds	4482ebb297	block-6.19-20251208 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmk3KZsQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpkNYD/91yqAJeehx2Heq3dWj9L8hDuETQelj/g9j gtZCiriAPy+bb1/BmWjK+BmvjtBt+g3a4Cwi6tVj4F1zoE46IPeLhO+2iJTEBiBq AhRtEf/MFXFK3qUnTpEnS8w3CtsXejOTB81VQ+6BysSu+B708m/1AQHv2HocZ37R jivrzfCsEdBr+ISwYw/EG5KcDBVTFo/JdXIhs7k4Z8bBfa3P5ye4EhKjORtgbFNU 5nXb78SZoWNCZF143YV++9MpZc3M2jzkzrk1CTLsUHhOxWg4T/6wTXfPGZc/W4m8 UBhs03u/gMJnKHhlZd4kpZWDito1TQZTdY2f5sBsysRQqeT7bwDK/1xiQ1nllZiP oYbeD6t65yMAlELwNFXo7y/DNcS2VLBMvChIX6p1gweEzyf23YneoHYyN5agEQlN 9C4EdcYzZRt0DwtHlIRtKvDk2LZzkJAcLau3D6ahU/DPLOawyWZKmvGiU+sSyJjF bEIO5c/+MLqkAgLAGaFgA4twFF1aYH9ssmJerDxprarkf1jtlOBLvUQ391Gtb5Hd B1yugmIgEwLbCFzhk9FlCtv2nQcWRCElnaeqv+Lv+xCBVPGCLm2qIHoTqmvHZPCd GbN/h0XLdgUboYPCFWVAX72/4K/cv+fQQcb+a7tiq6vMKcgJ/2I1szFGpFqz7azB hyiK0v3x2g== =r1xa -----END PGP SIGNATURE----- Merge tag 'block-6.19-20251208' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: "Followup set of fixes and updates for block for the 6.19 merge window. NVMe had some late minute debates which lead to dropping some patches from that tree, which is why the initial PR didn't have NVMe included. It's here now. This pull request contains: - NVMe pull request via Keith: - Subsystem usage cleanups (Max) - Endpoint device fixes (Shin'ichiro) - Debug statements (Gerd) - FC fabrics cleanups and fixes (Daniel) - Consistent alloc API usages (Israel) - Code comment updates (Chu) - Authentication retry fix (Justin) - Fix a memory leak in the discard ioctl code, if the task is being interrupted by a signal at just the wrong time - Zoned write plugging fixes - Add ioctls for for persistent reservations - Enable per-cpu bio caching by default - Various little fixes and tweaks" * tag 'block-6.19-20251208' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (27 commits) nvme-fabrics: add ENOKEY to no retry criteria for authentication failures nvme-auth: use kvfree() for memory allocated with kvcalloc() nvmet-tcp: use kvcalloc for commands array nvmet-rdma: use kvcalloc for commands and responses arrays nvme: fix typo error in nvme target nvmet-fc: use pr_* print macros instead of dev_* nvmet-fcloop: remove unused lsdir member. nvmet-fcloop: check all request and response have been processed nvme-fc: check all request and response have been processed block: fix memory leak in __blkdev_issue_zero_pages block: fix comment for op_is_zone_mgmt() to include RESET_ALL block: Clear BLK_ZONE_WPLUG_PLUGGED when aborting plugged BIOs blk-mq: Abort suspend when wakeup events are pending blk-mq: add blk_rq_nr_bvec() helper block: add IOC_PR_READ_RESERVATION ioctl block: add IOC_PR_READ_KEYS ioctl nvme: reject invalid pr_read_keys() num_keys values scsi: sd: reject invalid pr_read_keys() num_keys values block: enable per-cpu bio cache by default block: use bio_alloc_bioset for passthru IO by default ...	2025-12-09 08:53:24 +09:00
Linus Torvalds	7203ca412f	Significant patch series in this merge are as follows: - The 10 patch series "__vmalloc()/kvmalloc() and no-block support" from Uladzislau Rezki reworks the vmalloc() code to support non-blocking allocations (GFP_ATOIC, GFP_NOWAIT). - The 2 patch series "ksm: fix exec/fork inheritance" from xu xin fixes a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not inherited across fork/exec. - The 4 patch series "mm/zswap: misc cleanup of code and documentations" from SeongJae Park does some light maintenance work on the zswap code. - The 5 patch series "mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" from Mauricio Faria de Oliveira enhances the /sys/kernel/debug/page_owner debug feature. It adds unique identifiers to differentiate the various stack traces so that userspace monitoring tools can better match stack traces over time. - The 2 patch series "mm/page_alloc: pcp->batch cleanups" from Joshua Hahn makes some minor alterations to the page allocator's per-cpu-pages feature. - The 2 patch series "Improve UFFDIO_MOVE scalability by removing anon_vma lock" from Lokesh Gidra addresses a scalability issue in userfaultfd's UFFDIO_MOVE operation. - The 2 patch series "kasan: cleanups for kasan_enabled() checks" from Sabyrzhan Tasbolatov performs some cleanup in the KASAN code. - The 2 patch series "drivers/base/node: fold node register and unregister functions" from Donet Tom cleans up the NUMA node handling code a little. - The 4 patch series "mm: some optimizations for prot numa" from Kefeng Wang provides some cleanups and small optimizations to the NUMA allocation hinting code. - The 5 patch series "mm/page_alloc: Batch callers of free_pcppages_bulk" from Joshua Hahn addresses long lock hold times at boot on large machines. These were causing (harmless) softlockup warnings. - The 2 patch series "optimize the logic for handling dirty file folios during reclaim" from Baolin Wang removes some now-unnecessary work from page reclaim. - The 10 patch series "mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" from SeongJae Park enhances the DAMOS auto-tuning feature. - The 2 patch series "mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan fixes DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace configuration. - The 15 patch series "expand mmap_prepare functionality, port more users" from Lorenzo Stoakes enhances the new(ish) file_operations.mmap_prepare() method and ports additional callsites from the old ->mmap() over to ->mmap_prepare(). - The 8 patch series "Fix stale IOTLB entries for kernel address space" from Lu Baolu fixes a bug (and possible security issue on non-x86) in the IOMMU code. In some situations the IOMMU could be left hanging onto a stale kernel pagetable entry. - The 4 patch series "mm/huge_memory: cleanup __split_unmapped_folio()" from Wei Yang cleans up and optimizes the folio splitting code. - The 5 patch series "mm, swap: misc cleanup and bugfix" from Kairui Song implements some cleanups and a minor fix in the swap discard code. - The 8 patch series "mm/damon: misc documentation fixups" from SeongJae Park does as advertised. - The 9 patch series "mm/damon: support pin-point targets removal" from SeongJae Park permits userspace to remove a specific monitoring target in the middle of the current targets list. - The 2 patch series "mm: MISC follow-up patches for linux/pgalloc.h" from Harry Yoo implements a couple of cleanups related to mm header file inclusion. - The 2 patch series "mm/swapfile.c: select swap devices of default priority round robin" from Baoquan He improves the selection of swap devices for NUMA machines. - The 3 patch series "mm: Convert memory block states (MEM_) macros to enums" from Israel Batista changes the memory block labels from macros to enums so they will appear in kernel debug info. - The 3 patch series "ksm: perform a range-walk to jump over holes in break_ksm" from Pedro Demarchi Gomes addresses an inefficiency when KSM unmerges an address range. - The 22 patch series "mm/damon/tests: fix memory bugs in kunit tests" from SeongJae Park fixes leaks and unhandled malloc() failures in DAMON userspace unit tests. - The 2 patch series "some cleanups for pageout()" from Baolin Wang cleans up a couple of minor things in the page scanner's writeback-for-eviction code. - The 2 patch series "mm/hugetlb: refactor sysfs/sysctl interfaces" from Hui Zhu moves hugetlb's sysfs/sysctl handling code into a new file. - The 9 patch series "introduce VM_MAYBE_GUARD and make it sticky" from Lorenzo Stoakes makes the VMA guard regions available in /proc/pid/smaps and improves the mergeability of guarded VMAs. - The 2 patch series "mm: perform guard region install/remove under VMA lock" from Lorenzo Stoakes reduces mmap lock contention for callers performing VMA guard region operations. - The 2 patch series "vma_start_write_killable" from Matthew Wilcox starts work in permitting applications to be killed when they are waiting on a read_lock on the VMA lock. - The 11 patch series "mm/damon/tests: add more tests for online parameters commit" from SeongJae Park adds additional userspace testing of DAMON's "commit" feature. - The 9 patch series "mm/damon: misc cleanups" from SeongJae Park does that. - The 2 patch series "make VM_SOFTDIRTY a sticky VMA flag" from Lorenzo Stoakes addresses the possible loss of a VMA's VM_SOFTDIRTY flag when that VMA is merged with another. - The 16 patch series "mm: support device-private THP" from Balbir Singh introduces support for Transparent Huge Page (THP) migration in zone device-private memory. - The 3 patch series "Optimize folio split in memory failure" from Zi Yan optimizes folio split operations in the memory failure code. - The 2 patch series "mm/huge_memory: Define split_type and consolidate split support checks" from Wei Yang provides some more cleanups in the folio splitting code. - The 16 patch series "mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" from Lorenzo Stoakes cleans up our handling of pagetable leaf entries by introducing the concept of 'software leaf entries', of type softleaf_t. - The 4 patch series "reparent the THP split queue" from Muchun Song reparents the THP split queue to its parent memcg. This is in preparation for addressing the long-standing "dying memcg" problem, wherein dead memcg's linger for too long, consuming memory resources. - The 3 patch series "unify PMD scan results and remove redundant cleanup" from Wei Yang does a little cleanup in the hugepage collapse code. - The 6 patch series "zram: introduce writeback bio batching" from Sergey Senozhatsky improves zram writeback efficiency by introducing batched bio writeback support. - The 4 patch series "memcg: cleanup the memcg stats interfaces" from Shakeel Butt cleans up our handling of the interrupt safety of some memcg stats. - The 4 patch series "make vmalloc gfp flags usage more apparent" from Vishal Moola cleans up vmalloc's handling of incoming GFP flags. - The 6 patch series "mm: Add soft-dirty and uffd-wp support for RISC-V" from Chunyan Zhang teches soft dirty and userfaultfd write protect tracking to use RISC-V's Svrsw60t59b extension. - The 5 patch series "mm: swap: small fixes and comment cleanups" from Youngjun Park fixes a small bug and cleans up some of the swap code. - The 4 patch series "initial work on making VMA flags a bitmap" from Lorenzo Stoakes starts work on converting the vma struct's flags to a bitmap, so we stop running out of them, especially on 32-bit. - The 2 patch series "mm/swapfile: fix and cleanup swap list iterations" from Youngjun Park addresses a possible bug in the swap discard code and cleans things up a little. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaTEb0wAKCRDdBJ7gKXxA jjfIAP94W4EkCCwNOupnChoG+YWw/JW21anXt5NN+i5svn1yugEAwzvv6A+cAFng o+ug/fyrfPZG7PLp2R8WFyGIP0YoBA4= =IUzS -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "__vmalloc()/kvmalloc() and no-block support" (Uladzislau Rezki) Rework the vmalloc() code to support non-blocking allocations (GFP_ATOIC, GFP_NOWAIT) "ksm: fix exec/fork inheritance" (xu xin) Fix a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not inherited across fork/exec "mm/zswap: misc cleanup of code and documentations" (SeongJae Park) Some light maintenance work on the zswap code "mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" (Mauricio Faria de Oliveira) Enhance the /sys/kernel/debug/page_owner debug feature by adding unique identifiers to differentiate the various stack traces so that userspace monitoring tools can better match stack traces over time "mm/page_alloc: pcp->batch cleanups" (Joshua Hahn) Minor alterations to the page allocator's per-cpu-pages feature "Improve UFFDIO_MOVE scalability by removing anon_vma lock" (Lokesh Gidra) Address a scalability issue in userfaultfd's UFFDIO_MOVE operation "kasan: cleanups for kasan_enabled() checks" (Sabyrzhan Tasbolatov) "drivers/base/node: fold node register and unregister functions" (Donet Tom) Clean up the NUMA node handling code a little "mm: some optimizations for prot numa" (Kefeng Wang) Cleanups and small optimizations to the NUMA allocation hinting code "mm/page_alloc: Batch callers of free_pcppages_bulk" (Joshua Hahn) Address long lock hold times at boot on large machines. These were causing (harmless) softlockup warnings "optimize the logic for handling dirty file folios during reclaim" (Baolin Wang) Remove some now-unnecessary work from page reclaim "mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" (SeongJae Park) Enhance the DAMOS auto-tuning feature "mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" (Quanmin Yan) Fix DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace configuration "expand mmap_prepare functionality, port more users" (Lorenzo Stoakes) Enhance the new(ish) file_operations.mmap_prepare() method and port additional callsites from the old ->mmap() over to ->mmap_prepare() "Fix stale IOTLB entries for kernel address space" (Lu Baolu) Fix a bug (and possible security issue on non-x86) in the IOMMU code. In some situations the IOMMU could be left hanging onto a stale kernel pagetable entry "mm/huge_memory: cleanup __split_unmapped_folio()" (Wei Yang) Clean up and optimize the folio splitting code "mm, swap: misc cleanup and bugfix" (Kairui Song) Some cleanups and a minor fix in the swap discard code "mm/damon: misc documentation fixups" (SeongJae Park) "mm/damon: support pin-point targets removal" (SeongJae Park) Permit userspace to remove a specific monitoring target in the middle of the current targets list "mm: MISC follow-up patches for linux/pgalloc.h" (Harry Yoo) A couple of cleanups related to mm header file inclusion "mm/swapfile.c: select swap devices of default priority round robin" (Baoquan He) improve the selection of swap devices for NUMA machines "mm: Convert memory block states (MEM_) macros to enums" (Israel Batista) Change the memory block labels from macros to enums so they will appear in kernel debug info "ksm: perform a range-walk to jump over holes in break_ksm" (Pedro Demarchi Gomes) Address an inefficiency when KSM unmerges an address range "mm/damon/tests: fix memory bugs in kunit tests" (SeongJae Park) Fix leaks and unhandled malloc() failures in DAMON userspace unit tests "some cleanups for pageout()" (Baolin Wang) Clean up a couple of minor things in the page scanner's writeback-for-eviction code "mm/hugetlb: refactor sysfs/sysctl interfaces" (Hui Zhu) Move hugetlb's sysfs/sysctl handling code into a new file "introduce VM_MAYBE_GUARD and make it sticky" (Lorenzo Stoakes) Make the VMA guard regions available in /proc/pid/smaps and improves the mergeability of guarded VMAs "mm: perform guard region install/remove under VMA lock" (Lorenzo Stoakes) Reduce mmap lock contention for callers performing VMA guard region operations "vma_start_write_killable" (Matthew Wilcox) Start work on permitting applications to be killed when they are waiting on a read_lock on the VMA lock "mm/damon/tests: add more tests for online parameters commit" (SeongJae Park) Add additional userspace testing of DAMON's "commit" feature "mm/damon: misc cleanups" (SeongJae Park) "make VM_SOFTDIRTY a sticky VMA flag" (Lorenzo Stoakes) Address the possible loss of a VMA's VM_SOFTDIRTY flag when that VMA is merged with another "mm: support device-private THP" (Balbir Singh) Introduce support for Transparent Huge Page (THP) migration in zone device-private memory "Optimize folio split in memory failure" (Zi Yan) "mm/huge_memory: Define split_type and consolidate split support checks" (Wei Yang) Some more cleanups in the folio splitting code "mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" (Lorenzo Stoakes) Clean up our handling of pagetable leaf entries by introducing the concept of 'software leaf entries', of type softleaf_t "reparent the THP split queue" (Muchun Song) Reparent the THP split queue to its parent memcg. This is in preparation for addressing the long-standing "dying memcg" problem, wherein dead memcg's linger for too long, consuming memory resources "unify PMD scan results and remove redundant cleanup" (Wei Yang) A little cleanup in the hugepage collapse code "zram: introduce writeback bio batching" (Sergey Senozhatsky) Improve zram writeback efficiency by introducing batched bio writeback support "memcg: cleanup the memcg stats interfaces" (Shakeel Butt) Clean up our handling of the interrupt safety of some memcg stats "make vmalloc gfp flags usage more apparent" (Vishal Moola) Clean up vmalloc's handling of incoming GFP flags "mm: Add soft-dirty and uffd-wp support for RISC-V" (Chunyan Zhang) Teach soft dirty and userfaultfd write protect tracking to use RISC-V's Svrsw60t59b extension "mm: swap: small fixes and comment cleanups" (Youngjun Park) Fix a small bug and clean up some of the swap code "initial work on making VMA flags a bitmap" (Lorenzo Stoakes) Start work on converting the vma struct's flags to a bitmap, so we stop running out of them, especially on 32-bit "mm/swapfile: fix and cleanup swap list iterations" (Youngjun Park) Address a possible bug in the swap discard code and clean things up a little [ This merge also reverts commit ebb9aeb980e5 ("vfio/nvgrace-gpu: register device memory for poison handling") because it looks broken to me, I've asked for clarification - Linus ] * tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm: fix vma_start_write_killable() signal handling mm/swapfile: use plist_for_each_entry in __folio_throttle_swaprate mm/swapfile: fix list iteration when next node is removed during discard fs/proc/task_mmu.c: fix make_uffd_wp_huge_pte() huge pte handling mm/kfence: add reboot notifier to disable KFENCE on shutdown memcg: remove inc/dec_lruvec_kmem_state helpers selftests/mm/uffd: initialize char variable to Null mm: fix DEBUG_RODATA_TEST indentation in Kconfig mm: introduce VMA flags bitmap type tools/testing/vma: eliminate dependency on vma->__vm_flags mm: simplify and rename mm flags function for clarity mm: declare VMA flags by bit zram: fix a spelling mistake mm/page_alloc: optimize lowmem_reserve max lookup using its semantic monotonicity mm/vmscan: skip increasing kswapd_failures when reclaim was boosted pagemap: update BUDDY flag documentation mm: swap: remove scan_swap_map_slots() references from comments mm: swap: change swap_alloc_slow() to void mm, swap: remove redundant comment for read_swap_cache_async mm, swap: use SWP_SOLIDSTATE to determine if swap is rotational ...	2025-12-05 13:52:43 -08:00
Jens Axboe	55d57b3bcc	io_uring/poll: unify poll waitqueue entry and list removal For some cases, the order in which the waitq entry list and head writing happens is important, for others it doesn't really matter. But it's somewhat confusing to have them spread out over the file. Abstract out the nicely documented code in io_pollfree_wake() and move it into a helper, and use that helper consistently rather than having other call sites manually do the same thing. While at it, correct a comment function name as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-05 10:23:28 -07:00
Joanne Koong	a4c694bfc2	io_uring/kbuf: use WRITE_ONCE() for userspace-shared buffer ring fields buf->addr and buf->len reside in memory shared with userspace. They should be written with WRITE_ONCE() to guarantee atomic stores and prevent tearing or other unsafe compiler optimizations. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Cc: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-05 09:52:02 -07:00
Caleb Sander Mateos	78385c7299	io_uring/kbuf: use READ_ONCE() for userspace-mapped memory The struct io_uring_buf elements in a buffer ring are in a memory region accessible from userspace. A malicious/buggy userspace program could therefore write to them at any time, so they should be accessed with READ_ONCE() in the kernel. Commit 98b6fa62c84f ("io_uring/kbuf: always use READ_ONCE() to read ring provided buffer lengths") already switched the reads of the len field to READ_ONCE(). Do the same for bid and addr. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Cc: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-04 15:46:23 -07:00
Joanne Koong	525916ce49	io_uring/rsrc: fix lost entries after cloned range When cloning with node replacements (IORING_REGISTER_DST_REPLACE), destination entries after the cloned range are not copied over. Add logic to copy them over to the new destination table. Fixes: c1329532d5aa ("io_uring/rsrc: allow cloning with node replacements") Cc: stable@vger.kernel.org Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-04 15:46:13 -07:00
Joanne Koong	e29af2aba2	io_uring/rsrc: rename misleading src_node variable in io_clone_buffers() The variable holds nodes from the destination ring's existing buffer table. In io_clone_buffers(), the term "src" is used to refer to the source ring. Rename to node for clarity. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-04 15:46:13 -07:00
Joanne Koong	b8201b50e4	io_uring/rsrc: clean up buffer cloning arg validation Get rid of some redundant checks and move the src arg validation to before the buffer table allocation, which simplifies error handling. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-04 15:46:13 -07:00
Fengnan Chang	48f22f8093	block: enable per-cpu bio cache by default Since after commit 12e4e8c7ab59 ("io_uring/rw: enable bio caches for IRQ rw"), bio_put is safe for task and irq context, bio_alloc_bioset is safe for task context and no one calls in irq context, so we can enable per cpu bio cache by default. Benchmarked with t/io_uring and ext4+nvme: taskset -c 6 /root/fio/t/io_uring -p0 -d128 -b4096 -s1 -c1 -F1 -B1 -R1 -X1 -n1 -P1 /mnt/testfile base IOPS is 562K, patch IOPS is 574K. The CPU usage of bio_alloc_bioset decrease from 1.42% to 1.22%. The worst case is allocate bio in CPU A but free in CPU B, still use t/io_uring and ext4+nvme: base IOPS is 648K, patch IOPS is 647K. Also use fio test ext4/xfs with libaio/sync/io_uring on null_blk and nvme, no obvious performance regression. Signed-off-by: Fengnan Chang <changfengnan@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-04 07:19:24 -07:00
Caleb Sander Mateos	34c78b8610	io_uring/io-wq: always retry worker create on ERESTART* If a task has a pending signal when create_io_thread() is called, copy_process() will return -ERESTARTNOINTR. io_should_retry_thread() will request a retry of create_io_thread() up to WORKER_INIT_LIMIT = 3 times. If all retries fail, the io_uring request will fail with ECANCELED. Commit 3918315c5dc ("io-wq: backoff when retrying worker creation") added a linear backoff to allow the thread to handle its signal before the retry. However, a thread receiving frequent signals may get unlucky and have a signal pending at every retry. Since the userspace task doesn't control when it receives signals, there's no easy way for it to prevent the create_io_thread() failure due to pending signals. The task may also lack the information necessary to regenerate the canceled SQE. So always retry the create_io_thread() on the ERESTART* errors, analogous to what a fork() syscall would do. EAGAIN can occur due to various persistent conditions such as exceeding RLIMIT_NPROC, so respect the WORKER_INIT_LIMIT retry limit for EAGAIN errors. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-04 07:18:02 -07:00
Jens Axboe	84230ad2d2	io_uring/poll: correctly handle io_poll_add() return value on update When the core of io_uring was updated to handle completions consistently and with fixed return codes, the POLL_REMOVE opcode with updates got slightly broken. If a POLL_ADD is pending and then POLL_REMOVE is used to update the events of that request, if that update causes the POLL_ADD to now trigger, then that completion is lost and a CQE is never posted. Additionally, ensure that if an update does cause an existing POLL_ADD to complete, that the completion value isn't always overwritten with -ECANCELED. For that case, whatever io_poll_add() set the value to should just be retained. Cc: stable@vger.kernel.org Fixes: 97b388d70b53 ("io_uring: handle completions in the core") Reported-by: syzbot+641eec6b7af1f62f2b99@syzkaller.appspotmail.com Tested-by: syzbot+641eec6b7af1f62f2b99@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-12-04 07:18:01 -07:00
Linus Torvalds	0abcfd8983	for-6.19/io_uring-20251201 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmktsm0QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpiLvD/0dptgeJyLHKchOtRHzi/UvtM/EuNFKJrvI LBWCyIMjygxsVfPR41Lave9SE3UpcavF8Mg/EddasTci8VlMcDF8zPxWLb289Lz2 tkp/wOVuyYmDhNXKmKNW59NOPTd0NosEJFTZI4VhMudwx+UtAHELJGfBWW5hRyQB Md+UwZ2+J9HbYd19mToaDFxz7jpIPLEE4BYUGtljveRUdpnxhyFGGUS2+CQXZt/5 lnRvJmmEv4nSGH9ZRksix1xnV6KvJM0UwYQhrWvXhgwyiKu47zG7ONpd39KqoaRw Fw+6zZd0t7nyyuZkk15cKNnBLnjilnsCzmdcPq0Cuvkmbf6y1hlhEQQTGWXTKfJx zCZxEZcnCC4wL0CBQjZjS38AEMfH2p76M/36+NTWtlYCibY7qUtd9ndpUr49BYGo o4qfT0HMpI1PHuUvpZwpMcf4OX5qvtLmavT9vt78uqmtM+Aryzzuy3bI3S2SGjNe if/cNHnZc8Z06hUqdEit5NW+lYzj642AoF/j7qH9ADDH+VXRWaCdK/iI8tPaEpDV Rw6j442eVugS5tDPoTjdO8jsJ9+OCNNV1t/Jxy+Or+zrGdq7lfg4mnzEia1/izy5 8MnSubRy6LEd+I5PnK/9y9mPIwFMIFgULi+mUjucAhJjRj5beiG74eR6+jBAdyp1 GhFvN6fwdw== =4g/f -----END PGP SIGNATURE----- Merge tag 'for-6.19/io_uring-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring updates from Jens Axboe: - Unify how task_work cancelations are detected, placing it in the task_work running state rather than needing to check the task state - Series cleaning up and moving the cancelation code to where it belongs, in cancel.c - Cleanup of waitid and futex argument handling - Add support for mixed sized SQEs. 6.18 added support for mixed sized CQEs, improving flexibility and efficiency of workloads that need big CQEs. This adds similar support for SQEs, where the occasional need for a 128b SQE doesn't necessitate having all SQEs be 128b in size - Introduce zcrx and SQ/CQ layout queries. The former returns what zcrx features are available. And both return the ring size information to help with allocation size calculation for user provided rings like IORING_SETUP_NO_MMAP and IORING_MEM_REGION_TYPE_USER - Zcrx updates for 6.19. It includes a bunch of small patches, IORING_REGISTER_ZCRX_CTRL and RQ flushing and David's work on sharing zcrx b/w multiple io_uring instances - Series cleaning up ring initializations, notable deduplicating ring size and offset calculations. It also moves most of the checking before doing any allocations, making the code simpler - Add support for getsockname and getpeername, which is mostly a trivial hookup after a bit of refactoring on the networking side - Various fixes and cleanups * tag 'for-6.19/io_uring-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits) io_uring: Introduce getsockname io_uring cmd socket: Split out a getsockname helper for io_uring socket: Unify getsockname and getpeername implementation io_uring/query: drop unused io_handle_query_entry() ctx arg io_uring/kbuf: remove obsolete buf_nr_pages and update comments io_uring/register: use correct location for io_rings_layout io_uring/zcrx: share an ifq between rings io_uring/zcrx: add io_fill_zcrx_offsets() io_uring/zcrx: export zcrx via a file io_uring/zcrx: move io_zcrx_scrub() and dependencies up io_uring/zcrx: count zcrx users io_uring/zcrx: add sync refill queue flushing io_uring/zcrx: introduce IORING_REGISTER_ZCRX_CTRL io_uring/zcrx: elide passing msg flags io_uring/zcrx: use folio_nr_pages() instead of shift operation io_uring/zcrx: convert to use netmem_desc io_uring/query: introduce rings info query io_uring/query: introduce zcrx query io_uring: move cq/sq user offset init around io_uring: pre-calculate scq layout ...	2025-12-03 18:58:57 -08:00
Linus Torvalds	1b5dd29869	vfs-6.19-rc1.fd_prepare.fs -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZwAKCRCRxhvAZXjc op0AAP4oNVJkFyvgKoPos5K2EGFB8M7merGhpYtsOoeg8UK6OwD/UySQErHsXQDR sUDDa5uFOhfrkcfM8REtAN4wF8p5qAc= =QgFD -----END PGP SIGNATURE----- Merge tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fd prepare updates from Christian Brauner: "This adds the FD_ADD() and FD_PREPARE() primitive. They simplify the common pattern of get_unused_fd_flags() + create file + fd_install() that is used extensively throughout the kernel and currently requires cumbersome cleanup paths. FD_ADD() - For simple cases where a file is installed immediately: fd = FD_ADD(O_CLOEXEC, vfio_device_open_file(device)); if (fd < 0) vfio_device_put_registration(device); return fd; FD_PREPARE() - For cases requiring access to the fd or file, or additional work before publishing: FD_PREPARE(fdf, O_CLOEXEC, sync_file->file); if (fdf.err) { fput(sync_file->file); return fdf.err; } data.fence = fd_prepare_fd(fdf); if (copy_to_user((void __user )arg, &data, sizeof(data))) return -EFAULT; return fd_publish(fdf); The primitives are centered around struct fd_prepare. FD_PREPARE() encapsulates all allocation and cleanup logic and must be followed by a call to fd_publish() which associates the fd with the file and installs it into the caller's fdtable. If fd_publish() isn't called, both are deallocated automatically. FD_ADD() is a shorthand that does fd_publish() immediately and never exposes the struct to the caller. I've implemented this in a way that it's compatible with the cleanup infrastructure while also being usable separately. IOW, it's centered around struct fd_prepare which is aliased to class_fd_prepare_t and so we can make use of all the basica guard infrastructure" tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits) io_uring: convert io_create_mock_file() to FD_PREPARE() file: convert replace_fd() to FD_PREPARE() vfio: convert vfio_group_ioctl_get_device_fd() to FD_ADD() tty: convert ptm_open_peer() to FD_ADD() ntsync: convert ntsync_obj_get_fd() to FD_PREPARE() media: convert media_request_alloc() to FD_PREPARE() hv: convert mshv_ioctl_create_partition() to FD_ADD() gpio: convert linehandle_create() to FD_PREPARE() pseries: port papr_rtas_setup_file_interface() to FD_ADD() pseries: convert papr_platform_dump_create_handle() to FD_ADD() spufs: convert spufs_gang_open() to FD_PREPARE() papr-hvpipe: convert papr_hvpipe_dev_create_handle() to FD_PREPARE() spufs: convert spufs_context_open() to FD_PREPARE() net/socket: convert __sys_accept4_file() to FD_ADD() net/socket: convert sock_map_fd() to FD_ADD() net/kcm: convert kcm_ioctl() to FD_PREPARE() net/handshake: convert handshake_nl_accept_doit() to FD_PREPARE() secretmem: convert memfd_secret() to FD_ADD() memfd: convert memfd_create() to FD_ADD() bpf: convert bpf_token_create() to FD_PREPARE() ...	2025-12-01 17:32:07 -08:00
Linus Torvalds	1885cdbfbb	vfs-6.19-rc1.iomap -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZAAKCRCRxhvAZXjc ooCXAQCwzX2GS/55QHV6JXBBoNxguuSQ5dCj91ZmTfHzij0xNAEAhKEBw7iMGX72 c2/x+xYf+Pc6mAfxdus5RLMggqBFPAk= =jInB -----END PGP SIGNATURE----- Merge tag 'vfs-6.19-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull iomap updates from Christian Brauner: "FUSE iomap Support for Buffered Reads: This adds iomap support for FUSE buffered reads and readahead. This enables granular uptodate tracking with large folios so only non-uptodate portions need to be read. Also fixes a race condition with large folios + writeback cache that could cause data corruption on partial writes followed by reads. - Refactored iomap read/readahead bio logic into helpers - Added caller-provided callbacks for read operations - Moved buffered IO bio logic into new file - FUSE now uses iomap for read_folio and readahead Zero Range Folio Batch Support: Add folio batch support for iomap_zero_range() to handle dirty folios over unwritten mappings. Fix raciness issues where dirty data could be lost during zero range operations. - filemap_get_folios_tag_range() helper for dirty folio lookup - Optional zero range dirty folio processing - XFS fills dirty folios on zero range of unwritten mappings - Removed old partial EOF zeroing optimization DIO Write Completions from Interrupt Context: Restore pre-iomap behavior where pure overwrite completions run inline rather than being deferred to workqueue. Reduces context switches for high-performance workloads like ScyllaDB. - Removed unused IOCB_DIO_CALLER_COMP code - Error completions always run in user context (fixes zonefs) - Reworked REQ_FUA selection logic - Inverted IOMAP_DIO_INLINE_COMP to IOMAP_DIO_OFFLOAD_COMP Buffered IO Cleanups: Some performance and code clarity improvements: - Replace manual bitmap scanning with find_next_bit() - Simplify read skip logic for writes - Optimize pending async writeback accounting - Better variable naming - Documentation for iomap_finish_folio_write() requirements Misaligned Vectors for Zoned XFS: Enables sub-block aligned vectors in XFS always-COW mode for zoned devices via new IOMAP_DIO_FSBLOCK_ALIGNED flag. Bug Fixes: - Allocate s_dio_done_wq for async reads (fixes syzbot report after error completion changes) - Fix iomap_read_end() for already uptodate folios (regression fix)" * tag 'vfs-6.19-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (40 commits) iomap: allocate s_dio_done_wq for async reads as well iomap: fix iomap_read_end() for already uptodate folios iomap: invert the polarity of IOMAP_DIO_INLINE_COMP iomap: support write completions from interrupt context iomap: rework REQ_FUA selection iomap: always run error completions in user context fs, iomap: remove IOCB_DIO_CALLER_COMP iomap: use find_next_bit() for uptodate bitmap scanning iomap: use find_next_bit() for dirty bitmap scanning iomap: simplify when reads can be skipped for writes iomap: simplify ->read_folio_range() error handling for reads iomap: optimize pending async writeback accounting docs: document iomap writeback's iomap_finish_folio_write() requirement iomap: account for unaligned end offsets when truncating read range iomap: rename bytes_pending/bytes_accounted to bytes_submitted/bytes_not_submitted xfs: support sub-block aligned vectors in always COW mode iomap: add IOMAP_DIO_FSBLOCK_ALIGNED flag xfs: error tag to force zeroing on debug kernels iomap: remove old partial eof zeroing optimization xfs: fill dirty folios on zero range of unwritten mappings ...	2025-12-01 08:14:00 -08:00
Christian Brauner	6fb1022918	io_uring: convert io_create_mock_file() to FD_PREPARE() Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-45-b6efa1706cfd@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-28 12:42:36 +01:00
Gabriel Krisman Bertazi	5d24321e4c	io_uring: Introduce getsockname io_uring cmd Introduce a socket-specific io_uring_cmd to support getsockname/getpeername via io_uring. I made this an io_uring_cmd instead of a new operation to avoid polluting the command namespace with what is exclusively a socket operation. In addition, since we don't need to conform to existing interfaces, this merges the getsockname/getpeername in a single operation, since the implementation is pretty much the same. This has been frequently requested, for instance at [1] and more recently in the project Discord channel. The main use-case is to support fixed socket file descriptors. [1] https://github.com/axboe/liburing/issues/1356 Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-26 13:45:23 -07:00
Caleb Sander Mateos	1e93de9205	io_uring/query: drop unused io_handle_query_entry() ctx arg io_handle_query_entry() doesn't use its struct io_ring_ctx *ctx argument. So remove it from the function and its callers. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-26 09:37:10 -07:00
Pavel Begunkov	f6dc5a3619	io_uring: fix mixed cqe overflow handling I started to see zcrx data corruptions. That turned out to be due to CQ tail pointing to a stale entry which happened to be from a zcrx request. I.e. the tail is incremented without the CQE memory being changed. The culprit is __io_cqring_overflow_flush() passing "cqe32=true" to io_get_cqe_overflow() for non-mixed CQE32 setups, which only expects it to be set for mixed 32B CQEs and not for SETUP_CQE32. The fix is slightly hacky, long term it's better to unify mixed and CQE32 handling. Fixes: e26dca67fde19 ("io_uring: add support for IORING_SETUP_CQE_MIXED") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-25 07:03:45 -07:00
Christoph Hellwig	f9f8514999	fs, iomap: remove IOCB_DIO_CALLER_COMP This was added by commit 099ada2c8726 ("io_uring/rw: add write support for IOCB_DIO_CALLER_COMP") and disabled a little later by commit 838b35bb6a89 ("io_uring/rw: disable IOCB_DIO_CALLER_COMP") because it didn't work. Remove all the related code that sat unused for 2 years. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251113170633.1453259-2-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-25 10:22:18 +01:00
Jens Axboe	f6041803a8	io_uring/net: ensure vectored buffer node import is tied to notification When support for vectored registered buffers was added, the import itself is using 'req' rather than the notification io_kiocb, sr->notif. For non-vectored imports, sr->notif is correctly used. This is important as the lifetime of the two may be different. Use the correct io_kiocb for the vectored buffer import. Cc: stable@vger.kernel.org Fixes: 23371eac7d9a ("io_uring/net: implement vectored reg bufs for zctx") Reported-by: Google Big Sleep <big-sleep-vuln-reports+bigsleep-463332873@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-24 10:59:02 -07:00
Joanne Koong	84692a1519	io_uring/kbuf: remove obsolete buf_nr_pages and update comments The buf_nr_pages field in io_buffer_list was previously used to determine whether the buffer list uses ring-provided buffers or classic provided buffers. This is now determined by checking the IOBL_BUF_RING flag. Remove the buf_nr_pages field and update related comments. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-20 13:23:05 -07:00
Jens Axboe	46447367a5	io_uring/cmd_net: fix wrong argument types for skb_queue_splice() If timestamp retriving needs to be retried and the local list of SKB's already has entries, then it's spliced back into the socket queue. However, the arguments for the splice helper are transposed, causing exactly the wrong direction of splicing into the on-stack list. Fix that up. Cc: stable@vger.kernel.org Reported-by: Google Big Sleep <big-sleep-vuln-reports+bigsleep-462435176@google.com> Fixes: 9e4ed359b8ef ("io_uring/netcmd: add tx timestamping cmd support") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-20 11:40:15 -07:00
Jens Axboe	f779ac0b87	io_uring/register: use correct location for io_rings_layout A previous consolidated the ring size etc calculations into io_prepare_config(), but missed updating io_register_resize_rings() correctly to use the calculated values. As a result, it ended up using on-stack uninitialized values, and hence either failed validating the size correctly, or just failed resizing because the sizes were random. This caused failures in the liburing regression tests: [...] Running test resize-rings.t resize=-7 test_basic 3000 failed Test resize-rings.t failed with ret 1 Running test resize-rings.t /dev/sda resize=-7 test_basic 3000 failed Test resize-rings.t failed with ret 1 Running test resize-rings.t /dev/nvme1n1 resize=-7 test_basic 3000 failed Test resize-rings.t failed with ret 1 Running test resize-rings.t /dev/dm-0 resize=-7 test_basic 3000 failed Test resize-rings.t failed with ret 1 because io_create_region() would return -E2BIG because of unintialized reg->size values. Adjust the struct io_rings_layout rl pointer to point to the correct location, and remove the (now dead) __rl on stack struct. Fixes: eb76ff6a6829 ("io_uring: pre-calculate scq layout") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-18 19:30:45 -07:00
Ryan Roberts	9ac09bb9fe	mm: consistently use current->mm in mm_get_unmapped_area() mm_get_unmapped_area() is a wrapper around arch_get_unmapped_area() / arch_get_unmapped_area_topdown(), both of which search current->mm for some free space. Neither take an mm_struct - they implicitly operate on current->mm. But the wrapper takes an mm_struct and uses it to decide whether to search bottom up or top down. All callers pass in current->mm for this, so everything is working consistently. But it feels like an accident waiting to happen; eventually someone will call that function with a different mm, expecting to find free space in it, but what gets returned is free space in the current mm. So let's simplify by removing the parameter and have the wrapper use current->mm to decide which end to start at. Now everything is consistent and self-documenting. Link: https://lkml.kernel.org/r/20251003155306.2147572-1-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-16 17:27:57 -08:00
David Wei	00d9148127	io_uring/zcrx: share an ifq between rings Add a way to share an ifq from a src ring that is real (i.e. bound to a HW RX queue) with other rings. This is done by passing a new flag IORING_ZCRX_IFQ_REG_IMPORT in the registration struct io_uring_zcrx_ifq_reg, alongside the fd of an exported zcrx ifq. Signed-off-by: David Wei <dw@davidwei.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
David Wei	0926f94ab3	io_uring/zcrx: add io_fill_zcrx_offsets() Add a helper io_fill_zcrx_offsets() that sets the constant offsets in struct io_uring_zcrx_offsets returned to userspace. Signed-off-by: David Wei <dw@davidwei.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
Pavel Begunkov	d7af80b213	io_uring/zcrx: export zcrx via a file Add an option to wrap a zcrx instance into a file and expose it to the user space. Currently, users can't do anything meaningful with the file, but it'll be used in a next patch to import it into another io_uring instance. It's implemented as a new op called ZCRX_CTRL_EXPORT for the IORING_REGISTER_ZCRX_CTRL registration opcode. Signed-off-by: David Wei <dw@davidwei.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
David Wei	742cb2e14e	io_uring/zcrx: move io_zcrx_scrub() and dependencies up In preparation for adding zcrx ifq exporting and importing, move io_zcrx_scrub() and its dependencies up the file to be closer to io_close_queue(). Signed-off-by: David Wei <dw@davidwei.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
Pavel Begunkov	39c9676f78	io_uring/zcrx: count zcrx users zcrx tries to detach ifq / terminate page pools when the io_uring ctx owning it is being destroyed. There will be multiple io_uring instances attached to it in the future, so add a separate counter to track the users. Note, refs can't be reused for this purpose as it only used to prevent zcrx and rings destruction, and also used by page pools to keep it alive. Signed-off-by: David Wei <dw@davidwei.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
Pavel Begunkov	475eb39b00	io_uring/zcrx: add sync refill queue flushing Add an zcrx interface via IORING_REGISTER_ZCRX_CTRL that forces the kernel to flush / consume entries from the refill queue. Just as with the IORING_REGISTER_ZCRX_REFILL attempt, the motivation is to address cases where the refill queue becomes full, and the user can't return buffers and needs to stash them. It's still a slow path, and the user should size refill queue appropriately, but it should be helpful for handling temporary traffic spikes and other unpredictable conditions. The interface is simpler comparing to ZCRX_REFILL as it doesn't need temporary refill entry arrays and gives natural batching, whereas ZCRX_REFILL requires even more user logic to be somewhat efficient. Also, add a structure for the operation. It's not currently used but can serve for future improvements like limiting the number of buffers to process, etc. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
Pavel Begunkov	d663976dad	io_uring/zcrx: introduce IORING_REGISTER_ZCRX_CTRL It'll be annoying and take enough of boilerplate code to implement new zcrx features as separate io_uring register opcode. Introduce IORING_REGISTER_ZCRX_CTRL that will multiplex such calls to zcrx. Note, there are no real users of the opcode in this patch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
Pavel Begunkov	1b8b5d0316	io_uring/zcrx: elide passing msg flags zcrx sqe->msg_flags has never been defined and checked to be zero. It doesn't need to be a MSG_* bitmask. Keep them undefined, don't mix with MSG_DONTWAIT, and don't pass into io_zcrx_recv() as it's ignored anyway. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
Pedro Demarchi Gomes	a0169c3a62	io_uring/zcrx: use folio_nr_pages() instead of shift operation folio_nr_pages() is a faster helper function to get the number of pages when NR_PAGES_IN_LARGE_FOLIO is enabled. Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
Pavel Begunkov	f0243d2b86	io_uring/zcrx: convert to use netmem_desc Convert zcrx to struct netmem_desc, and use struct net_iov::desc to access its fields instead of struct net_iov inner union alises. zcrx only directly reads niov->pp, so with this patch it doesn't depend on the union anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Byungchul Park <byungchul@sk.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
Pavel Begunkov	4aaa9bc4d5	io_uring/query: introduce rings info query Same problem as with zcrx in the previous patch, the user needs to know SQ/CQ header sizes to allocated memory before setup to use it for user provided rings, i.e. IORING_SETUP_NO_MMAP, however that information is only returned after registration, hence the user is guessing kernel implementation details. Return the header size and alignment, which is split with the same motivation, to allow the user to know the real structure size without alignment in case there will be more flexible placement schemes in the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:17:36 -07:00
Pavel Begunkov	2647e2ecc0	io_uring/query: introduce zcrx query Add a new query type IO_URING_QUERY_ZCRX returning the user some basic information about the interface, which includes allowed flags for areas and registration and supported IORING_REGISTER_ZCRX_CTRL subcodes. There is also a chicken-egg problem with user provided refill queue memory, where offsets and size information is returned after registration, but to properly allocate memory you need to know it beforehand, which is why the userspace currently has to guess the RQ headers size and severely overestimates it. Return the size information. It's split into "size" and "alignment" fields because for default placement modes the user is interested in the aligned size, however if it gets support for more flexible placement, it'll need to only know the actual header size. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:17:36 -07:00

1 2 3 4 5 ...

1925 Commits