mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2026-01-15 19:12:15 +00:00
8530 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
e5ac874257 |
block-6.16-20250718
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmh6ZU8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmRdD/9Q6I5VC13uVjbrXLA3R4d+gLsDzcVv3lIp
ps9HBz1s5yXIP9hb68pnIu6H+SGKyzd83Uqst/74+NzQWAuDaWO9ydT1DLu5bpHS
Q1qjA1seIbhPRi184wXSqjr3OgaX0rNdzOkWL/PKQ0dHFx54adrXiu3qoSWvBQYg
YUrMvFFmNN7gQdTagburM+g4RXRWqhqcn0FJfyb1IX90gQNVCv8JKY2NkJbF9SIM
rlAQoZefoiX+5Fo8dGIutaZRZ1X04lIv9S5oXxzgw/4xhUtGrVfL1mwSCS9twQtp
5r2v7dcUqCxZ1pwHJazMene/Y5540ycZR3KgMsh8Ggxs9is1GbzbrXMe3gdDogTR
10k9X1C0NLkeQ6h12kcX9TlMuN4jbBRbXsNQQnTd0XEvMUVxggRcg3j/TQ/+W5Uj
eEMmWKbZD1PZsxqxqKJ8T0NzNY5JdZYdRLo+4lrp3Lw2b3o1cyUQ2pONGNRzEClj
4iHWQuopbB5AV3jo9lxOrZD8tZywDNjNFYFz7aTQ5OXIA98lbAM5/0NXcExvk047
5FAjzo0dbfHVFX3jfPwTUifxFXZ3nDJSBBO2y6tKvllwZU6f/gIMSPCbeP5yQsdW
jwGv5IBRBvLj7RDChSFo14KQdupNhBmIfru6huZwtT8vj4IHXRkaRAgMFqE23owx
4HLPoGR5Ww==
=hRsE
-----END PGP SIGNATURE-----
Merge tag 'block-6.16-20250718' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- NVMe changes via Christoph:
- revert the cross-controller atomic write size validation
that caused regressions (Christoph Hellwig)
- fix endianness of command word printout in
nvme_log_err_passthru() (John Garry)
- fix callback lock for TLS handshake (Maurizio Lombardi)
- fix misaccounting of nvme-mpath inflight I/O (Yu Kuai)
- fix inconsistent RCU list manipulation in
nvme_ns_add_to_ctrl_list() (Zheng Qixing)
- Fix for a kobject leak in queue unregistration
- Fix for loop async file write start/end handling
* tag 'block-6.16-20250718' of git://git.kernel.dk/linux:
loop: use kiocb helpers to fix lockdep warning
nvmet-tcp: fix callback lock for TLS handshake
nvme: fix misaccounting of nvme-mpath inflight I/O
nvme: revert the cross-controller atomic write size validation
nvme: fix endianness of command word prints in nvme_log_err_passthru()
nvme: fix inconsistent RCU list manipulation in nvme_ns_add_to_ctrl_list()
block: fix kobject leak in blk_unregister_queue
|
||
|
|
c4706c5058 |
loop: use kiocb helpers to fix lockdep warning
The lockdep tool can report a circular lock dependency warning in the loop
driver's AIO read/write path:
```
[ 6540.587728] kworker/u96:5/72779 is trying to acquire lock:
[ 6540.593856] ff110001b5968440 (sb_writers#9){.+.+}-{0:0}, at: loop_process_work+0x11a/0xf70 [loop]
[ 6540.603786]
[ 6540.603786] but task is already holding lock:
[ 6540.610291] ff110001b5968440 (sb_writers#9){.+.+}-{0:0}, at: loop_process_work+0x11a/0xf70 [loop]
[ 6540.620210]
[ 6540.620210] other info that might help us debug this:
[ 6540.627499] Possible unsafe locking scenario:
[ 6540.627499]
[ 6540.634110] CPU0
[ 6540.636841] ----
[ 6540.639574] lock(sb_writers#9);
[ 6540.643281] lock(sb_writers#9);
[ 6540.646988]
[ 6540.646988] *** DEADLOCK ***
```
This patch fixes the issue by using the AIO-specific helpers
`kiocb_start_write()` and `kiocb_end_write()`. These functions are
designed to be used with a `kiocb` and manage write sequencing
correctly for asynchronous I/O without introducing the problematic
lock dependency.
The `kiocb` is already part of the `loop_cmd` struct, so this change
also simplifies the completion function `lo_rw_aio_do_completion()` by
using the `iocb` from the `cmd` struct directly, instead of retrieving
the loop device from the request queue.
Fixes: 39d86db34e41 ("loop: add file_start_write() and file_end_write()")
Cc: Changhui Zhong <czhong@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250716114808.3159657-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
||
|
|
40f92e79b0 |
block-6.16-20250710
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhwb8QQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpuDTEAC5J4noilx4TRpKQ0gp3cF9KHvB2wAD0ry8
1Y45lccZGrdUyWBnB7KyvIKUHt4MVk5Lw4d3vkv1Shx6XesW35hbCOI2W7UPsMsL
nEBYJcroNNKlTlx9TJazVs0xmjF6G7JwaYXD6CVNLkjAQXxdeGst2Or15vhD4soz
3nmwFAyP3sEU7ESRNZ53UaNaM2KW0BBNef+jcFn9MOdSZcilePY7ckh74JzCc9Oc
GIcH0eTRDdfPi3TteLu/2VMNjpogX+9LY41r3laSKwSgEcYmj+pPFLuqjU6A82hg
dT8FWJLR+GuUWTs9B7FuWWpmk7uwOPrIadSQo2DcTdiBSvBYuGv+0BPIxq1kfykn
cUjresj49q2hNAjBK71iEDycZR+W+pn864r1mJg+8pASoKKyNX2/3iTvQj57RwFO
phoICyxr37WxCYQMcTXYwPcYD8BnF7mTJIDYDFti4BY1w/dUwlSvsbfI9Zk1rxIH
VumZzML0nhfTbEsq+QVOTZ6bq0hn71EmVONLamM1LdaoAh6PZU2CMduJuPjEgCNz
I73xzc4MlshOZYBidiq++1yFnRX64pB6jPi2omu31PjXMd1ZZaZWENh0OGFp4zHX
8yCmJoWTs8BXy5v74tJWxMTvShfMlqFBuTQlRexh1kala4IdngQrZjO6vBU2pw2C
4orH43oFdA==
=m0St
-----END PGP SIGNATURE-----
Merge tag 'block-6.16-20250710' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- MD changes via Yu:
- fix UAF due to stack memory used for bio mempool (Jinchao)
- fix raid10/raid1 nowait IO error path (Nigel and Qixing)
- fix kernel crash from reading bitmap sysfs entry (Håkon)
- Fix for a UAF in the nbd connect error path
- Fix for blocksize being bigger than pagesize, if THP isn't enabled
* tag 'block-6.16-20250710' of git://git.kernel.dk/linux:
block: reject bs > ps block devices when THP is disabled
nbd: fix uaf in nbd_genl_connect() error path
md/md-bitmap: fix GPF in bitmap_get_stats()
md/raid1,raid10: strip REQ_NOWAIT from member bios
raid10: cleanup memleak at raid10_make_request
md/raid1: Fix stack memory use after return in raid1_reshape
|
||
|
|
aa9552438e |
nbd: fix uaf in nbd_genl_connect() error path
There is a use-after-free issue in nbd: block nbd6: Receive control failed (result -104) block nbd6: shutting down sockets ================================================================== BUG: KASAN: slab-use-after-free in recv_work+0x694/0xa80 drivers/block/nbd.c:1022 Write of size 4 at addr ffff8880295de478 by task kworker/u33:0/67 CPU: 2 UID: 0 PID: 67 Comm: kworker/u33:0 Not tainted 6.15.0-rc5-syzkaller-00123-g2c89c1b655c0 #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Workqueue: nbd6-recv recv_work Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:408 [inline] print_report+0xc3/0x670 mm/kasan/report.c:521 kasan_report+0xe0/0x110 mm/kasan/report.c:634 check_region_inline mm/kasan/generic.c:183 [inline] kasan_check_range+0xef/0x1a0 mm/kasan/generic.c:189 instrument_atomic_read_write include/linux/instrumented.h:96 [inline] atomic_dec include/linux/atomic/atomic-instrumented.h:592 [inline] recv_work+0x694/0xa80 drivers/block/nbd.c:1022 process_one_work+0x9cc/0x1b70 kernel/workqueue.c:3238 process_scheduled_works kernel/workqueue.c:3319 [inline] worker_thread+0x6c8/0xf10 kernel/workqueue.c:3400 kthread+0x3c2/0x780 kernel/kthread.c:464 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:153 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> nbd_genl_connect() does not properly stop the device on certain error paths after nbd_start_device() has been called. This causes the error path to put nbd->config while recv_work continue to use the config after putting it, leading to use-after-free in recv_work. This patch moves nbd_start_device() after the backend file creation. Reported-by: syzbot+48240bab47e705c53126@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68227a04.050a0220.f2294.00b5.GAE@google.com/T/ Fixes: 6497ef8df568 ("nbd: provide a way for userspace processes to identify device backends") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250612132405.364904-1-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
1880df2cf4 |
block-6.16-20250704
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhn80AQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgphdSD/93OEB7MwxhhzhaU9U0eiYRPlXcV9+nRMKI
kSjPM/JFdGsiUGcEBvNvSNqJCpxQTytv+1JTPO4KhQ4hjiGDnuuaw51h7Ro3uRlp
75Up2uWnh9RaVRCABJQnHVd6zizij0RFHJYwlYlIXkGVQ6vqmaGz1Y4GAeGD4Jw+
iokVENz4uH9n5Zn3oruvufZk+uffZ++Sr4Vqtq3hVJ78ZWOV+iLXzHJSCmEnWSQL
QptFP+MDSd9o0ej5bKLDP6kG4xIvMkBl9JY+Y2QH+Rev5Jroc26GmTcgwbRTkXDi
hHQgilwmq4LkMyTGDaH2M7BlXoJlAhnWt7/2da9yr6ygLwHoD9LU2ALgGBKgb0r9
E/YrM2ioEC8lkKUGgalX9JReXTExGBvNeaKixi+CoNKDXMauEbJUNkSOH6kfstRo
5QCdn5g9l0Bf6qKBBmAnfty5mDtw9F3mowefxv2DFAPebXD+2I2FyIuafC5LedlE
llsC77t2vBBKOAqL+WXypyYKTKAxMSk9NRO4FFkF9OFDdJIruofHXy0Nsi8aHLV7
defzDrr9y1plYHqjMzJy8VfLvv+2YDrmkldBgcfxMRBWfetD3XIOGCmpBFmdOcgx
FUqviNDc7Yr2LyDwMdIPfS8ZqmAdmB198/c7UrRdiZe/QyB7tMeeo1vzeCw3XF3n
srEJ1bJLxA==
=1VG9
-----END PGP SIGNATURE-----
Merge tag 'block-6.16-20250704' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- NVMe fixes via Christoph:
- fix incorrect cdw15 value in passthru error logging (Alok Tiwari)
- fix memory leak of bio integrity in nvmet (Dmitry Bogdanov)
- refresh visible attrs after being checked (Eugen Hristev)
- fix suspicious RCU usage warning in the multipath code (Geliang Tang)
- correctly account for namespace head reference counter (Nilay Shroff)
- Fix for a regression introduced in ublk in this cycle, where it would
attempt to queue a canceled request.
- brd RCU sleeping fix, also introduced in this cycle. Bare bones fix,
should be improved upon for the next release.
* tag 'block-6.16-20250704' of git://git.kernel.dk/linux:
brd: fix sleeping function called from invalid context in brd_insert_page()
ublk: don't queue request if the associated uring_cmd is canceled
nvme-multipath: fix suspicious RCU usage warning
nvme-pci: refresh visible attrs after being checked
nvmet: fix memory leak of bio integrity
nvme: correctly account for namespace head reference counter
nvme: Fix incorrect cdw15 value in passthru error logging
|
||
|
|
0d519bb0de |
brd: fix sleeping function called from invalid context in brd_insert_page()
__xa_cmpxchg() is called with rcu_read_lock(), and it will allocate
memory if necessary.
Fix the problem by moving rcu_read_lock() after __xa_cmpxchg(), meanwhile,
it still should be held before xa_unlock(), prevent returned page to be
freed by concurrent discard.
Fixes: bbcacab2e8ee ("brd: avoid extra xarray lookups on first write")
Reported-by: syzbot+ea4c8fd177a47338881a@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/685ec4c9.a00a0220.129264.000c.GAE@google.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250630112828.421219-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
||
|
|
01ed88aea5 |
ublk: don't queue request if the associated uring_cmd is canceled
Commit 524346e9d79f ("ublk: build batch from IOs in same io_ring_ctx and io task")
need to dereference `io->cmd` for checking if the IO can be added to current
batch, see ublk_belong_to_same_batch() and io_uring_cmd_ctx_handle(). However,
`io->cmd` may become invalid after the uring_cmd is canceled.
Fixes it by only allowing to queue this IO in case that ublk_prep_req()
returns `BLK_STS_OK`, when 'io->cmd' is guaranteed to be valid.
Reported-by: Changhui Zhong <czhong@redhat.com>
Fixes: 524346e9d79f ("ublk: build batch from IOs in same io_ring_ctx and io task")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250701072325.1458109-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
||
|
|
e540341508 |
block-6.16-20250626
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhd4zsQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmZgEACOk81RNf8WGNQf4/parSENzebWNj9W+fKD
RDhWxwBAquT2VzkF8Iu6wbteVbP9A8yq4BagbD079OWrr0iV8NgWA5y1GyqdER6N
upe2ZtBlY7RR4F1FerpSGqRBbhWYejNojSr073ea8mmx5Yl0BbHz5aKKmzWGbUYO
lveYPgCeL4dD7kfPeINiamhicLudyAGdqqYpG+/wriefwaVhTgCe+4aQ6pEwftRT
utqCzrpUnxrmXS4TFXiWd4u3iVNwPhzcMyUrgkK1yTM7mWIqp8QyHzfF4Acbh/T3
RN/8d5OCfYmamlRvDUCl3FXWukkdGtBrA4m51mhUIzRJ9Np9IiSHdd2UTDgGqSeG
2NSjLtmdDQvtVXeuqBs56os7e3DFx42LZuceqbGWaTQ4VC4QE+Xz+n2ZENx/hWFZ
/lixcIBdxt6iqjveJuBJeXW6UqaR+Hz4hpSigZU69DMQzrKm65bSoMdOvyn5b0bU
GtlPusSnfgpsSe/H41Lm7SLBePiGXMJvhujzlkWW5cnUUl+yRUQhTO206kQJkbV1
XUMs8Syow15gjQaXI9KiAq+MMUuUwOvXmptMyYQ1NjFy16yzhJ8QOhJilJLWfLdT
SqsLyXn1kG2EdcPmXHJRthIgVmQ+uORy2JB1wAomyjJj9a16wJYhgCGDjrl4mocl
9LpjfnyMsA==
=ln4w
-----END PGP SIGNATURE-----
Merge tag 'block-6.16-20250626' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- Fixes for ublk:
- fix C++ narrowing warnings in the uapi header
- update/improve UBLK_F_SUPPORT_ZERO_COPY comment in uapi header
- fix for the ublk ->queue_rqs() implementation, limiting a batch
to just the specific task AND ring
- ublk_get_data() error handling fix
- sanity check more arguments in ublk_ctrl_add_dev()
- selftest addition
- NVMe pull request via Christoph:
- reset delayed remove_work after reconnect
- fix atomic write size validation
- Fix for a warning introduced in bdev_count_inflight_rw() in this
merge window
* tag 'block-6.16-20250626' of git://git.kernel.dk/linux:
block: fix false warning in bdev_count_inflight_rw()
ublk: sanity check add_dev input for underflow
nvme: fix atomic write size validation
nvme: refactor the atomic write unit detection
nvme: reset delayed remove_work after reconnect
ublk: setup ublk_io correctly in case of ublk_get_data() failure
ublk: update UBLK_F_SUPPORT_ZERO_COPY comment in UAPI header
ublk: fix narrowing warnings in UAPI header
selftests: ublk: don't take same backing file for more than one ublk devices
ublk: build batch from IOs in same io_ring_ctx and io task
|
||
|
|
969127bf07 |
ublk: sanity check add_dev input for underflow
Add additional checks that queue depth and number of queues are non-zero. Signed-off-by: Ronnie Sahlberg <rsahlberg@whamcloud.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250626022046.235018-1-ronniesahlberg@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
4c8a951787 |
ublk: setup ublk_io correctly in case of ublk_get_data() failure
If ublk_get_data() fails, -EIOCBQUEUED is returned and the current command
becomes ASYNC. And the only reason is that mapping data can't move on,
because of no enough pages or pending signal, then the current ublk request
has to be requeued.
Once the request need to be requeued, we have to setup `ublk_io` correctly,
including io->cmd and flags, otherwise the request may not be forwarded to
ublk server successfully.
Fixes: 9810362a57cb ("ublk: don't call ublk_dispatch_req() for NEED_GET_DATA")
Reported-by: Changhui Zhong <czhong@redhat.com>
Closes: https://lore.kernel.org/linux-block/CAGVVp+VN9QcpHUz_0nasFf5q9i1gi8H8j-G-6mkBoqa3TyjRHA@mail.gmail.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Changhui Zhong <czhong@redhat.com>
Link: https://lore.kernel.org/r/20250624104121.859519-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
||
|
|
524346e9d7 |
ublk: build batch from IOs in same io_ring_ctx and io task
ublk_queue_cmd_list() dispatches the whole batch list by scheduling task
work via the tail request's io_uring_cmd, this way is fine even though
more than one io_ring_ctx are involved for this batch since it is just
one running context.
However, the task work handler ublk_cmd_list_tw_cb() takes `issue_flags`
of tail uring_cmd's io_ring_ctx for completing all commands. This way is
wrong if any uring_cmd is issued from different io_ring_ctx.
Fixes it by always building batch IOs from same io_ring_ctx and io task
because ublk_dispatch_req() does validate task context, and IO needs to
be aborted in case of running from fallback task work context.
For typical per-queue or per-io daemon implementation, this way shouldn't
make difference from performance viewpoint, because single io_ring_ctx is
taken in each daemon for normal use case.
Fixes: d796cea7b9f3 ("ublk: implement ->queue_rqs()")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250625022554.883571-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
||
|
|
75f5f23f87 |
block-6.16-20250619
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhUFmcQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpiaZEACPwWddJeq0LZ/Km4UeyJCmDxQRgZ87uZxl DtTwmcpxabpl+eXG8d7jiem6d8T0v/XbugiqP7CmeRMQWsJ1pmBC/vvtyFTQsWNf WzD5iQp4IoBRuEzGzWVB7aso6hpOPTmClydko16+bDuKseTBla8/2iaTcIijs/64 isfmMKUAtRRVqFHAvj5yifjMV17vfPbb/kYAV11YYR+STTijxa8lM9rW32QrxTM9 oIByKBEECFJY+dmlN4VAOez0fGuufuh4IO+5YitRmMdwKipcXUF0H74CEmM8lefU b2smqwoyAz81Z+g1tAlc2kNsLsYBbCjJ+3uis00RGrK23AFY7ucg3wXRuBic01Mb sF3HFwo1t1ekIolH2TqDRMFyYxBQwJ9y8u8lN9w66ubLdtEUrjlnP1fhgHw7Ctj7 7Wdgs+UAssxWE2Mj9JGz9stQrz3IfROrWy8p7LZs5IMEOR04spAackky+ehHhTPb Bp53d+du8UZkH4lwkiN2AnxfPL789ndHUPLs5Mo3NMbC4pduiHJGtJ8SozWwJ1GU slpinkg+hw2IMxuPDfboayZGpPkiQzMIbnhumcXPd4OCzhZupl1n8bEHKVeCVTfm bPJ6J7tQ2LVwFKe1SFVX/7OBger0n13joBuroi0mDa2R2Nve90oAO0AndR475y5q QQfWcKLfqw== =wuVQ -----END PGP SIGNATURE----- Merge tag 'block-6.16-20250619' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - Two fixes for aoe which fixes issues dating back to when this driver was converted to blk-mq - Fix for ublk, checking for valid queue depth and count values before setting up a device * tag 'block-6.16-20250619' of git://git.kernel.dk/linux: ublk: santizize the arguments from userspace when adding a device aoe: defer rexmit timer downdev work to workqueue aoe: clean device rq_list in aoedev_downdev() |
||
|
|
8c84728558 |
ublk: santizize the arguments from userspace when adding a device
Sanity check the values for queue depth and number of queues
we get from userspace when adding a device.
Signed-off-by: Ronnie Sahlberg <rsahlberg@whamcloud.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver")
Fixes: 62fe99cef94a ("ublk: add read()/write() support for ublk char device")
Link: https://lore.kernel.org/r/20250619021031.181340-1-ronniesahlberg@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
||
|
|
cffc873d68 |
aoe: defer rexmit timer downdev work to workqueue
When aoe's rexmit_timer() notices that an aoe target fails to respond to commands for more than aoe_deadsecs, it calls aoedev_downdev() which cleans the outstanding aoe and block queues. This can involve sleeping, such as in blk_mq_freeze_queue(), which should not occur in irq context. This patch defers that aoedev_downdev() call to the aoe device's workqueue. Link: https://bugzilla.kernel.org/show_bug.cgi?id=212665 Signed-off-by: Justin Sanders <jsanders.devel@gmail.com> Link: https://lore.kernel.org/r/20250610170600.869-2-jsanders.devel@gmail.com Tested-By: Valentin Kleibel <valentin@vrvis.at> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
7f90d45e57 |
aoe: clean device rq_list in aoedev_downdev()
An aoe device's rq_list contains accepted block requests that are waiting to be transmitted to the aoe target. This queue was added as part of the conversion to blk_mq. However, the queue was not cleaned out when an aoe device is downed which caused blk_mq_freeze_queue() to sleep indefinitely waiting for those requests to complete, causing a hang. This fix cleans out the queue before calling blk_mq_freeze_queue(). Link: https://bugzilla.kernel.org/show_bug.cgi?id=212665 Fixes: 3582dd291788 ("aoe: convert aoeblk to blk-mq") Signed-off-by: Justin Sanders <jsanders.devel@gmail.com> Link: https://lore.kernel.org/r/20250610170600.869-1-jsanders.devel@gmail.com Tested-By: Valentin Kleibel <valentin@vrvis.at> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
f713ffa363 |
block-6.16-20250614
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhNaUIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpvUOD/0WlwBN8WxAA+rzUXo42QJ3W+XruQ+VhQdx Hs/DBEH6KZji86ZVzoJOIwsdlSL2/6PxRIZVqwr3Q8aYnNedUnsjcD4frNBl76EA wFfPttjL7DcaOvVhY0n37IrQNmaeQC1R1O2JxhiWTBzNoNf2iWj84vSSgbgcfDVR trfhRvEwRgmAy037/72pUFYN+JRlv80D03SGfWTQtp6/qq+AA/z5XqWdg9I/opVM 7+H5GoWHfPSG0wQo+Dms3mHV4zm5tOOfMmGIR2o4DoueKgMNgnUXRT8dc7DDBsqV 0moKRHKbTbeN1fz3zqcko0Mp1gq+62hF/eXppQSeJMpMuAbcxaA+/ZFv7Ho9ZwYF jJwcp0O5e8XbRHFqrYWysKGKSvYfvTjr08X70+QFzm9ZJaGCtJYd2ceUNmyO2p6s m54gUnPq5d3nABbpCkAdP5sAv0yVV5idIoezCHIaBYQv8qPpKDrdHHXTQY/VX05x VBGmg9hUZSDMiGkR1d4oKTBayehuWVIpyczhy65KbAfoBA62hAl+aAldkpvLRo1r gKsrMSGP/H6zBU/IRaMGc/bnEnP6zFkn5vxnGwpDcD2tdJn0g+yEjIvJSXrmGJ0w lwzqYd3/vhFPmaEDxE3PyOOGBVCOPqGic+Y6OEIuHA3p2HFO3bsh6+64+iqls/so EmiHPp7n5g== =N1zM -----END PGP SIGNATURE----- Merge tag 'block-6.16-20250614' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - Fix for a deadlock on queue freeze with zoned writes - Fix for zoned append emulation - Two bio folio fixes, for sparsemem and for very large folios - Fix for a performance regression introduced in 6.13 when plug insertion was changed - Fix for NVMe passthrough handling for polled IO - Document the ublk auto registration feature - loop lockdep warning fix * tag 'block-6.16-20250614' of git://git.kernel.dk/linux: nvme: always punt polled uring_cmd end_io work to task_work Documentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublists block: Fix bvec_set_folio() for very large folios bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAP block: use plug request list tail for one-shot backmerge attempt block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_work block: Clear BIO_EMULATES_ZONE_APPEND flag on BIO completion ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG) loop: move lo_set_size() out of queue freeze |
||
|
|
4bb08cf974 |
loop: move lo_set_size() out of queue freeze
It isn't necessary to freeze queue for updating disk size given submit_bio() doesn't grab queue usage counter for checking eod. Also many driver won't freeze queue for calling set_capacity_and_notify(). Move lo_set_size() out of queue freeze for fixing many lockdep warning report. Link: https://lore.kernel.org/linux-block/67ea99e0.050a0220.3c3d88.0042.GAE@google.com/ Reported-by: syzbot+9dd7dbb1a4b915dee638@syzkaller.appspotmail.com Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250611084938.108829-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
41cb08555c |
treewide, timers: Rename from_timer() to timer_container_of()
Move this API to the canonical timer_*() namespace. [ tglx: Redone against pre rc1 ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com |
||
|
|
6d8854216e |
block-6.16-20250606
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhC7/UQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgps+6D/9BOhkMyMkUF9LAev4PBNE+x3aftjl7Y1AY
EHv2vozb4nDwXIaalG4qGUhprz2+z+hqxYjmnlOAsqbixhcSzKK5z9rjxyDka776
x03vfvKXaXZUG7XN7ENY8sJnLx4QJ0nh4+0gzT9yDyq2vKvPFLEKweNOxKDKCSbE
31vGoLFwjltp74hX+Qrnj1KMaTLgvAaV0eXKWlbX7Iiw6GFVm200zb27gth6U8bV
WQAmjSkFQ0daHtAWmXIVy7hrXiCqe8D6YPKvBXnQ4cfKVbgG0HHDuTmQLpKGzfMi
rr24MU5vZjt6OsYalceiTtifSUcf/I2+iFV7HswOk9kpOY5A2ylsWawRP2mm4PDI
nJE3LaSTRpEvs5kzPJ2kr8Zp4/uvF6ehSq8Y9w52JekmOzxusLcRcswezaO00EI0
32uuK+P505EGTcCBTrEdtaI6k7zzQEeVoIpxqvMhNRG/s5vzvIV3eVrALu2HSDma
P3paEdx7PwJla3ndmdChfh1vUR3TW3gWoZvoNCVmJzNCnLEAScTS2NsiQeEjy8zs
20IGsrRgIqt9KR8GZ2zj1ZOM47Cg0dIU3pbbA2Ja71wx4TYXJCSFFRK7mzDtXYlY
BWOix/Dks8tk118cwuxnT+IiwmWDMbDZKnygh+4tiSyrs0IszeekRADLUu03C0Ve
Dhpljqf3zA==
=gs32
-----END PGP SIGNATURE-----
Merge tag 'block-6.16-20250606' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:
- NVMe pull request via Christoph:
- TCP error handling fix (Shin'ichiro Kawasaki)
- TCP I/O stall handling fixes (Hannes Reinecke)
- fix command limits status code (Keith Busch)
- support vectored buffers also for passthrough (Pavel Begunkov)
- spelling fixes (Yi Zhang)
- MD pull request via Yu:
- fix REQ_RAHEAD and REQ_NOWAIT IO err handling for raid1/10
- fix max_write_behind setting for dm-raid
- some minor cleanups
- Integrity data direction fix and cleanup
- bcache NULL pointer fix
- Fix for loop missing write start/end handling
- Decouple hardware queues and IO threads in ublk
- Slew of ublk selftests additions and updates
* tag 'block-6.16-20250606' of git://git.kernel.dk/linux: (29 commits)
nvme: spelling fixes
nvme-tcp: fix I/O stalls on congested sockets
nvme-tcp: sanitize request list handling
nvme-tcp: remove tag set when second admin queue config fails
nvme: enable vectored registered bufs for passthrough cmds
nvme: fix implicit bool to flags conversion
nvme: fix command limits status code
selftests: ublk: kublk: improve behavior on init failure
block: flip iter directions in blk_rq_integrity_map_user()
block: drop direction param from bio_integrity_copy_user()
selftests: ublk: cover PER_IO_DAEMON in more stress tests
Documentation: ublk: document UBLK_F_PER_IO_DAEMON
selftests: ublk: add stress test for per io daemons
selftests: ublk: add functional test for per io daemons
selftests: ublk: kublk: decouple ublk_queues from ublk server threads
selftests: ublk: kublk: move per-thread data out of ublk_queue
selftests: ublk: kublk: lift queue initialization out of thread
selftests: ublk: kublk: tie sqe allocation to io instead of queue
selftests: ublk: kublk: plumb q_id in io_uring user_data
ublk: have a per-io daemon instead of a per-queue daemon
...
|
||
|
|
3719a04a80 |
pci-v6.16-changes
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmhAa9EUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vyA3w//aX8d73z/xVxkYLMN/6XQA5fdmd4d
Dv4n0Pjf0WCMKbsgRCdXEYLvcHV8VhH5iCR/b2UsFm9LjxSIRuqE5XosY3bNhrHn
xVKEh2prq2XZOibWrFkJ+RZ0FF7Ogq1Uy5gUBbBHbE1q1byZzrOALaF3FWGaDIZQ
6QLLAFtd3UtqOOUu8J8P9N15uFR8gunyfuM9U7TLMcy4B8txk6T6m/9xAWtRURuJ
I6WN8lO+g8Nl2mL9m27+wyWiVT3tKqoMwp8rVtym/L5JQOmHycYhn0WQAr2dPCMs
Xbgmoeei0je7mZvk5btpt68NAKQ3ZnCVkxbbINBkUxAjI0dbI6h37EhW18ShYVUk
CCo4fmaFtwP8qNN9tSvDN8vZdGB44fN5tIz4lmGzKk5gt+oV50RC/APrzC+PJBQ0
+2SdDVKj71Gr2H1VnI6uLB7oQ+tp7TOdhg+DGV4bdc6QFnsM+BpKWRq5f1UQcau/
XVDmorM/2t6z0DNktAv3NFwSodUjk1loWESr/pRBH1AqAWZTK98PWIg97XYsal59
zbJ3dLrnCqUNozeVgjtZo1LWD2FZaVTvhq2NY7D+QPpnMGhFUhHxNliZUXiQa1q4
boI2hEFdu3IQP/OC2a1zGJyMRLU43d5rhZ1U5xQSVtM0c3lgCY7rn/t26LymQVPA
SYdg2jBcnhe6gXo=
=eWJw
-----END PGP SIGNATURE-----
Merge tag 'pci-v6.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci updates from Bjorn Helgaas:
"Enumeration:
- Print the actual delay time in pci_bridge_wait_for_secondary_bus()
instead of assuming it was 1000ms (Wilfred Mallawa)
- Revert 'iommu/amd: Prevent binding other PCI drivers to IOMMU PCI
devices', which broke resume from system sleep on AMD platforms and
has been fixed by other commits (Lukas Wunner)
Resource management:
- Remove mtip32xx use of pcim_iounmap_regions(), which is deprecated
and unnecessary (Philipp Stanner)
- Remove pcim_iounmap_regions() and pcim_request_region_exclusive()
and related flags since all uses have been removed (Philipp
Stanner)
- Rework devres 'request' functions so they are no longer 'hybrid',
i.e., their behavior no longer depends on whether
pcim_enable_device or pci_enable_device() was used, and remove
related code (Philipp Stanner)
- Warn (not BUG()) about failure to assign optional resources (Ilpo
Järvinen)
Error handling:
- Log the DPC Error Source ID only when it's actually valid (when
ERR_FATAL or ERR_NONFATAL was received from a downstream device)
and decode into bus/device/function (Bjorn Helgaas)
- Determine AER log level once and save it so all related messages
use the same level (Karolina Stolarek)
- Use KERN_WARNING, not KERN_ERR, when logging PCIe Correctable
Errors (Karolina Stolarek)
- Ratelimit PCIe Correctable and Non-Fatal error logging, with sysfs
controls on interval and burst count, to avoid flooding logs and
RCU stall warnings (Jon Pan-Doh)
Power management:
- Increment PM usage counter when probing reset methods so we don't
try to read config space of a powered-off device (Alex Williamson)
- Set all devices to D0 during enumeration to ensure ACPI opregion is
connected via _REG (Mario Limonciello)
Power control:
- Rename pwrctrl Kconfig symbols from 'PWRCTL' to 'PWRCTRL' to match
the filename paths. Retain old deprecated symbols for
compatibility, except for the pwrctrl slot driver
(PCI_PWRCTRL_SLOT) (Johan Hovold)
- When unregistering pwrctrl, cancel outstanding rescan work before
cleaning up data structures to avoid use-after-free issues (Brian
Norris)
Bandwidth control:
- Simplify link bandwidth controller by replacing the count of Link
Bandwidth Management Status (LBMS) events with a PCI_LINK_LBMS_SEEN
flag (Ilpo Järvinen)
- Update the Link Speed after retraining, since the Link Speed may
have changed (Ilpo Järvinen)
PCIe native device hotplug:
- Ignore Presence Detect Changed caused by DPC.
pciehp already ignores Link Down/Up events caused by DPC, but on
slots using in-band presence detect, DPC causes a spurious Presence
Detect Changed event (Lukas Wunner)
- Ignore Link Down/Up caused by Secondary Bus Reset.
On hotplug ports using in-band presence detect, the reset causes a
Presence Detect Changed event, which mistakenly caused teardown and
re-enumeration of the device. Drivers may need to annotate code
that resets their device (Lukas Wunner)
Virtualization:
- Add an ACS quirk for Loongson Root Ports that don't advertise ACS
but don't allow peer-to-peer transactions between Root Ports; the
quirk allows each Root Port to be in a separate IOMMU group (Huacai
Chen)
Endpoint framework:
- For fixed-size BARs, retain both the actual size and the possibly
larger size allocated to accommodate iATU alignment requirements
(Jerome Brunet)
- Simplify ctrl/SPAD space allocation and avoid allocating more space
than needed (Jerome Brunet)
- Correct MSI-X PBA offset calculations for DesignWare and Cadence
endpoint controllers (Niklas Cassel)
- Align the return value (number of interrupts) encoding for
pci_epc_get_msi()/pci_epc_ops::get_msi() and
pci_epc_get_msix()/pci_epc_ops::get_msix() (Niklas Cassel)
- Align the nr_irqs parameter encoding for
pci_epc_set_msi()/pci_epc_ops::set_msi() and
pci_epc_set_msix()/pci_epc_ops::set_msix() (Niklas Cassel)
Common host controller library:
- Convert pci-host-common to a library so platforms that don't need
native host controller drivers don't need to include these helper
functions (Manivannan Sadhasivam)
Apple PCIe controller driver:
- Extract ECAM bridge creation helper from pci_host_common_probe() to
separate driver-specific things like MSI from PCI things (Marc
Zyngier)
- Dynamically allocate RID-to_SID bitmap to prepare for SoCs with
varying capabilities (Marc Zyngier)
- Skip ports disabled in DT when setting up ports (Janne Grunau)
- Add t6020 compatible string (Alyssa Rosenzweig)
- Add T602x PCIe support (Hector Martin)
- Directly set/clear INTx mask bits because T602x dropped the
accessors that could do this without locking (Marc Zyngier)
- Move port PHY registers to their own reg items to accommodate
T602x, which moves them around; retain default offsets for existing
DTs that lack phy%d entries with the reg offsets (Hector Martin)
- Stop polling for core refclk, which doesn't work on T602x and the
bootloader has already done anyway (Hector Martin)
- Use gpiod_set_value_cansleep() when asserting PERST# in probe
because we're allowed to sleep there (Hector Martin)
Cadence PCIe controller driver:
- Drop a runtime PM 'put' to resolve a runtime atomic count underflow
(Hans Zhang)
- Make the cadence core buildable as a module (Kishon Vijay Abraham I)
- Add cdns_pcie_host_disable() and cdns_pcie_ep_disable() for use by
loadable drivers when they are removed (Siddharth Vadapalli)
Freescale i.MX6 PCIe controller driver:
- Apply link training workaround only on IMX6Q, IMX6SX, IMX6SP
(Richard Zhu)
- Remove redundant dw_pcie_wait_for_link() from
imx_pcie_start_link(); since the DWC core does this, imx6 only
needs it when retraining for a faster link speed (Richard Zhu)
- Toggle i.MX95 core reset to align with PHY powerup (Richard Zhu)
- Set SYS_AUX_PWR_DET to work around i.MX95 ERR051624 erratum: in
some cases, the controller can't exit 'L23 Ready' through Beacon or
PERST# deassertion (Richard Zhu)
- Clear GEN3_ZRXDC_NONCOMPL to work around i.MX95 ERR051586 erratum:
controller can't meet 2.5 GT/s ZRX-DC timing when operating at 8
GT/s, causing timeouts in L1 (Richard Zhu)
- Wait for i.MX95 PLL lock before enabling controller (Richard Zhu)
- Save/restore i.MX95 LUT for suspend/resume (Richard Zhu)
Mobiveil PCIe controller driver:
- Return bool (not int) for link-up check in
mobiveil_pab_ops.link_up() and layerscape-gen4, mobiveil (Hans
Zhang)
NVIDIA Tegra194 PCIe controller driver:
- Create debugfs directory for 'aspm_state_cnt' only when
CONFIG_PCIEASPM is enabled, since there are no other entries (Hans
Zhang)
Qualcomm PCIe controller driver:
- Add OF support for parsing DT 'eq-presets-<N>gts' property for lane
equalization presets (Krishna Chaitanya Chundru)
- Read Maximum Link Width from the Link Capabilities register if DT
lacks 'num-lanes' property (Krishna Chaitanya Chundru)
- Add Physical Layer 64 GT/s Capability ID and register offsets for
8, 32, and 64 GT/s lane equalization registers (Krishna Chaitanya
Chundru)
- Add generic dwc support for configuring lane equalization presets
(Krishna Chaitanya Chundru)
- Add DT and driver support for PCIe on IPQ5018 SoC (Nitheesh Sekar)
Renesas R-Car PCIe controller driver:
- Describe endpoint BAR 4 as being fixed size (Jerome Brunet)
- Document how to obtain R-Car V4H (r8a779g0) controller firmware
(Yoshihiro Shimoda)
Rockchip PCIe controller driver:
- Reorder rockchip_pci_core_rsts because
reset_control_bulk_deassert() deasserts in reverse order, to fix a
link training regression (Jensen Huang)
- Mark RK3399 as being capable of raising INTx interrupts (Niklas
Cassel)
Rockchip DesignWare PCIe controller driver:
- Check only PCIE_LINKUP, not LTSSM status, to determine whether the
link is up (Shawn Lin)
- Increase N_FTS (used in L0s->L0 transitions) and enable ASPM L0s
for Root Complex and Endpoint modes (Shawn Lin)
- Hide the broken ATS Capability in rockchip_pcie_ep_init() instead
of rockchip_pcie_ep_pre_init() so it stays hidden after PERST#
resets non-sticky registers (Shawn Lin)
- Call phy_power_off() before phy_exit() in rockchip_pcie_phy_deinit()
(Diederik de Haas)
Synopsys DesignWare PCIe controller driver:
- Set PORT_LOGIC_LINK_WIDTH to one lane to make initial link training
more robust; this will not affect the intended link width if all
lanes are functional (Wenbin Yao)
- Return bool (not int) for link-up check in dw_pcie_ops.link_up()
and armada8k, dra7xx, dw-rockchip, exynos, histb, keembay,
keystone, kirin, meson, qcom, qcom-ep, rcar_gen4, spear13xx,
tegra194, uniphier, visconti (Hans Zhang)
- Add debugfs support for exposing DWC device-specific PTM context
(Manivannan Sadhasivam)
TI J721E PCIe driver:
- Make j721e buildable as a loadable and removable module (Siddharth
Vadapalli)
- Fix j721e host/endpoint dependencies that result in link failures
in some configs (Arnd Bergmann)
Device tree bindings:
- Add qcom DT binding for 'global' interrupt (PCIe controller and
link-specific events) for ipq8074, ipq8074-gen3, ipq6018, sa8775p,
sc7280, sc8180x sdm845, sm8150, sm8250, sm8350 (Manivannan
Sadhasivam)
- Add qcom DT binding for 8 MSI SPI interrupts for msm8998, ipq8074,
ipq8074-gen3, ipq6018 (Manivannan Sadhasivam)
- Add dw rockchip DT binding for rk3576 and rk3562 (Kever Yang)
- Correct indentation and style of examples in brcm,stb-pcie,
cdns,cdns-pcie-ep, intel,keembay-pcie-ep, intel,keembay-pcie,
microchip,pcie-host, rcar-pci-ep, rcar-pci-host, xilinx-versal-cpm
(Krzysztof Kozlowski)
- Convert Marvell EBU (dove, kirkwood, armada-370, armada-xp) and
armada8k from text to schema DT bindings (Rob Herring)
- Remove obsolete .txt DT bindings for content that has been moved to
schemas (Rob Herring)
- Add qcom DT binding for MHI registers in IPQ5332, IPQ6018, IPQ8074
and IPQ9574 (Varadarajan Narayanan)
- Convert v3,v360epc-pci from text to DT schema binding (Rob Herring)
- Change microchip,pcie-host DT binding to be 'dma-noncoherent' since
PolarFire may be configured that way (Conor Dooley)
Miscellaneous:
- Drop 'pci' suffix from intel_mid_pci.c filename to match similar
files (Andy Shevchenko)
- All platforms with PCI have an MMU, so add PCI Kconfig dependency
on MMU to simplify build testing and avoid inadvertent build
regressions (Arnd Bergmann)
- Update Krzysztof Wilczyński's email address in MAINTAINERS
(Krzysztof Wilczyński)
- Update Manivannan Sadhasivam's email address in MAINTAINERS
(Manivannan Sadhasivam)"
* tag 'pci-v6.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (147 commits)
MAINTAINERS: Update Manivannan Sadhasivam email address
PCI: j721e: Fix host/endpoint dependencies
PCI: j721e: Add support to build as a loadable module
PCI: cadence-ep: Introduce cdns_pcie_ep_disable() helper for cleanup
PCI: cadence-host: Introduce cdns_pcie_host_disable() helper for cleanup
PCI: cadence: Add support to build pcie-cadence library as a kernel module
MAINTAINERS: Update Krzysztof Wilczyński email address
PCI: Remove unnecessary linesplit in __pci_setup_bridge()
PCI: WARN (not BUG()) when we fail to assign optional resources
PCI: Remove unused pci_printk()
PCI: qcom: Replace PERST# sleep time with proper macro
PCI: dw-rockchip: Replace PERST# sleep time with proper macro
PCI: host-common: Convert to library for host controller drivers
PCI/ERR: Remove misleading TODO regarding kernel panic
PCI: cadence: Remove duplicate message code definitions
PCI: endpoint: Align pci_epc_set_msix(), pci_epc_ops::set_msix() nr_irqs encoding
PCI: endpoint: Align pci_epc_set_msi(), pci_epc_ops::set_msi() nr_irqs encoding
PCI: endpoint: Align pci_epc_get_msix(), pci_epc_ops::get_msix() return value encoding
PCI: endpoint: Align pci_epc_get_msi(), pci_epc_ops::get_msi() return value encoding
PCI: cadence-ep: Correct PBA offset in .set_msix() callback
...
|
||
|
|
fd1f847350 |
- The 2 patch series "zram: support algorithm-specific parameters" from
Sergey Senozhatsky adds infrastructure for passing algorithm-specific parameters into zram. A single parameter `winbits' is implemented at this time. - The 5 patch series "memcg: nmi-safe kmem charging" from Shakeel Butt makes memcg charging nmi-safe, which is required by BFP, which can operate in NMI context. - The 5 patch series "Some random fixes and cleanup to shmem" from Kemeng Shi implements small fixes and cleanups in the shmem code. - The 2 patch series "Skip mm selftests instead when kernel features are not present" from Zi Yan fixes some issues in the MM selftest code. - The 2 patch series "mm/damon: build-enable essential DAMON components by default" from SeongJae Park reworks DAMON Kconfig to make it easier to enable CONFIG_DAMON. - The 2 patch series "sched/numa: add statistics of numa balance task migration" from Libo Chen adds more info into sysfs and procfs files to improve visibility into the NUMA balancer's task migration activity. - The 4 patch series "selftests/mm: cow and gup_longterm cleanups" from Mark Brown provides various updates to some of the MM selftests to make them play better with the overall containing framework. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaDzA9wAKCRDdBJ7gKXxA js8sAP9V3COg+vzTmimzP3ocTkkbbIJzDfM6nXpE2EQ4BR3ejwD+NsIT2ZLtTF6O LqAZpgO7ju6wMjR/lM30ebCq5qFbZAw= =oruw -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-06-01-14-06' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - "zram: support algorithm-specific parameters" from Sergey Senozhatsky adds infrastructure for passing algorithm-specific parameters into zram. A single parameter `winbits' is implemented at this time. - "memcg: nmi-safe kmem charging" from Shakeel Butt makes memcg charging nmi-safe, which is required by BFP, which can operate in NMI context. - "Some random fixes and cleanup to shmem" from Kemeng Shi implements small fixes and cleanups in the shmem code. - "Skip mm selftests instead when kernel features are not present" from Zi Yan fixes some issues in the MM selftest code. - "mm/damon: build-enable essential DAMON components by default" from SeongJae Park reworks DAMON Kconfig to make it easier to enable CONFIG_DAMON. - "sched/numa: add statistics of numa balance task migration" from Libo Chen adds more info into sysfs and procfs files to improve visibility into the NUMA balancer's task migration activity. - "selftests/mm: cow and gup_longterm cleanups" from Mark Brown provides various updates to some of the MM selftests to make them play better with the overall containing framework. * tag 'mm-stable-2025-06-01-14-06' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (43 commits) mm/khugepaged: clean up refcount check using folio_expected_ref_count() selftests/mm: fix test result reporting in gup_longterm selftests/mm: report unique test names for each cow test selftests/mm: add helper for logging test start and results selftests/mm: use standard ksft_finished() in cow and gup_longterm selftests/damon/_damon_sysfs: skip testcases if CONFIG_DAMON_SYSFS is disabled sched/numa: add statistics of numa balance task sched/numa: fix task swap by skipping kernel threads tools/testing: check correct variable in open_procmap() tools/testing/vma: add missing function stub mm/gup: update comment explaining why gup_fast() disables IRQs selftests/mm: two fixes for the pfnmap test mm/khugepaged: fix race with folio split/free using temporary reference mm: add CONFIG_PAGE_BLOCK_ORDER to select page block order mmu_notifiers: remove leftover stub macros selftests/mm: deduplicate test names in madv_populate kcov: rust: add flags for KCOV with Rust mm: rust: make CONFIG_MMU ifdefs more narrow mmu_gather: move tlb flush for VM_PFNMAP/VM_MIXEDMAP vmas into free_pgtables() mm/damon/Kconfig: enable CONFIG_DAMON by default ... |
||
|
|
dc75a0d93b |
zram: support deflate-specific params
Introduce support of algorithm specific parameters in algorithm_params device attribute. The expected format is algorithm.param=value. For starters, add support for deflate.winbits parameter. Link: https://lkml.kernel.org/r/20250514024825.1745489-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
a5ade2e9fa |
zram: rename ZCOMP_PARAM_NO_LEVEL
Patch series "zram: support algorithm-specific parameters". This patchset adds support for algorithm-specific parameters. For now, only deflate-specific winbits can be configured, which fixes deflate support on some s390 setups. This patch (of 2): Use more generic name because this will be default "un-set" value for more params in the future. Link: https://lkml.kernel.org/r/20250514024825.1745489-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20250514024825.1745489-2-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
00c010e130 |
- The 11 patch series "Add folio_mk_pte()" from Matthew Wilcox
simplifies the act of creating a pte which addresses the first page in a folio and reduces the amount of plumbing which architecture must implement to provide this. - The 8 patch series "Misc folio patches for 6.16" from Matthew Wilcox is a shower of largely unrelated folio infrastructure changes which clean things up and better prepare us for future work. - The 3 patch series "memory,x86,acpi: hotplug memory alignment advisement" from Gregory Price adds early-init code to prevent x86 from leaving physical memory unused when physical address regions are not aligned to memory block size. - The 2 patch series "mm/compaction: allow more aggressive proactive compaction" from Michal Clapinski provides some tuning of the (sadly, hard-coded (more sadly, not auto-tuned)) thresholds for our invokation of proactive compaction. In a simple test case, the reduction of a guest VM's memory consumption was dramatic. - The 8 patch series "Minor cleanups and improvements to swap freeing code" from Kemeng Shi provides some code cleaups and a small efficiency improvement to this part of our swap handling code. - The 6 patch series "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin adds the ability for a ptracer to modify syscalls arguments. At this time we can alter only "system call information that are used by strace system call tampering, namely, syscall number, syscall arguments, and syscall return value. This series should have been incorporated into mm.git's "non-MM" branch, but I goofed. - The 3 patch series "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl against /proc/pid/pagemap. This permits CRIU to more efficiently get at the info about guard regions. - The 2 patch series "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan implements that fix. No runtime effect is expected because validate_page_before_insert() happens to fix up this error. - The 3 patch series "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David Hildenbrand basically brings uprobe text poking into the current decade. Remove a bunch of hand-rolled implementation in favor of using more current facilities. - The 3 patch series "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman Khandual provides enhancements and generalizations to the pte dumping code. This might be needed when 128-bit Page Table Descriptors are enabled for ARM. - The 12 patch series "Always call constructor for kernel page tables" from Kevin Brodsky "ensures that the ctor/dtor is always called for kernel pgtables, as it already is for user pgtables". This permits the addition of more functionality such as "insert hooks to protect page tables". This change does result in various architectures performing unnecesary work, but this is fixed up where it is anticipated to occur. - The 9 patch series "Rust support for mm_struct, vm_area_struct, and mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM structures. - The 3 patch series "fix incorrectly disallowed anonymous VMA merges" from Lorenzo Stoakes takes advantage of some VMA merging opportunities which we've been missing for 15 years. - The 4 patch series "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB flushing. Instead of flushing each address range in the provided iovec, we batch the flushing across all the iovec entries. The syscall's cost was approximately halved with a microbenchmark which was designed to load this particular operation. - The 6 patch series "Track node vacancy to reduce worst case allocation counts" from Sidhartha Kumar makes the maple tree smarter about its node preallocation. stress-ng mmap performance increased by single-digit percentages and the amount of unnecessarily preallocated memory was dramaticelly reduced. - The 3 patch series "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes a few unnecessary things which Baoquan noted when reading the code. - The 3 patch series ""Enhance sysfs handling for memory hotplug in weighted interleave" from Rakie Kim "enhances the weighted interleave policy in the memory management subsystem by improving sysfs handling, fixing memory leaks, and introducing dynamic sysfs updates for memory hotplug support". Fixes things on error paths which we are unlikely to hit. - The 7 patch series "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory" from SeongJae Park introduces new DAMOS quota goal metrics which eliminate the manual tuning which is required when utilizing DAMON for memory tiering. - The 5 patch series "mm/vmalloc.c: code cleanup and improvements" from Baoquan He provides cleanups and small efficiency improvements which Baoquan found via code inspection. - The 2 patch series "vmscan: enforce mems_effective during demotion" from Gregory Price "changes reclaim to respect cpuset.mems_effective during demotion when possible". because "presently, reclaim explicitly ignores cpuset.mems_effective when demoting, which may cause the cpuset settings to violated." "This is useful for isolating workloads on a multi-tenant system from certain classes of memory more consistently." - The 2 patch series ""Clean up split_huge_pmd_locked() and remove unnecessary folio pointers" from Gavin Guo provides minor cleanups and efficiency gains in in the huge page splitting and migrating code. - The 3 patch series "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache for `struct mem_cgroup', yielding improved memory utilization. - The 4 patch series "add max arg to swappiness in memory.reclaim and lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness=" argument for memory.reclaim MGLRU's lru_gen. This directs proactive reclaim to reclaim from only anon folios rather than file-backed folios. - The 17 patch series "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the first step on the path to permitting the kernel to maintain existing VMs while replacing the host kernel via file-based kexec. At this time only memblock's reserve_mem is preserved. - The 7 patch series "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides and uses a smarter way of looping over a pfn range. By skipping ranges of invalid pfns. - The 2 patch series "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning when a task is pinned a single NUMA mode. Dramatic performance benefits were seen in some real world cases. - The 2 patch series "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank Garg addresses a warning which occurs during memory compaction when using JFS. - The 4 patch series "move all VMA allocation, freeing and duplication logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more appropriate mm/vma.c. - The 6 patch series "mm, swap: clean up swap cache mapping helper" from Kairui Song provides code consolidation and cleanups related to the folio_index() function. - The 2 patch series "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that. - The 8 patch series "memcg: Fix test_memcg_min/low test failures" from Waiman Long addresses some bogus failures which are being reported by the test_memcontrol selftest. - The 3 patch series "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo Stoakes commences the deprecation of file_operations.mmap() in favor of the new file_operations.mmap_prepare(). The latter is more restrictive and prevents drivers from messing with things in ways which, amongst other problems, may defeat VMA merging. - The 4 patch series "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's one. This is a step along the way to making memcg and objcg charging NMI-safe, which is a BPF requirement. - The 6 patch series "mm/damon: minor fixups and improvements for code, tests, and documents" from SeongJae Park is "yet another batch of miscellaneous DAMON changes. Fix and improve minor problems in code, tests and documents." - The 7 patch series "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg stats to be irq safe. Another step along the way to making memcg charging and stats updates NMI-safe, a BPF requirement. - The 4 patch series "Let unmap_hugepage_range() and several related functions take folio instead of page" from Fan Ni provides folio conversions in the hugetlb code. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaDt5qgAKCRDdBJ7gKXxA ju6XAP9nTiSfRz8Cz1n5LJZpFKEGzLpSihCYyR6P3o1L9oe3mwEAlZ5+XAwk2I5x Qqb/UGMEpilyre1PayQqOnct3aSL9Ao= =tYYm -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of creating a pte which addresses the first page in a folio and reduces the amount of plumbing which architecture must implement to provide this. - "Misc folio patches for 6.16" from Matthew Wilcox is a shower of largely unrelated folio infrastructure changes which clean things up and better prepare us for future work. - "memory,x86,acpi: hotplug memory alignment advisement" from Gregory Price adds early-init code to prevent x86 from leaving physical memory unused when physical address regions are not aligned to memory block size. - "mm/compaction: allow more aggressive proactive compaction" from Michal Clapinski provides some tuning of the (sadly, hard-coded (more sadly, not auto-tuned)) thresholds for our invokation of proactive compaction. In a simple test case, the reduction of a guest VM's memory consumption was dramatic. - "Minor cleanups and improvements to swap freeing code" from Kemeng Shi provides some code cleaups and a small efficiency improvement to this part of our swap handling code. - "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin adds the ability for a ptracer to modify syscalls arguments. At this time we can alter only "system call information that are used by strace system call tampering, namely, syscall number, syscall arguments, and syscall return value. This series should have been incorporated into mm.git's "non-MM" branch, but I goofed. - "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl against /proc/pid/pagemap. This permits CRIU to more efficiently get at the info about guard regions. - "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan implements that fix. No runtime effect is expected because validate_page_before_insert() happens to fix up this error. - "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David Hildenbrand basically brings uprobe text poking into the current decade. Remove a bunch of hand-rolled implementation in favor of using more current facilities. - "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman Khandual provides enhancements and generalizations to the pte dumping code. This might be needed when 128-bit Page Table Descriptors are enabled for ARM. - "Always call constructor for kernel page tables" from Kevin Brodsky ensures that the ctor/dtor is always called for kernel pgtables, as it already is for user pgtables. This permits the addition of more functionality such as "insert hooks to protect page tables". This change does result in various architectures performing unnecesary work, but this is fixed up where it is anticipated to occur. - "Rust support for mm_struct, vm_area_struct, and mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM structures. - "fix incorrectly disallowed anonymous VMA merges" from Lorenzo Stoakes takes advantage of some VMA merging opportunities which we've been missing for 15 years. - "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB flushing. Instead of flushing each address range in the provided iovec, we batch the flushing across all the iovec entries. The syscall's cost was approximately halved with a microbenchmark which was designed to load this particular operation. - "Track node vacancy to reduce worst case allocation counts" from Sidhartha Kumar makes the maple tree smarter about its node preallocation. stress-ng mmap performance increased by single-digit percentages and the amount of unnecessarily preallocated memory was dramaticelly reduced. - "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes a few unnecessary things which Baoquan noted when reading the code. - ""Enhance sysfs handling for memory hotplug in weighted interleave" from Rakie Kim "enhances the weighted interleave policy in the memory management subsystem by improving sysfs handling, fixing memory leaks, and introducing dynamic sysfs updates for memory hotplug support". Fixes things on error paths which we are unlikely to hit. - "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory" from SeongJae Park introduces new DAMOS quota goal metrics which eliminate the manual tuning which is required when utilizing DAMON for memory tiering. - "mm/vmalloc.c: code cleanup and improvements" from Baoquan He provides cleanups and small efficiency improvements which Baoquan found via code inspection. - "vmscan: enforce mems_effective during demotion" from Gregory Price changes reclaim to respect cpuset.mems_effective during demotion when possible. because presently, reclaim explicitly ignores cpuset.mems_effective when demoting, which may cause the cpuset settings to violated. This is useful for isolating workloads on a multi-tenant system from certain classes of memory more consistently. - "Clean up split_huge_pmd_locked() and remove unnecessary folio pointers" from Gavin Guo provides minor cleanups and efficiency gains in in the huge page splitting and migrating code. - "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache for `struct mem_cgroup', yielding improved memory utilization. - "add max arg to swappiness in memory.reclaim and lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness=" argument for memory.reclaim MGLRU's lru_gen. This directs proactive reclaim to reclaim from only anon folios rather than file-backed folios. - "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the first step on the path to permitting the kernel to maintain existing VMs while replacing the host kernel via file-based kexec. At this time only memblock's reserve_mem is preserved. - "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides and uses a smarter way of looping over a pfn range. By skipping ranges of invalid pfns. - "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning when a task is pinned a single NUMA mode. Dramatic performance benefits were seen in some real world cases. - "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank Garg addresses a warning which occurs during memory compaction when using JFS. - "move all VMA allocation, freeing and duplication logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more appropriate mm/vma.c. - "mm, swap: clean up swap cache mapping helper" from Kairui Song provides code consolidation and cleanups related to the folio_index() function. - "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that. - "memcg: Fix test_memcg_min/low test failures" from Waiman Long addresses some bogus failures which are being reported by the test_memcontrol selftest. - "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo Stoakes commences the deprecation of file_operations.mmap() in favor of the new file_operations.mmap_prepare(). The latter is more restrictive and prevents drivers from messing with things in ways which, amongst other problems, may defeat VMA merging. - "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's one. This is a step along the way to making memcg and objcg charging NMI-safe, which is a BPF requirement. - "mm/damon: minor fixups and improvements for code, tests, and documents" from SeongJae Park is yet another batch of miscellaneous DAMON changes. Fix and improve minor problems in code, tests and documents. - "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg stats to be irq safe. Another step along the way to making memcg charging and stats updates NMI-safe, a BPF requirement. - "Let unmap_hugepage_range() and several related functions take folio instead of page" from Fan Ni provides folio conversions in the hugetlb code. * tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits) mm: pcp: increase pcp->free_count threshold to trigger free_high mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range() mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page mm/hugetlb: pass folio instead of page to unmap_ref_private() memcg: objcg stock trylock without irq disabling memcg: no stock lock for cpu hot-unplug memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs memcg: make count_memcg_events re-entrant safe against irqs memcg: make mod_memcg_state re-entrant safe against irqs memcg: move preempt disable to callers of memcg_rstat_updated memcg: memcg_rstat_updated re-entrant safe against irqs mm: khugepaged: decouple SHMEM and file folios' collapse selftests/eventfd: correct test name and improve messages alloc_tag: check mem_profiling_support in alloc_tag_init Docs/damon: update titles and brief introductions to explain DAMOS selftests/damon/_damon_sysfs: read tried regions directories in order mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject() mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat() mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs ... |
||
|
|
ab03a61c66 |
ublk: have a per-io daemon instead of a per-queue daemon
Currently, ublk_drv associates to each hardware queue (hctx) a unique task (called the queue's ubq_daemon) which is allowed to issue COMMIT_AND_FETCH commands against the hctx. If any other task attempts to do so, the command fails immediately with EINVAL. When considered together with the block layer architecture, the result is that for each CPU C on the system, there is a unique ublk server thread which is allowed to handle I/O submitted on CPU C. This can lead to suboptimal performance under imbalanced load generation. For an extreme example, suppose all the load is generated on CPUs mapping to a single ublk server thread. Then that thread may be fully utilized and become the bottleneck in the system, while other ublk server threads are totally idle. This issue can also be addressed directly in the ublk server without kernel support by having threads dequeue I/Os and pass them around to ensure even load. But this solution requires inter-thread communication at least twice for each I/O (submission and completion), which is generally a bad pattern for performance. The problem gets even worse with zero copy, as more inter-thread communication would be required to have the buffer register/unregister calls to come from the correct thread. Therefore, address this issue in ublk_drv by allowing each I/O to have its own daemon task. Two I/Os in the same queue are now allowed to be serviced by different daemon tasks - this was not possible before. Imbalanced load can then be balanced across all ublk server threads by having the ublk server threads issue FETCH_REQs in a round-robin manner. As a small toy example, consider a system with a single ublk device having 2 queues, each of depth 4. A ublk server having 4 threads could issue its FETCH_REQs against this device as follows (where each entry is the qid,tag pair that the FETCH_REQ targets): ublk server thread: T0 T1 T2 T3 0,0 0,1 0,2 0,3 1,3 1,0 1,1 1,2 This setup allows for load that is concentrated on one hctx/ublk_queue to be spread out across all ublk server threads, alleviating the issue described above. Add the new UBLK_F_PER_IO_DAEMON feature to ublk_drv, which ublk servers can use to essentially test for the presence of this change and tailor their behavior accordingly. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20250529-ublk_task_per_io-v8-1-e9d3b119336a@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
39d86db34e |
loop: add file_start_write() and file_end_write()
file_start_write() and file_end_write() should be added around ->write_iter().
Recently we switch to ->write_iter() from vfs_iter_write(), and the
implied file_start_write() and file_end_write() are lost.
Also we never add them for dio code path, so add them back for covering
both.
Cc: Jeff Moyer <jmoyer@redhat.com>
Fixes: f2fed441c69b ("loop: stop using vfs_iter_{read,write} for buffered I/O")
Fixes: bc07c10a3603 ("block: loop: support DIO & AIO")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250527153405.837216-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
||
|
|
6f59de9bc0 |
for-6.16/block-20250523
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgwnGYQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpq9aD/4iqOts77xhWWLrOJWkkhOcV5rREeyppq8X
MKYul9S4cc4Uin9Xou9a+nab31QBQEk3nsN3kX9o3yAXvkh6yUm36HD8qYNW/46q
IUkwRQQJ0COyTnexMZQNTbZPQDIYcenXmQxOcrEJ5jC1Jcz0sOKHsgekL+ab3kCy
fLnuz2ozvjGDMala/NmE8fN5qSlj4qQABHgbamwlwfo4aWu07cwfqn5G/FCYJgDO
xUvsnTVclom2g4G+7eSSvGQI1QyAxl5QpviPnj/TEgfFBFnhbCSoBTEY6ecqhlfW
6u59MF/Uw8E+weiuGY4L87kDtBhjQs3UMSLxCuwH7MxXb25ff7qB4AIkcFD0kKFH
3V5NtwqlU7aQT0xOjGxaHhfPwjLD+FVss4ARmuHS09/Kn8egOW9yROPyetnuH84R
Oz0Ctnt1IPLFjvGeg3+rt9fjjS9jWOXLITb9Q6nX9gnCt7orCwIYke8YCpmnJyhn
i+fV4CWYIQBBRKxIT0E/GhJxZOmL0JKpomnbpP2dH8npemnsTCuvtfdrK9gfhH2X
chBVqCPY8MNU5zKfzdEiavPqcm9392lMzOoOXW2pSC1eAKqnAQ86ZT3r7rLntqE8
75LxHcvaQIsnpyG+YuJVHvoiJ83TbqZNpyHwNaQTYhDmdYpp2d/wTtTQywX4DuXb
Y6NDJw5+kQ==
=1PNK
-----END PGP SIGNATURE-----
Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- ublk updates:
- Add support for updating the size of a ublk instance
- Zero-copy improvements
- Auto-registering of buffers for zero-copy
- Series simplifying and improving GET_DATA and request lookup
- Series adding quiesce support
- Lots of selftests additions
- Various cleanups
- NVMe updates via Christoph:
- add per-node DMA pools and use them for PRP/SGL allocations
(Caleb Sander Mateos, Keith Busch)
- nvme-fcloop refcounting fixes (Daniel Wagner)
- support delayed removal of the multipath node and optionally
support the multipath node for private namespaces (Nilay Shroff)
- support shared CQs in the PCI endpoint target code (Wilfred
Mallawa)
- support admin-queue only authentication (Hannes Reinecke)
- use the crc32c library instead of the crypto API (Eric Biggers)
- misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes
Reinecke, Leon Romanovsky, Gustavo A. R. Silva)
- MD updates via Yu:
- Fix that normal IO can be starved by sync IO, found by mkfs on
newly created large raid5, with some clean up patches for bdev
inflight counters
- Clean up brd, getting rid of atomic kmaps and bvec poking
- Add loop driver specifically for zoned IO testing
- Eliminate blk-rq-qos calls with a static key, if not enabled
- Improve hctx locking for when a plug has IO for multiple queues
pending
- Remove block layer bouncing support, which in turn means we can
remove the per-node bounce stat as well
- Improve blk-throttle support
- Improve delay support for blk-throttle
- Improve brd discard support
- Unify IO scheduler switching. This should also fix a bunch of lockdep
warnings we've been seeing, after enabling lockdep support for queue
freezing/unfreezeing
- Add support for block write streams via FDP (flexible data placement)
on NVMe
- Add a bunch of block helpers, facilitating the removal of a bunch of
duplicated boilerplate code
- Remove obsolete BLK_MQ pci and virtio Kconfig options
- Add atomic/untorn write support to blktrace
- Various little cleanups and fixes
* tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits)
selftests: ublk: add test for UBLK_F_QUIESCE
ublk: add feature UBLK_F_QUIESCE
selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE
traceevent/block: Add REQ_ATOMIC flag to block trace events
ublk: run auto buf unregisgering in same io_ring_ctx with registering
io_uring: add helper io_uring_cmd_ctx_handle()
ublk: remove io argument from ublk_auto_buf_reg_fallback()
ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch()
selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK
selftests: ublk: support UBLK_F_AUTO_BUF_REG
ublk: support UBLK_AUTO_BUF_REG_FALLBACK
ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG
ublk: prepare for supporting to register request buffer automatically
ublk: convert to refcount_t
selftests: ublk: make IO & device removal test more stressful
nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk
nvme: introduce multipath_always_on module param
nvme-multipath: introduce delayed removal of the multipath head node
nvme-pci: derive and better document max segments limits
nvme-pci: use struct_size for allocation struct nvme_dev
...
|
||
|
|
b465ae7b25 |
ublk: add feature UBLK_F_QUIESCE
Add feature UBLK_F_QUIESCE, which adds control command `UBLK_U_CMD_QUIESCE_DEV` for quiescing device, then device state can become `UBLK_S_DEV_QUIESCED` or `UBLK_S_DEV_FAIL_IO` finally from ublk_ch_release() with ublk server cooperation. This feature can help to support to upgrade ublk server application by shutting down ublk server gracefully, meantime keep ublk block device persistent during the upgrading period. The feature is only available for UBLK_F_USER_RECOVERY. Suggested-by: Yoav Cohen <yoav@nvidia.com> Link: https://lore.kernel.org/linux-block/DM4PR12MB632807AB7CDCE77D1E5AB7D0A9B92@DM4PR12MB6328.namprd12.prod.outlook.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250522163523.406289-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
a11a722298 |
block-6.15-20250522
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgve4UQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgplcrD/9S58Q2NGNppHdxOb0kv5uNQFBnyKTGaAOV yo/XT5lshRhc9C9cZyMEjtSVwfLqEYGyaqrX/4i85t1jnJ4ih1x97B4cZEhx0wJE Tfcyzt8ugspaiN2j+eIouft9C5tcDUw6RWjMjJUX5zSH5e+pbxz3/Ibpk2PZjKZO iAKa4APHyPnaRiWnjgEkT5GJKjI1Fhi1G4ORTXP1VoI03K6bRKmsbLDoMaIZsBe3 uoBO0RV2NJqIfLk6J5xB1mhBt1prQrXd6f4SpZj0CPyUrnqqH9zankOKoi8vnM2N L/xNmdfLFevdk2QfeVSQrQlLilMlvXp8KJu2fuexCR78sqySoXpnb9Mjh58AWlCD czs8ltYIzHci/rTR7p6oBqF8DznUc5tLwEX7wI2rbIJ5CgxnxhnIZYDTyd7xct92 4pIefgU4YHTCpCY3JFnOggh6dlP9YlzCvjB61SvZPl7OrY2+E+ly7Xdz1Uo5Url/ gQaE3z0iajLefiPn0sTJgoI93A8H20ctGGsS9p0LyjWskmd9iCPG1eKbOSCXJ5Zw 1EJEeM0giXgWZ8Ks2h9NaODuoSkrNldWQmJxcitR0jprfmKZj4eh4qfgVlzdPzWH dn4j51I8jgUrjlehGPA3zRrGBqxSQyGMR9RtqCNjCXEJMiJNT9M1TiSB/Y6p4D8f NcH4/SJVLQ== =IqsP -----END PGP SIGNATURE----- Merge tag 'block-6.15-20250522' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - Fix for a regression with setting up loop on a file system without ->write_iter() - Fix for an nvme sysfs regression * tag 'block-6.15-20250522' of git://git.kernel.dk/linux: nvme: avoid creating multipath sysfs group under namespace path devices loop: don't require ->write_iter for writable files in loop_configure |
||
|
|
914e0dc508 |
ublk: run auto buf unregisgering in same io_ring_ctx with registering
UBLK_F_AUTO_BUF_REG requires that the buffer registered automatically
is unregistered in same `io_ring_ctx`, so check it explicitly.
Document this requirement for UBLK_F_AUTO_BUF_REG.
Drop WARN_ON_ONCE() which is triggered from userspace code path.
Fixes: 99c1e4eb6a3f ("ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG")
Reported-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250522152043.399824-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
||
|
|
5234f2c3e3 |
ublk: remove io argument from ublk_auto_buf_reg_fallback()
The argument has been unused since the function was added, so remove it. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250521160720.1893326-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
9172dbf3a6 |
ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch()
If ublk_set_auto_buf_reg() fails, we need to unlock and return,
otherwise `ub->mutex` is leaked.
Fixes: 99c1e4eb6a3f ("ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG")
Reported-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250521025502.71041-2-ming.lei@redhat.com
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
||
|
|
53f427e794 |
ublk: support UBLK_AUTO_BUF_REG_FALLBACK
For UBLK_F_AUTO_BUF_REG, buffer is registered to uring_cmd context automatically with the provided buffer index. User may provide one wrong buffer index, or the specified buffer is registered by application already. Add UBLK_AUTO_BUF_REG_FALLBACK for supporting to auto buffer registering fallback by completing the uring_cmd and telling ublk server the register failure via UBLK_AUTO_BUF_REG_FALLBACK, then ublk server still can register the buffer from userspace. So we can provide reliable way for supporting auto buffer register. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250520045455.515691-5-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
99c1e4eb6a |
ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG
Add UBLK_F_AUTO_BUF_REG for supporting to register buffer automatically to local io_uring context with provided buffer index. Add UAPI structure `struct ublk_auto_buf_reg` for holding user parameter to register request buffer automatically, one 'flags' field is defined, and there is still 32bit available for future extension, such as, adding one io_ring FD field for registering buffer to external io_uring. `struct ublk_auto_buf_reg` is populated from ublk uring_cmd's sqe->addr, and all existing ublk commands are data-less, so it is just fine to reuse sqe->addr for this purpose. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250520045455.515691-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
9e6b4756b3 |
ublk: prepare for supporting to register request buffer automatically
UBLK_F_SUPPORT_ZERO_COPY requires ublk server to issue explicit buffer register/unregister uring_cmd for each IO, this way is not only inefficient, but also introduce dependency between buffer consumer and buffer register/ unregister uring_cmd, please see tools/testing/selftests/ublk/stripe.c in which backing file IO has to be issued one by one by IOSQE_IO_LINK. Prepare for adding feature UBLK_F_AUTO_BUF_REG for addressing the existing zero copy limitation: - register request buffer automatically to ublk uring_cmd's io_uring context before delivering io command to ublk server - unregister request buffer automatically from the ublk uring_cmd's io_uring context when completing the request - io_uring will unregister the buffer automatically when uring is exiting, so we needn't worry about accident exit For using this feature, ublk server has to create one sparse buffer table Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250520045455.515691-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
b1c3b4695a |
ublk: convert to refcount_t
Convert to refcount_t and prepare for supporting to register bvec buffer automatically, which needs to initialize reference counter as 2, and kref doesn't provide this interface, so convert to refcount_t. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Suggested-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250520045455.515691-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
355341e435 |
loop: don't require ->write_iter for writable files in loop_configure
Block devices can be opened read-write even if they can't be written to
for historic reasons. Remove the check requiring file->f_op->write_iter
when the block devices was opened in loop_configure. The call to
loop_check_backing_file just below ensures the ->write_iter is present
for backing files opened for writing, which is the only check that is
actually needed.
Fixes: f5c84eff634b ("loop: Add sanity check for read/write_iter")
Reported-by: Christian Hesse <mail@eworm.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250520135420.1177312-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
||
|
|
6462c247b2 |
block-6.15-20250515
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgmVlYQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpg3FD/9KaCQn3VOI6NBMSgNP2FRjGhHbyza6kjN8
Q1g0wVTEESt5jBDoNuahA1aTevhBT9qDNG3d7PPwg0OEsE/WW0/RBMe16W9VlDz5
ph3S+nNLaBXQ6aZgH/qyH8eM+CEdoV2FhCUaikuQ25mft6BeonYGTEHVmdnfAg0i
er7XnBl/lYNSzdzy1bEbGb0N7Lheasa386Z9oMevCtqgObG/XIFvc2otxxx1QmJ9
iVcJJSnHf1Y9oqYGjebmqRoYnp9d3KEeAA1lMamlshVyX3DcZcXqMe0xEBZflGoG
+nLUQ9Qjk38Azp+FOqQh/D4Bs/jXzThSsizzTwgBmHA5Pu5dCC0ZyPSrSX8Aahd6
Yf4awgwFBH7vAvIbf8GMDloNxa6IAuZRzDig/fgqF2thSJJBghA7HNl92oz3BvoY
KsnpdxPa8EtrTc0n5WNDj9/0m1msDfzlgLUBgSTM4N0fPlEkDYzLnBSi7d5jq8K4
3lLqmZ/YhBeyzx37pTHLzE3rax/gZ6yLxyxliHrN/F9wSZRbc/PpMu1TdaSwElyk
F5VyLMOMHA4PPpqkNdn1TgPd2GQ/uqDkNA0oO7wstBQ5rBuHN4RmItpQBq3lov+Z
JyntB5th//2IFEVWAso6Ct67hqUAC8+JNmkTMpRzLaKgX4qr2ZBeGJIx445dhh+z
C99L/zA3XA==
=s90l
-----END PGP SIGNATURE-----
Merge tag 'block-6.15-20250515' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- NVMe pull request via Christoph:
- fixes for atomic writes (Alan Adamson)
- fixes for polled CQs in nvmet-epf (Damien Le Moal)
- fix for polled CQs in nvme-pci (Keith Busch)
- fix compile on odd configs that need to be forced to inline
(Kees Cook)
- one more quirk (Ilya Guterman)
- Fix for missing allocation of an integrity buffer for some cases
- Fix for a regression with ublk command cancelation
* tag 'block-6.15-20250515' of git://git.kernel.dk/linux:
ublk: fix dead loop when canceling io command
nvme-pci: add NVME_QUIRK_NO_DEEPEST_PS quirk for SOLIDIGM P44 Pro
nvme: all namespaces in a subsystem must adhere to a common atomic write size
nvme: multipath: enable BLK_FEAT_ATOMIC_WRITES for multipathing
nvmet: pci-epf: remove NVMET_PCI_EPF_Q_IS_SQ
nvmet: pci-epf: improve debug message
nvmet: pci-epf: cleanup nvmet_pci_epf_raise_irq()
nvmet: pci-epf: do not fall back to using INTX if not supported
nvmet: pci-epf: clear completion queue IRQ flag on delete
nvme-pci: acquire cq_poll_lock in nvme_poll_irqdisable
nvme-pci: make nvme_pci_npages_prp() __always_inline
block: always allocate integrity buffer when required
|
||
|
|
dd24f87f65 |
ublk: fix dead loop when canceling io command
Commit:
f40139fde527 ("ublk: fix race between io_uring_cmd_complete_in_task and
ublk_cancel_cmd")
adds a request state check in ublk_cancel_cmd(), and if the request is
started, skips canceling this uring_cmd.
However, the current uring_cmd may be in ACTIVE state, without block
request coming to the uring command. Meantime, if the cached request in
tag_set.tags[tag] has been delivered to ublk server and reycycled, then
this uring_cmd can't be canceled.
ublk requests are aborted in ublk char device release handler, which
depends on canceling all ACTIVE uring_cmd. So it causes a dead loop.
Fix this issue by not taking a stale request into account when canceling
uring_cmd in ublk_cancel_cmd().
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/linux-block/mruqwpf4tqenkbtgezv5oxwq7ngyq24jzeyqy4ixzvivatbbxv@4oh2wzz4e6qn/
Fixes: f40139fde527 ("ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250515162601.77346-1-ming.lei@redhat.com
[axboe: rewording of commit message]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
||
|
|
bbcacab2e8 |
brd: avoid extra xarray lookups on first write
The xarray can return the previous entry at a location. Use this fact to simplify the brd code when there is no existing page at a location. This also slighly improves the handling of racy discards as we now always have a page under RCU protection by the time we are ready to copy the data. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250507060700.3929430-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
cf42d4cccf |
zram: modernize writeback interface
The writeback interface supports a page_index=N parameter which performs
writeback of the given page. Since we rarely need to writeback just one
single page, the typical use case involves a number of writeback calls,
each performing writeback of one page:
echo page_index=100 > zram0/writeback
...
echo page_index=200 > zram0/writeback
echo page_index=500 > zram0/writeback
...
echo page_index=700 > zram0/writeback
One obvious downside of this is that it increases the number of syscalls.
Less obvious, but a significantly more important downside, is that when
given only one page to post-process zram cannot perform an optimal target
selection. This becomes a critical limitation when writeback_limit is
enabled, because under writeback_limit we want to guarantee the highest
memory savings hence we first need to writeback pages that release the
highest amount of zsmalloc pool memory.
This patch adds page_indexes=LOW-HIGH parameter to the writeback
interface:
echo page_indexes=100-200 page_indexes=500-700 > zram0/writeback
This gives zram a chance to apply an optimal target selection strategy on
each iteration of the writeback loop.
We also now permit multiple page_index parameters per call (previously
zram would recognize only one page_index) and a mix or single pages and
page ranges:
echo page_index=42 page_index=99 page_indexes=100-200 \
page_indexes=500-700 > zram0/writeback
Apart from that the patch also unifies parameters passing and resembles
other "modern" zram device attributes (e.g. recompression), while the old
interface used a mixed scheme: values-less parameters for mode and a
key=value format for page_index. We still support the "old" value-less
format for compatibility reasons.
[senozhatsky@chromium.org: simplify parse_page_index() range checks, per Brian]
nk: https://lkml.kernel.org/r/20250404015327.2427684-1-senozhatsky@chromium.org
[sozhatsky@chromium.org: fix uninitialized variable in zram_writeback_slots(), per Dan]
nk: https://lkml.kernel.org/r/20250409112611.1154282-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20250327015818.4148660-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Brian Geffon <bgeffon@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Richard Chang <richardycc@google.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
56e5a103a7 |
zsmalloc: prefer the the original page's node for compressed data
Currently, zsmalloc, zswap's and zram's backend memory allocator, does not enforce any policy for the allocation of memory for the compressed data, instead just adopting the memory policy of the task entering reclaim, or the default policy (prefer local node) if no such policy is specified. This can lead to several pathological behaviors in multi-node NUMA systems: 1. Systems with CXL-based memory tiering can encounter the following inversion with zswap/zram: the coldest pages demoted to the CXL tier can return to the high tier when they are reclaimed to compressed swap, creating memory pressure on the high tier. 2. Consider a direct reclaimer scanning nodes in order of allocation preference. If it ventures into remote nodes, the memory it compresses there should stay there. Trying to shift those contents over to the reclaiming thread's preferred node further *increases* its local pressure, and provoking more spills. The remote node is also the most likely to refault this data again. This undesirable behavior was pointed out by Johannes Weiner in [1]. 3. For zswap writeback, the zswap entries are organized in node-specific LRUs, based on the node placement of the original pages, allowing for targeted zswap writeback for specific nodes. However, the compressed data of a zswap entry can be placed on a different node from the LRU it is placed on. This means that reclaim targeted at one node might not free up memory used for zswap entries in that node, but instead reclaiming memory in a different node. All of these issues will be resolved if the compressed data go to the same node as the original page. This patch encourages this behavior by having zswap and zram pass the node of the original page to zsmalloc, and have zsmalloc prefer the specified node if we need to allocate new (zs)pages for the compressed data. Note that we are not strictly binding the allocation to the preferred node. We still allow the allocation to fall back to other nodes when the preferred node is full, or if we have zspages with slots available on a different node. This is OK, and still a strict improvement over the status quo: 1. On a system with demotion enabled, we will generally prefer demotions over compressed swapping, and only swap when pages have already gone to the lowest tier. This patch should achieve the desired effect for the most part. 2. If the preferred node is out of memory, letting the compressed data going to other nodes can be better than the alternative (OOMs, keeping cold memory unreclaimed, disk swapping, etc.). 3. If the allocation go to a separate node because we have a zspage with slots available, at least we're not creating extra immediate memory pressure (since the space is already allocated). 3. While there can be mixings, we generally reclaim pages in same-node batches, which encourage zspage grouping that is more likely to go to the right node. 4. A strict binding would require partitioning zsmalloc by node, which is more complicated, and more prone to regression, since it reduces the storage density of zsmalloc. We need to evaluate the tradeoff and benchmark carefully before adopting such an involved solution. [1]: https://lore.kernel.org/linux-mm/20250331165306.GC2110528@cmpxchg.org/ [senozhatsky@chromium.org: coding-style fixes] Link: https://lkml.kernel.org/r/mnvexa7kseswglcqbhlot4zg3b3la2ypv2rimdl5mh5glbmhvz@wi6bgqn47hge Link: https://lkml.kernel.org/r/20250402204416.3435994-1-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Gregory Price <gourry@gourry.net> Acked-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org> [zram, zsmalloc] Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev> [zswap/zsmalloc] Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
cc9f0629ca |
block-6.15-20250509
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgeCdgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgprW3EACoZmPgLRnt19f/u+veGKnJknGMkxgyph44
MryIe6kmJPnZ8CUAq+RloHtObAsMDbNmvcfwvp4Aeg1he7D6yNgCTHyamGRhtwR0
ydGS/eawYUibp5757TIegj26jZOEwWKkVaytX60N0rnRLLkr6IFXgA9JBhxpkF95
YYDW5+6nr7Z95ZNprSpvA5Q59tbv+HAGoKX2mmmj6sWwXCwx9eHEK0//bnOe6HzA
3mYGpsFOIgOZ/DFj33jscSygsuyQRevh/i4TxuNZjDm+AD9TZZwAbspxKPVml+pT
aVQ6oRQoGzcPDaoMTTUAwxrMmMaVI3oDFibTDofldDjRbvPI96wR4Cs1orCBHrsE
5xOynjyp6Yrr+RBCbA0i6vIdhb5P/oDsXQ4/qxLgdz46aWwXl7kxd6JhdjDXrmXm
3k9Kou/GHNGbB2qhYU+qASRAWj+hieEEgS31ppCu44q0hh8pbynlUCnl3Di0hHsJ
CIgjzOtdLb9sCJm68mpZGCkx71B2RX2TTZRhtCUHZNuAr2vqynPcza2+gtTIYJkN
uxtYeguKwKzWTJkB5l4hvscH8GVAKY4hFpILT7CjAMs6Rn/i9mCtnuWlDdGbxQdb
JQjxbKM5C/XNdcMG/3oDWaTrg1beFPSk87N9dCNx2A87zLxtLi71OpIScPFdlP9y
XN6QgJbh2g==
=sdd1
-----END PGP SIGNATURE-----
Merge tag 'block-6.15-20250509' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- Fix for a regression in this series for loop and read/write iterator
handling
- zone append block update tweak
- remove a broken IO priority test
- NVMe pull request via Christoph:
- unblock ctrl state transition for firmware update (Daniel
Wagner)
* tag 'block-6.15-20250509' of git://git.kernel.dk/linux:
block: remove test of incorrect io priority level
nvme: unblock ctrl state transition for firmware update
block: only update request sector if needed
loop: Add sanity check for read/write_iter
|
||
|
|
a216081323 |
rnbd-srv: use bio_add_virt_nofail
Use the bio_add_virt_nofail to add a single kernel virtual address to a bio as that can't fail. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
af78428ed3 |
block: remove the q argument from blk_rq_map_kern
Remove the q argument from blk_rq_map_kern and the internal helpers called by it as the queue can trivially be derived from the request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
a26a339a65 |
brd: fix discard end sector
brd_do_discard() just aligned start sector to page, this can only work
if the discard size if at least one page. For example:
blkdiscard /dev/ram0 -o 5120 -l 1024
In this case, size = (1024 - (8192 - 5120)), which is a huge value.
Fix the problem by round_down() the end sector.
Fixes: 9ead7efc6f3f ("brd: implement discard support")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250506061756.2970934-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
||
|
|
d4099f8893 |
brd: fix aligned_sector from brd_do_discard()
The calculation is just wrong, fix it by round_up().
Fixes: 9ead7efc6f3f ("brd: implement discard support")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250506061756.2970934-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
||
|
|
0e8acffc1b |
brd: protect page with rcu
Currently, after fetching the page by xa_load() in IO path, there is no
protection and page can be freed concurrently by discard:
cpu0
brd_submit_bio
brd_do_bvec
page = brd_lookup_page
cpu1
brd_submit_bio
brd_do_discard
page = __xa_erase()
__free_page()
// page UAF
Fix the problem by protecting page with rcu.
Meanwhile, if page is already freed, also prevent BUG_ON() by skipping
the write, and user will get zero data later if there is no page.
Fixes: 9ead7efc6f3f ("brd: implement discard support")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250506061756.2970934-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
||
|
|
e96ee7e1de |
ublk: consolidate UBLK_IO_FLAG_OWNED_BY_SRV checks
Every ublk I/O command except UBLK_IO_FETCH_REQ checks that the ublk_io has UBLK_IO_FLAG_OWNED_BY_SRV set. Consolidate the separate checks into a single one in __ublk_ch_uring_cmd(), analogous to those for UBLK_IO_FLAG_ACTIVE and UBLK_IO_FLAG_NEED_GET_DATA. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505172624.1121839-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
f5c84eff63 |
loop: Add sanity check for read/write_iter
Some file systems do not support read_iter/write_iter, such as selinuxfs
in this issue.
So before calling them, first confirm that the interface is supported and
then call it.
It is releavant in that vfs_iter_read/write have the check, and removal
of their used caused szybot to be able to hit this issue.
Fixes: f2fed441c69b ("loop: stop using vfs_iter__{read,write} for buffered I/O")
Reported-by: syzbot+6af973a3b8dfd2faefdc@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=6af973a3b8dfd2faefdc
Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250428143626.3318717-1-lizhi.xu@windriver.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|