mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
synced 2026-01-11 17:10:13 +00:00
25175 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
66cd8e9cb8 |
slab fixes for 6.18-rc3
-----BEGIN PGP SIGNATURE----- iQFPBAABCAA5FiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmj7Wb0bFIAAAAAABAAO bWFudTIsMi41KzEuMTEsMiwyAAoJELvgsHXSRYiaXgoH/RvZHUHzvACGosCOb2uc YVKIkxmPd3Feww+0oyQox+/BrBdri1eZ93Vfsl/udX7ZT5NBwC/pWT40SJUvyO+M fICYdoM/f+Qkknja7FSVuQXho252JPZYZ37erFeB3J+lardHyPOMMjWX2YnRQZvW EUwyKUH31DJnVa2IN9R1tT64ulY79KGmeMiSdQxaQc8SSM5idZ7/sQHjz4UcbOh1 kLsh3EQOgvVegzefLiTgKkcwT8Lry92X3bjeOrdvTqInDZj1aZWMDlQSFb31+XbS 1aeCrl3PpnD8EeEo7MIVmDzEAM4c2z4tYzpOVDa8rglkL7IRLwpK7t4DzK+zZhqq e0Q= =Y2Sq -----END PGP SIGNATURE----- Merge tag 'slab-for-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fixes from Vlastimil Babka: - Two fixes for race conditions in obj_exts allocation (Hao Ge) - Fix for slab accounting imbalance due to deferred slab decativation (Vlastimil Babka) * tag 'slab-for-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: slab: Fix obj_ext mistakenly considered NULL due to race condition slab: fix slab accounting imbalance due to defer_deactivate_slab() slab: Avoid race on slab->obj_exts in alloc_slab_obj_exts |
||
|
|
7f434e1d9a |
slab: Fix obj_ext mistakenly considered NULL due to race condition
If two competing threads enter alloc_slab_obj_exts(), and the one that
allocates the vector wins the cmpxchg(), the other thread that failed
allocation mistakenly assumes that slab->obj_exts is still empty due to
its own allocation failure. This will then trigger warnings with
CONFIG_MEM_ALLOC_PROFILING_DEBUG checks in the subsequent free path.
Therefore, let's check the result of cmpxchg() to see if marking the
allocation as failed was successful. If it wasn't, check whether the
winning side has succeeded its allocation (it might have been also
marking it as failed) and if yes, return success.
Suggested-by: Harry Yoo <harry.yoo@oracle.com>
Fixes: f7381b911640 ("slab: mark slab->obj_exts allocation failures unconditionally")
Cc: <stable@vger.kernel.org>
Signed-off-by: Hao Ge <gehao@kylinos.cn>
Link: https://patch.msgid.link/20251023143313.1327968-1-hao.ge@linux.dev
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
||
|
|
eecd7cb641 |
slab: fix slab accounting imbalance due to defer_deactivate_slab()
Since commit af92793e52c3 ("slab: Introduce kmalloc_nolock() and
kfree_nolock().") there's a possibility in alloc_single_from_new_slab()
that we discard the newly allocated slab if we can't spin and we fail to
trylock. As a result we don't perform inc_slabs_node() later in the
function. Instead we perform a deferred deactivate_slab() which can
either put the unacounted slab on partial list, or discard it
immediately while performing dec_slabs_node(). Either way will cause an
accounting imbalance.
Fix this by not marking the slab as frozen, and using free_slab()
instead of deactivate_slab() for non-frozen slabs in
free_deferred_objects(). For CONFIG_SLUB_TINY, that's the only possible
case. By not using discard_slab() we avoid dec_slabs_node().
Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Link: https://patch.msgid.link/20251023-fix-slab-accounting-v2-1-0e62d50986ea@suse.cz
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
||
|
|
0f3ad9c610 |
17 hotfixes. 12 are cc:stable and 14 are for MM.
There's a two-patch DAMON series from SeongJae Park which addresses a missed check and possible memory leak. Apart from that it's all singletons - please see the changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaPkz/QAKCRDdBJ7gKXxA js6eAQCdnA10LouzzVdqA+HuYh206z8qE2KGsGKpUGDfJv40uAEA4ZbxYrMJmwhU MXFn7Czphh/NOfFFCnrDnOlAFH7MmQc= =hc/P -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2025-10-22-12-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "17 hotfixes. 12 are cc:stable and 14 are for MM. There's a two-patch DAMON series from SeongJae Park which addresses a missed check and possible memory leak. Apart from that it's all singletons - please see the changelogs for details" * tag 'mm-hotfixes-stable-2025-10-22-12-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: csky: abiv2: adapt to new folio flags field mm/damon/core: use damos_commit_quota_goal() for new goal commit mm/damon/core: fix potential memory leak by cleaning ops_filter in damon_destroy_scheme hugetlbfs: move lock assertions after early returns in huge_pmd_unshare() vmw_balloon: indicate success when effectively deflating during migration mm/damon/core: fix list_add_tail() call on damon_call() mm/mremap: correctly account old mapping after MREMAP_DONTUNMAP remap mm: prevent poison consumption when splitting THP ocfs2: clear extent cache after moving/defragmenting extents mm: don't spin in add_stack_record when gfp flags don't allow dma-debug: don't report false positives with DMA_BOUNCE_UNALIGNED_KMALLOC mm/damon/sysfs: dealloc commit test ctx always mm/damon/sysfs: catch commit test ctx alloc failure hung_task: fix warnings caused by unaligned lock pointers |
||
|
|
7eca961dd7 |
mm/damon/core: use damos_commit_quota_goal() for new goal commit
When damos_commit_quota_goals() is called for adding new DAMOS quota goals of DAMOS_QUOTA_USER_INPUT metric, current_value fields of the new goals should be also set as requested. However, damos_commit_quota_goals() is not updating the field for the case, since it is setting only metrics and target values using damos_new_quota_goal(), and metric-optional union fields using damos_commit_quota_goal_union(). As a result, users could see the first current_value parameter that committed online with a new quota goal is ignored. Users are assumed to commit the current_value for DAMOS_QUOTA_USER_INPUT quota goals, since it is being used as a feedback. Hence the real impact would be subtle. That said, this is obviously not intended behavior. Fix the issue by using damos_commit_quota_goal() which sets all quota goal parameters, instead of damos_commit_quota_goal_union(), which sets only the union fields. Link: https://lkml.kernel.org/r/20251014001846.279282-1-sj@kernel.org Fixes: 1aef9df0ee90 ("mm/damon/core: commit damos_quota_goal->nid") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.16+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
7071537159 |
mm/damon/core: fix potential memory leak by cleaning ops_filter in damon_destroy_scheme
Currently, damon_destroy_scheme() only cleans up the filter list but leaves ops_filter untouched, which could lead to memory leaks when a scheme is destroyed. This patch ensures both filter and ops_filter are properly freed in damon_destroy_scheme(), preventing potential memory leaks. Link: https://lkml.kernel.org/r/20251014084225.313313-1-lienze@kylinos.cn Fixes: ab82e57981d0 ("mm/damon/core: introduce damos->ops_filters") Signed-off-by: Enze Li <lienze@kylinos.cn> Reviewed-by: SeongJae Park <sj@kernel.org> Tested-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
cec944dd32 |
hugetlbfs: move lock assertions after early returns in huge_pmd_unshare()
When hugetlb_vmdelete_list() processes VMAs during truncate operations, it
may encounter VMAs where huge_pmd_unshare() is called without the required
shareable lock. This triggers an assertion failure in
hugetlb_vma_assert_locked().
The previous fix in commit dd83609b8898 ("hugetlbfs: skip VMAs without
shareable locks in hugetlb_vmdelete_list") skipped entire VMAs without
shareable locks to avoid the assertion. However, this prevented pages
from being unmapped and freed, causing a regression in
fallocate(PUNCH_HOLE) operations where pages were not freed immediately,
as reported by Mark Brown.
Instead of checking locks in the caller or skipping VMAs, move the lock
assertions in huge_pmd_unshare() to after the early return checks. The
assertions are only needed when actual PMD unsharing work will be
performed. If the function returns early because sz != PMD_SIZE or the
PMD is not shared, no locks are required and assertions should not fire.
This approach reverts the VMA skipping logic from commit dd83609b8898
("hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list")
while moving the assertions to avoid the assertion failure, keeping all
the logic within huge_pmd_unshare() itself and allowing page unmapping and
freeing to proceed for all VMAs.
Link: https://lkml.kernel.org/r/20251014113344.21194-1-kartikey406@gmail.com
Fixes: dd83609b8898 ("hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list")
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Reported-by: <syzbot+f26d7c75c26ec19790e7@syzkaller.appspotmail.com>
Reported-by: Mark Brown <broonie@kernel.org>
Closes: https://syzkaller.appspot.com/bug?extid=f26d7c75c26ec19790e7
Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Oscar Salvador <osalvador@suse.de>
Tested-by: <syzbot+f26d7c75c26ec19790e7@syzkaller.appspotmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
c3fa5b1bfd |
mm/damon/core: fix list_add_tail() call on damon_call()
Each damon_ctx maintains callback requests using a linked list
(damon_ctx->call_controls). When a new callback request is received via
damon_call(), the new request should be added to the list. However, the
function is making a mistake at list_add_tail() invocation: putting the
new item to add and the list head to add it before, in the opposite order.
Because of the linked list manipulation implementation, the new request
can still be reached from the context's list head. But the list items
that were added before the new request are dropped from the list.
As a result, the callbacks are unexpectedly not invocated. Worse yet, if
the dropped callback requests were dynamically allocated, the memory is
leaked. Actually DAMON sysfs interface is using a dynamically allocated
repeat-mode callback request for automatic essential stats update. And
because the online DAMON parameters commit is using a non-repeat-mode
callback request, the issue can easily be reproduced, like below.
# damo start --damos_action stat --refresh_stat 1s
# damo tune --damos_action stat --refresh_stat 1s
The first command dynamically allocates the repeat-mode callback request
for automatic essential stat update. Users can see the essential stats
are automatically updated for every second, using the sysfs interface.
The second command calls damon_commit() with a new callback request that
was made for the commit. As a result, the previously added repeat-mode
callback request is dropped from the list. The automatic stats refresh
stops working, and the memory for the repeat-mode callback request is
leaked. It can be confirmed using kmemleak.
Fix the mistake on the list_add_tail() call.
Link: https://lkml.kernel.org/r/20251014205939.1206-1-sj@kernel.org
Fixes: 004ded6bee11 ("mm/damon: accept parallel damon_call() requests")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [6.17+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
0e59f47c15 |
mm/mremap: correctly account old mapping after MREMAP_DONTUNMAP remap
Commit b714ccb02a76 ("mm/mremap: complete refactor of move_vma()")
mistakenly introduced a new behaviour - clearing the VM_ACCOUNT flag of
the old mapping when a mapping is mremap()'d with the MREMAP_DONTUNMAP
flag set.
While we always clear the VM_LOCKED and VM_LOCKONFAULT flags for the old
mapping (the page tables have been moved, so there is no data that could
possibly be locked in memory), there is no reason to touch any other VMA
flags.
This is because after the move the old mapping is in a state as if it were
freshly mapped. This implies that the attributes of the mapping ought to
remain the same, including whether or not the mapping is accounted.
Link: https://lkml.kernel.org/r/20251013165836.273113-1-lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Fixes: b714ccb02a76 ("mm/mremap: complete refactor of move_vma()")
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
6ed8bfd24c |
slab: Avoid race on slab->obj_exts in alloc_slab_obj_exts
If two competing threads enter alloc_slab_obj_exts() and one of them
fails to allocate the object extension vector, it might override the
valid slab->obj_exts allocated by the other thread with
OBJEXTS_ALLOC_FAIL. This will cause the thread that lost this race and
expects a valid pointer to dereference a NULL pointer later on.
Update slab->obj_exts atomically using cmpxchg() to avoid
slab->obj_exts overrides by racing threads.
Thanks for Vlastimil and Suren's help with debugging.
Fixes: f7381b911640 ("slab: mark slab->obj_exts allocation failures unconditionally")
Cc: <stable@vger.kernel.org>
Suggested-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Hao Ge <gehao@kylinos.cn>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Link: https://patch.msgid.link/20251021010353.1187193-1-hao.ge@linux.dev
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
||
|
|
86f54f9b6c |
slab: reset slab->obj_ext when freeing and it is OBJEXTS_ALLOC_FAIL
If obj_exts allocation failed, slab->obj_exts is set to OBJEXTS_ALLOC_FAIL,
But we do not clear it when freeing the slab. Since OBJEXTS_ALLOC_FAIL and
MEMCG_DATA_OBJEXTS currently share the same bit position, during the
release of the associated folio, a VM_BUG_ON_FOLIO() check in
folio_memcg_kmem() is triggered because the OBJEXTS_ALLOC_FAIL flag was
not cleared, causing it to be interpreted as a kmem folio (non-slab)
with MEMCG_OBJEXTS_DATA flag set, which is invalid because
MEMCG_OBJEXTS_DATA is supposed to be set only on slabs.
Another problem that predates sharing the OBJEXTS_ALLOC_FAIL and
MEMCG_DATA_OBJEXTS bits is that on configurations with
is_check_pages_enabled(), the non-cleared bit in page->memcg_data will
trigger a free_page_is_bad() failure "page still charged to cgroup"
When freeing a slab, we clear slab->obj_exts if the obj_ext array has
been successfully allocated. So let's clear it also when the allocation
has failed.
Fixes: 09c46563ff6d ("codetag: debug: introduce OBJEXTS_ALLOC_FAIL to mark failed slab_ext allocations")
Fixes: 7612833192d5 ("slab: Reuse first bit for OBJEXTS_ALLOC_FAIL")
Link: https://lore.kernel.org/all/20251015141642.700170-1-hao.ge@linux.dev/
Cc: <stable@vger.kernel.org>
Signed-off-by: Hao Ge <gehao@kylinos.cn>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
||
|
|
841a8bfcba |
mm: prevent poison consumption when splitting THP
When performing memory error injection on a THP (Transparent Huge Page)
mapped to userspace on an x86 server, the kernel panics with the following
trace. The expected behavior is to terminate the affected process instead
of panicking the kernel, as the x86 Machine Check code can recover from an
in-userspace #MC.
mce: [Hardware Error]: CPU 0: Machine Check Exception: f Bank 3: bd80000000070134
mce: [Hardware Error]: RIP 10:<ffffffff8372f8bc> {memchr_inv+0x4c/0xf0}
mce: [Hardware Error]: TSC afff7bbff88a ADDR 1d301b000 MISC 80 PPIN 1e741e77539027db
mce: [Hardware Error]: PROCESSOR 0:d06d0 TIME 1758093249 SOCKET 0 APIC 0 microcode 80000320
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
Kernel panic - not syncing: Fatal local machine check
The root cause of this panic is that handling a memory failure triggered
by an in-userspace #MC necessitates splitting the THP. The splitting
process employs a mechanism, implemented in
try_to_map_unused_to_zeropage(), which reads the pages in the THP to
identify zero-filled pages. However, reading the pages in the THP results
in a second in-kernel #MC, occurring before the initial memory_failure()
completes, ultimately leading to a kernel panic. See the kernel panic
call trace on the two #MCs.
First Machine Check occurs // [1]
memory_failure() // [2]
try_to_split_thp_page()
split_huge_page()
split_huge_page_to_list_to_order()
__folio_split() // [3]
remap_page()
remove_migration_ptes()
remove_migration_pte()
try_to_map_unused_to_zeropage() // [4]
memchr_inv() // [5]
Second Machine Check occurs // [6]
Kernel panic
[1] Triggered by accessing a hardware-poisoned THP in userspace, which is
typically recoverable by terminating the affected process.
[2] Call folio_set_has_hwpoisoned() before try_to_split_thp_page().
[3] Pass the RMP_USE_SHARED_ZEROPAGE remap flag to remap_page().
[4] Try to map the unused THP to zeropage.
[5] Re-access pages in the hw-poisoned THP in the kernel.
[6] Triggered in-kernel, leading to a panic kernel.
In Step[2], memory_failure() sets the poisoned flag on the page in the THP
by TestSetPageHWPoison() before calling try_to_split_thp_page().
As suggested by David Hildenbrand, fix this panic by not accessing to the
poisoned page in the THP during zeropage identification, while continuing
to scan unaffected pages in the THP for possible zeropage mapping. This
prevents a second in-kernel #MC that would cause kernel panic in Step[4].
Thanks to Andrew Zaborowski for his initial work on fixing this issue.
Link: https://lkml.kernel.org/r/20251015064926.1887643-1-qiuxu.zhuo@intel.com
Link: https://lkml.kernel.org/r/20251011075520.320862-1-qiuxu.zhuo@intel.com
Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp")
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reported-by: Farrah Chen <farrah.chen@intel.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Acked-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jiaqi Yan <jiaqiyan@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
c83aab85e1 |
mm: don't spin in add_stack_record when gfp flags don't allow
syzbot was able to find the following path: add_stack_record_to_list mm/page_owner.c:182 [inline] inc_stack_record_count mm/page_owner.c:214 [inline] __set_page_owner+0x2c3/0x4a0 mm/page_owner.c:333 set_page_owner include/linux/page_owner.h:32 [inline] post_alloc_hook+0x240/0x2a0 mm/page_alloc.c:1851 prep_new_page mm/page_alloc.c:1859 [inline] get_page_from_freelist+0x21e4/0x22c0 mm/page_alloc.c:3858 alloc_pages_nolock_noprof+0x94/0x120 mm/page_alloc.c:7554 Don't spin in add_stack_record_to_list() when it is called from *_nolock() context. Link: https://lkml.kernel.org/r/CAADnVQK_8bNYEA7TJYgwTYR57=TTFagsvRxp62pFzS_z129eTg@mail.gmail.com Fixes: 97769a53f117 ("mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reported-by: syzbot+8259e1d0e3ae8ed0c490@syzkaller.appspotmail.com Reported-by: syzbot+665739f456b28f32b23d@syzkaller.appspotmail.com Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
139e7a572a |
mm/damon/sysfs: dealloc commit test ctx always
The damon_ctx for testing online DAMON parameters commit inputs is deallocated only when the test fails. This means memory is leaked for every successful online DAMON parameters commit. Fix the leak by always deallocating it. Link: https://lkml.kernel.org/r/20251003201455.41448-3-sj@kernel.org Fixes: 4c9ea539ad59 ("mm/damon/sysfs: validate user inputs from damon_sysfs_commit_input()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
f0c5118ebb |
mm/damon/sysfs: catch commit test ctx alloc failure
Patch series "mm/damon/sysfs: fix commit test damon_ctx [de]allocation". DAMON sysfs interface dynamically allocates and uses a damon_ctx object for testing if given inputs for online DAMON parameters update is valid. The object is being used without an allocation failure check, and leaked when the test succeeds. Fix the two bugs. This patch (of 2): The damon_ctx for testing online DAMON parameters commit inputs is used without its allocation failure check. This could result in an invalid memory access. Fix it by directly returning an error when the allocation failed. Link: https://lkml.kernel.org/r/20251003201455.41448-1-sj@kernel.org Link: https://lkml.kernel.org/r/20251003201455.41448-2-sj@kernel.org Fixes: 4c9ea539ad59 ("mm/damon/sysfs: validate user inputs from damon_sysfs_commit_input()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
df90f6cd29 |
slab: fix clearing freelist in free_deferred_objects()
defer_free() links pending objects using the slab's freelist offset
which is fine as they are not free yet. free_deferred_objects() then
clears this pointer to avoid confusing the debugging consistency checks
that may be enabled for the cache.
However, with CONFIG_SLAB_FREELIST_HARDENED, even the NULL pointer needs
to be encoded appropriately using set_freepointer(), otherwise it's
decoded as something else and triggers the consistency checks, as found
by the kernel test robot.
Use set_freepointer() to prevent the issue.
Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Reported-and-tested-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202510101652.7921fdc6-lkp@intel.com
Acked-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
||
|
|
6bb71f0fe5 |
slab hotfix for 6.18-rc1
-----BEGIN PGP SIGNATURE----- iQFPBAABCAA5FiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmjqGjcbFIAAAAAABAAO bWFudTIsMi41KzEuMTEsMiwyAAoJELvgsHXSRYiaREIH/1qSp0ucBkrwjPKP0YZq 7BIR4Lx/+o77emeV/vX2qhJnsXELEn7uRlb2vb8tbebobqgQc3SisQMqtJ4Uz2Er /ymYNRUKmkT4frQZwD81TjCP2rSoy1Zlyegs8Cy4UuAb4ixfWItqJA9BEBrDSkJ2 kpLWcQX9SPMAszq8s0dXiAIvOuNoHIQyVExyOkzLiUkwtbUbsG1gl6sy80W9pZ8+ VktCL+LMczcVYtdUt2e8DCZWFpihwwA8hFKLxv9icfOlkDrhHRrhK7rL+DnLVpgF Lt4XdQGouxo3MgwcCOxITlKx7BidJcfwVJ6tiElURm4L7Df+A+ESCzu7OVAyxvEU 5AY= =7fxN -----END PGP SIGNATURE----- Merge tag 'slab-for-6.18-rc1-hotfix' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fix from Vlastimil Babka: "A NULL pointer deref hotfix" * tag 'slab-for-6.18-rc1-hotfix' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: slab: fix barn NULL pointer dereference on memoryless nodes |
||
|
|
ae13bd2310 |
Just one series here - Mike Rappoport has taught KEXEC handover to
preserve vmalloc allocations across handover. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaOmDWAAKCRDdBJ7gKXxA jh+MAQDUPBj3mFm238CXI5DC1gJ3ETe3NJjJvfzIjLs51c+dFgD+PUuvDA0GUtKH LCl6T+HJXh2FgGn1F2Kl/0hwPtEvuA4= =HYr7 -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2025-10-10-15-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more updates from Andrew Morton: "Just one series here - Mike Rappoport has taught KEXEC handover to preserve vmalloc allocations across handover" * tag 'mm-nonmm-stable-2025-10-10-15-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: lib/test_kho: use kho_preserve_vmalloc instead of storing addresses in fdt kho: add support for preserving vmalloc allocations kho: replace kho_preserve_phys() with kho_preserve_pages() kho: check if kho is finalized in __kho_preserve_order() MAINTAINERS, .mailmap: update Umang's email address |
||
|
|
971370a88c |
7 hotfixes. All 7 are cc:stable and all 7 are for MM.
All singletons, please see the changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaOmCTwAKCRDdBJ7gKXxA jtQZAQC9sRd+4LNYothoXlY9avKNYR4YvN3ogiIFJqHwiENu/QD/ec/57KME9dA4 H4SqK/49Rs/tVCYmkPTO7IWRmxo9/AA= =h9On -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2025-10-10-15-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "7 hotfixes. All 7 are cc:stable and all 7 are for MM. All singletons, please see the changelogs for details" * tag 'mm-hotfixes-stable-2025-10-10-15-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: hugetlb: avoid soft lockup when mprotect to large memory area fsnotify: pass correct offset to fsnotify_mmap_perm() mm/ksm: fix flag-dropping behavior in ksm_madvise mm/damon/vaddr: do not repeat pte_offset_map_lock() until success mm/rmap: fix soft-dirty and uffd-wp bit loss when remapping zero-filled mTHP subpage to shared zeropage mm/thp: fix MTE tag mismatch when replacing zero-filled subpages memcg: skip cgroup_file_notify if spinning is not allowed |
||
|
|
fd6db58867 |
slab: fix barn NULL pointer dereference on memoryless nodes
Phil reported a boot failure once sheaves become used in commits
59faa4da7cd4 ("maple_tree: use percpu sheaves for maple_node_cache") and
3accabda4da1 ("mm, vma: use percpu sheaves for vm_area_struct cache"):
BUG: kernel NULL pointer dereference, address: 0000000000000040
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 21 UID: 0 PID: 818 Comm: kworker/u398:0 Not tainted 6.17.0-rc3.slab+ #5 PREEMPT(voluntary)
Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.26.0 07/30/2025
RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0
Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89
RSP: 0018:ffffd2d10950bdb0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8a775dab74b0 RCX: 00000000ffffffff
RDX: 0000000000000cc0 RSI: ffff8a6800804000 RDI: ffff8a680004e300
RBP: ffffd2d10950be40 R08: 0000000000000060 R09: ffffffffb9367388
R10: 00000000000149e8 R11: ffff8a6f87a38000 R12: 0000000000000cc0
R13: 0000000000000cc0 R14: ffff8a680004e300 R15: 00000000000000c0
FS: 0000000000000000(0000) GS:ffff8a77a3541000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000040 CR3: 0000000e1aa24000 CR4: 00000000003506f0
Call Trace:
<TASK>
? srso_return_thunk+0x5/0x5f
? vm_area_alloc+0x1e/0x60
kmem_cache_alloc_noprof+0x4ec/0x5b0
vm_area_alloc+0x1e/0x60
create_init_stack_vma+0x26/0x210
alloc_bprm+0x139/0x200
kernel_execve+0x4a/0x140
call_usermodehelper_exec_async+0xd0/0x190
? __pfx_call_usermodehelper_exec_async+0x10/0x10
ret_from_fork+0xf0/0x110
? __pfx_call_usermodehelper_exec_async+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Modules linked in:
CR2: 0000000000000040
---[ end trace 0000000000000000 ]---
RIP: 0010:__pcs_replace_empty_main+0x44/0x1d0
Code: ec 08 48 8b 46 10 48 8b 76 08 48 85 c0 74 0b 8b 48 18 85 c9 0f 85 e5 00 00 00 65 48 63 05 e4 ee 50 02 49 8b 84 c6 e0 00 00 00 <4c> 8b 68 40 4c 89 ef e8 b0 81 ff ff 48 89 c5 48 85 c0 74 1d 48 89
RSP: 0018:ffffd2d10950bdb0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8a775dab74b0 RCX: 00000000ffffffff
RDX: 0000000000000cc0 RSI: ffff8a6800804000 RDI: ffff8a680004e300
RBP: ffffd2d10950be40 R08: 0000000000000060 R09: ffffffffb9367388
R10: 00000000000149e8 R11: ffff8a6f87a38000 R12: 0000000000000cc0
R13: 0000000000000cc0 R14: ffff8a680004e300 R15: 00000000000000c0
FS: 0000000000000000(0000) GS:ffff8a77a3541000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000040 CR3: 0000000e1aa24000 CR4: 00000000003506f0
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x36a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: Fatal exception ]---
And noted "this is an AMD EPYC 7401 with 8 NUMA nodes configured such
that memory is only on 2 of them."
# numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 8 16 24 32 40 48 56 64 72 80 88
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 2 10 18 26 34 42 50 58 66 74 82 90
node 1 size: 31584 MB
node 1 free: 30397 MB
node 2 cpus: 4 12 20 28 36 44 52 60 68 76 84 92
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 6 14 22 30 38 46 54 62 70 78 86 94
node 3 size: 0 MB
node 3 free: 0 MB
node 4 cpus: 1 9 17 25 33 41 49 57 65 73 81 89
node 4 size: 0 MB
node 4 free: 0 MB
node 5 cpus: 3 11 19 27 35 43 51 59 67 75 83 91
node 5 size: 32214 MB
node 5 free: 31625 MB
node 6 cpus: 5 13 21 29 37 45 53 61 69 77 85 93
node 6 size: 0 MB
node 6 free: 0 MB
node 7 cpus: 7 15 23 31 39 47 55 63 71 79 87 95
node 7 size: 0 MB
node 7 free: 0 MB
Linus decoded the stacktrace to get_barn() and get_node() and determined
that kmem_cache->node[numa_mem_id()] is NULL.
The problem is due to a wrong assumption that memoryless nodes only
exist on systems with CONFIG_HAVE_MEMORYLESS_NODES, where numa_mem_id()
points to the nearest node that has memory. SLUB has been allocating its
kmem_cache_node structures only on nodes with memory and so it does with
struct node_barn.
For kmem_cache_node, get_partial_node() checks if get_node() result is
not NULL, which I assumed was for protection from a bogus node id passed
to kmalloc_node() but apparently it's also for systems where
numa_mem_id() (used when no specific node is given) might return a
memoryless node.
Fix the sheaves code the same way by checking the result of get_node()
and bailing out if it's NULL. Note that cpus on such memoryless nodes
will have degraded sheaves performance, which can be improved later,
preferably by making numa_mem_id() work properly on such systems.
Fixes: 2d517aa09bbc ("slab: add opt-in caching layer of percpu sheaves")
Reported-and-tested-by: Phil Auld <pauld@redhat.com>
Closes: https://lore.kernel.org/all/20251010151116.GA436967@pauld.westford.csb/
Analyzed-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/all/CAHk-%3Dwg1xK%2BBr%3DFJ5QipVhzCvq7uQVPt5Prze6HDhQQ%3DQD_BcQ@mail.gmail.com/
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
||
|
|
f6db358dea |
slab fixes for 6.18-rc1
-----BEGIN PGP SIGNATURE----- iQFPBAABCAA5FiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmjngFEbFIAAAAAABAAO bWFudTIsMi41KzEuMTEsMiwyAAoJELvgsHXSRYia2DMH+wdy8KaYLVstuYYI3AuX 7lilQCqEw7yKNrU2pS/yCOGRg2a+X5qFkMSRji9S1htALuyu8VS2X2MZrqgE0Q1y oUKtZZuEaX80SPd2XhwpmsvCIo4V+bLtokHl2SsMD9tEV1AxKnO8UO5zMvxeqz8a 8XEfiGHU4oAm678hLhnQovA7akgn37uKhzFglevStwRVYELhxv5RNEoH6QsVKjr4 H7oTXQoYL4T+H84VDPXd8c26g3rhJ7fP1t0AthsZypOMoiiHDW08J03Qw9UQT+Hc 8NqgTgltczEvm0WExQ4xnCq6QZbwQLaPoulGFxP1iE4Ii0HLqU9SDkhv8i4fymBI Fbk= =1Zqd -----END PGP SIGNATURE----- Merge tag 'slab-for-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fixes from Vlastimil Babka: - Fixes for several corner cases in error paths and debugging options, related to the new kmalloc_nolock() functionality (Kuniyuki Iwashima, Ran Xiaokai) * tag 'slab-for-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: slub: Don't call lockdep_unregister_key() for immature kmem_cache. slab: Fix using this_cpu_ptr() in preemptible context slab: Add allow_spin check to eliminate kmemleak warnings |
||
|
|
f52ce0ea90 |
mm: hugetlb: avoid soft lockup when mprotect to large memory area
When calling mprotect() to a large hugetlb memory area in our customer's workload (~300GB hugetlb memory), soft lockup was observed: watchdog: BUG: soft lockup - CPU#98 stuck for 23s! [t2_new_sysv:126916] CPU: 98 PID: 126916 Comm: t2_new_sysv Kdump: loaded Not tainted 6.17-rc7 Hardware name: GIGACOMPUTING R2A3-T40-AAV1/Jefferson CIO, BIOS 5.4.4.1 07/15/2025 pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : mte_clear_page_tags+0x14/0x24 lr : mte_sync_tags+0x1c0/0x240 sp : ffff80003150bb80 x29: ffff80003150bb80 x28: ffff00739e9705a8 x27: 0000ffd2d6a00000 x26: 0000ff8e4bc00000 x25: 00e80046cde00f45 x24: 0000000000022458 x23: 0000000000000000 x22: 0000000000000004 x21: 000000011b380000 x20: ffff000000000000 x19: 000000011b379f40 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000000 x10: 0000000000000000 x9 : ffffc875e0aa5e2c x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000 x5 : fffffc01ce7a5c00 x4 : 00000000046cde00 x3 : fffffc0000000000 x2 : 0000000000000004 x1 : 0000000000000040 x0 : ffff0046cde7c000 Call trace: mte_clear_page_tags+0x14/0x24 set_huge_pte_at+0x25c/0x280 hugetlb_change_protection+0x220/0x430 change_protection+0x5c/0x8c mprotect_fixup+0x10c/0x294 do_mprotect_pkey.constprop.0+0x2e0/0x3d4 __arm64_sys_mprotect+0x24/0x44 invoke_syscall+0x50/0x160 el0_svc_common+0x48/0x144 do_el0_svc+0x30/0xe0 el0_svc+0x30/0xf0 el0t_64_sync_handler+0xc4/0x148 el0t_64_sync+0x1a4/0x1a8 Soft lockup is not triggered with THP or base page because there is cond_resched() called for each PMD size. Although the soft lockup was triggered by MTE, it should be not MTE specific. The other processing which takes long time in the loop may trigger soft lockup too. So add cond_resched() for hugetlb to avoid soft lockup. Link: https://lkml.kernel.org/r/20250929202402.1663290-1-yang@os.amperecomputing.com Fixes: 8f860591ffb2 ("[PATCH] Enable mprotect on huge pages") Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Reviewed-by: Christoph Lameter (Ampere) <cl@gentwo.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
28bba2c293 |
fsnotify: pass correct offset to fsnotify_mmap_perm()
fsnotify_mmap_perm() requires a byte offset for the file about to be mmap'ed. But it is called from vm_mmap_pgoff(), which has a page offset. Previously the conversion was done incorrectly so let's fix it, being careful not to overflow on 32-bit platforms. Discovered during code review. Link: https://lkml.kernel.org/r/20251003155238.2147410-1-ryan.roberts@arm.com Fixes: 066e053fe208 ("fsnotify: add pre-content hooks on mmap()") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Kiryl Shutsemau <kas@kernel.org> Cc: Amir Goldstein <amir73il@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
b93af2cc8e |
mm/damon/vaddr: do not repeat pte_offset_map_lock() until success
DAMON's virtual address space operation set implementation (vaddr) calls pte_offset_map_lock() inside the page table walk callback function. This is for reading and writing page table accessed bits. If pte_offset_map_lock() fails, it retries by returning the page table walk callback function with ACTION_AGAIN. pte_offset_map_lock() can continuously fail if the target is a pmd migration entry, though. Hence it could cause an infinite page table walk if the migration cannot be done until the page table walk is finished. This indeed caused a soft lockup when CPU hotplugging and DAMON were running in parallel. Avoid the infinite loop by simply not retrying the page table walk. DAMON is promising only a best-effort accuracy, so missing access to such pages is no problem. Link: https://lkml.kernel.org/r/20250930004410.55228-1-sj@kernel.org Fixes: 7780d04046a2 ("mm/pagewalkers: ACTION_AGAIN if pte_offset_map_lock() fails") Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: Xinyu Zheng <zhengxinyu6@huawei.com> Closes: https://lore.kernel.org/20250918030029.2652607-1-zhengxinyu6@huawei.com Acked-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> [6.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
9658d698a8 |
mm/rmap: fix soft-dirty and uffd-wp bit loss when remapping zero-filled mTHP subpage to shared zeropage
When splitting an mTHP and replacing a zero-filled subpage with the shared zeropage, try_to_map_unused_to_zeropage() currently drops several important PTE bits. For userspace tools like CRIU, which rely on the soft-dirty mechanism for incremental snapshots, losing the soft-dirty bit means modified pages are missed, leading to inconsistent memory state after restore. As pointed out by David, the more critical uffd-wp bit is also dropped. This breaks the userfaultfd write-protection mechanism, causing writes to be silently missed by monitoring applications, which can lead to data corruption. Preserve both the soft-dirty and uffd-wp bits from the old PTE when creating the new zeropage mapping to ensure they are correctly tracked. Link: https://lkml.kernel.org/r/20250930081040.80926-1-lance.yang@linux.dev Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Signed-off-by: Lance Yang <lance.yang@linux.dev> Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jann Horn <jannh@google.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
1ce6473d17 |
mm/thp: fix MTE tag mismatch when replacing zero-filled subpages
When both THP and MTE are enabled, splitting a THP and replacing its zero-filled subpages with the shared zeropage can cause MTE tag mismatch faults in userspace. Remapping zero-filled subpages to the shared zeropage is unsafe, as the zeropage has a fixed tag of zero, which may not match the tag expected by the userspace pointer. KSM already avoids this problem by using memcmp_pages(), which on arm64 intentionally reports MTE-tagged pages as non-identical to prevent unsafe merging. As suggested by David[1], this patch adopts the same pattern, replacing the memchr_inv() byte-level check with a call to pages_identical(). This leverages existing architecture-specific logic to determine if a page is truly identical to the shared zeropage. Having both the THP shrinker and KSM rely on pages_identical() makes the design more future-proof, IMO. Instead of handling quirks in generic code, we just let the architecture decide what makes two pages identical. [1] https://lore.kernel.org/all/ca2106a3-4bb2-4457-81af-301fd99fbef4@redhat.com Link: https://lkml.kernel.org/r/20250922021458.68123-1-lance.yang@linux.dev Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Signed-off-by: Lance Yang <lance.yang@linux.dev> Reported-by: Qun-wei Lin <Qun-wei.Lin@mediatek.com> Closes: https://lore.kernel.org/all/a7944523fcc3634607691c35311a5d59d1a3f8d4.camel@mediatek.com Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: andrew.yang <andrew.yang@mediatek.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Byungchul Park <byungchul@sk.com> Cc: Charlie Jenkins <charlie@rivosinc.com> Cc: Chinwen Chang <chinwen.chang@mediatek.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Samuel Holland <samuel.holland@sifive.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
fcc0669c5a |
memcg: skip cgroup_file_notify if spinning is not allowed
Generally memcg charging is allowed from all the contexts including NMI where even spinning on spinlock can cause locking issues. However one call chain was missed during the addition of memcg charging from any context support. That is try_charge_memcg() -> memcg_memory_event() -> cgroup_file_notify(). The possible function call tree under cgroup_file_notify() can acquire many different spin locks in spinning mode. Some of them are cgroup_file_kn_lock, kernfs_notify_lock, pool_workqeue's lock. So, let's just skip cgroup_file_notify() from memcg charging if the context does not allow spinning. Alternative approach was also explored where instead of skipping cgroup_file_notify(), we defer the memcg event processing to irq_work [1]. However it adds complexity and it was decided to keep things simple until we need more memcg events with !allow_spinning requirement. Link: https://lore.kernel.org/all/5qi2llyzf7gklncflo6gxoozljbm4h3tpnuv4u4ej4ztysvi6f@x44v7nz2wdzd/ [1] Link: https://lkml.kernel.org/r/20250922220203.261714-1-shakeel.butt@linux.dev Fixes: 3ac4638a734a ("memcg: make memcg_rstat_updated nmi safe") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Closes: https://lore.kernel.org/all/20250905061919.439648-1-yepeilin@google.com/ Cc: Alexei Starovoitov <ast@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peilin Ye <yepeilin@google.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
8375b76517 |
kho: replace kho_preserve_phys() with kho_preserve_pages()
to make it clear that KHO operates on pages rather than on a random physical address. The kho_preserve_pages() will be also used in upcoming support for vmalloc preservation. Link: https://lkml.kernel.org/r/20250921054458.4043761-3-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
16d1ba7c96 |
dma-mapping fixes for Linux 6.18:
- two small fixes for the recently performed code refactoring (Shigeru Yoshida) and missing handling of direction parameter in DMA debug code (Petr Tesarik) -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaOTg+QAKCRCJp1EFxbsS RDFkAQCxV0khAeYDuiPdub+9d5XGVGTjBxG1ErYsvDbbsZ3QpAEAhheuAdbMBYU1 kOwmGuBUY32d0cMz0/4BfKbIzuQs9wE= =aE76 -----END PGP SIGNATURE----- Merge tag 'dma-mapping-6.18-2025-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fixes from Marek Szyprowski: "Two small fixes for the recently performed code refactoring (Shigeru Yoshida) and missing handling of direction parameter in DMA debug code (Petr Tesarik)" * tag 'dma-mapping-6.18-2025-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-mapping: fix direction in dma_alloc direction traces kmsan: fix kmsan_handle_dma() to avoid false positives |
||
|
|
f7dfa0f31b |
slub: Don't call lockdep_unregister_key() for immature kmem_cache.
syzbot reported the lockdep splat below in __kmem_cache_release(). [0] The problem is that __kmem_cache_release() could be called from do_kmem_cache_create() before init_kmem_cache_cpus() registers the lockdep key. Let's perform lockdep_unregister_key() only when init_kmem_cache_cpus() has been done, which we can determine by checking s->cpu_slab [0]: WARNING: CPU: 1 PID: 6128 at kernel/locking/lockdep.c:6606 lockdep_unregister_key+0x2ca/0x310 kernel/locking/lockdep.c:6606 Modules linked in: CPU: 1 UID: 0 PID: 6128 Comm: syz.4.21 Not tainted syzkaller #0 PREEMPT_{RT,(full)} Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025 RIP: 0010:lockdep_unregister_key+0x2ca/0x310 kernel/locking/lockdep.c:6606 Code: 50 e4 0f 48 3b 44 24 10 0f 84 26 fe ff ff e8 bd cd 17 09 e8 e8 ce 17 09 41 f7 c7 00 02 00 00 74 bd fb 40 84 ed 75 bc eb cd 90 <0f> 0b 90 e9 19 ff ff ff 90 0f 0b 90 e9 2a ff ff ff 48 c7 c7 d0 ac RSP: 0018:ffffc90003e870d0 EFLAGS: 00010002 RAX: eb1525397f5bdf00 RBX: ffff88803c121148 RCX: 1ffff920007d0dfc RDX: 0000000000000000 RSI: ffffffff8acb1500 RDI: ffffffff8b1dd0e0 RBP: 00000000ffffffea R08: ffffffff8eb5aa37 R09: 1ffffffff1d6b546 R10: dffffc0000000000 R11: fffffbfff1d6b547 R12: 0000000000000000 R13: ffff88814d1b8900 R14: 0000000000000000 R15: 0000000000000203 FS: 00007f773f75e6c0(0000) GS:ffff88812712f000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffdaea3af52 CR3: 000000003a5ca000 CR4: 00000000003526f0 Call Trace: <TASK> __kmem_cache_release+0xe3/0x1e0 mm/slub.c:7696 do_kmem_cache_create+0x74e/0x790 mm/slub.c:8575 create_cache mm/slab_common.c:242 [inline] __kmem_cache_create_args+0x1ce/0x330 mm/slab_common.c:340 nfsd_file_cache_init+0x1d6/0x530 fs/nfsd/filecache.c:816 nfsd_startup_generic fs/nfsd/nfssvc.c:282 [inline] nfsd_startup_net fs/nfsd/nfssvc.c:377 [inline] nfsd_svc+0x393/0x900 fs/nfsd/nfssvc.c:786 nfsd_nl_threads_set_doit+0x84a/0x960 fs/nfsd/nfsctl.c:1639 genl_family_rcv_msg_doit+0x212/0x300 net/netlink/genetlink.c:1115 genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline] genl_rcv_msg+0x60e/0x790 net/netlink/genetlink.c:1210 netlink_rcv_skb+0x208/0x470 net/netlink/af_netlink.c:2552 genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219 netlink_unicast_kernel net/netlink/af_netlink.c:1320 [inline] netlink_unicast+0x846/0xa10 net/netlink/af_netlink.c:1346 netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1896 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg+0x219/0x270 net/socket.c:742 ____sys_sendmsg+0x508/0x820 net/socket.c:2630 ___sys_sendmsg+0x21f/0x2a0 net/socket.c:2684 __sys_sendmsg net/socket.c:2716 [inline] __do_sys_sendmsg net/socket.c:2721 [inline] __se_sys_sendmsg net/socket.c:2719 [inline] __x64_sys_sendmsg+0x1a1/0x260 net/socket.c:2719 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f77400eeec9 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f773f75e038 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f7740345fa0 RCX: 00007f77400eeec9 RDX: 0000000000008004 RSI: 0000200000000180 RDI: 0000000000000006 RBP: 00007f7740171f91 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f7740346038 R14: 00007f7740345fa0 R15: 00007ffce616f8d8 </TASK> [alexei.starovoitov@gmail.com: simplify the fix] Link: https://lore.kernel.org/all/20251007052534.2776661-1-kuniyu@google.com/ Fixes: 83382af9ddc3 ("slab: Make slub local_(try)lock more precise for LOCKDEP") Reported-by: syzbot+a6f4d69b9b23404bbabf@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68e4a3d1.a00a0220.298cc0.0471.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
83d59d81b2 |
slab: Fix using this_cpu_ptr() in preemptible context
defer_free() maybe called in preemptible context, this will trigger the
below warning message:
BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1
caller is defer_free+0x1b/0x60
Call Trace:
<TASK>
dump_stack_lvl+0xac/0xc0
check_preemption_disabled+0xbe/0xe0
defer_free+0x1b/0x60
kfree_nolock+0x1eb/0x2b0
alloc_slab_obj_exts+0x356/0x390
__alloc_tagging_slab_alloc_hook+0xa0/0x300
__kmalloc_cache_noprof+0x1c4/0x5c0
__set_page_owner+0x10d/0x1c0
post_alloc_hook+0x84/0xf0
get_page_from_freelist+0x73b/0x1380
__alloc_frozen_pages_noprof+0x110/0x2c0
alloc_pages_mpol+0x44/0x140
alloc_slab_page+0xac/0x150
allocate_slab+0x78/0x3a0
___slab_alloc+0x76b/0xed0
__slab_alloc.constprop.0+0x5a/0xb0
__kmalloc_noprof+0x3dc/0x6d0
__list_lru_init+0x6c/0x210
alloc_super+0x3b6/0x470
sget_fc+0x5f/0x3a0
get_tree_nodev+0x27/0x90
vfs_get_tree+0x26/0xc0
vfs_kern_mount.part.0+0xb6/0x140
kern_mount+0x24/0x40
init_pipe_fs+0x4f/0x70
do_one_initcall+0x62/0x2e0
kernel_init_freeable+0x25b/0x4b0
kernel_init+0x1a/0x1c0
ret_from_fork+0x290/0x2e0
ret_from_fork_asm+0x11/0x20
</TASK>
Disable preemption in defer_free() and also defer_deactivate_slab() to
make it safe.
[vbabka@suse.cz: disable preemption instead of using raw_cpu_ptr() per
the discussion ]
Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Link: https://lore.kernel.org/r/20250930083402.782927-1-ranxiaokai627@163.com
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
||
|
|
7a405dbb0f |
Only two patch series in this pull request:
- The 3 patch series "mm/memory_hotplug: fixup crash during uevent
handling" from Hannes Reinecke which fixes a race which was causing udev
to trigger a crash in the memory hotplug code.
- The 2 patch series "mm_slot: following fixup for usage of
mm_slot_entry()" from Wei Yang adds some touchups to the just-merged
mm_slot changes.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaOBhBQAKCRDdBJ7gKXxA
jteZAQDpU3imBB416Bj5/RxtcafTaT7yCEmi7yHajtM+NZW9dAEAnu72pCIXpDgB
p6HytsUndzGJqUmbHQA2cZGdWPfyFAs=
=ZNaf
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2025-10-03-16-49' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more MM updates from Andrew Morton:
"Only two patch series in this pull request:
- "mm/memory_hotplug: fixup crash during uevent handling" from Hannes
Reinecke fixes a race that was causing udev to trigger a crash in
the memory hotplug code
- "mm_slot: following fixup for usage of mm_slot_entry()" from Wei
Yang adds some touchups to the just-merged mm_slot changes"
* tag 'mm-stable-2025-10-03-16-49' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/khugepaged: use KMEM_CACHE()
mm/ksm: cleanup mm_slot_entry() invocation
Documentation/mm: drop pxx_mkdevmap() descriptions from page table helpers
mm: clean up is_guard_pte_marker()
drivers/base: move memory_block_add_nid() into the caller
mm/memory_hotplug: activate node before adding new memory blocks
drivers/base/memory: add node id parameter to add_memory_block()
|
||
|
|
b41048485e |
mm/mm_init: simplify deferred initialization of struct pages
Refactor and simplify deferred initialization of the memory map. Beside the negative diffstat it gives 3ms (55ms vs 58ms) reduction in the initialization of deferred pages on single node system with 64GiB of RAM. -----BEGIN PGP SIGNATURE----- iQFEBAABCgAuFiEEeOVYVaWZL5900a/pOQOGJssO/ZEFAmjg5awQHHJwcHRAa2Vy bmVsLm9yZwAKCRA5A4Ymyw79kcgsB/0fnrzF18rUvPWPeCMm3D8taDl74U8qTOp7 RcxaFa+EQZJZmYp3QkBGs/LPYPAuOn1uUlJny9dPqk3XSg//yxSdTH35++mvCL3s gTOGEsckQPGzavD30VX9XJ0BXNwVNbpfZcH6Ga9dGY25cGG24Zi9Uln6qx8f6fu8 /qV92aQaNL27mLVjbb4h/c2yNbmQHdwFa42OObTPLqNPp5O8UXK3/Bk1gZ/3jDdX DPWq5P56S0QichAmcFnPfEUjUzCRSzJdDJsoGQuD2G4Ves/Z2BH7Avik1LxM5kQw dVl47icFi0esBEe41WQQu18u/2eRDqw9Urz3MZ3v/aGiCh7Oq3NQ =GDe0 -----END PGP SIGNATURE----- Merge tag 'memblock-v6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull mm-init update from Mike Rapoport: "Simplify deferred initialization of struct pages Refactor and simplify deferred initialization of the memory map. Beside the negative diffstat it gives 3ms (55ms vs 58ms) reduction in the initialization of deferred pages on single node system with 64GiB of RAM" * tag 'memblock-v6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: memblock: drop for_each_free_mem_pfn_range_in_zone_from() mm/mm_init: drop deferred_init_maxorder() mm/mm_init: deferred_init_memmap: use a job per zone mm/mm_init: use deferred_init_memmap_chunk() in deferred_grow_zone() |
||
|
|
a498d59c46 |
dma-mapping updates for Linux 6.18:
- refactoring of DMA mapping API to physical addresses as the primary interface instead of page+offset parameters; this gets much closer to Matthew Wilcox's long term wish for struct-pageless IO to cacheable DRAM and is supporting memdesc project which seeks to substantially transform how struct page works; an advantage of this approach is the possibility of introducing DMA_ATTR_MMIO, which covers existing 'dma_map_resource' flow in the common paths, what in turn lets to use recently introduced dma_iova_link() API to map PCI P2P MMIO without creating struct page; developped by Leon Romanovsky and Jason Gunthorpe - minor clean-up by Petr Tesarik and Qianfeng Rong -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSrngzkoBtlA8uaaJ+Jp1EFxbsSRAUCaNugqAAKCRCJp1EFxbsS RBvDAQCEd4P6pz6ROQHf5BfiF5J1db2H6bWsFLjajx3KfNWf8gD+P0eQ0hTzLrcd zuSKZTivviOiyjXlt/9GOaXXPnmTwA0= =b0nZ -----END PGP SIGNATURE----- Merge tag 'dma-mapping-6.18-2025-09-30' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping updates from Marek Szyprowski: - Refactoring of DMA mapping API to physical addresses as the primary interface instead of page+offset parameters This gets much closer to Matthew Wilcox's long term wish for struct-pageless IO to cacheable DRAM and is supporting memdesc project which seeks to substantially transform how struct page works. An advantage of this approach is the possibility of introducing DMA_ATTR_MMIO, which covers existing 'dma_map_resource' flow in the common paths, what in turn lets to use recently introduced dma_iova_link() API to map PCI P2P MMIO without creating struct page Developped by Leon Romanovsky and Jason Gunthorpe - Minor clean-up by Petr Tesarik and Qianfeng Rong * tag 'dma-mapping-6.18-2025-09-30' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: kmsan: fix missed kmsan_handle_dma() signature conversion mm/hmm: properly take MMIO path mm/hmm: migrate to physical address-based DMA mapping API dma-mapping: export new dma_*map_phys() interface xen: swiotlb: Open code map_resource callback dma-mapping: implement DMA_ATTR_MMIO for dma_(un)map_page_attrs() kmsan: convert kmsan_handle_dma to use physical addresses dma-mapping: convert dma_direct_*map_page to be phys_addr_t based iommu/dma: implement DMA_ATTR_MMIO for iommu_dma_(un)map_phys() iommu/dma: rename iommu_dma_*map_page to iommu_dma_*map_phys dma-mapping: rename trace_dma_*map_page to trace_dma_*map_phys dma-debug: refactor to use physical addresses for page mapping iommu/dma: implement DMA_ATTR_MMIO for dma_iova_link(). dma-mapping: introduce new DMA attribute to indicate MMIO memory swiotlb: Remove redundant __GFP_NOWARN dma-direct: clean up the logic in __dma_direct_alloc_pages() |
||
|
|
c14bdcc9f2 |
mm/khugepaged: use KMEM_CACHE()
We got some late review commits during review of commit b4c9ffb54b32
("mm/khugepaged: remove definition of struct khugepaged_mm_slot"). No
need to keep the old cache name "khugepaged_mm_slot", let's simply use
KMEM_CACHE().
Link: https://lkml.kernel.org/r/20251001091900.20041-3-richard.weiyang@gmail.com
Fixes: b4c9ffb54b32 ("mm/khugepaged: remove definition of struct khugepaged_mm_slot")
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Kiryl Shutsemau <kirill@shutemov.name>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: SeongJae Park <sj@kernel.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Kiryl Shutsemau <kas@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
df6879a748 |
mm/ksm: cleanup mm_slot_entry() invocation
Patch series "mm_slot: following fixup for usage of mm_slot_entry()", v2.
We got some late review commits during review of "mm_slot: fix the usage of
mm_slot_entry()" in [1].
This patch (of 2):
We got some late review commits during review of commit 08498be43ee6
("mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL"). Let's
reduce the indentation level and make the code easier to follow by using
gotos to a new label.
Link: https://lkml.kernel.org/r/20251001091900.20041-1-richard.weiyang@gmail.com
Link: https://lkml.kernel.org/r/20251001091900.20041-2-richard.weiyang@gmail.com
Link: https://lkml.kernel.org/r/20250927004539.19308-1-richard.weiyang@gmail.com [1]
Fixes: 08498be43ee6 ("mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL")
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Kiryl Shutsemau <kas@kernel.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Kiryl Shutsemau <kirill@shutemov.name>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
89be2815f4 |
mm: clean up is_guard_pte_marker()
Let's simplify the implementation. The current code is redundant as it effectively expands to: is_swap_pte(pte) && is_pte_marker_entry(...) && // from is_pte_marker() is_pte_marker_entry(...) // from is_guard_swp_entry() While a modern compiler could likely optimize this away, let's have clean code and not rely on it. Link: https://lkml.kernel.org/r/20250924045830.3817-1-lance.yang@linux.dev Signed-off-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Mika Penttilä <mpenttil@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
b8179af120 |
mm/memory_hotplug: activate node before adding new memory blocks
The sysfs attributes for memory blocks require the node ID to be set and initialized, so move the node activation before adding new memory blocks. This also has the nice side effect that the BUG_ON() can be converted into a WARN_ON() as we now can handle registration errors. Link: https://lkml.kernel.org/r/20250729064637.51662-3-hare@kernel.org Fixes: b9ff036082cd ("mm/memory_hotplug.c: make add_memory_resource use __try_online_node") Signed-off-by: Hannes Reinecke <hare@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
070a542f08 |
NFS Client Updates for Linux 6.18
New Features:
* Add a Kconfig option to redirect dfprintk() to the trace buffer
* Enable use of the RWF_DONTCACHE flag on the NFS client
* Add striped layout handling to pNFS flexfiles
* Add proper localio handling for READ and WRITE O_DIRECT
Bugfixes:
* Handle NFS4ERR_GRACE errors during delegation recall
* Fix NFSv4.1 backchannel max_resp_sz verification check
* Fix mount hang after CREATE_SESSION failure
* Fix d_parent->d_inode locking in nfs4_setup_readdir()
Other Cleanups and Improvements:
* Improvements to write handling tracepoints
* Fix a few trivial spelling mistakes
* Cleanups to the rpcbind cleanup call sites
* Convert the SUNRPC xdr_buf to use a scratch folio instead of scratch page
* Remove unused NFS_WBACK_BUSY() macro
* Remove __GFP_NOWARN flags
* Unexport rpc_malloc() and rpc_free()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmjgOEkACgkQ18tUv7Cl
QOuX7RAA33AUq4NBxzDOgz4u4eNU/a8z2AazRgAtfmPVLTitrx/kqcfVEtHAdHFi
cWkN2SO+TVzxIGOrudqNyjV2cjfUJV4ZJBkNY6lJvxPNAH27Dk9P2iMF12QYtHOq
qyrwqoUQcyBkmtpgFUyHzydA4J17JDl5A7I/tOkro3ZfV4gmYAUwVdS+VtJoosLp
7FnXv+W5FBWkfKrIT+vPyiBqxl0gZXmzUkJK2lG9m9NvE2Jk2MbPFyhdUEA5JybJ
akNLdBnFwNWw2rLulSqs68ZbCGz6NY634q1Z+ZsRJ907ZdBqJ7zIBFv4yc/bMpZm
Q9kh1M0OyvK0MlLRFe3efOLxRoN9nJPd+kuaw9eYw5V57Jrwj6QGV4nud2C8nzs8
iB+LuJli+FRCeD84SY8NnMFKpXphHCeMXcBMRMsLTOSotJZFithO95+w1pKlK64A
lxY1JXOQYelwJZxfhGPovwac4t1arpDjsumRlTmq12KaQnM3Z1gR2PUgeLxEPHQM
f6gEiN9KDOhW/gZrFQxNs2hVAH68RDKpWxeR2XeVJlJYf37Hgh8bNGEiURi3G57n
ED7tFbK9lzHVFR07hiP3Cvzop4z2mzadgHo+1vzdXmZZQA4gc4MSFfszFLCnQopw
LEb7RqpVVXtQb+f7A+LuD+a2rLEnW+gTf6iLqCR8hAB5k1xmcYQ=
=8wnU
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.18-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"New Features:
- Add a Kconfig option to redirect dfprintk() to the trace buffer
- Enable use of the RWF_DONTCACHE flag on the NFS client
- Add striped layout handling to pNFS flexfiles
- Add proper localio handling for READ and WRITE O_DIRECT
Bugfixes:
- Handle NFS4ERR_GRACE errors during delegation recall
- Fix NFSv4.1 backchannel max_resp_sz verification check
- Fix mount hang after CREATE_SESSION failure
- Fix d_parent->d_inode locking in nfs4_setup_readdir()
Other Cleanups and Improvements:
- Improvements to write handling tracepoints
- Fix a few trivial spelling mistakes
- Cleanups to the rpcbind cleanup call sites
- Convert the SUNRPC xdr_buf to use a scratch folio instead of
scratch page
- Remove unused NFS_WBACK_BUSY() macro
- Remove __GFP_NOWARN flags
- Unexport rpc_malloc() and rpc_free()"
* tag 'nfs-for-6.18-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (46 commits)
NFS: add basic STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
nfs/localio: add tracepoints for misaligned DIO READ and WRITE support
nfs/localio: add proper O_DIRECT support for READ and WRITE
nfs/localio: refactor iocb initialization
nfs/localio: refactor iocb and iov_iter_bvec initialization
nfs/localio: avoid issuing misaligned IO using O_DIRECT
nfs/localio: make trace_nfs_local_open_fh more useful
NFSD: filecache: add STATX_DIOALIGN and STATX_DIO_READ_ALIGN support
sunrpc: unexport rpc_malloc() and rpc_free()
NFSv4/flexfiles: Add support for striped layouts
NFSv4/flexfiles: Update layout stats & error paths for striped layouts
NFSv4/flexfiles: Write path updates for striped layouts
NFSv4/flexfiles: Commit path updates for striped layouts
NFSv4/flexfiles: Read path updates for striped layouts
NFSv4/flexfiles: Update low level helper functions to be DS stripe aware.
NFSv4/flexfiles: Add data structure support for striped layouts
NFSv4/flexfiles: Use ds_commit_idx when marking a write commit
NFSv4/flexfiles: Remove cred local variable dependency
nfs4_setup_readdir(): insufficient locking for ->d_parent->d_inode dereferencing
NFS: Enable use of the RWF_DONTCACHE flag on the NFS client
...
|
||
|
|
6238729bfc |
fuse update for 6.18
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCaNuYpQAKCRDh3BK/laaZ PBF6APoDWzheST2Gw3LPkU03xF6Yn+ZMP0p3L676yYBb07LrWQD8CBsfOkMei6W/ L+/GaVAbpIsuCQ6GTPE/6HTLWevU2Q0= =kR5v -----END PGP SIGNATURE----- Merge tag 'fuse-update-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Extend copy_file_range interface to be fully 64bit capable (Miklos) - Add selftest for fusectl (Chen Linxuan) - Move fuse docs into a separate directory (Bagas Sanjaya) - Allow fuse to enter freezable state in some cases (Sergey Senozhatsky) - Clean up writeback accounting after removing tmp page copies (Joanne) - Optimize virtiofs request handling (Li RongQing) - Add synchronous FUSE_INIT support (Miklos) - Allow server to request prune of unused inodes (Miklos) - Fix deadlock with AIO/sync release (Darrick) - Add some prep patches for block/iomap support (Darrick) - Misc fixes and cleanups * tag 'fuse-update-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (26 commits) fuse: move CREATE_TRACE_POINTS to a separate file fuse: move the backing file idr and code into a new source file fuse: enable FUSE_SYNCFS for all fuseblk servers fuse: capture the unique id of fuse commands being sent fuse: fix livelock in synchronous file put from fuseblk workers mm: fix lockdep issues in writeback handling fuse: add prune notification fuse: remove redundant calls to fuse_copy_finish() in fuse_notify() fuse: fix possibly missing fuse_copy_finish() call in fuse_notify() fuse: remove FUSE_NOTIFY_CODE_MAX from <uapi/linux/fuse.h> fuse: remove fuse_readpages_end() null mapping check fuse: fix references to fuse.rst -> fuse/fuse.rst fuse: allow synchronous FUSE_INIT fuse: zero initialize inode private data fuse: remove unused 'inode' parameter in fuse_passthrough_open virtio_fs: fix the hash table using in virtio_fs_enqueue_req() mm: remove BDI_CAP_WRITEBACK_ACCT fuse: use default writeback accounting virtio_fs: Remove redundant spinlock in virtio_fs_request_complete() fuse: remove unneeded offset assignment when filling write pages ... |
||
|
|
e95e2d3f34 |
slab: Add allow_spin check to eliminate kmemleak warnings
In slab_post_alloc_hook(), kmemleak check is skipped when
gfpflags_allow_spinning() returns false since commit af92793e52c3
("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Therefore, unconditionally calling kmemleak_not_leak() in
alloc_slab_obj_exts() would trigger the following warning:
kmemleak: Trying to color unknown object at 0xffff8881057f5000 as Grey
Call Trace:
alloc_slab_obj_exts+0x1b5/0x370
__alloc_tagging_slab_alloc_hook+0x9f/0x2d0
__kmalloc_cache_noprof+0x1c4/0x5c0
__set_page_owner+0x10d/0x1c0
post_alloc_hook+0x84/0xf0
get_page_from_freelist+0x73b/0x1380
__alloc_frozen_pages_noprof+0x110/0x2c0
alloc_pages_mpol+0x44/0x140
alloc_slab_page+0xac/0x150
allocate_slab+0x78/0x3a0
___slab_alloc+0x76b/0xed0
__slab_alloc.constprop.0+0x5a/0xb0
Add the allow_spin check in alloc_slab_obj_exts() to eliminate the above
warning.
Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20250930063831.782815-1-ranxiaokai627@163.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
||
|
|
ab220548db |
kmsan: fix kmsan_handle_dma() to avoid false positives
KMSAN reports an uninitialized value issue in dma_map_phys()[1]. This is a false positive caused by the way the virtual address is handled in kmsan_handle_dma(). Fix it by translating the physical address to a virtual address using phys_to_virt(). [1] BUG: KMSAN: uninit-value in dma_map_phys+0xdc5/0x1060 dma_map_phys+0xdc5/0x1060 dma_map_page_attrs+0xcf/0x130 e1000_xmit_frame+0x3c51/0x78f0 dev_hard_start_xmit+0x22f/0xa30 sch_direct_xmit+0x3b2/0xcf0 __dev_queue_xmit+0x3588/0x5e60 neigh_resolve_output+0x9c5/0xaf0 ip6_finish_output2+0x24e0/0x2d30 ip6_finish_output+0x903/0x10d0 ip6_output+0x331/0x600 mld_sendpack+0xb4a/0x1770 mld_ifc_work+0x1328/0x19b0 process_scheduled_works+0xb91/0x1d80 worker_thread+0xedf/0x1590 kthread+0xd5c/0xf00 ret_from_fork+0x1f5/0x4c0 ret_from_fork_asm+0x1a/0x30 Uninit was created at: __kmalloc_cache_noprof+0x8f5/0x16b0 syslog_print+0x9a/0xef0 do_syslog+0x849/0xfe0 __x64_sys_syslog+0x97/0x100 x64_sys_call+0x3cf8/0x3e30 do_syscall_64+0xd9/0xfa0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Bytes 0-89 of 90 are uninitialized Memory access of size 90 starts at ffff8880367ed000 CPU: 1 UID: 0 PID: 1552 Comm: kworker/1:2 Not tainted 6.17.0-next-20250929 #26 PREEMPT(none) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014 Workqueue: mld mld_ifc_work Fixes: 6eb1e769b2c1 ("kmsan: convert kmsan_handle_dma to use physical addresses") Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20251002051024.3096061-1-syoshida@redhat.com |
||
|
|
8804d970fa |
Summary of significant series in this pull request:
- The 3 patch series "mm, swap: improve cluster scan strategy" from
Kairui Song improves performance and reduces the failure rate of swap
cluster allocation.
- The 4 patch series "support large align and nid in Rust allocators"
from Vitaly Wool permits Rust allocators to set NUMA node and large
alignment when perforning slub and vmalloc reallocs.
- The 2 patch series "mm/damon/vaddr: support stat-purpose DAMOS" from
Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets
for virtual address spaces for ops-level DAMOS filters.
- The 3 patch series "execute PROCMAP_QUERY ioctl under per-vma lock"
from Suren Baghdasaryan reduces mmap_lock contention during reads of
/proc/pid/maps.
- The 2 patch series "mm/mincore: minor clean up for swap cache
checking" from Kairui Song performs some cleanup in the swap code.
- The 11 patch series "mm: vm_normal_page*() improvements" from David
Hildenbrand provides code cleanup in the pagemap code.
- The 5 patch series "add persistent huge zero folio support" from
Pankaj Raghav provides a block layer speedup by optionalls making the
huge_zero_pagepersistent, instead of releasing it when its refcount
falls to zero.
- The 3 patch series "kho: fixes and cleanups" from Mike Rapoport adds a
few touchups to the recently added Kexec Handover feature.
- The 10 patch series "mm: make mm->flags a bitmap and 64-bit on all
arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To
end the constant struggle with space shortage on 32-bit conflicting with
64-bit's needs.
- The 2 patch series "mm/swapfile.c and swap.h cleanup" from Chris Li
cleans up some swap code.
- The 7 patch series "selftests/mm: Fix false positives and skip
unsupported tests" from Donet Tom fixes a few things in our selftests
code.
- The 7 patch series "prctl: extend PR_SET_THP_DISABLE to only provide
THPs when advised" from David Hildenbrand "allows individual processes
to opt-out of THP=always into THP=madvise, without affecting other
workloads on the system".
It's a long story - the [1/N] changelog spells out the considerations.
- The 11 patch series "Add and use memdesc_flags_t" from Matthew Wilcox
gets us started on the memdesc project. Please see
https://kernelnewbies.org/MatthewWilcox/Memdescs and
https://blogs.oracle.com/linux/post/introducing-memdesc.
- The 3 patch series "Tiny optimization for large read operations" from
Chi Zhiling improves the efficiency of the pagecache read path.
- The 5 patch series "Better split_huge_page_test result check" from Zi
Yan improves our folio splitting selftest code.
- The 2 patch series "test that rmap behaves as expected" from Wei Yang
adds some rmap selftests.
- The 3 patch series "remove write_cache_pages()" from Christoph Hellwig
removes that function and converts its two remaining callers.
- The 2 patch series "selftests/mm: uffd-stress fixes" from Dev Jain
fixes some UFFD selftests issues.
- The 3 patch series "introduce kernel file mapped folios" from Boris
Burkov introduces the concept of "kernel file pages". Using these
permits btrfs to account its metadata pages to the root cgroup, rather
than to the cgroups of random inappropriate tasks.
- The 2 patch series "mm/pageblock: improve readability of some
pageblock handling" from Wei Yang provides some readability improvements
to the page allocator code.
- The 11 patch series "mm/damon: support ARM32 with LPAE" from SeongJae
Park teaches DAMON to understand arm32 highmem.
- The 4 patch series "tools: testing: Use existing atomic.h for
vma/maple tests" from Brendan Jackman performs some code cleanups and
deduplication under tools/testing/.
- The 2 patch series "maple_tree: Fix testing for 32bit compiles" from
Liam Howlett fixes a couple of 32-bit issues in
tools/testing/radix-tree.c.
- The 2 patch series "kasan: unify kasan_enabled() and remove
arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN
arch-specific initialization code into a common arch-neutral
implementation.
- The 3 patch series "mm: remove zpool" from Johannes Weiner removes
zspool - an indirection layer which now only redirects to a single thing
(zsmalloc).
- The 2 patch series "mm: task_stack: Stack handling cleanups" from
Pasha Tatashin makes a couple of cleanups in the fork code.
- The 37 patch series "mm: remove nth_page()" from David Hildenbrand
makes rather a lot of adjustments at various nth_page() callsites,
eventually permitting the removal of that undesirable helper function.
- The 2 patch series "introduce kasan.write_only option in hw-tags" from
Yeoreum Yun creates a KASAN read-only mode for ARM, using that
architecture's memory tagging feature. It is felt that a read-only mode
KASAN is suitable for use in production systems rather than debug-only.
- The 3 patch series "mm: hugetlb: cleanup hugetlb folio allocation"
from Kefeng Wang does some tidying in the hugetlb folio allocation code.
- The 12 patch series "mm: establish const-correctness for pointer
parameters" from Max Kellermann makes quite a number of the MM API
functions more accurate about the constness of their arguments. This
was getting in the way of subsystems (in this case CEPH) when they
attempt to improving their own const/non-const accuracy.
- The 7 patch series "Cleanup free_pages() misuse" from Vishal Moola
fixes a number of code sites which were confused over when to use
free_pages() vs __free_pages().
- The 3 patch series "Add Rust abstraction for Maple Trees" from Alice
Ryhl makes the mapletree code accessible to Rust. Required by nouveau
and by its forthcoming successor: the new Rust Nova driver.
- The 2 patch series "selftests/mm: split_huge_page_test:
split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and
some cleanups to the thp selftesting code.
- The 14 patch series "mm, swap: introduce swap table as swap cache
(phase I)" from Chris Li and Kairui Song is the first step along the
path to implementing "swap tables" - a new approach to swap allocation
and state tracking which is expected to yield speed and space
improvements. This patchset itself yields a 5-20% performance benefit
in some situations.
- The 3 patch series "Some ptdesc cleanups" from Matthew Wilcox utilizes
the new memdesc layer to clean up the ptdesc code a little.
- The 3 patch series "Fix va_high_addr_switch.sh test failure" from
Chunyu Hu fixes some issues in our 5-level pagetable selftesting code.
- The 2 patch series "Minor fixes for memory allocation profiling" from
Suren Baghdasaryan addresses a couple of minor issues in relatively new
memory allocation profiling feature.
- The 3 patch series "Small cleanups" from Matthew Wilcox has a few
cleanups in preparation for more memdesc work.
- The 2 patch series "mm/damon: add addr_unit for DAMON_LRU_SORT and
DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in
furtherance of supporting arm highmem.
- The 2 patch series "selftests/mm: Add -Wunreachable-code and fix
warnings" from Muhammad Anjum adds that compiler check to selftests code
and fixes the fallout, by removing dead code.
- The 10 patch series "Improvements to Victim Process Thawing and OOM
Reaper Traversal Order" from zhongjinji makes a number of improvements
in the OOM killer: mainly thawing a more appropriate group of victim
threads so they can release resources.
- The 5 patch series "mm/damon: misc fixups and improvements for 6.18"
from SeongJae Park is a bunch of small and unrelated fixups for DAMON.
- The 7 patch series "mm/damon: define and use DAMON initialization
check function" from SeongJae Park implement reliability and
maintainability improvements to a recently-added bug fix.
- The 2 patch series "mm/damon/stat: expose auto-tuned intervals and
non-idle ages" from SeongJae Park provides additional transparency to
userspace clients of the DAMON_STAT information.
- The 2 patch series "Expand scope of khugepaged anonymous collapse"
from Dev Jain removes some constraints on khubepaged's collapsing of
anon VMAs. It also increases the success rate of MADV_COLLAPSE against
an anon vma.
- The 2 patch series "mm: do not assume file == vma->vm_file in
compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards
removal of file_operations.mmap(). This patchset concentrates upon
clearing up the treatment of stacked filesystems.
- The 6 patch series "mm: Improve mlock tracking for large folios" from
Kiryl Shutsemau provides some fixes and improvements to mlock's tracking
of large folios. /proc/meminfo's "Mlocked" field became more accurate.
- The 2 patch series "mm/ksm: Fix incorrect accounting of KSM counters
during fork" from Donet Tom fixes several user-visible KSM stats
inaccuracies across forks and adds selftest code to verify these
counters.
- The 2 patch series "mm_slot: fix the usage of mm_slot_entry" from Wei
Yang addresses some potential but presently benign issues in KSM's
mm_slot handling.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaN3cywAKCRDdBJ7gKXxA
jtaPAQDmIuIu7+XnVUK5V11hsQ/5QtsUeLHV3OsAn4yW5/3dEQD/UddRU08ePN+1
2VRB0EwkLAdfMWW7TfiNZ+yhuoiL/AA=
=4mhY
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "mm, swap: improve cluster scan strategy" from Kairui Song improves
performance and reduces the failure rate of swap cluster allocation
- "support large align and nid in Rust allocators" from Vitaly Wool
permits Rust allocators to set NUMA node and large alignment when
perforning slub and vmalloc reallocs
- "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend
DAMOS_STAT's handling of the DAMON operations sets for virtual
address spaces for ops-level DAMOS filters
- "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren
Baghdasaryan reduces mmap_lock contention during reads of
/proc/pid/maps
- "mm/mincore: minor clean up for swap cache checking" from Kairui Song
performs some cleanup in the swap code
- "mm: vm_normal_page*() improvements" from David Hildenbrand provides
code cleanup in the pagemap code
- "add persistent huge zero folio support" from Pankaj Raghav provides
a block layer speedup by optionalls making the
huge_zero_pagepersistent, instead of releasing it when its refcount
falls to zero
- "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to
the recently added Kexec Handover feature
- "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo
Stoakes turns mm_struct.flags into a bitmap. To end the constant
struggle with space shortage on 32-bit conflicting with 64-bit's
needs
- "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap
code
- "selftests/mm: Fix false positives and skip unsupported tests" from
Donet Tom fixes a few things in our selftests code
- "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised"
from David Hildenbrand "allows individual processes to opt-out of
THP=always into THP=madvise, without affecting other workloads on the
system".
It's a long story - the [1/N] changelog spells out the considerations
- "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on
the memdesc project. Please see
https://kernelnewbies.org/MatthewWilcox/Memdescs and
https://blogs.oracle.com/linux/post/introducing-memdesc
- "Tiny optimization for large read operations" from Chi Zhiling
improves the efficiency of the pagecache read path
- "Better split_huge_page_test result check" from Zi Yan improves our
folio splitting selftest code
- "test that rmap behaves as expected" from Wei Yang adds some rmap
selftests
- "remove write_cache_pages()" from Christoph Hellwig removes that
function and converts its two remaining callers
- "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD
selftests issues
- "introduce kernel file mapped folios" from Boris Burkov introduces
the concept of "kernel file pages". Using these permits btrfs to
account its metadata pages to the root cgroup, rather than to the
cgroups of random inappropriate tasks
- "mm/pageblock: improve readability of some pageblock handling" from
Wei Yang provides some readability improvements to the page allocator
code
- "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON
to understand arm32 highmem
- "tools: testing: Use existing atomic.h for vma/maple tests" from
Brendan Jackman performs some code cleanups and deduplication under
tools/testing/
- "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes
a couple of 32-bit issues in tools/testing/radix-tree.c
- "kasan: unify kasan_enabled() and remove arch-specific
implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific
initialization code into a common arch-neutral implementation
- "mm: remove zpool" from Johannes Weiner removes zspool - an
indirection layer which now only redirects to a single thing
(zsmalloc)
- "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a
couple of cleanups in the fork code
- "mm: remove nth_page()" from David Hildenbrand makes rather a lot of
adjustments at various nth_page() callsites, eventually permitting
the removal of that undesirable helper function
- "introduce kasan.write_only option in hw-tags" from Yeoreum Yun
creates a KASAN read-only mode for ARM, using that architecture's
memory tagging feature. It is felt that a read-only mode KASAN is
suitable for use in production systems rather than debug-only
- "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does
some tidying in the hugetlb folio allocation code
- "mm: establish const-correctness for pointer parameters" from Max
Kellermann makes quite a number of the MM API functions more accurate
about the constness of their arguments. This was getting in the way
of subsystems (in this case CEPH) when they attempt to improving
their own const/non-const accuracy
- "Cleanup free_pages() misuse" from Vishal Moola fixes a number of
code sites which were confused over when to use free_pages() vs
__free_pages()
- "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the
mapletree code accessible to Rust. Required by nouveau and by its
forthcoming successor: the new Rust Nova driver
- "selftests/mm: split_huge_page_test: split_pte_mapped_thp
improvements" from David Hildenbrand adds a fix and some cleanups to
the thp selftesting code
- "mm, swap: introduce swap table as swap cache (phase I)" from Chris
Li and Kairui Song is the first step along the path to implementing
"swap tables" - a new approach to swap allocation and state tracking
which is expected to yield speed and space improvements. This
patchset itself yields a 5-20% performance benefit in some situations
- "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc
layer to clean up the ptdesc code a little
- "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some
issues in our 5-level pagetable selftesting code
- "Minor fixes for memory allocation profiling" from Suren Baghdasaryan
addresses a couple of minor issues in relatively new memory
allocation profiling feature
- "Small cleanups" from Matthew Wilcox has a few cleanups in
preparation for more memdesc work
- "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from
Quanmin Yan makes some changes to DAMON in furtherance of supporting
arm highmem
- "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad
Anjum adds that compiler check to selftests code and fixes the
fallout, by removing dead code
- "Improvements to Victim Process Thawing and OOM Reaper Traversal
Order" from zhongjinji makes a number of improvements in the OOM
killer: mainly thawing a more appropriate group of victim threads so
they can release resources
- "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park
is a bunch of small and unrelated fixups for DAMON
- "mm/damon: define and use DAMON initialization check function" from
SeongJae Park implement reliability and maintainability improvements
to a recently-added bug fix
- "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from
SeongJae Park provides additional transparency to userspace clients
of the DAMON_STAT information
- "Expand scope of khugepaged anonymous collapse" from Dev Jain removes
some constraints on khubepaged's collapsing of anon VMAs. It also
increases the success rate of MADV_COLLAPSE against an anon vma
- "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()"
from Lorenzo Stoakes moves us further towards removal of
file_operations.mmap(). This patchset concentrates upon clearing up
the treatment of stacked filesystems
- "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau
provides some fixes and improvements to mlock's tracking of large
folios. /proc/meminfo's "Mlocked" field became more accurate
- "mm/ksm: Fix incorrect accounting of KSM counters during fork" from
Donet Tom fixes several user-visible KSM stats inaccuracies across
forks and adds selftest code to verify these counters
- "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses
some potential but presently benign issues in KSM's mm_slot handling
* tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits)
mm: swap: check for stable address space before operating on the VMA
mm: convert folio_page() back to a macro
mm/khugepaged: use start_addr/addr for improved readability
hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list
alloc_tag: fix boot failure due to NULL pointer dereference
mm: silence data-race in update_hiwater_rss
mm/memory-failure: don't select MEMORY_ISOLATION
mm/khugepaged: remove definition of struct khugepaged_mm_slot
mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL
hugetlb: increase number of reserving hugepages via cmdline
selftests/mm: add fork inheritance test for ksm_merging_pages counter
mm/ksm: fix incorrect KSM counter handling in mm_struct during fork
drivers/base/node: fix double free in register_one_node()
mm: remove PMD alignment constraint in execmem_vmalloc()
mm/memory_hotplug: fix typo 'esecially' -> 'especially'
mm/rmap: improve mlock tracking for large folios
mm/filemap: map entire large folio faultaround
mm/fault: try to map the entire file folio in finish_fault()
mm/rmap: mlock large folios in try_to_unmap_one()
mm/rmap: fix a mlock race condition in folio_referenced_one()
...
|
||
|
|
24d9e8b3c9 |
slab updates for 6.18
-----BEGIN PGP SIGNATURE----- iQFPBAABCAA5FiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmja74IbFIAAAAAABAAO bWFudTIsMi41KzEuMTEsMiwyAAoJELvgsHXSRYiacR4H/04aBsr7LZnTJVeZLQwK HKoOwXBqiQyqPdjKXGKnp7Mh9gRp2W3V11VsYTuDJNUS+Vz5YXW0z8cRnUfZ3SYs l+GZC3vZeAy2EVJE1U6Mb673hU8vziI80IO2q/tGzaj9a+wC3L0lemc+YFQTwG+u pMtt8zU2vHRjgkx8TNNqJBBOLLDV+RzIl8pqXVnh4eju6x6ZdreGnjXaePYMdjG0 fXLf9XwIeWREqbfeOCEOB50Ts71kkdiOeskwnJyfCTDT8WTu3zC/dICqfh66e3Gg 8hQKvMsuKpm/FwbtgdB0WvaDjENH6PmY+ubLYVxwvNpcsTSqfe0IYGm+HpUP+TPf m+Y= =w+JL -----END PGP SIGNATURE----- Merge tag 'slab-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - A new layer for caching objects for allocation and free via percpu arrays called sheaves. The aim is to combine the good parts of SLAB (lower-overhead and simpler percpu caching, compared to SLUB) without the past issues with arrays for freeing remote NUMA node objects and their flushing. It also allows more efficient kfree_rcu(), and cheaper object preallocations for cases where the exact number of objects is unknown, but an upper bound is. Currently VMAs and maple nodes are using this new caching, with a plan to enable it for all caches and remove the complex SLUB fastpath based on cpu (partial) slabs and this_cpu_cmpxchg_double(). (Vlastimil Babka, with Liam Howlett and Pedro Falcato for the maple tree changes) - Re-entrant kmalloc_nolock(), which allows opportunistic allocations from NMI and tracing/kprobe contexts. Building on prior page allocator and memcg changes, it will result in removing BPF-specific caches on top of slab (Alexei Starovoitov) - Various fixes and cleanups. (Kuan-Wei Chiu, Matthew Wilcox, Suren Baghdasaryan, Ye Liu) * tag 'slab-for-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (40 commits) slab: Introduce kmalloc_nolock() and kfree_nolock(). slab: Reuse first bit for OBJEXTS_ALLOC_FAIL slab: Make slub local_(try)lock more precise for LOCKDEP mm: Introduce alloc_frozen_pages_nolock() mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock(). locking/local_lock: Introduce local_lock_is_locked(). maple_tree: Convert forking to use the sheaf interface maple_tree: Add single node allocation support to maple state maple_tree: Prefilled sheaf conversion and testing tools/testing: Add support for prefilled slab sheafs maple_tree: Replace mt_free_one() with kfree() maple_tree: Use kfree_rcu in ma_free_rcu testing/radix-tree/maple: Hack around kfree_rcu not existing tools/testing: include maple-shim.c in maple.c maple_tree: use percpu sheaves for maple_node_cache mm, vma: use percpu sheaves for vm_area_struct cache tools/testing: Add support for changes to slab for sheaves slab: allow NUMA restricted allocations to use percpu sheaves tools/testing/vma: Implement vm_refcnt reset slab: skip percpu sheaves for remote object freeing ... |
||
|
|
07fdad3a93 |
Networking changes for 6.18.
Core & protocols
----------------
- Improve drop account scalability on NUMA hosts for RAW and UDP sockets
and the backlog, almost doubling the Pps capacity under DoS.
- Optimize the UDP RX performance under stress, reducing contention,
revisiting the binary layout of the involved data structs and
implementing NUMA-aware locking. This improves UDP RX performance by
an additional 50%, even more under extreme conditions.
- Add support for PSP encryption of TCP connections; this mechanism has
some similarities with IPsec and TLS, but offers superior HW offloads
capabilities.
- Ongoing work to support Accurate ECN for TCP. AccECN allows more than
one congestion notification signal per RTT and is a building block for
Low Latency, Low Loss, and Scalable Throughput (L4S).
- Reorganize the TCP socket binary layout for data locality, reducing
the number of touched cachelines in the fastpath.
- Refactor skb deferral free to better scale on large multi-NUMA hosts,
this improves TCP and UDP RX performances significantly on such HW.
- Increase the default socket memory buffer limits from 256K to 4M to
better fit modern link speeds.
- Improve handling of setups with a large number of nexthop, making dump
operating scaling linearly and avoiding unneeded synchronize_rcu() on
delete.
- Improve bridge handling of VLAN FDB, storing a single entry per bridge
instead of one entry per port; this makes the dump order of magnitude
faster on large switches.
- Restore IP ID correctly for encapsulated packets at GSO segmentation
time, allowing GRO to merge packets in more scenarios.
- Improve netfilter matching performance on large sets.
- Improve MPTCP receive path performance by leveraging recently
introduced core infrastructure (skb deferral free) and adopting recent
TCP autotuning changes.
- Allow bridges to redirect to a backup port when the bridge port is
administratively down.
- Introduce MPTCP 'laminar' endpoint that con be used only once per
connection and simplify common MPTCP setups.
- Add RCU safety to dst->dev, closing a lot of possible races.
- A significant crypto library API for SCTP, MPTCP and IPv6 SR, reducing
code duplication.
- Supports pulling data from an skb frag into the linear area of an XDP
buffer.
Things we sprinkled into general kernel code
--------------------------------------------
- Generate netlink documentation from YAML using an integrated
YAML parser.
Driver API
----------
- Support using IPv6 Flow Label in Rx hash computation and RSS queue
selection.
- Introduce API for fetching the DMA device for a given queue, allowing
TCP zerocopy RX on more H/W setups.
- Make XDP helpers compatible with unreadable memory, allowing more
easily building DevMem-enabled drivers with a unified XDP/skbs
datapath.
- Add a new dedicated ethtool callback enabling drivers to provide the
number of RX rings directly, improving efficiency and clarity in RX
ring queries and RSS configuration.
- Introduce a burst period for the health reporter, allowing better
handling of multiple errors due to the same root cause.
- Support for DPLL phase offset exponential moving average, controlling
the average smoothing factor.
Device drivers
--------------
- Add a new Huawei driver for 3rd gen NIC (hinic3).
- Add a new SpacemiT driver for K1 ethernet MAC.
- Add a generic abstraction for shared memory communication devices
(dibps)
- Ethernet high-speed NICs:
- nVidia/Mellanox:
- Use multiple per-queue doorbell, to avoid MMIO contention issues
- support adjacent functions, allowing them to delegate their
SR-IOV VFs to sibling PFs
- support RSS for IPSec offload
- support exposing raw cycle counters in PTP and mlx5
- support for disabling host PFs.
- Intel (100G, ice, idpf):
- ice: support for SRIOV VFs over an Active-Active link aggregate
- ice: support for firmware logging via debugfs
- ice: support for Earliest TxTime First (ETF) hardware offload
- idpf: support basic XDP functionalities and XSk
- Broadcom (bnxt):
- support Hyper-V VF ID
- dynamic SRIOV resource allocations for RoCE
- Meta (fbnic):
- support queue API, zero-copy Rx and Tx
- support basic XDP functionalities
- devlink health support for FW crashes and OTP mem corruptions
- expand hardware stats coverage to FEC, PHY, and Pause
- Wangxun:
- support ethtool coalesce options
- support for multiple RSS contexts
- Ethernet virtual:
- Macsec:
- replace custom netlink attribute checks with policy-level checks
- Bonding:
- support aggregator selection based on port priority
- Microsoft vNIC:
- use page pool fragments for RX buffers instead of full pages to
improve memory efficiency
- Ethernet NICs consumer, and embedded:
- Qualcomm: support Ethernet function for IPQ9574 SoC
- Airoha: implement wlan offloading via NPU
- Freescale
- enetc: add NETC timer PTP driver and add PTP support
- fec: enable the Jumbo frame support for i.MX8QM
- Renesas (R-Car S4): support HW offloading for layer 2 switching
- support for RZ/{T2H, N2H} SoCs
- Cadence (macb): support TAPRIO traffic scheduling
- TI:
- support for Gigabit ICSS ethernet SoC (icssm-prueth)
- Synopsys (stmmac): a lot of cleanups
- Ethernet PHYs:
- Support 10g-qxgmi phy-mode for AQR412C, Felix DSA and Lynx PCS
driver
- Support bcm63268 GPHY power control
- Support for Micrel lan8842 PHY and PTP
- Support for Aquantia AQR412 and AQR115
- CAN:
- a large CAN-XL preparation work
- reorganize raw_sock and uniqframe struct to minimize memory usage
- rcar_canfd: update the CAN-FD handling
- WiFi:
- extended Neighbor Awareness Networking (NAN) support
- S1G channel representation cleanup
- improve S1G support
- WiFi drivers:
- Intel (iwlwifi):
- major refactor and cleanup
- Broadcom (brcm80211):
- support for AP isolation
- RealTek (rtw88/89) rtw88/89:
- preparation work for RTL8922DE support
- MediaTek (mt76):
- HW restart improvements
- MLO support
- Qualcomm/Atheros (ath10k_
- GTK rekey fixes
- Bluetooth drivers:
- btusb: support for several new IDs for MT7925
- btintel: support for BlazarIW core
- btintel_pcie: support for _suspend() / _resume()
- btintel_pcie: support for Scorpious, Panther Lake-H484 IDs
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmjdBdsSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkWnAP/2Jz0ExnYMDxLxkkKqMte+3JNeF/KNjB
zVIslYN/5ekkd1TMXyDLFlWHqUOUMSF9+cK5ZRMHG6/lzOueSzFuuuDFsWs6Kf2f
q7Q9MzlXhR9++YCsdES1uS5x3PPjMInzo2ZivMFa6fUDLLFSzeAOKL9k+RS0EggU
VlXv2Wkm73R0O6KAssgDsHke9cnNz+F0DzhQ0S3qkyZF9tS5NrDeUl7fZ47XZgwb
ZuXdEzKmTTepo2XvZGxJoF2D7nekWFlHhLjEPpDJtET19nwhukCry41/FplJwlKR
dWsYkqLkrOEQKFQ+q++/5c3BgFOG+IrQLbP5bGQF1tZRMl4Dqp9PDxK5fKJfccbS
0VY3Y2qWOSYejT2Ji2kEHR5E5rPyFm3Y5A4Rnpz0yEHj14vL2v4zf7CZRkCyyDfD
doIZXFGkM0+N7QeXLEN833cje9zjaXuP9GAE+2bb+wAWAZAqof4KX8JgHh+y5Rwm
pvUtvFxmEtntlMwNBap8aT3FquGtfZncU8pzAE5kvWjuMvyF0NVWiE5rB2kSQM/X
NLmUdvDyjwwJqthXAJG0fl+a0mNJ/kOAqSOKJDJrfKDGWa+ovwY0iY06SpK0wIbO
Wz7tpMk5MSlYXW8xWMlbyhvvU6T9xuoQ2KV4QTdMxc6Ir3sNX6YkQr+gjQjxB0gx
ST2QF6lZeWFh
=w2Kz
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core & protocols:
- Improve drop account scalability on NUMA hosts for RAW and UDP
sockets and the backlog, almost doubling the Pps capacity under DoS
- Optimize the UDP RX performance under stress, reducing contention,
revisiting the binary layout of the involved data structs and
implementing NUMA-aware locking. This improves UDP RX performance
by an additional 50%, even more under extreme conditions
- Add support for PSP encryption of TCP connections; this mechanism
has some similarities with IPsec and TLS, but offers superior HW
offloads capabilities
- Ongoing work to support Accurate ECN for TCP. AccECN allows more
than one congestion notification signal per RTT and is a building
block for Low Latency, Low Loss, and Scalable Throughput (L4S)
- Reorganize the TCP socket binary layout for data locality, reducing
the number of touched cachelines in the fastpath
- Refactor skb deferral free to better scale on large multi-NUMA
hosts, this improves TCP and UDP RX performances significantly on
such HW
- Increase the default socket memory buffer limits from 256K to 4M to
better fit modern link speeds
- Improve handling of setups with a large number of nexthop, making
dump operating scaling linearly and avoiding unneeded
synchronize_rcu() on delete
- Improve bridge handling of VLAN FDB, storing a single entry per
bridge instead of one entry per port; this makes the dump order of
magnitude faster on large switches
- Restore IP ID correctly for encapsulated packets at GSO
segmentation time, allowing GRO to merge packets in more scenarios
- Improve netfilter matching performance on large sets
- Improve MPTCP receive path performance by leveraging recently
introduced core infrastructure (skb deferral free) and adopting
recent TCP autotuning changes
- Allow bridges to redirect to a backup port when the bridge port is
administratively down
- Introduce MPTCP 'laminar' endpoint that con be used only once per
connection and simplify common MPTCP setups
- Add RCU safety to dst->dev, closing a lot of possible races
- A significant crypto library API for SCTP, MPTCP and IPv6 SR,
reducing code duplication
- Supports pulling data from an skb frag into the linear area of an
XDP buffer
Things we sprinkled into general kernel code:
- Generate netlink documentation from YAML using an integrated YAML
parser
Driver API:
- Support using IPv6 Flow Label in Rx hash computation and RSS queue
selection
- Introduce API for fetching the DMA device for a given queue,
allowing TCP zerocopy RX on more H/W setups
- Make XDP helpers compatible with unreadable memory, allowing more
easily building DevMem-enabled drivers with a unified XDP/skbs
datapath
- Add a new dedicated ethtool callback enabling drivers to provide
the number of RX rings directly, improving efficiency and clarity
in RX ring queries and RSS configuration
- Introduce a burst period for the health reporter, allowing better
handling of multiple errors due to the same root cause
- Support for DPLL phase offset exponential moving average,
controlling the average smoothing factor
Device drivers:
- Add a new Huawei driver for 3rd gen NIC (hinic3)
- Add a new SpacemiT driver for K1 ethernet MAC
- Add a generic abstraction for shared memory communication
devices (dibps)
- Ethernet high-speed NICs:
- nVidia/Mellanox:
- Use multiple per-queue doorbell, to avoid MMIO contention
issues
- support adjacent functions, allowing them to delegate their
SR-IOV VFs to sibling PFs
- support RSS for IPSec offload
- support exposing raw cycle counters in PTP and mlx5
- support for disabling host PFs.
- Intel (100G, ice, idpf):
- ice: support for SRIOV VFs over an Active-Active link
aggregate
- ice: support for firmware logging via debugfs
- ice: support for Earliest TxTime First (ETF) hardware offload
- idpf: support basic XDP functionalities and XSk
- Broadcom (bnxt):
- support Hyper-V VF ID
- dynamic SRIOV resource allocations for RoCE
- Meta (fbnic):
- support queue API, zero-copy Rx and Tx
- support basic XDP functionalities
- devlink health support for FW crashes and OTP mem corruptions
- expand hardware stats coverage to FEC, PHY, and Pause
- Wangxun:
- support ethtool coalesce options
- support for multiple RSS contexts
- Ethernet virtual:
- Macsec:
- replace custom netlink attribute checks with policy-level
checks
- Bonding:
- support aggregator selection based on port priority
- Microsoft vNIC:
- use page pool fragments for RX buffers instead of full pages
to improve memory efficiency
- Ethernet NICs consumer, and embedded:
- Qualcomm: support Ethernet function for IPQ9574 SoC
- Airoha: implement wlan offloading via NPU
- Freescale
- enetc: add NETC timer PTP driver and add PTP support
- fec: enable the Jumbo frame support for i.MX8QM
- Renesas (R-Car S4):
- support HW offloading for layer 2 switching
- support for RZ/{T2H, N2H} SoCs
- Cadence (macb): support TAPRIO traffic scheduling
- TI:
- support for Gigabit ICSS ethernet SoC (icssm-prueth)
- Synopsys (stmmac): a lot of cleanups
- Ethernet PHYs:
- Support 10g-qxgmi phy-mode for AQR412C, Felix DSA and Lynx PCS
driver
- Support bcm63268 GPHY power control
- Support for Micrel lan8842 PHY and PTP
- Support for Aquantia AQR412 and AQR115
- CAN:
- a large CAN-XL preparation work
- reorganize raw_sock and uniqframe struct to minimize memory
usage
- rcar_canfd: update the CAN-FD handling
- WiFi:
- extended Neighbor Awareness Networking (NAN) support
- S1G channel representation cleanup
- improve S1G support
- WiFi drivers:
- Intel (iwlwifi):
- major refactor and cleanup
- Broadcom (brcm80211):
- support for AP isolation
- RealTek (rtw88/89) rtw88/89:
- preparation work for RTL8922DE support
- MediaTek (mt76):
- HW restart improvements
- MLO support
- Qualcomm/Atheros (ath10k):
- GTK rekey fixes
- Bluetooth drivers:
- btusb: support for several new IDs for MT7925
- btintel: support for BlazarIW core
- btintel_pcie: support for _suspend() / _resume()
- btintel_pcie: support for Scorpious, Panther Lake-H484 IDs"
* tag 'net-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1536 commits)
net: stmmac: Add support for Allwinner A523 GMAC200
dt-bindings: net: sun8i-emac: Add A523 GMAC200 compatible
Revert "Documentation: net: add flow control guide and document ethtool API"
octeontx2-pf: fix bitmap leak
octeontx2-vf: fix bitmap leak
net/mlx5e: Use extack in set rxfh callback
net/mlx5e: Introduce mlx5e_rss_params for RSS configuration
net/mlx5e: Introduce mlx5e_rss_init_params
net/mlx5e: Remove unused mdev param from RSS indir init
net/mlx5: Improve QoS error messages with actual depth values
net/mlx5e: Prevent entering switchdev mode with inconsistent netns
net/mlx5: HWS, Generalize complex matchers
net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs
selftests/net: add tcp_port_share to .gitignore
Revert "net/mlx5e: Update and set Xon/Xoff upon MTU set"
net: add NUMA awareness to skb_attempt_defer_free()
net: use llist for sd->defer_list
net: make softnet_data.defer_count an atomic
selftests: drv-net: psp: add tests for destroying devices
selftests: drv-net: psp: add test for auto-adjusting TCP MSS
...
|
||
|
|
feafee2845 |
arm64 updates for 6.18
Confidential computing:
- Add support for accepting secrets from firmware (e.g. ACPI CCEL)
and mapping them with appropriate attributes.
CPU features:
- Advertise atomic floating-point instructions to userspace.
- Extend Spectre workarounds to cover additional Arm CPU variants.
- Extend list of CPUs that support break-before-make level 2 and
guarantee not to generate TLB conflict aborts for changes of mapping
granularity (BBML2_NOABORT).
- Add GCS support to our uprobes implementation.
Documentation:
- Remove bogus SME documentation concerning register state when
entering/exiting streaming mode.
Entry code:
- Switch over to the generic IRQ entry code (GENERIC_IRQ_ENTRY).
- Micro-optimise syscall entry path with a compiler branch hint.
Memory management:
- Enable huge mappings in vmalloc space even when kernel page-table
dumping is enabled.
- Tidy up the types used in our early MMU setup code.
- Rework rodata= for closer parity with the behaviour on x86.
- For CPUs implementing BBML2_NOABORT, utilise block mappings in the
linear map even when rodata= applies to virtual aliases.
- Don't re-allocate the virtual region between '_text' and '_stext',
as doing so confused tools parsing /proc/vmcore.
Miscellaneous:
- Clean-up Kconfig menuconfig text for architecture features.
- Avoid redundant bitmap_empty() during determination of supported
SME vector lengths.
- Re-enable warnings when building the 32-bit vDSO object.
- Avoid breaking our eggs at the wrong end.
Perf and PMUs:
- Support for v3 of the Hisilicon L3C PMU.
- Support for Hisilicon's MN and NoC PMUs.
- Support for Fujitsu's Uncore PMU.
- Support for SPE's extended event filtering feature.
- Preparatory work to enable data source filtering in SPE.
- Support for multiple lanes in the DWC PCIe PMU.
- Support for i.MX94 in the IMX DDR PMU driver.
- MAINTAINERS update (Thank you, Yicong).
- Minor driver fixes (PERF_IDX2OFF() overflow, CMN register offsets).
Selftests:
- Add basic LSFE check to the existing hwcaps test.
- Support nolibc in GCS tests.
- Extend SVE ptrace test to pass unsupported regsets and invalid vector
lengths.
- Minor cleanups (typos, cosmetic changes).
System registers:
- Fix ID_PFR1_EL1 definition.
- Fix incorrect signedness of some fields in ID_AA64MMFR4_EL1.
- Sync TCR_EL1 definition with the latest Arm ARM (L.b).
- Be stricter about the input fed into our AWK sysreg generator script.
- Typo fixes and removal of redundant definitions.
ACPI, EFI and PSCI:
- Decouple Arm's "Software Delegated Exception Interface" (SDEI)
support from the ACPI GHES code so that it can be used by platforms
booted with device-tree.
- Remove unnecessary per-CPU tracking of the FPSIMD state across EFI
runtime calls.
- Fix a node refcount imbalance in the PSCI device-tree code.
CPU Features:
- Ensure register sanitisation is applied to fields in ID_AA64MMFR4.
- Expose AIDR_EL1 to userspace via sysfs, primarily so that KVM guests
can reliably query the underlying CPU types from the VMM.
- Re-enabling of SME support (CONFIG_ARM64_SME) as a result of fixes
to our context-switching, signal handling and ptrace code.
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmjWlMIQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNIVYB/sHaFYmToEaK4ldMfTn4ZQ8yr9ZuTMLr/ji
zOm7wULP08wKIcp6t+1N6e7A8Wb3YSErxTywc4b9MqctIel1WvBMewjJd38xb2qO
hFCuRuWemfyt6Rw/EGMCP54yueLzKbnN4q+/Aks8jedWtS+AXY6McGCJOzMjuECL
zKb68Hqqb09YQ2b2BPSAiU9g42rotiIYppLaLbEdxUnw/eqfEN3wSrl92/LuBlqH
r2wZDnJwA5q8Iy9HWlnhg4Jax0Cs86jSJPIQLB6v3WWhc3HR49pm5cqYzYkUht9L
nA6eaWJiBl/+k3S+ftPKNzU8n04r15lqyNZE2FYfRk+0BPSacJOf
=kO15
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"There's good stuff across the board, including some nice mm
improvements for CPUs with the 'noabort' BBML2 feature and a clever
patch to allow ptdump to play nicely with block mappings in the
vmalloc area.
Confidential computing:
- Add support for accepting secrets from firmware (e.g. ACPI CCEL)
and mapping them with appropriate attributes.
CPU features:
- Advertise atomic floating-point instructions to userspace
- Extend Spectre workarounds to cover additional Arm CPU variants
- Extend list of CPUs that support break-before-make level 2 and
guarantee not to generate TLB conflict aborts for changes of
mapping granularity (BBML2_NOABORT)
- Add GCS support to our uprobes implementation.
Documentation:
- Remove bogus SME documentation concerning register state when
entering/exiting streaming mode.
Entry code:
- Switch over to the generic IRQ entry code (GENERIC_IRQ_ENTRY)
- Micro-optimise syscall entry path with a compiler branch hint.
Memory management:
- Enable huge mappings in vmalloc space even when kernel page-table
dumping is enabled
- Tidy up the types used in our early MMU setup code
- Rework rodata= for closer parity with the behaviour on x86
- For CPUs implementing BBML2_NOABORT, utilise block mappings in the
linear map even when rodata= applies to virtual aliases
- Don't re-allocate the virtual region between '_text' and '_stext',
as doing so confused tools parsing /proc/vmcore.
Miscellaneous:
- Clean-up Kconfig menuconfig text for architecture features
- Avoid redundant bitmap_empty() during determination of supported
SME vector lengths
- Re-enable warnings when building the 32-bit vDSO object
- Avoid breaking our eggs at the wrong end.
Perf and PMUs:
- Support for v3 of the Hisilicon L3C PMU
- Support for Hisilicon's MN and NoC PMUs
- Support for Fujitsu's Uncore PMU
- Support for SPE's extended event filtering feature
- Preparatory work to enable data source filtering in SPE
- Support for multiple lanes in the DWC PCIe PMU
- Support for i.MX94 in the IMX DDR PMU driver
- MAINTAINERS update (Thank you, Yicong)
- Minor driver fixes (PERF_IDX2OFF() overflow, CMN register offsets).
Selftests:
- Add basic LSFE check to the existing hwcaps test
- Support nolibc in GCS tests
- Extend SVE ptrace test to pass unsupported regsets and invalid
vector lengths
- Minor cleanups (typos, cosmetic changes).
System registers:
- Fix ID_PFR1_EL1 definition
- Fix incorrect signedness of some fields in ID_AA64MMFR4_EL1
- Sync TCR_EL1 definition with the latest Arm ARM (L.b)
- Be stricter about the input fed into our AWK sysreg generator
script
- Typo fixes and removal of redundant definitions.
ACPI, EFI and PSCI:
- Decouple Arm's "Software Delegated Exception Interface" (SDEI)
support from the ACPI GHES code so that it can be used by platforms
booted with device-tree
- Remove unnecessary per-CPU tracking of the FPSIMD state across EFI
runtime calls
- Fix a node refcount imbalance in the PSCI device-tree code.
CPU Features:
- Ensure register sanitisation is applied to fields in ID_AA64MMFR4
- Expose AIDR_EL1 to userspace via sysfs, primarily so that KVM
guests can reliably query the underlying CPU types from the VMM
- Re-enabling of SME support (CONFIG_ARM64_SME) as a result of fixes
to our context-switching, signal handling and ptrace code"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (93 commits)
arm64: cpufeature: Remove duplicate asm/mmu.h header
arm64: Kconfig: Make CPU_BIG_ENDIAN depend on BROKEN
perf/dwc_pcie: Fix use of uninitialized variable
arm/syscalls: mark syscall invocation as likely in invoke_syscall
Documentation: hisi-pmu: Add introduction to HiSilicon V3 PMU
Documentation: hisi-pmu: Fix of minor format error
drivers/perf: hisi: Add support for L3C PMU v3
drivers/perf: hisi: Refactor the event configuration of L3C PMU
drivers/perf: hisi: Extend the field of tt_core
drivers/perf: hisi: Extract the event filter check of L3C PMU
drivers/perf: hisi: Simplify the probe process of each L3C PMU version
drivers/perf: hisi: Export hisi_uncore_pmu_isr()
drivers/perf: hisi: Relax the event ID check in the framework
perf: Fujitsu: Add the Uncore PMU driver
arm64: map [_text, _stext) virtual address range non-executable+read-only
arm64/sysreg: Update TCR_EL1 register
arm64: Enable vmalloc-huge with ptdump
arm64: cpufeature: add Neoverse-V3AE to BBML2 allow list
arm64: errata: Apply workarounds for Neoverse-V3AE
arm64: cputype: Add Neoverse-V3AE definitions
...
|
||
|
|
263e777ee3 |
vfs-6.18-rc1.writeback
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQigAKCRCRxhvAZXjc
ojqsAQCI9BOJV953LWPr9fmmFVz78wA4kTxkFFDAxwH/0iYtmQD8DORngPfJERI1
FwTY/gk2N744vUpbbcSM/KOnM3ss8wE=
=B3rD
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.18-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs writeback updates from Christian Brauner:
"This contains work adressing lockups reported by users when a systemd
unit reading lots of files from a filesystem mounted with the lazytime
mount option exits.
With the lazytime mount option enabled we can be switching many dirty
inodes on cgroup exit to the parent cgroup. The numbers observed in
practice when systemd slice of a large cron job exits can easily reach
hundreds of thousands or millions.
The logic in inode_do_switch_wbs() which sorts the inode into
appropriate place in b_dirty list of the target wb however has linear
complexity in the number of dirty inodes thus overall time complexity
of switching all the inodes is quadratic leading to workers being
pegged for hours consuming 100% of the CPU and switching inodes to the
parent wb.
Simple reproducer of the issue:
FILES=10000
# Filesystem mounted with lazytime mount option
MNT=/mnt/
echo "Creating files and switching timestamps"
for (( j = 0; j < 50; j ++ )); do
mkdir $MNT/dir$j
for (( i = 0; i < $FILES; i++ )); do
echo "foo" >$MNT/dir$j/file$i
done
touch -a -t 202501010000 $MNT/dir$j/file*
done
wait
echo "Syncing and flushing"
sync
echo 3 >/proc/sys/vm/drop_caches
echo "Reading all files from a cgroup"
mkdir /sys/fs/cgroup/unified/mycg1 || exit
echo $$ >/sys/fs/cgroup/unified/mycg1/cgroup.procs || exit
for (( j = 0; j < 50; j ++ )); do
cat /mnt/dir$j/file* >/dev/null &
done
wait
echo "Switching wbs"
# Now rmdir the cgroup after the script exits
This can be solved by:
- Avoiding contention on the wb->list_lock when switching inodes by
running a single work item per wb and managing a queue of items
switching to the wb
- Allowing rescheduling when switching inodes over to a different
cgroup to avoid softlockups
- Maintaining the b_dirty list ordering instead of sorting it"
* tag 'vfs-6.18-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
writeback: Add tracepoint to track pending inode switches
writeback: Avoid excessively long inode switching times
writeback: Avoid softlockup when switching many inodes
writeback: Avoid contention on wb->list_lock when switching inodes
|
||
|
|
b7ce6fa90f |
vfs-6.18-rc1.misc
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaNZQMQAKCRCRxhvAZXjc
omNLAQCgrwzd9sa1JTlixweu3OAxQlSEbLuMpEv7Ztm+B7Wz0AD9HtwPC44Kev03
GbMcB2DCFLC4evqYECj6IG7NBmoKsAs=
=1ICf
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual selections of misc updates for this cycle.
Features:
- Add "initramfs_options" parameter to set initramfs mount options.
This allows to add specific mount options to the rootfs to e.g.,
limit the memory size
- Add RWF_NOSIGNAL flag for pwritev2()
Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE
signal from being raised when writing on disconnected pipes or
sockets. The flag is handled directly by the pipe filesystem and
converted to the existing MSG_NOSIGNAL flag for sockets
- Allow to pass pid namespace as procfs mount option
Ever since the introduction of pid namespaces, procfs has had very
implicit behaviour surrounding them (the pidns used by a procfs
mount is auto-selected based on the mounting process's active
pidns, and the pidns itself is basically hidden once the mount has
been constructed)
This implicit behaviour has historically meant that userspace was
required to do some special dances in order to configure the pidns
of a procfs mount as desired. Examples include:
* In order to bypass the mnt_too_revealing() check, Kubernetes
creates a procfs mount from an empty pidns so that user
namespaced containers can be nested (without this, the nested
containers would fail to mount procfs)
But this requires forking off a helper process because you cannot
just one-shot this using mount(2)
* Container runtimes in general need to fork into a container
before configuring its mounts, which can lead to security issues
in the case of shared-pidns containers (a privileged process in
the pidns can interact with your container runtime process)
While SUID_DUMP_DISABLE and user namespaces make this less of an
issue, the strict need for this due to a minor uAPI wart is kind
of unfortunate
Things would be much easier if there was a way for userspace to
just specify the pidns they want. So this pull request contains
changes to implement a new "pidns" argument which can be set
using fsconfig(2):
fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
or classic mount(2) / mount(8):
// mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
Cleanups:
- Remove the last references to EXPORT_OP_ASYNC_LOCK
- Make file_remove_privs_flags() static
- Remove redundant __GFP_NOWARN when GFP_NOWAIT is used
- Use try_cmpxchg() in start_dir_add()
- Use try_cmpxchg() in sb_init_done_wq()
- Replace offsetof() with struct_size() in ioctl_file_dedupe_range()
- Remove vfs_ioctl() export
- Replace rwlock() with spinlock in epoll code as rwlock causes
priority inversion on preempt rt kernels
- Make ns_entries in fs/proc/namespaces const
- Use a switch() statement() in init_special_inode() just like we do
in may_open()
- Use struct_size() in dir_add() in the initramfs code
- Use str_plural() in rd_load_image()
- Replace strcpy() with strscpy() in find_link()
- Rename generic_delete_inode() to inode_just_drop() and
generic_drop_inode() to inode_generic_drop()
- Remove unused arguments from fcntl_{g,s}et_rw_hint()
Fixes:
- Document @name parameter for name_contains_dotdot() helper
- Fix spelling mistake
- Always return zero from replace_fd() instead of the file descriptor
number
- Limit the size for copy_file_range() in compat mode to prevent a
signed overflow
- Fix debugfs mount options not being applied
- Verify the inode mode when loading it from disk in minixfs
- Verify the inode mode when loading it from disk in cramfs
- Don't trigger automounts with RESOLVE_NO_XDEV
If openat2() was called with RESOLVE_NO_XDEV it didn't traverse
through automounts, but could still trigger them
- Add FL_RECLAIM flag to show_fl_flags() macro so it appears in
tracepoints
- Fix unused variable warning in rd_load_image() on s390
- Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD
- Use ns_capable_noaudit() when determining net sysctl permissions
- Don't call path_put() under namespace semaphore in listmount() and
statmount()"
* tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits)
fcntl: trim arguments
listmount: don't call path_put() under namespace semaphore
statmount: don't call path_put() under namespace semaphore
pid: use ns_capable_noaudit() when determining net sysctl permissions
fs: rename generic_delete_inode() and generic_drop_inode()
init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD
initramfs: Replace strcpy() with strscpy() in find_link()
initrd: Use str_plural() in rd_load_image()
initramfs: Use struct_size() helper to improve dir_add()
initrd: Fix unused variable warning in rd_load_image() on s390
fs: use the switch statement in init_special_inode()
fs/proc/namespaces: make ns_entries const
filelock: add FL_RECLAIM to show_fl_flags() macro
eventpoll: Replace rwlock with spinlock
selftests/proc: add tests for new pidns APIs
procfs: add "pidns" mount option
pidns: move is-ancestor logic to helper
openat2: don't trigger automounts with RESOLVE_NO_XDEV
namei: move cross-device check to __traverse_mounts
namei: remove LOOKUP_NO_XDEV check from handle_mounts
...
|
||
|
|
ca74b8cada |
Merge series "slab: Re-entrant kmalloc_nolock()"
From the cover letter [1]: This patch set introduces kmalloc_nolock() which is the next logical step towards any context allocation necessary to remove bpf_mem_alloc and get rid of preallocation requirement in BPF infrastructure. In production BPF maps grew to gigabytes in size. Preallocation wastes memory. Alloc from any context addresses this issue for BPF and other subsystems that are forced to preallocate too. This long task started with introduction of alloc_pages_nolock(), then memcg and objcg were converted to operate from any context including NMI, this set completes the task with kmalloc_nolock() that builds on top of alloc_pages_nolock() and memcg changes. After that BPF subsystem will gradually adopt it everywhere. Link: https://lore.kernel.org/all/20250909010007.1660-1-alexei.starovoitov@gmail.com/ [1] |
||
|
|
b912061924 |
Merge series "SLUB percpu sheaves"
This series adds an opt-in percpu array-based caching layer to SLUB.
It has evolved to a state where kmem caches with sheaves are compatible
with all SLUB features (slub_debug, SLUB_TINY, NUMA locality
considerations). The plan is therefore that it will be later enabled for
all kmem caches and replace the complicated cpu (partial) slabs code.
Note the name "sheaf" was invented by Matthew Wilcox so we don't call
the arrays magazines like the original Bonwick paper. The per-NUMA-node
cache of sheaves is thus called "barn".
This caching may seem similar to the arrays we had in SLAB, but there
are some important differences:
- deals differently with NUMA locality of freed objects, thus there are
no per-node "shared" arrays (with possible lock contention) and no
"alien" arrays that would need periodical flushing
- instead, freeing remote objects (which is rare) bypasses the sheaves
- percpu sheaves thus contain only local objects (modulo rare races
and local node exhaustion)
- NUMA restricted allocations and strict_numa mode is still honoured
- improves kfree_rcu() handling by reusing whole sheaves
- there is an API for obtaining a preallocated sheaf that can be used
for guaranteed and efficient allocations in a restricted context, when
the upper bound for needed objects is known but rarely reached
- opt-in, not used for every cache (for now)
The motivation comes mainly from the ongoing work related to VMA locking
scalability and the related maple tree operations. This is why VMA and
maple nodes caches are sheaf-enabled in the patchset.
A sheaf-enabled cache has the following expected advantages:
- Cheaper fast paths. For allocations, instead of local double cmpxchg,
thanks to local_trylock() it becomes a preempt_disable() and no atomic
operations. Same for freeing, which is otherwise a local double cmpxchg
only for short term allocations (so the same slab is still active on the
same cpu when freeing the object) and a more costly locked double
cmpxchg otherwise.
- kfree_rcu() batching and recycling. kfree_rcu() will put objects to a
separate percpu sheaf and only submit the whole sheaf to call_rcu()
when full. After the grace period, the sheaf can be used for
allocations, which is more efficient than freeing and reallocating
individual slab objects (even with the batching done by kfree_rcu()
implementation itself). In case only some cpus are allowed to handle rcu
callbacks, the sheaf can still be made available to other cpus on the
same node via the shared barn. The maple_node cache uses kfree_rcu() and
thus can benefit from this.
Note: this path is currently limited to !PREEMPT_RT
- Preallocation support. A prefilled sheaf can be privately borrowed to
perform a short term operation that is not allowed to block in the
middle and may need to allocate some objects. If an upper bound (worst
case) for the number of allocations is known, but only much fewer
allocations actually needed on average, borrowing and returning a sheaf
is much more efficient then a bulk allocation for the worst case
followed by a bulk free of the many unused objects. Maple tree write
operations should benefit from this.
- Compatibility with slub_debug. When slub_debug is enabled for a cache,
we simply don't create the percpu sheaves so that the debugging hooks
(at the node partial list slowpaths) are reached as before. The same
thing is done for CONFIG_SLUB_TINY. Sheaf preallocation still works by
reusing the (ineffective) paths for requests exceeding the cache's
sheaf_capacity. This is in line with the existing approach where
debugging bypasses the fast paths and SLUB_TINY preferes memory
savings over performance.
The above is adapted from the cover letter [1], which contains also
in-kernel microbenchmark results showing the lower overhead of sheaves.
Results from Suren Baghdasaryan [2] using a mmap/munmap microbenchmark
also show improvements.
Results from Sudarsan Mahendran [3] using will-it-scale show both
benefits and regressions, probably due to overall noisiness of those
tests.
Link: https://lore.kernel.org/all/20250910-slub-percpu-caches-v8-0-ca3099d8352c@suse.cz/ [1]
Link: https://lore.kernel.org/all/CAJuCfpEQ%3DRUgcAvRzE5jRrhhFpkm8E2PpBK9e9GhK26ZaJQt%3DQ@mail.gmail.com/ [2]
Link: https://lore.kernel.org/all/20250913000935.1021068-1-sudarsanm@google.com/ [3]
|