1
0
mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git synced 2026-01-11 17:10:13 +00:00
Dongsheng Yang 1d57628ff9 dm-pcache: add persistent cache target in device-mapper
This patch introduces dm-pcache, a new DM target that places a DAX-
capable persistent-memory device in front of any slower block device and
uses it as a high-throughput, low-latency  cache.

Design highlights
-----------------
- DAX data path – data is copied directly between DRAM and the pmem
  mapping, bypassing the block layer’s overhead.

- Segmented, crash-consistent layout
  - all layout metadata are dual-replicated CRC-protected.
  - atomic kset flushes; key replay on mount guarantees cache integrity
    even after power loss.

- Striped multi-tree index
  - Multi‑tree indexing for high parallelism.
  - overlap-resolution logic ensures non-intersecting cached extents.

- Background services
  - write-back worker flushes dirty keys in order, preserving backing-device
    crash consistency. This is important for checkpoint in cloud storage.
  - garbage collector reclaims clean segments when utilisation exceeds a
    tunable threshold.

- Data integrity – optional CRC32 on cached payload; metadata always protected.

Comparison with existing block-level caches
---------------------------------------------------------------------------------------------------------------------------------
| Feature                          | pcache (this patch)             | bcache                       | dm-writecache             |
|----------------------------------|---------------------------------|------------------------------|---------------------------|
| pmem access method               | DAX                             | bio (block I/O)              | DAX                       |
| Write latency (4 K rand-write)   | ~5 µs                           | ~20 µs                       | ~5 µs                     |
| Concurrency                      | multi subtree index             | global index tree            | single tree + wc_lock     |
| IOPS (4K randwrite, 32 numjobs)  | 2.1 M                           | 352 K                        | 283 K                     |
| Read-cache support               | YES                             | YES                          | NO                        |
| Deployment                       | no re-format of backend         | backend devices must be      | no re-format of backend   |
|                                  |                                 | reformatted                  |                           |
| Write-back ordering              | log-structured;                 | no ordering guarantee        | no ordering guarantee     |
|                                  | preserves app-IO-order          |                              |                           |
| Data integrity checks            | metadata + data CRC(optional)   | metadata CRC only            | none                      |
---------------------------------------------------------------------------------------------------------------------------------

Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-08-25 15:25:29 +02:00

203 lines
6.3 KiB
ReStructuredText
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

.. SPDX-License-Identifier: GPL-2.0
=================================
dm-pcache — Persistent Cache
=================================
*Author: Dongsheng Yang <dongsheng.yang@linux.dev>*
This document describes *dm-pcache*, a Device-Mapper target that lets a
byte-addressable *DAX* (persistent-memory, “pmem”) region act as a
high-performance, crash-persistent cache in front of a slower block
device. The code lives in `drivers/md/dm-pcache/`.
Quick feature summary
=====================
* *Write-back* caching (only mode currently supported).
* *16 MiB segments* allocated on the pmem device.
* *Data CRC32* verification (optional, per cache).
* Crash-safe: every metadata structure is duplicated (`PCACHE_META_INDEX_MAX
== 2`) and protected with CRC+sequence numbers.
* *Multi-tree indexing* (indexing trees sharded by logical address) for high PMem parallelism
* Pure *DAX path* I/O no extra BIO round-trips
* *Log-structured write-back* that preserves backend crash-consistency
Constructor
===========
::
pcache <cache_dev> <backing_dev> [<number_of_optional_arguments> <cache_mode writeback> <data_crc true|false>]
========================= ====================================================
``cache_dev`` Any DAX-capable block device (``/dev/pmem0``…).
All metadata *and* cached blocks are stored here.
``backing_dev`` The slow block device to be cached.
``cache_mode`` Optional, Only ``writeback`` is accepted at the
moment.
``data_crc`` Optional, default to ``false``
* ``true`` store CRC32 for every cached entry
and verify on reads
* ``false`` skip CRC (faster)
========================= ====================================================
Example
-------
.. code-block:: shell
dmsetup create pcache_sdb --table \
"0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true"
The first time a pmem device is used, dm-pcache formats it automatically
(super-block, cache_info, etc.).
Status line
===========
``dmsetup status <device>`` (``STATUSTYPE_INFO``) prints:
::
<sb_flags> <seg_total> <cache_segs> <segs_used> \
<gc_percent> <cache_flags> \
<key_head_seg>:<key_head_off> \
<dirty_tail_seg>:<dirty_tail_off> \
<key_tail_seg>:<key_tail_off>
Field meanings
--------------
=============================== =============================================
``sb_flags`` Super-block flags (e.g. endian marker).
``seg_total`` Number of physical *pmem* segments.
``cache_segs`` Number of segments used for cache.
``segs_used`` Segments currently allocated (bitmap weight).
``gc_percent`` Current GC high-water mark (0-90).
``cache_flags`` Bit 0 DATA_CRC enabled
Bit 1 INIT_DONE (cache initialised)
Bits 2-5 cache mode (0 == WB).
``key_head`` Where new key-sets are being written.
``dirty_tail`` First dirty key-set that still needs
write-back to the backing device.
``key_tail`` First key-set that may be reclaimed by GC.
=============================== =============================================
Messages
========
*Change GC trigger*
::
dmsetup message <dev> 0 gc_percent <0-90>
Theory of operation
===================
Sub-devices
-----------
==================== =========================================================
backing_dev Any block device (SSD/HDD/loop/LVM, etc.).
cache_dev DAX device; must expose direct-access memory.
==================== =========================================================
Segments and key-sets
---------------------
* The pmem space is divided into *16 MiB segments*.
* Each write allocates space from a per-CPU *data_head* inside a segment.
* A *cache-key* records a logical range on the origin and where it lives
inside pmem (segment + offset + generation).
* 128 keys form a *key-set* (kset); ksets are written sequentially in pmem
and are themselves crash-safe (CRC).
* The pair *(key_tail, dirty_tail)* delimit clean/dirty and live/dead ksets.
Write-back
----------
Dirty keys are queued into a tree; a background worker copies data
back to the backing_dev and advances *dirty_tail*. A FLUSH/FUA bio from the
upper layers forces an immediate metadata commit.
Garbage collection
------------------
GC starts when ``segs_used >= seg_total * gc_percent / 100``. It walks
from *key_tail*, frees segments whose every key has been invalidated, and
advances *key_tail*.
CRC verification
----------------
If ``data_crc is enabled`` dm-pcache computes a CRC32 over every cached data
range when it is inserted and stores it in the on-media key. Reads
validate the CRC before copying to the caller.
Failure handling
================
* *pmem media errors* all metadata copies are read with
``copy_mc_to_kernel``; an uncorrectable error logs and aborts initialisation.
* *Cache full* if no free segment can be found, writes return ``-EBUSY``;
dm-pcache retries internally (request deferral).
* *System crash* on attach, the driver replays ksets from *key_tail* to
rebuild the in-core trees; every segments generation guards against
use-after-free keys.
Limitations & TODO
==================
* Only *write-back* mode; other modes planned.
* Only FIFO cache invalidate; other (LRU, ARC...) planned.
* Table reload is not supported currently.
* Discard planned.
Example workflow
================
.. code-block:: shell
# 1. Create devices
dmsetup create pcache_sdb --table \
"0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true"
# 2. Put a filesystem on top
mkfs.ext4 /dev/mapper/pcache_sdb
mount /dev/mapper/pcache_sdb /mnt
# 3. Tune GC threshold to 80 %
dmsetup message pcache_sdb 0 gc_percent 80
# 4. Observe status
watch -n1 'dmsetup status pcache_sdb'
# 5. Shutdown
umount /mnt
dmsetup remove pcache_sdb
``dm-pcache`` is under active development; feedback, bug reports and patches
are very welcome!