Bug 1392052 - Add a section/chapter on configuring persistent memory: NVDIMM
Summary: Add a section/chapter on configuring persistent memory: NVDIMM
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: doc-Storage_Administration_Guide
Version: 7.3
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: rc
: 7.4
Assignee: Apurva Bhide
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1425467 1550714
TreeView+ depends on / blocked
 
Reported: 2016-11-04 16:34 UTC by Jeff Moyer
Modified: 2019-03-06 01:05 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1550714 (view as bug list)
Environment:
Last Closed: 2018-06-26 04:19:07 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Jeff Moyer 2016-11-04 16:34:25 UTC
Document URL: 

Section Number and Name: 

Describe the issue: 

Suggestions for improvement: 

We need to document how administrators take advantage of persistent memory.  I will provide draft text.

Additional information:

Comment 2 Jeff Moyer 2016-12-15 22:36:33 UTC
I expect I will make some revisions to this text.  For example, it may be worth adding background information on sector mode.  Milan, you may also want to look at the blog post I wrote for more information.  Feel free to incorporate parts of that as you see fit.

https://developers.redhat.com/blog/2016/12/05/configuring-and-using-persistent-memory-rhel-7-3/

I also gave a talk at Red Hat Summit:
https://rh2016.smarteventscloud.com/connect/fileDownload/session/B589159303269827CAA72E8363E3F6B6/SS42192_Moyer.pdf

Persistent Memory (NVDIMMs)

Persistent memory, sometimes called storage class memory, can be
thought of as a cross between memory and storage. It shares a couple
of properties with memory. First, it is byte addressable, meaning it
can be accessed using CPU load and store instructions, as opposed to
read() or write() system calls that are required for accessing
traditional block-based storage. Second, pmem has similar performance
characteristics as DRAM, meaning it has very low access latencies
(measured in the tens to hundreds of nanoseconds). In addition to
these beneficial memory-like properties, contents of persistent memory
are preserved when the power is off, just as with storage.

NVDIMMs can be grouped into interleave sets just like normal DRAM.  An
interleave set is like a RAID 1 (stripe) across multiple DIMMs.  For
DRAM, memory is often configured this way to increase performance.
NVDIMMs also benefit from increased performance when configured into
interleave sets.  However, there is another advantage: interleaving
can be used to combine multiple smaller NVDIMMs into one larger
logical device.  (Note that configuration of interleave sets is done
via the system BIOS or UEFI firmware.)  When a system with NVDIMMs
boots, the operating system will create pmem device nodes for each
interleave set.  For example, if there are 2 8GiB NVDIMMs as part of
an interleave set, the system will create a single device node
(/dev/pmem0) of size 16GiB.

Persistent memory can be used in one of two access modes.  The first
is sector mode, which presents the storage as a fast block device.
This is useful for legacy applications (applications that have not
been modified to make use of persistent memory), or for applications
that wish to make use of the full I/O stack, including device-mapper.

The second mode is memory mode.  A pmem device configured in memory
mode is able to suport direct access programming as describe by the
SNIA NVM Programming Model specification.  When this mode of access
is used, I/O bypasses the kernel's storage stack.  Because of this,
device-mapper drivers cannot be used.

NDCTL

Configuring persistent memory is done via the ndctl utility.  Here is
example output from the 'ndctl list' command:

# ndctl list
[
 {
 "dev":"namespace1.0",
 "mode":"raw",
 "size":17179869184,
 "blockdev":"pmem1"
 },
 {
 "dev":"namespace0.0",
 "mode":"raw",
 "size":17179869184,
 "blockdev":"pmem0"
 }
]

As described above, unconfigured storage shows up in "raw" mode.  In
this example, there are two interleave sets of 16GiB in size each.
The namespace identifier is used to identify the NVDIMM for ndctl
operations.

Configuring PMEM for use as a Block Device (Legacy Mode)
 
To use persistent memory as a fast block device, the namespace needs
to be configured in sector mode.

# ndctl create-namespace -f -e namespace1.0 -m sector
{
 "dev":"namespace1.0",
 "mode":"sector",
 "size":17162027008,
 "uuid":"029caa76-7be3-4439-8890-9c2e374bcc76",
 "sector_size":4096,
 "blockdev":"pmem1s"
}

In this example, namespace1.0 was reconfigured to sector mode.  Notice
that the block device name changed from pmem1 to pmem1s.  This device
can be used the same way as any other block device on the system.  It
can be partitioned, you can create a file system on it, it can be
configured as part of a software RAID set, it can be the cache device
for dm-cache, etc.

Configuring PMEM for Direct Access (DAX)

Direct access requires the namespace to be configured to memory mode.

# ndctl create-namespace -f -e namespace0.0 -m memory -M mem
{
 "dev":"namespace0.0",
 "mode":"memory",
 "size":17177772032,
 "uuid":"e6944638-46aa-4e06-a722-0b3f16a5acbf",
 "blockdev":"pmem0"
}

Here, we’ve converted namespace0.0 to a memory mode namespace.  The
"-M mem" argument instructs ndctl to put operating system data
structures used for DMA in system DRAM.  In order for the kernel to
perform DMA, it requires a data structure for each page in the memory
region.  The overhead of this data structure is 64 bytes per 4KiB
page.  For small devices, the amount of overhead is small enough to
easily fit in DRAM (for example, this 16GiB namespace only requires
256MiB for page structures).  Given that the NVDIMM is small (and
expensive), it makes sense to store the kernel’s page tracking data
structures in DRAM.  That’s what “-M mem” indicates.  Future NVDIMM
devices could be TiBs in size.  For those devices, the amount of
memory required to store the page tracking data structures may exceed
the amount of DRAM in the system (1TiB of pmem requires 16GiB just for
page structures).  As a result, it makes more sense to specify “-M
dev” to store the data structures in the persistent memory itself in
such cases.

After configuring the namespace in memory mode, it is now ready for a
file system.  In RHEL 7.3, both ext4 and xfs have been modified to
support persistent memory as a Technology Preview.  File system
creation requires no special arguments.  However, in order to get DAX
functionality, the file system must be mounted with the “dax” mount
option.

# mkfs -t xfs /dev/pmem0
# mount -o dax /dev/pmem0 /mnt/pmem/

And that’s it!  Now applications wishing to make use of pmem can
create files in /mnt/pmem, open them, and mmap them for direct access.

Note that there are constraints that must be met when creating
partitions on a pmem device that will be used for direct access.
Partitions must be aligned on page boundaries.  On Intel platforms,
that means at least 4KiB alignment for the start and end of the
partition, but 2MiB is the preferred alignment.  By default, parted
will align partitions on 1MiB boundaries.

Comment 19 Fujitsu kernel team 2017-11-17 09:36:14 UTC
Hi Jeff and Marek,

Could you add the following topics to the document?

- How to configure NVDIMM.
  - To use it as like usual HDD/SSD
  - To use it as Device DAX

- NVML usage. I think the URL of upstream document should be
  included in the document.

- Troubleshooting. For example the situation, a NVDIMM is
  broken. How to replace the broken NVDIMM.

Regards,
Masayoshi Mizuma

Comment 20 Jeff Moyer 2017-11-30 21:19:37 UTC
(In reply to fj-lsoft-kernel-it from comment #19)
> Hi Jeff and Marek,
> 
> Could you add the following topics to the document?
> 
> - How to configure NVDIMM.
>   - To use it as like usual HDD/SSD
>   - To use it as Device DAX

This will be included.

> - NVML usage. I think the URL of upstream document should be
>   included in the document.

I'm not sure what our policy is for linking to external documentation, but I will look into it.

> - Troubleshooting. For example the situation, a NVDIMM is
>   broken. How to replace the broken NVDIMM.

I will see what we can document in a vendor agnostic way, but parts of this process will be vendor specific.


Note You need to log in before you can comment on or make changes to this bug.