Hide Forgot
Document URL: Section Number and Name: Describe the issue: Suggestions for improvement: We need to document how administrators take advantage of persistent memory. I will provide draft text. Additional information:
I expect I will make some revisions to this text. For example, it may be worth adding background information on sector mode. Milan, you may also want to look at the blog post I wrote for more information. Feel free to incorporate parts of that as you see fit. https://developers.redhat.com/blog/2016/12/05/configuring-and-using-persistent-memory-rhel-7-3/ I also gave a talk at Red Hat Summit: https://rh2016.smarteventscloud.com/connect/fileDownload/session/B589159303269827CAA72E8363E3F6B6/SS42192_Moyer.pdf Persistent Memory (NVDIMMs) Persistent memory, sometimes called storage class memory, can be thought of as a cross between memory and storage. It shares a couple of properties with memory. First, it is byte addressable, meaning it can be accessed using CPU load and store instructions, as opposed to read() or write() system calls that are required for accessing traditional block-based storage. Second, pmem has similar performance characteristics as DRAM, meaning it has very low access latencies (measured in the tens to hundreds of nanoseconds). In addition to these beneficial memory-like properties, contents of persistent memory are preserved when the power is off, just as with storage. NVDIMMs can be grouped into interleave sets just like normal DRAM. An interleave set is like a RAID 1 (stripe) across multiple DIMMs. For DRAM, memory is often configured this way to increase performance. NVDIMMs also benefit from increased performance when configured into interleave sets. However, there is another advantage: interleaving can be used to combine multiple smaller NVDIMMs into one larger logical device. (Note that configuration of interleave sets is done via the system BIOS or UEFI firmware.) When a system with NVDIMMs boots, the operating system will create pmem device nodes for each interleave set. For example, if there are 2 8GiB NVDIMMs as part of an interleave set, the system will create a single device node (/dev/pmem0) of size 16GiB. Persistent memory can be used in one of two access modes. The first is sector mode, which presents the storage as a fast block device. This is useful for legacy applications (applications that have not been modified to make use of persistent memory), or for applications that wish to make use of the full I/O stack, including device-mapper. The second mode is memory mode. A pmem device configured in memory mode is able to suport direct access programming as describe by the SNIA NVM Programming Model specification. When this mode of access is used, I/O bypasses the kernel's storage stack. Because of this, device-mapper drivers cannot be used. NDCTL Configuring persistent memory is done via the ndctl utility. Here is example output from the 'ndctl list' command: # ndctl list [ { "dev":"namespace1.0", "mode":"raw", "size":17179869184, "blockdev":"pmem1" }, { "dev":"namespace0.0", "mode":"raw", "size":17179869184, "blockdev":"pmem0" } ] As described above, unconfigured storage shows up in "raw" mode. In this example, there are two interleave sets of 16GiB in size each. The namespace identifier is used to identify the NVDIMM for ndctl operations. Configuring PMEM for use as a Block Device (Legacy Mode) To use persistent memory as a fast block device, the namespace needs to be configured in sector mode. # ndctl create-namespace -f -e namespace1.0 -m sector { "dev":"namespace1.0", "mode":"sector", "size":17162027008, "uuid":"029caa76-7be3-4439-8890-9c2e374bcc76", "sector_size":4096, "blockdev":"pmem1s" } In this example, namespace1.0 was reconfigured to sector mode. Notice that the block device name changed from pmem1 to pmem1s. This device can be used the same way as any other block device on the system. It can be partitioned, you can create a file system on it, it can be configured as part of a software RAID set, it can be the cache device for dm-cache, etc. Configuring PMEM for Direct Access (DAX) Direct access requires the namespace to be configured to memory mode. # ndctl create-namespace -f -e namespace0.0 -m memory -M mem { "dev":"namespace0.0", "mode":"memory", "size":17177772032, "uuid":"e6944638-46aa-4e06-a722-0b3f16a5acbf", "blockdev":"pmem0" } Here, we’ve converted namespace0.0 to a memory mode namespace. The "-M mem" argument instructs ndctl to put operating system data structures used for DMA in system DRAM. In order for the kernel to perform DMA, it requires a data structure for each page in the memory region. The overhead of this data structure is 64 bytes per 4KiB page. For small devices, the amount of overhead is small enough to easily fit in DRAM (for example, this 16GiB namespace only requires 256MiB for page structures). Given that the NVDIMM is small (and expensive), it makes sense to store the kernel’s page tracking data structures in DRAM. That’s what “-M mem” indicates. Future NVDIMM devices could be TiBs in size. For those devices, the amount of memory required to store the page tracking data structures may exceed the amount of DRAM in the system (1TiB of pmem requires 16GiB just for page structures). As a result, it makes more sense to specify “-M dev” to store the data structures in the persistent memory itself in such cases. After configuring the namespace in memory mode, it is now ready for a file system. In RHEL 7.3, both ext4 and xfs have been modified to support persistent memory as a Technology Preview. File system creation requires no special arguments. However, in order to get DAX functionality, the file system must be mounted with the “dax” mount option. # mkfs -t xfs /dev/pmem0 # mount -o dax /dev/pmem0 /mnt/pmem/ And that’s it! Now applications wishing to make use of pmem can create files in /mnt/pmem, open them, and mmap them for direct access. Note that there are constraints that must be met when creating partitions on a pmem device that will be used for direct access. Partitions must be aligned on page boundaries. On Intel platforms, that means at least 4KiB alignment for the start and end of the partition, but 2MiB is the preferred alignment. By default, parted will align partitions on 1MiB boundaries.
Hi Jeff and Marek, Could you add the following topics to the document? - How to configure NVDIMM. - To use it as like usual HDD/SSD - To use it as Device DAX - NVML usage. I think the URL of upstream document should be included in the document. - Troubleshooting. For example the situation, a NVDIMM is broken. How to replace the broken NVDIMM. Regards, Masayoshi Mizuma
(In reply to fj-lsoft-kernel-it from comment #19) > Hi Jeff and Marek, > > Could you add the following topics to the document? > > - How to configure NVDIMM. > - To use it as like usual HDD/SSD > - To use it as Device DAX This will be included. > - NVML usage. I think the URL of upstream document should be > included in the document. I'm not sure what our policy is for linking to external documentation, but I will look into it. > - Troubleshooting. For example the situation, a NVDIMM is > broken. How to replace the broken NVDIMM. I will see what we can document in a vendor agnostic way, but parts of this process will be vendor specific.