Description of problem: Boot guest with nvdimm device backed by /dev/dax on host, guest call trace. Version-Release number of selected component (if applicable): qemu-kvm-rhev-2.12.0-1.el7 kernel-3.10.0-891.el7.x86_64 How reproducible: always Steps to Reproduce: 1. Specify memmap=4G!2G on host kernel command line, reboot host # cat /proc/cmdline BOOT_IMAGE=/vmlinuz-3.10.0-891.el7.x86_64 root=/dev/mapper/rhel_dell--per730--39-root ro rd.lvm.lv=rhel_dell-per730-39/root rd.lvm.lv=rhel_dell-per730-39/swap console=ttyS0,115200n81 LANG=en_US.UTF-8 memmap=4G!2G 2. On host # ls -lh /dev/pmem* brw-rw---- 1 root disk 259, 0 May 22 01:42 /dev/pmem0 brw-rw---- 1 root disk 259, 1 May 22 01:42 /dev/pmem4 brw-rw---- 1 root disk 259, 2 May 22 01:42 /dev/pmem5 # ndctl list [ { "dev":"namespace5.0", "mode":"memory", "size":2147483648, "blockdev":"pmem5", "numa_node":0 }, { "dev":"namespace4.0", "mode":"memory", "size":13565952, "blockdev":"pmem4", "numa_node":0 }, { "dev":"namespace0.0", "mode":"memory", "size":268435456, "blockdev":"pmem0", "numa_node":0 } ] # ndctl create-namespace -m dax -e namespace5.0 -f -v -a 4096 { "dev":"namespace5.0", "mode":"dax", "size":"2014.00 MiB (2111.83 MB)", "uuid":"8435e12d-af6a-4f38-8b41-c721f8e1e82f", "daxregion":{ "id":5, "size":"2014.00 MiB (2111.83 MB)", "align":4096, "devices":[ { "chardev":"dax5.0", "size":"2014.00 MiB (2111.83 MB)" } ] }, "numa_node":0 } 3. Boot guest with /dev/dax5.0 ]# /usr/libexec/qemu-kvm -m 10G,slots=20,maxmem=40G -smp 32 -M pc,nvdimm\ -object memory-backend-file,mem-path=/dev/dax5.0,size=2G,id=mem0,share=on\ -device nvdimm,id=dimm0,memdev=mem0 \ rhel76-64-virtio-scsi.qcow2 -monitor stdio -vnc :0 \ -netdev tap,id=tap0 -device virtio-net-pci,id=net0,netdev=tap0 \ -serial unix:/tmp/console,server,nowait Actual results: Guest call trace: [ 29.349612] NMI watchdog: BUG: soft lockup - CPU#16 stuck for 23s! [systemd-udevd:848] [ 29.355097] Modules linked in: nd_pmem dax_pmem nd_btt device_dax ppdev sg i2c_piix4 joydev pcspkr parport_pc parport nfit libnvdimm ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic sr_mod cdrom crct10dif_common ata_generic pata_acpi bochs_drm drm_kms_helper 8021q garp mrp stp llc syscopyarea sysfillrect sysimgblt virtio_net fb_sys_fops ttm drm ata_piix scsi_transport_iscsi libata virtio_pci virtio_ring serio_raw floppy virtio dm_mirror dm_region_hash dm_log dm_mod [ 29.364899] CPU: 16 PID: 848 Comm: systemd-udevd Not tainted 3.10.0-891.el7.x86_64 #1 [ 29.366832] Hardware name: Red Hat KVM, BIOS 1.11.0-2.el7 04/01/2014 [ 29.368469] task: ffff92ce6a2d5ee0 ti: ffff92ce68350000 task.ti: ffff92ce68350000 [ 29.370299] RIP: 0010:[<ffffffffabb5e6ad>] [<ffffffffabb5e6ad>] memcpy+0xd/0x110 [ 29.372221] RSP: 0018:ffff92ce68353918 EFLAGS: 00010246 [ 29.373809] RAX: ffff92ce66356000 RBX: fffff954ca98d580 RCX: 0000000000000200 [ 29.375635] RDX: 0000000000000000 RSI: ffffb4653fff0000 RDI: ffff92ce66356000 [ 29.377466] RBP: ffff92ce68353998 R08: 0000000000000000 R09: 00000000003fff80 [ 29.379328] R10: ffff92ce66356000 R11: 0000000000000000 R12: ffff92ce682f92d8 [ 29.381184] R13: 000000007fff0000 R14: 0000000000001000 R15: 0000000000000000 [ 29.382982] FS: 00007fa55b0d08c0(0000) GS:ffff92ce75200000(0000) knlGS:0000000000000000 [ 29.385008] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 29.386701] CR2: 000000008a794010 CR3: 00000002a8214000 CR4: 00000000000006e0 [ 29.388535] Call Trace: [ 29.389809] [<ffffffffc061a1c9>] ? pmem_do_bvec+0x99/0x2a0 [nd_pmem] [ 29.391517] [<ffffffffc061a739>] pmem_rw_page+0x39/0x60 [nd_pmem] [ 29.393454] [<ffffffffaba5bb1f>] bdev_read_page+0x7f/0xa0 [ 29.397394] [<ffffffffaba6190f>] do_mpage_readpage+0x57f/0x740 [ 29.400190] [<ffffffffaba5b730>] ? I_BDEV+0x10/0x10 [ 29.402738] [<ffffffffab9971ae>] ? __add_to_page_cache_locked+0xee/0x190 [ 29.405571] [<ffffffffab9a5ede>] ? lru_cache_add+0xe/0x10 [ 29.408168] [<ffffffffaba61be2>] mpage_readpages+0x112/0x170 [ 29.410815] [<ffffffffaba5b730>] ? I_BDEV+0x10/0x10 [ 29.413075] [<ffffffffaba5b730>] ? I_BDEV+0x10/0x10 [ 29.414720] [<ffffffffaba5bfed>] blkdev_readpages+0x1d/0x20 [ 29.416311] [<ffffffffab9a3daf>] __do_page_cache_readahead+0x1cf/0x260 [ 29.418077] [<ffffffffab9a4349>] force_page_cache_readahead+0x99/0xe0 [ 29.419778] [<ffffffffab9a4427>] page_cache_sync_readahead+0x97/0xb0 [ 29.421508] [<ffffffffab998052>] generic_file_aio_read+0x2c2/0x790 [ 29.423191] [<ffffffffaba5c42c>] blkdev_aio_read+0x4c/0x70 [ 29.424777] [<ffffffffaba1cd93>] do_sync_read+0x93/0xe0 [ 29.426699] [<ffffffffaba1d7bf>] vfs_read+0x9f/0x170 [ 29.428871] [<ffffffffaba1e68f>] SyS_read+0x7f/0xf0 [ 29.430395] [<ffffffffabf336e1>] ? system_call_after_swapgs+0xae/0x146 [ 29.432084] [<ffffffffabf33795>] system_call_fastpath+0x1c/0x21 [ 29.433595] [<ffffffffabf336e1>] ? system_call_after_swapgs+0xae/0x146 [ 29.435232] Code: d3 ff 0f ae e8 0f 31 48 c1 e2 20 89 c0 48 09 c2 48 31 d3 e9 7b ff ff ff 90 90 90 90 90 90 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 20 4c 8b 06 4c 8b 4e 08 4c 8b 56 10 4c Expected results: Guest boot successfully. Additional info:
QE found that if set backend size<=2014M in qemu cmline, guest can work well without call trace. -object memory-backend-file,mem-path=/dev/dax5.0,size=*2014M*,id=mem0,share=on\ -device nvdimm,id=dimm0,memdev=mem0 Apparently, the size of the device changed after changing the mode in step 2. > { > "dev":"namespace5.0", > "mode":"memory", > "size":2147483648, ------> 2G > "blockdev":"pmem5", > "numa_node":0 > }, > { > "dev":"namespace5.0", > "mode":"dax", > "size":"2014.00 MiB (2111.83 MB)", -------> 2014M > "uuid":"8435e12d-af6a-4f38-8b41-c721f8e1e82f", > "daxregion":{ > "id":5, > "size":"2014.00 MiB (2111.83 MB)", > "align":4096, > "devices":[ > { > "chardev":"dax5.0", > "size":"2014.00 MiB (2111.83 MB)" > } > ] > }, > "numa_node":0 > } So the case is when set illegal size for nvdimm backend, guest call trace and hit soft lockup.
Two separate items: 1. Unfortunately QEMU cannot easily detect the correct size value or even display an error if the size is incorrect. I will let Intel engineers know about this issue so they can decide how to solve it. 2. End users must run QEMU via libvirt. For RHEL we should document that "ndctl list" must be checked for the exact size of the devdax character device. This way users can avoid hitting this soft lockup.
Upstream discussion on querying the size of devdax devices so that QEMU can refuse incorrect values: https://lists.gnu.org/archive/html/qemu-devel/2019-02/msg00523.html
Hi Ademar, May I know why move this bz to RHEL-AV? We already have bug1669053 for fast train. This bug is reported on rhel7, do you mean it won't be fixed for rhel7? Thanks.