Bug 2156536

Summary: [RHEL9.2] nvme_user_space_driver Bus error (core dumped) when install guest
Product: Red Hat Enterprise Linux 9 Reporter: Zhenyu Zhang <zhenyzha>
Component: qemu-kvmAssignee: Eric Auger <eric.auger>
qemu-kvm sub component: NVMe QA Contact: Zhenyu Zhang <zhenyzha>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: low    
Priority: low CC: alexander.lougovski, alex.williamson, aliang, cohuck, coli, eric.auger, gshan, jinzhao, juzhang, kwolf, lijin, sgarzare, stefanha, vgoyal, virt-maint, xuwei, yanghliu, yihyu, zhguo
Version: 9.2Keywords: Triaged
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: ---   
Hardware: aarch64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2158686 (view as bug list) Environment:
Last Closed: 2023-02-09 08:37:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2158686    

Description Zhenyu Zhang 2022-12-27 11:34:37 UTC
Description of problem:
NVMe Userspace Driver install a guest on NVMe device failed
Hit Bus error (core dumped) 

Version-Release number of selected component (if applicable):
Test Environment:
Host Distro: RHEL-9.2.0-20221226.0
Host Kernel: 5.14.0-226.el9.aarch64
Qemu-kvm: qemu-kvm-7.2.0-2.el9.aarch64
Guest Kernel: 5.14.0-226.el9.aarch64
edk2:edk2-aarch64-20221207gitfff6d81270b5-1.el9.noarch

How reproducible:
100%

Steps to Reproduce:
1. Setup host nvme
[root@ampere-hr350a-07 ~]# modprobe vfio-pci
[root@ampere-hr350a-07 ~]# modinfo vfio-pci
filename:       /lib/modules/5.14.0-226.el9.aarch64/kernel/drivers/vfio/pci/vfio-pci.ko.xz
description:    VFIO PCI - User Level meta-driver
author:         Alex Williamson <alex.williamson>
license:        GPL v2
rhelversion:    9.2
srcversion:     A326814B1A0356BEFB22207
alias:          vfio_pci:v*d*sv*sd*bc*sc*i*
depends:        vfio-pci-core
intree:         Y
name:           vfio_pci
vermagic:       5.14.0-226.el9.aarch64 SMP preempt mod_unload modversions aarch64
sig_id:         PKCS#7
signer:         Red Hat Enterprise Linux kernel signing key
sig_key:        0D:4E:DB:4A:00:71:20:EC:E4:2A:9C:E4:91:88:40:D5:1A:DB:FC:A3
sig_hashalgo:   sha256
signature:      9B:4F:5A:25:FC:7F:37:2E:C0:20:19:1B:A5:E7:06:DF:D2:3C:89:B4:
		04:73:44:5C:F4:6B:0B:F4:A7:74:3B:E8:24:E3:AB:E2:0A:C1:34:B4:
		E1:25:F9:64:51:94:74:AA:EE:64:50:B2:3B:09:FF:4D:DA:1C:9D:AF:
		58:2B:84:96:03:05:8F:91:F8:94:9F:79:4B:09:A9:E5:79:5E:B2:08:
		B9:AF:BF:F9:61:C2:F9:EC:A3:38:F3:9E:30:BE:96:04:40:1B:08:8C:
		0C:AF:22:1A:EE:FF:9A:AC:5E:90:E4:AC:97:37:06:DB:E2:A2:2F:38:
		32:2E:09:0A:3A:DE:82:86:AE:10:2D:63:35:95:9B:E4:3C:FA:DB:C7:
		4E:E8:9D:96:D0:47:25:2F:8C:5A:78:05:6F:77:3D:84:7D:D0:02:7D:
		18:6F:84:1B:95:3B:2E:AE:FC:B7:FF:55:6E:12:11:AC:5C:3E:D3:B6:
		D9:B1:E2:CB:C2:02:37:AA:B8:F2:47:4F:83:11:13:94:5B:6D:5D:F5:
		88:7B:96:34:41:F1:07:B3:FC:F1:E7:FF:52:A0:6B:85:19:CA:DE:E5:
		0C:EE:77:F6:BB:B7:6E:70:11:A6:AF:A8:EE:64:6B:B6:A1:61:1A:BA:
		4D:D1:D4:DC:F8:98:B1:88:7B:B5:8B:5D:A2:17:99:B6:D1:CC:7A:49:
		19:71:F7:7F:2C:F7:1D:9C:6B:65:F7:37:87:D0:31:0A:2A:52:26:50:
		59:31:8C:13:9A:9F:E0:75:0F:BA:45:54:AD:AE:65:D8:04:55:88:FA:
		19:0D:94:E2:6F:94:65:8A:57:AB:97:D7:E9:5A:AE:B4:64:37:17:3D:
		C7:2F:FC:E4:B9:CE:FF:B4:E0:47:99:51:53:22:AC:82:5C:80:BB:6B:
		D3:D0:41:B5:37:F9:3D:D6:57:AC:95:A6:BD:34:71:FD:C1:8D:EB:07:
		87:8E:AC:67:9B:69:57:EF:94:4F:81:7C:93:F0:DC:82:31:D1:79:43:
		BD:FA:21:6B
parm:           ids:Initial PCI IDs to add to the vfio driver, format is "vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]" and multiple comma separated entries can be specified (string)
parm:           nointxmask:Disable support for PCI 2.3 style INTx masking.  If this resolves problems for specific devices, report lspci -vvvxxx to linux-pci.org so the device can be fixed automatically via the broken_intx_masking flag. (bool)
parm:           disable_idle_d3:Disable using the PCI D3 low power state for idle, unused devices (bool)
parm:           enable_sriov:Enable support for SR-IOV configuration.  Enabling SR-IOV on a PF typically requires support of the userspace PF driver, enabling VFs without such support may result in non-functional VFs or PF. (bool)
parm:           disable_denylist:Disable use of device denylist. Disabling the denylist allows binding to devices with known errata that may lead to exploitable stability or security issues when accessed by untrusted users. (bool)

# lspci | grep -i 'Non-Volatile'
0004:01:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller]
0005:01:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller]

# lspci -n -s 0005:01:00.0
0005:01:00.0 0108: 8086:0a54

# echo 0005:01:00.0 > /sys/bus/pci/devices/0005:01:00.0/driver/unbind
# echo 8086 0a54 > /sys/bus/pci/drivers/vfio-pci/new_id

2. create a image on nvme 
[root@ampere-hr350a-07 ~]# /usr/bin/qemu-img create -f qcow2 nvme://0005:01:00.0/1 20G
Formatting 'nvme://0005:01:00.0/1', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=21474836480 lazy_refcounts=off refcount_bits=16
[root@ampere-hr350a-07 ~]# qemu-img info -U nvme://0005:01:00.0/1
image: nvme://0005:01:00.0/1
file format: qcow2
virtual size: 20 GiB (21474836480 bytes)
disk size: unavailable
cluster_size: 65536
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: false
    refcount bits: 16
    corrupt: false
    extended l2: false

3. install a guest on image
/usr/libexec/qemu-kvm \
-name 'avocado-vt-vm1'  \
-sandbox on  \
-blockdev node-name=file_aavmf_code,driver=file,filename=/usr/share/edk2/aarch64/QEMU_EFI-silent-pflash.raw,auto-read-only=on,discard=unmap \
-blockdev node-name=drive_aavmf_code,driver=raw,read-only=on,file=file_aavmf_code \
-blockdev node-name=file_aavmf_vars,driver=file,filename=/root/avocado/data/avocado-vt/avocado-vt-vm1_rhel920-aarch64_nvme_direct_raw_VARS.fd,auto-read-only=on,discard=unmap \
-blockdev node-name=drive_aavmf_vars,driver=raw,read-only=off,file=file_aavmf_vars \
-machine virt,gic-version=host,memory-backend=mem-machine_mem,pflash0=drive_aavmf_code,pflash1=drive_aavmf_vars \
-device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
-device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0  \
-nodefaults \
-device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 \
-device virtio-gpu-pci,bus=pcie-root-port-1,addr=0x0 \
-m 253952 \
-object '{"qom-type": "memory-backend-ram", "size": 266287972352, "id": "mem-machine_mem"}'  \
-smp 16,maxcpus=16,cores=8,threads=1,clusters=1,sockets=2  \
-cpu 'host' \
-chardev socket,wait=off,server=on,path=/var/tmp/monitor-qmpmonitor1,id=qmp_id_qmpmonitor1  \
-mon chardev=qmp_id_qmpmonitor1,mode=control \
-chardev socket,wait=off,server=on,path=/var/tmp/monitor-catch_monitor,id=qmp_id_catch_monitor  \
-mon chardev=qmp_id_catch_monitor,mode=control  \
-serial unix:'/var/tmp/serial-serial0',server=on,wait=off \
-device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \
-device qemu-xhci,id=usb1,bus=pcie-root-port-2,addr=0x0 \
-device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
-blockdev '{"node-name": "nvme_image1", "driver": "nvme", "auto-read-only": true, "discard": "unmap", "device": "0005:01:00.0", "namespace": 1, "cache": {"direct": false, "no-flush": false}}' \
-blockdev '{"node-name": "drive_image1", "driver": "raw", "read-only": false, "cache": {"direct": false, "no-flush": false}, "file": "nvme_image1"}' \
-device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
-device '{"driver": "virtio-blk-pci", "id": "image1", "drive": "drive_image1", "bootindex": 1, "write-cache": "off", "bus": "pcie-root-port-3", "addr": "0x0"}' \
-device pcie-root-port,id=pcie-root-port-4,port=0x4,addr=0x1.0x4,bus=pcie.0,chassis=5 \
-device virtio-net-pci,mac=9a:91:18:a0:16:a9,rombar=0,id=idsTPBvn,netdev=idsuVgf9,bus=pcie-root-port-4,addr=0x0  \
-netdev tap,id=idsuVgf9,vhost=on \
-device pcie-root-port,id=pcie-root-port-5,port=0x5,addr=0x1.0x5,bus=pcie.0,chassis=6 \
-device '{"id": "virtio_scsi_pci0", "driver": "virtio-scsi-pci", "bus": "pcie-root-port-5", "addr": "0x0"}' \
-blockdev '{"node-name": "file_cd1", "driver": "file", "auto-read-only": true, "discard": "unmap", "aio": "threads", "filename": "/home/kvm_autotest_root/iso/linux/RHEL-9.2.0-20221227.0-aarch64-dvd1.iso", "cache": {"direct": true, "no-flush": false}}' \
-blockdev '{"node-name": "drive_cd1", "driver": "raw", "read-only": true, "cache": {"direct": true, "no-flush": false}, "file": "file_cd1"}' \
-device '{"driver": "scsi-cd", "id": "cd1", "drive": "drive_cd1", "bootindex": 2, "write-cache": "on"}' \
-blockdev '{"node-name": "file_unattended", "driver": "file", "auto-read-only": true, "discard": "unmap", "aio": "threads", "filename": "/home/kvm_autotest_root/images/rhel920-aarch64/ks.iso", "cache": {"direct": true, "no-flush": false}}' \
-blockdev '{"node-name": "drive_unattended", "driver": "raw", "read-only": true, "cache": {"direct": true, "no-flush": false}, "file": "file_unattended"}' \
-device '{"driver": "scsi-cd", "id": "unattended", "drive": "drive_unattended", "bootindex": 3, "write-cache": "on"}'  \
-kernel '/home/kvm_autotest_root/images/rhel920-aarch64/vmlinuz'  \
-append 'inst.sshd inst.repo=cdrom inst.ks=cdrom:/ks.cfg net.ifnames=0 console=ttyAMA0,38400'  \
-initrd '/home/kvm_autotest_root/images/rhel920-aarch64/initrd.img'  \
-vnc :20  \
-rtc base=utc,clock=host,driftfix=slew  \
-no-shutdown \
-enable-kvm \
-monitor stdio 

Actual results:
guest hang
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel: INFO: task dd:2693 blocked for more than 122 seconds.
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel:       Not tainted 5.14.0-226.el9.aarch64 #1
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel: task:dd              state:D stack:    0 pid: 2693 ppid:  2686 flags:0x00000204
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel: Call trace:
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel:  __switch_to+0xc8/0x110
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel:  __schedule+0x220/0x520
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel:  schedule+0x60/0xe0
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel:  io_schedule+0x40/0x60
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel:  bit_wait_io+0x1c/0x6c
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel:  __wait_on_bit+0x9c/0x190
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel:  out_of_line_wait_on_bit+0x84/0xb0
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel:  __block_write_begin_int+0x218/0x4cc
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel:  block_write_begin+0x54/0xa0
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel:  blkdev_write_begin+0x30/0x40
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel:  generic_perform_write+0xac/0x1e0
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel:  __generic_file_write_iter+0xe4/0x1a0
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel:  blkdev_write_iter+0xfc/0x180
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel:  new_sync_write+0xd4/0x160
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel:  vfs_write+0x1a0/0x1ec
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel:  ksys_write+0x6c/0xf4
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel:  __arm64_sys_write+0x20/0x2c
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel:  invoke_syscall.constprop.0+0x7c/0xd0
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel:  el0_svc_common.constprop.0+0x154/0x160
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel:  do_el0_svc+0x2c/0xb0
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel:  el0_svc+0x38/0x190
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel:  el0t_64_sync_handler+0x9c/0x120
2022-12-27 02:07:47: 192.168.122.209Dec 27 06:10:45 localhost.localdomain kernel:  el0t_64_sync+0x174/0x178

hit Bus error (core dumped)
[root@ampere-hr350a-07 home]# echo "/home/core" > /proc/sys/kernel/core_pattern
[root@ampere-hr350a-07 home]# ulimit -c unlimited
[root@ampere-hr350a-07 home]# qemu-img info nvme://0005:01:00.0/1
Bus error (core dumped)

Comment 18 Stefan Hajnoczi 2023-01-10 18:49:33 UTC
(Just out of interest, this is not high priority.)

SIGBUS from this line:

  cap = le64_to_cpu(regs->cap);

This 64-bit field is 4KB aligned:

  regs = 0xffff87629000

  typedef struct QEMU_PACKED NvmeBar {
      uint64_t    cap;
      ...

The mapping function call looks okay to me as well:

    regs = qemu_vfio_pci_map_bar(s->vfio, 0, 0, sizeof(NvmeBar),
                                 PROT_READ | PROT_WRITE, errp);

I'm not sure why there is a SIGBUS.

Eric: Any ideas how the VFIO BAR mapping could raise a SIGBUS on a memory load?

Comment 20 Eric Auger 2023-01-16 07:46:57 UTC
(In reply to Stefan Hajnoczi from comment #18)
> (Just out of interest, this is not high priority.)
> 
> SIGBUS from this line:
> 
>   cap = le64_to_cpu(regs->cap);
> 
> This 64-bit field is 4KB aligned:
> 
>   regs = 0xffff87629000
> 
>   typedef struct QEMU_PACKED NvmeBar {
>       uint64_t    cap;
>       ...
> 
> The mapping function call looks okay to me as well:
> 
>     regs = qemu_vfio_pci_map_bar(s->vfio, 0, 0, sizeof(NvmeBar),
>                                  PROT_READ | PROT_WRITE, errp);
> 
> I'm not sure why there is a SIGBUS.
> 
> Eric: Any ideas how the VFIO BAR mapping could raise a SIGBUS on a memory
> load?

Stefan, one case that comes to my mind is if the memory access is disabled on the PCI device. We got the case in the past on arm but I fail to remember the exact circumstances. 

https://lists.archive.carbon60.com/linux/kernel/3714001

Comment 26 Eric Auger 2023-02-09 08:12:34 UTC
Hi Zhenyu,
sure, please close it. The best would be to run this kind of test in our VFIO test plan.
Thanks

Eric

Comment 27 Zhenyu Zhang 2023-02-09 08:37:27 UTC
Ok, we will keep this test scenario in the aarch64 platform.

Close this bug as a current release according to comments 24-26.