Bug 1542494 - VMs with NVMe devices passed through sometimes fail to be launched
Summary: VMs with NVMe devices passed through sometimes fail to be launched
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.4
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: pre-dev-freeze
: ---
Assignee: Alex Williamson
QA Contact: CongLi
URL:
Whiteboard: aos-scalability-39
Keywords: Reopened
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-02-06 13:03 UTC by jmencak
Modified: 2018-10-30 08:38 UTC (History)
31 users (show)

(edit)
Clone Of:
(edit)
Last Closed: 2018-10-30 08:37:27 UTC


Attachments (Terms of Use)
Nova compute log file from the physical host, where the instance failed to launch. (294.37 KB, text/plain)
2018-02-06 13:03 UTC, jmencak
no flags Details
lspci -vvv of "Samsung Electronics Co Ltd NVMe SSD Controller" (3.84 KB, text/plain)
2018-02-10 10:22 UTC, jmencak
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:3083 None None None 2018-10-30 08:38 UTC

Description jmencak 2018-02-06 13:03:01 UTC
Created attachment 1392047 [details]
Nova compute log file from the physical host, where the instance failed to launch.

Description of problem:
On OSP11, VMs with NVMe devices passed through, VMs sometimes fail to be launched:

Version-Release number of selected component (if applicable):
OSP11

How reproducible:
Always 

Steps to Reproduce:
1. Create a small OCP11 environment with at least 1 baremetal host allowing NVMe disk passthrough.
2. Create an openstack server/VM with the NVMe device, this should work and the VM land on the baremetal host containing the passed-through NVMe device.
3. "openstack server delete" the newly created VM.
4. Try to 2. again.  This will likely fail.  Nova log attached (56700a77-c88e-40ce-9748-ab8373a06bf1).

Actual results:
The VM fails to be created.  A work-around seems to be reboot of the physical host and 5-10 minut wait before retrying.

Expected results:
The VM is created successfully with no workarounds.

Additional info:
[stack@b08-h02-r620 jm]$ openstack server show 56700a77-c88e-40ce-9748-ab8373a06bf1
+-------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field                               | Value                                                                                                                                                 |
+-------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig                   | MANUAL                                                                                                                                                |
| OS-EXT-AZ:availability_zone         |                                                                                                                                                       |
| OS-EXT-SRV-ATTR:host                | None                                                                                                                                                  |
| OS-EXT-SRV-ATTR:hypervisor_hostname | None                                                                                                                                                  |
| OS-EXT-SRV-ATTR:instance_name       | instance-0000130b                                                                                                                                     |
| OS-EXT-STS:power_state              | NOSTATE                                                                                                                                               |
| OS-EXT-STS:task_state               | None                                                                                                                                                  |
| OS-EXT-STS:vm_state                 | error                                                                                                                                                 |
| OS-SRV-USG:launched_at              | None                                                                                                                                                  |
| OS-SRV-USG:terminated_at            | None                                                                                                                                                  |
| accessIPv4                          |                                                                                                                                                       |
| accessIPv6                          |                                                                                                                                                       |
| addresses                           |                                                                                                                                                       |
| config_drive                        |                                                                                                                                                       |
| created                             | 2018-02-06T12:08:17Z                                                                                                                                  |
| fault                               | {u'message': u'No valid host was found. There are not enough hosts available.', u'code': 500, u'details': u'  File "/usr/lib/python2.7/site-          |
|                                     | packages/nova/conductor/manager.py", line 528, in build_instances\n    context, request_spec, filter_properties)\n  File "/usr/lib/python2.7/site-    |
|                                     | packages/nova/conductor/manager.py", line 599, in _schedule_instances\n    hosts = self.scheduler_client.select_destinations(context, spec_obj)\n     |
|                                     | File "/usr/lib/python2.7/site-packages/nova/scheduler/utils.py", line 371, in wrapped\n    return func(*args, **kwargs)\n  File "/usr/lib/python2.7   |
|                                     | /site-packages/nova/scheduler/client/__init__.py", line 51, in select_destinations\n    return self.queryclient.select_destinations(context,          |
|                                     | spec_obj)\n  File "/usr/lib/python2.7/site-packages/nova/scheduler/client/__init__.py", line 37, in __run_method\n    return getattr(self.instance,   |
|                                     | __name)(*args, **kwargs)\n  File "/usr/lib/python2.7/site-packages/nova/scheduler/client/query.py", line 32, in select_destinations\n    return       |
|                                     | self.scheduler_rpcapi.select_destinations(context, spec_obj)\n  File "/usr/lib/python2.7/site-packages/nova/scheduler/rpcapi.py", line 129, in        |
|                                     | select_destinations\n    return cctxt.call(ctxt, \'select_destinations\', **msg_args)\n  File "/usr/lib/python2.7/site-                               |
|                                     | packages/oslo_messaging/rpc/client.py", line 169, in call\n    retry=self.retry)\n  File "/usr/lib/python2.7/site-                                    |
|                                     | packages/oslo_messaging/transport.py", line 97, in _send\n    timeout=timeout, retry=retry)\n  File "/usr/lib/python2.7/site-                         |
|                                     | packages/oslo_messaging/_drivers/amqpdriver.py", line 578, in send\n    retry=retry)\n  File "/usr/lib/python2.7/site-                                |
|                                     | packages/oslo_messaging/_drivers/amqpdriver.py", line 569, in _send\n    raise result\n', u'created': u'2018-02-06T12:08:45Z'}                        |
| flavor                              | large-pci (a27f603b-ef23-4d17-8302-7c324b660711)                                                                                                      |
| hostId                              |                                                                                                                                                       |
| id                                  | 56700a77-c88e-40ce-9748-ab8373a06bf1                                                                                                                  |
| image                               | ocp-3.9-rhel (8a9cccd9-71b4-49e7-b500-b2ab4e1d92ff)                                                                                                   |
| key_name                            | id_rsa_perf                                                                                                                                           |
| name                                | jm-pci-pass-01                                                                                                                                        |
| project_id                          | 508d3035cc2b4c9f9d154302afd03ff2                                                                                                                      |
| properties                          |                                                                                                                                                       |
| status                              | ERROR                                                                                                                                                 |
| updated                             | 2018-02-06T12:08:50Z                                                                                                                                  |
| user_id                             | a042dcdcf9e84e78a9e96620814b1a4c                                                                                                                      |
| volumes_attached                    |                                                                                                                                                       |
+-------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+

Comment 1 jmencak 2018-02-08 14:51:37 UTC
Some additional logs from libvirtd:

./messages:Feb  6 12:05:42 overcloud-1029pcompute-1 libvirtd: 2018-02-06 12:05:42.743+0000: 3387: error : virPCIGetHeaderType:3115 : internal error: Unknown PCI header type '127'
./messages:Feb  6 12:05:42 overcloud-1029pcompute-1 libvirtd: 2018-02-06 12:05:42.744+0000: 3387: error : virHostdevReAttachPCIDevices:1054 : Failed to reset PCI device: internal error: Unknown PCI header type '127'
./messages:Feb  6 12:07:28 overcloud-1029pcompute-1 libvirtd: 2018-02-06 12:07:28.398+0000: 3386: error : virPCIGetHeaderType:3115 : internal error: Unknown PCI header type '127'

Comment 2 Daniel Berrange 2018-02-08 14:55:38 UTC
Based on that error message, it could well be a libvirt bug. At least probably need someone who understands PCI to investigate it.

Comment 3 jmencak 2018-02-08 20:26:16 UTC
Trying to narrow this down a bit.  When "lspci -vn" for the NVMe device on the hypervisor shows:

01:00.0 0108: 144d:a804 (prog-if 02 [NVM Express])
        Subsystem: 144d:a801
        Flags: bus master, fast devsel, latency 0, IRQ 16, NUMA node 0
        Memory at 9d100000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable- Count=1/32 Maskable- 64bit+
        Capabilities: [70] Express Endpoint, MSI 00
        Capabilities: [b0] MSI-X: Enable+ Count=8 Masked-
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [148] Device Serial Number 00-00-00-00-00-00-00-00
        Capabilities: [158] Power Budgeting <?>
        Capabilities: [168] #19
        Capabilities: [188] Latency Tolerance Reporting
        Capabilities: [190] L1 PM Substates
        Kernel driver in use: vfio-pci
        Kernel modules: nvme

OpenStack doesn't have a problem launching a VM with the NVMe device passed through.  This is the case after a fresh reboot.  When the VM is deleted (openstack server delete), this is what I see on the hypervisor "lspci -vn" for the NVMe device:

01:00.0 0108: 144d:a804 (rev ff) (prog-if ff)
        !!! Unknown header type 7f
        Kernel driver in use: vfio-pci
        Kernel modules: nvme

Any subsequent attempts to create a VM with the device in this state will fail.

Comment 5 Alex Williamson 2018-02-09 21:36:42 UTC
(In reply to jmencak from comment #1)
> Some additional logs from libvirtd:
> 
> ./messages:Feb  6 12:05:42 overcloud-1029pcompute-1 libvirtd: 2018-02-06
> 12:05:42.743+0000: 3387: error : virPCIGetHeaderType:3115 : internal error:
> Unknown PCI header type '127'
> ./messages:Feb  6 12:05:42 overcloud-1029pcompute-1 libvirtd: 2018-02-06
> 12:05:42.744+0000: 3387: error : virHostdevReAttachPCIDevices:1054 : Failed
> to reset PCI device: internal error: Unknown PCI header type '127'
> ./messages:Feb  6 12:07:28 overcloud-1029pcompute-1 libvirtd: 2018-02-06
> 12:07:28.398+0000: 3386: error : virPCIGetHeaderType:3115 : internal error:
> Unknown PCI header type '127'

Header type is an 8-bit field with bit 7 reserved for identifying multi-function devices and the remaining bits used for defining the header layout of the remainder of config space.  Anything above a value of 2 here is undefined.  A value of 127 indicates that -1 is read from PCI config space and as with private comment 3, we can see that the device no longer responds to config accesses.  Often when this occurs it means that the device failed to return after reset.  There's not enough information provided to determine what sort of reset vfio might be attempting on the device.  Please provide lspci -vvv for the device.  144d:a804 is a Samsung "NVMe SSD Controller SM961/PM961".  We have other reports of Samsung NVMe controllers also failing with assignment, see bug 1500825.  I have a Samsung EVO 960 NVMe enroute for testing, I wish I'd seen this bug previously to opt specifically for an SM961/PM961 device.

Comment 6 jmencak 2018-02-10 10:22 UTC
Created attachment 1394172 [details]
lspci -vvv of "Samsung Electronics Co Ltd NVMe SSD Controller"

Comment 7 jmencak 2018-02-10 11:10:35 UTC
Unchecking the private comment #3.

I've managed to work-around the problem with PCI device by first stopping the VMs by "openstack server stop <server_id>" and then "openstack server delete <server_id>".  That way the PCI device never enters "!!! Unknown header type 7f" state.  Similarly, with "virsh shutdown <server_id>" I've never seen the PCI device enter the "failed" state, however I did see it enter the state every time I simply "virsh destroy <server_id> --graceful".

However, this bug is likely to be more than just the issues with the PCI device.  "openstack server delete <server_id>" can leave virsh instances on the computes both either "shut off" or "running" depending whether they were first stopped by "openstack server stop <server_id>" or not respectively, while the nova database is already updated with the "vm_state" == "deleted" (which is clearly not the case).

Comment 8 jmencak 2018-02-10 20:12:47 UTC
Even though there are available hosts with valid/available PCI devices to be passed through (and also nova DB confirms that):

MariaDB [nova]> select compute_node_id, address, product_id, vendor_id, status from pci_devices;
+-----------------+--------------+------------+-----------+-----------+
| compute_node_id | address      | product_id | vendor_id | status    |
+-----------------+--------------+------------+-----------+-----------+
|              14 | 0000:01:00.0 | a804       | 144d      | allocated |
|              17 | 0000:01:00.0 | a804       | 144d      | available |
|              20 | 0000:01:00.0 | a804       | 144d      | allocated |
+-----------------+--------------+------------+-----------+-----------+
3 rows in set (0.00 sec)

pci_passthrough_filter.py can fail with

2018-02-10 20:00:21.122 235395 INFO nova.filters [req-ccb2389d-9e1c-4db7-8caa-0a3a9ec00812 a042dcdcf9e84e78a9e96620814b1a4c 508d3035cc2b4c9f9d154302afd03ff2 - - -] Filter PciPassthroughFilter
 returned 0 hosts
2018-02-10 20:00:21.124 235395 INFO nova.filters [req-ccb2389d-9e1c-4db7-8caa-0a3a9ec00812 a042dcdcf9e84e78a9e96620814b1a4c 508d3035cc2b4c9f9d154302afd03ff2 - - -] Filtering removed all hosts for the request with instance ID '83fd30cd-99ab-44d6-ba0c-760145bd43a8'. Filter results: ['RetryFilter: (start: 35, end: 35)', 'AvailabilityZoneFilter: (start: 35, end: 35)', 'RamFilter: (start: 35, end: 35)', 'DiskFilter: (start: 35, end: 35)', 'ComputeFilter: (start: 35, end: 17)', 'ComputeCapabilitiesFilter: (start: 17, end: 17)', 'ImagePropertiesFilter: (start: 17, end: 17)', 'ServerGroupAntiAffinityFilter: (start: 17, end: 17)', 'ServerGroupAffinityFilter: (start: 17, end: 17)', 'PciPassthroughFilter: (start: 17, end: 0)']

This is why I say there is likely to be more problems than just the issues with the PCI device itself.

Comment 9 Jiri Denemark 2018-02-12 17:04:42 UTC
This doesn't seem to be a libvirt bug given that lspci shows the same result. However, it's possible that libvirt does something it shouldn't do or doen't do something it should do which could cause this.

From comment 7 it looks like the way a domain is killed matters: virsh shutdown vs. virsh destroy.

What is "openstack server delete" doing? Is it just deleting the domain from openstack's database or is it actually killing it in some way. Comment 7 suggests it doesn't actually kill the domain, but shouldn't the domain remain running in this case?

Comment 10 jmencak 2018-02-12 18:34:13 UTC
(In reply to Jiri Denemark from comment #9)
> This doesn't seem to be a libvirt bug given that lspci shows the same
> result.
Same result?  I submitted the "lspci -vvv" result of a functional/healthy PCI device.  Here is "lspci -vvv" of the "failed" PCI device, after "virsh destroy <server_id>":
[root@overcloud-1029pcompute-2 heat-admin]# lspci -vvv -s 0000:01:
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 (rev ff) (prog-if ff)
        !!! Unknown header type 7f
        Kernel driver in use: vfio-pci
        Kernel modules: nvme

> However, it's possible that libvirt does something it shouldn't do
> or doen't do something it should do which could cause this.
> 
> From comment 7 it looks like the way a domain is killed matters: virsh
> shutdown vs. virsh destroy.
Seems that way, that's why I wrote that.

> What is "openstack server delete" doing? Is it just deleting the domain from
> openstack's database or is it actually killing it in some way. Comment 7
> suggests it doesn't actually kill the domain, but shouldn't the domain
> remain running in this case?
Both of these questions need to be answered by an OpenStack developer.

Comment 11 Jiri Denemark 2018-02-13 09:17:19 UTC
(In reply to jmencak from comment #10)
> > This doesn't seem to be a libvirt bug given that lspci shows the same
> > result.
> Same result?  I submitted the "lspci -vvv" result of a functional/healthy
> PCI device.  Here is "lspci -vvv" of the "failed" PCI device, after "virsh
> destroy <server_id>":
> [root@overcloud-1029pcompute-2 heat-admin]# lspci -vvv -s 0000:01:
> 01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD
> Controller SM961/PM961 (rev ff) (prog-if ff)
>         !!! Unknown header type 7f
>         Kernel driver in use: vfio-pci
>         Kernel modules: nvme

Exactly. Libvirt complains "Failed to reset PCI device: internal error: Unknown PCI header type '127'", which is consistent to what lspci reports. That is both libvirt and lspci report the same result.

Comment 13 Laine Stump 2018-02-19 17:35:35 UTC
Note that even though libvirt is reporting "Failed to reset PCI device", it actually never would have tried to reset the device even if it saw a correct device type in the PCI config - it's just that libvirt's virPCIDeviceReset() function is called unconditionally for every assigned PCI devices, then does several preliminary checks before deciding whether or not to reset the PCI device itself; it happens that the very first check is to examine the device type (in the device's PCI config in sysfs) to see if it's an endpoint device (if it's not an endpoint device, that's considered an error, since libvirt only assigns endpoint devices), and the *last* check is to see if the device is bound to vfio-pci and, if so, skip reset and return success (since vfio handles any necessary device resets as appropriate). So libvirt just doesn't get a chance to *skip* device reset.

libvirt could re-order these checks, looking for use of vfio-pci *first*, and avoid this on particular error, but that would just be delaying the failure report, not actually preventing it.

Also, in terms of interaction with vfio and the device, this happens after the qemu process has been terminated, and before any other device-related action by libvirt. If lspci doesn't show the invalid device type while the guest is running, then my suspicion is that the failure is caused by whatever action is taken (by qemu->vfio->kernel) as the qemu process is terminating. I would wager that the problem (invalid device type in PCI config after terminating the guest) can be reproduced with a straight qemu command, with no libvirt involved.

Comment 14 jmencak 2018-02-19 19:04:11 UTC
I can confirm that lspci doesn't show the invalid device type while the guest is running.

Comment 15 Laine Stump 2018-02-19 22:49:31 UTC
Comment 7 leads me to believe that the NVMe device may be somehow dependent on the guest OS properly shutting down / releasing the device (which of course should never be the case).

I think the best course of action is to send this BZ to the kernel, but before that I can think of one experiment that could conclusively rule out libvirt as the source/trigger of any misbehavior, and also possibly help to narrow down the scope of investigation for the kernel person who looks at it:

Once a guest is running 1) stop libvirtd.service, then 2) manually kill the qemu process, then 3) look at lspci for the device. (since libvirtd isn't running, any change will be due to qemu/vfio/kernel releasing the device from the guest).

Comment 20 Alex Williamson 2018-06-01 16:18:35 UTC
There's no obvious way to make progress on this without a reproducer, the NVMe endpoint appears to be failing to return from reset.  This might be a hardware problem with the original device or related to the topology of the original system.  Re-open if a new reproducer is found.

Comment 21 Alex Krzos 2018-07-27 14:38:02 UTC
(In reply to Alex Williamson from comment #20)
> There's no obvious way to make progress on this without a reproducer, the
> NVMe endpoint appears to be failing to return from reset.  This might be a
> hardware problem with the original device or related to the topology of the
> original system.  Re-open if a new reproducer is found.

Hey Alex,

I can reproduce the issue in our staging environment.  Most recently we deployed OSP13 with RHEL 7.5 and kernel 3.10.0-862.3.3.el7.x86_64 and still witnessed the issue.

libvirt/nova versions:
libvirt-3.9.0-14.el7_5.5.x86_64
libvirt-client-3.9.0-14.el7_5.5.x86_64
libvirt-daemon-3.9.0-14.el7_5.5.x86_64
libvirt-daemon-config-network-3.9.0-14.el7_5.5.x86_64
libvirt-daemon-config-nwfilter-3.9.0-14.el7_5.5.x86_64
libvirt-daemon-driver-interface-3.9.0-14.el7_5.5.x86_64
libvirt-daemon-driver-lxc-3.9.0-14.el7_5.5.x86_64
libvirt-daemon-driver-network-3.9.0-14.el7_5.5.x86_64
libvirt-daemon-driver-nodedev-3.9.0-14.el7_5.5.x86_64
libvirt-daemon-driver-nwfilter-3.9.0-14.el7_5.5.x86_64
libvirt-daemon-driver-qemu-3.9.0-14.el7_5.5.x86_64
libvirt-daemon-driver-secret-3.9.0-14.el7_5.5.x86_64
libvirt-daemon-driver-storage-3.9.0-14.el7_5.5.x86_64
libvirt-daemon-driver-storage-core-3.9.0-14.el7_5.5.x86_64
libvirt-daemon-driver-storage-disk-3.9.0-14.el7_5.5.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.5.x86_64
libvirt-daemon-driver-storage-iscsi-3.9.0-14.el7_5.5.x86_64
libvirt-daemon-driver-storage-logical-3.9.0-14.el7_5.5.x86_64
libvirt-daemon-driver-storage-mpath-3.9.0-14.el7_5.5.x86_64
libvirt-daemon-driver-storage-rbd-3.9.0-14.el7_5.5.x86_64
libvirt-daemon-driver-storage-scsi-3.9.0-14.el7_5.5.x86_64
libvirt-daemon-kvm-3.9.0-14.el7_5.5.x86_64
libvirt-libs-3.9.0-14.el7_5.5.x86_64
libvirt-python-3.9.0-1.el7.x86_64
openstack-nova-api-17.0.3-0.20180420001141.el7ost.noarch
openstack-nova-common-17.0.3-0.20180420001141.el7ost.noarch
openstack-nova-compute-17.0.3-0.20180420001141.el7ost.noarch
openstack-nova-conductor-17.0.3-0.20180420001141.el7ost.noarch
openstack-nova-console-17.0.3-0.20180420001141.el7ost.noarch
openstack-nova-migration-17.0.3-0.20180420001141.el7ost.noarch
openstack-nova-novncproxy-17.0.3-0.20180420001141.el7ost.noarch
openstack-nova-placement-api-17.0.3-0.20180420001141.el7ost.noarch
openstack-nova-scheduler-17.0.3-0.20180420001141.el7ost.noarch
puppet-nova-12.4.0-3.el7ost.noarch
python-nova-17.0.3-0.20180420001141.el7ost.noarch
python-novaclient-9.1.1-1.el7ost.noarch


In order to reproduce on our small staging cluster I did the following:

Disabled all but one compute via nova service-disable (overcloud-compute-0 was left enabled)

Created a pci passthrough flavor (openstack flavor create --ram 16128 --disk 96 --vcpus 4 --property "pci_passthrough:alias"="nvme:1" m4.xlarge-pci)

Created an instance using the pci passthrough flavor (openstack server create --image centos7 --flavor m4.xlarge-pci --key-name akrzos --security-group ci_security_group --nic net-id=ci_network akrzos )

Verified said instance had passed through nvme device:

(overcloud) [stack@b04-h02-1029p ~]$ ssh -i akrzos.pem centos@10.12.80.16
[centos@akrzos ~]$ lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda     253:0    0   96G  0 disk 
└─vda1  253:1    0   96G  0 part /
nvme0n1 259:0    0  477G  0 disk 


Then deleted the instance with (openstack server delete akrzos)

Attempted to boot a new pci passthrough instance fails:

(overcloud) [stack@b04-h02-1029p ~]$ openstack server create --image centos7 --flavor m4.xlarge-pci --key-name akrzos --security-group ci_security_group --nic net-id=ci_network akrzos 
...
(overcloud) [stack@b04-h02-1029p ~]$ openstack server list
+--------------------------------------+--------+--------+----------+---------+---------------+
| ID                                   | Name   | Status | Networks | Image   | Flavor        |
+--------------------------------------+--------+--------+----------+---------+---------------+
| 6c4c8227-082d-4417-ab69-d41dd8dd1350 | akrzos | ERROR  |          | centos7 | m4.xlarge-pci |
+--------------------------------------+--------+--------+----------+---------+---------------+



While doing this I checked in on the compute node's lspci output regarding the nvme device:

Before booting the first instance:
[stack@b04-h02-1029p ansible]$ ansible -i hosts.com compute -u heat-admin -b  -m'shell' -a'lspci -vvv -s 0000:01:'                                                                                                                            [117/1926]

overcloud-compute-0 | SUCCESS | rc=0 >>
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 (prog-if 02 [NVM Express])
        Subsystem: Samsung Electronics Co Ltd Device a801
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 16
        NUMA node: 0
        Region 0: Memory at 9d100000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable- Count=1/32 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
                DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L0s unlimited, L1 <64us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [b0] MSI-X: Enable+ Count=8 Masked-
                Vector table: BAR=0 offset=00003000
                PBA: BAR=0 offset=00002000
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
        Capabilities: [148 v1] Device Serial Number 00-00-00-00-00-00-00-00
        Capabilities: [158 v1] Power Budgeting <?>
        Capabilities: [168 v1] #19
        Capabilities: [188 v1] Latency Tolerance Reporting
                Max snoop latency: 71680ns
                Max no snoop latency: 71680ns
        Capabilities: [190 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
        Kernel driver in use: nvme
        Kernel modules: nvme

While the first instance was booted and running:

[stack@b04-h02-1029p ansible]$ ansible -i hosts.com compute -u heat-admin -b  -m'shell' -a'lspci -vvv -s 0000:01:'                                                                                                                             [40/1926]

overcloud-compute-0 | SUCCESS | rc=0 >>
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 (prog-if 02 [NVM Express])
        Subsystem: Samsung Electronics Co Ltd Device a801
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 16
        NUMA node: 0
        Region 0: Memory at 9d100000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable- Count=1/32 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L0s unlimited, L1 <64us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [b0] MSI-X: Enable+ Count=8 Masked-
                Vector table: BAR=0 offset=00003000
                PBA: BAR=0 offset=00002000
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
        Capabilities: [148 v1] Device Serial Number 00-00-00-00-00-00-00-00
        Capabilities: [158 v1] Power Budgeting <?>
        Capabilities: [168 v1] #19
        Capabilities: [188 v1] Latency Tolerance Reporting
                Max snoop latency: 0ns
                Max no snoop latency: 0ns
        Capabilities: [190 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
        Kernel driver in use: vfio-pci
        Kernel modules: nvme

After deleting the first instance:

[stack@b04-h02-1029p ansible]$ ansible -i hosts.com compute -u heat-admin -b  -m'shell' -a'lspci -vvv -s 0000:01:' 

overcloud-compute-0 | SUCCESS | rc=0 >>
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 (rev ff) (prog-if ff)
        !!! Unknown header type 7f
        Kernel modules: nvme


If I stop the instance first with openstack server stop akrzos I can then I can create a new instance that uses the passthrough nvme

I can provide you access to this testbed to verify however it will take us some time to make sure it is setup to reproduce cleanly for you.

Comment 22 Alex Williamson 2018-07-27 15:21:00 UTC
(In reply to Alex Krzos from comment #21)
> (In reply to Alex Williamson from comment #20)
>
> I can reproduce the issue in our staging environment.  Most recently we
> deployed OSP13 with RHEL 7.5 and kernel 3.10.0-862.3.3.el7.x86_64 and still
> witnessed the issue.

Hi Alex,

Can you verify the vendor:device ID of your NVMe drive is 144d:a804, this should be found in 'lspci -nnks 1:00.0'.  I found a reproducible case of this myself recently and have a patch pending upstream that disables the nvme controller before reset on this device.  If it matches, I'll provide a backport to 7.5 host kernel that I'd appreciate testing.  Thanks

Comment 23 Alex Krzos 2018-07-27 15:31:24 UTC
(In reply to Alex Williamson from comment #22)
> (In reply to Alex Krzos from comment #21)
> > (In reply to Alex Williamson from comment #20)
> >
> > I can reproduce the issue in our staging environment.  Most recently we
> > deployed OSP13 with RHEL 7.5 and kernel 3.10.0-862.3.3.el7.x86_64 and still
> > witnessed the issue.
> 
> Hi Alex,
> 
> Can you verify the vendor:device ID of your NVMe drive is 144d:a804, this
> should be found in 'lspci -nnks 1:00.0'.  I found a reproducible case of
> this myself recently and have a patch pending upstream that disables the
> nvme controller before reset on this device.  If it matches, I'll provide a
> backport to 7.5 host kernel that I'd appreciate testing.  Thanks

vendor:device ID is 144d:a804 according to the first line of output however the second line shows [144d:a801] so I am unsure of which is the correct vendor:Device ID


[root@overcloud-compute-0 ~]# lspci -nnks 1:00.0
01:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 [144d:a804]
        Subsystem: Samsung Electronics Co Ltd Device [144d:a801]
        Kernel driver in use: nvme
        Kernel modules: nvme

Comment 24 Alex Williamson 2018-07-27 16:20:00 UTC
(In reply to Alex Krzos from comment #23)
>
> vendor:device ID is 144d:a804 according to the first line of output however
> the second line shows [144d:a801] so I am unsure of which is the correct
> vendor:Device ID

Just the first line is all we need, that's a match.  Please try this kernel on the VM host and let me know if it resolves the issue:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=17387051

Comment 25 Alex Williamson 2018-07-31 18:12:39 UTC
(In reply to Alex Williamson from comment #24)
> (In reply to Alex Krzos from comment #23)
> >
> > vendor:device ID is 144d:a804 according to the first line of output however
> > the second line shows [144d:a801] so I am unsure of which is the correct
> > vendor:Device ID
> 
> Just the first line is all we need, that's a match.  Please try this kernel
> on the VM host and let me know if it resolves the issue:
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=17387051

The previous scratch build has already expired, were you able to test or do I need to resubmit the build?

Comment 26 CongLi 2018-08-01 06:59:15 UTC
(In reply to Alex Williamson from comment #22)
> (In reply to Alex Krzos from comment #21)
> > (In reply to Alex Williamson from comment #20)
> >
> > I can reproduce the issue in our staging environment.  Most recently we
> > deployed OSP13 with RHEL 7.5 and kernel 3.10.0-862.3.3.el7.x86_64 and still
> > witnessed the issue.
> 
> Hi Alex,
> 
> Can you verify the vendor:device ID of your NVMe drive is 144d:a804, this
> should be found in 'lspci -nnks 1:00.0'.  I found a reproducible case of
> this myself recently and have a patch pending upstream that disables the
> nvme controller before reset on this device.  If it matches, I'll provide a
> backport to 7.5 host kernel that I'd appreciate testing.  Thanks

Hi Alex,

Does it mean this issue only triggered with '1:00.0' ?

Thanks.

Comment 27 Alex Krzos 2018-08-01 11:19:03 UTC
(In reply to Alex Williamson from comment #25)
> (In reply to Alex Williamson from comment #24)
> > (In reply to Alex Krzos from comment #23)
> > >
> > > vendor:device ID is 144d:a804 according to the first line of output however
> > > the second line shows [144d:a801] so I am unsure of which is the correct
> > > vendor:Device ID
> > 
> > Just the first line is all we need, that's a match.  Please try this kernel
> > on the VM host and let me know if it resolves the issue:
> > 
> > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=17387051
> 
> The previous scratch build has already expired, were you able to test or do
> I need to resubmit the build?

We have not had the time to work testing this in.  We do have a small staging environment which I can provide for you this week to test that kernel on.  Send me a direct message and I can get you access today.

Comment 28 Alex Williamson 2018-08-01 14:07:48 UTC
(In reply to CongLi from comment #26)
> Hi Alex,
> 
> Does it mean this issue only triggered with '1:00.0' ?

Hi CongLi,

The issue is only triggered with Samsung NVMe controllers, this one happens to be installed at address 1:00.0, but the address is not a fundamental component of the issue, afaik.  Thanks

Comment 29 Alex Williamson 2018-08-01 14:11:15 UTC
(In reply to Alex Krzos from comment #27)
> 
> We have not had the time to work testing this in.  We do have a small
> staging environment which I can provide for you this week to test that
> kernel on.  Send me a direct message and I can get you access today.

If you can distill the test case to a QEMU commandline or libvirt XML, I would offer to verify the kernel, otherwise I would kindly request that you are more familiar with the test environment and a better candidate to verify.  I'll build a new kernel and put it in a more stable location for testing at your leisure.

Comment 30 Alex Williamson 2018-08-01 19:13:02 UTC
(In reply to Alex Williamson from comment #29)
> (In reply to Alex Krzos from comment #27)
> > 
> > We have not had the time to work testing this in.  We do have a small
> > staging environment which I can provide for you this week to test that
> > kernel on.  Send me a direct message and I can get you access today.
> 
> If you can distill the test case to a QEMU commandline or libvirt XML, I
> would offer to verify the kernel, otherwise I would kindly request that you
> are more familiar with the test environment and a better candidate to
> verify.  I'll build a new kernel and put it in a more stable location for
> testing at your leisure.

Here's a new build of the kernel for testing:

http://people.redhat.com/~alwillia/bz1542494/

Comment 31 Alex Krzos 2018-08-03 13:43:40 UTC
(In reply to Alex Williamson from comment #30)
> (In reply to Alex Williamson from comment #29)
> > (In reply to Alex Krzos from comment #27)
> > > 
> > > We have not had the time to work testing this in.  We do have a small
> > > staging environment which I can provide for you this week to test that
> > > kernel on.  Send me a direct message and I can get you access today.
> > 
> > If you can distill the test case to a QEMU commandline or libvirt XML, I
> > would offer to verify the kernel, otherwise I would kindly request that you
> > are more familiar with the test environment and a better candidate to
> > verify.  I'll build a new kernel and put it in a more stable location for
> > testing at your leisure.
> 
> Here's a new build of the kernel for testing:
> 
> http://people.redhat.com/~alwillia/bz1542494/

Thanks,

I installed the kernel on a single hypervisor in the OpenStack cloud and then disabled all of the other hypervisors.  I then ran through a serial set of commands to boot an instance with pci passthrough and then deleted 5 times.  This was run serially since there is only a single nvme on the compute thus the sequence resembled - boot instance w/ nvme, delete instance, boot instance w/nvme, delete instance

At the conclusion the nvme on the compute node never showed the error condition on the hypervisor and the instances never showed up as error state either.

Output and commands here:
https://gist.github.com/akrzos/b5cf6bf5b8492868180fabf791b2ad88

I believe this concludes that the patched kernel is not affected.  I am installing the kernel on the rest of the compute nodes to have more exhaustive testing and to avoid the error condition in this testbed.

I also wanted to point, running the same sequence on un-patched hypervisors results in the second instance unable to boot and the following output on the un-patched compute nodes as previous discussed in this bz:
[root@overcloud-compute-1 ~]# lspci -vvv -s 0000:01:
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 (rev ff) (prog-if ff)
        !!! Unknown header type 7f
        Kernel driver in use: vfio-pci
        Kernel modules: nvme


Thanks AlexW for the patch, this fixes our issue as it seems.

Comment 33 Bruno Meneguele 2018-08-30 16:18:35 UTC
Considering the patchset that was posted on rhkernel-list also solves another BZ (BZ 1592654) that has exception flag requested, shouldn't this BZ also request the flag thus the entire patchset has the chance to get integrated?

Comment 34 Bruno Meneguele 2018-09-01 02:12:31 UTC
Patch(es) committed on kernel repository and an interim kernel build is undergoing testing

Comment 36 Bruno Meneguele 2018-09-03 19:47:47 UTC
Patch(es) available on kernel-3.10.0-942.el7

Comment 38 CongLi 2018-09-10 05:53:25 UTC
1. Reproduced this issue on: kernel-3.10.0-940.el7.x86_64

Steps:
1) install a guest on nvme
2) reboot guest

qemu:
(qemu) qemu-kvm: vfio_err_notifier_handler(0000:02:00.0) Unrecoverable error detected. Please collect any data possible and then kill the guest

host dmesg:
[61345.741741] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8
[61345.748574] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID)
[61345.758777] pcieport 0000:00:1b.0:   device [8086:a340] error status/mask=00000001/00002000
[61345.767151] pcieport 0000:00:1b.0:    [ 0] Receiver Error        
[61347.641675] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8
[61347.648520] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID)
[61347.658729] pcieport 0000:00:1b.0:   device [8086:a340] error status/mask=00000001/00002000
[61347.667108] pcieport 0000:00:1b.0:    [ 0] Receiver Error        
[61348.042293] vfio_ecap_init: 0000:02:00.0 hiding ecap 0x19@0x168
[61348.048240] vfio_ecap_init: 0000:02:00.0 hiding ecap 0x1e@0x190
[61348.203471] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8
[61348.210291] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID)
[61348.220494] pcieport 0000:00:1b.0:   device [8086:a340] error status/mask=00000001/00002000
[61348.228880] pcieport 0000:00:1b.0:    [ 0] Receiver Error        
 

2. Verified this bug on: kernel-3.10.0-944.el7.x86_64
guest could boot up and reboot successfully.
host dmesg:
[257072.707914] vfio-pci 0000:02:00.0: enabling device (0400 -> 0402)
[257072.814894] vfio_ecap_init: 0000:02:00.0 hiding ecap 0x19@0x168
[257072.820944] vfio_ecap_init: 0000:02:00.0 hiding ecap 0x1e@0x190
[257085.907046] vfio-pci 0000:02:00.0: irq 126 for MSI/MSI-X
[257086.114261] vfio-pci 0000:02:00.0: irq 126 for MSI/MSI-X

nvme info:
02:00.0 0108: 144d:a804 (prog-if 02 [NVM Express])
	Subsystem: 144d:a801
	Flags: bus master, fast devsel, latency 0, IRQ 16, NUMA node 0
	Memory at a4100000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [40] Power Management version 3
	Capabilities: [50] MSI: Enable- Count=1/32 Maskable- 64bit+
	Capabilities: [70] Express Endpoint, MSI 00
	Capabilities: [b0] MSI-X: Enable+ Count=8 Masked-
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [148] Device Serial Number 00-00-00-00-00-00-00-00
	Capabilities: [158] Power Budgeting <?>
	Capabilities: [168] #19
	Capabilities: [188] Latency Tolerance Reporting
	Capabilities: [190] L1 PM Substates
	Kernel driver in use: vfio-pci
	Kernel modules: nvme
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961

Comment 41 errata-xmlrpc 2018-10-30 08:37:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3083


Note You need to log in before you can comment on or make changes to this bug.