Bug 1592654 - [NVMe Device Assignment] Guest reboot failed from the NVMe assigned which os installed on
Summary: [NVMe Device Assignment] Guest reboot failed from the NVMe assigned which os ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: rc
: ---
Assignee: Alex Williamson
QA Contact: Evan McNabb
URL:
Whiteboard:
: 1601843 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-19 05:26 UTC by CongLi
Modified: 2018-10-30 09:24 UTC (History)
15 users (show)

Fixed In Version: kernel-3.10.0-942.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-30 09:22:49 UTC


Attachments (Terms of Use)
NVMe - no bootable device (21.26 KB, image/png)
2018-06-19 05:26 UTC, CongLi
no flags Details
seabios.log (9.79 KB, text/plain)
2018-07-19 10:17 UTC, CongLi
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:3083 None None None 2018-10-30 09:24:14 UTC

Description CongLi 2018-06-19 05:26:33 UTC
Created attachment 1452813 [details]
NVMe - no bootable device

Description of problem:
Guest reboot failed from the NVMe passthrough which os installed on

Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.12.0-3.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Boot up guest from the NVMe drive directly (RHEL.7.5 os installed on the NVMe drive)
    -device vfio-pci,id=nvme,host=06:00.0,bootindex=0 \ 
2. Reboot guest
3.

Actual results:
Guest reboot failed with 'No bootable device'.

Expected results:
Guest reboot successfully.

Additional info:
1. Guest reboot successfully with:
    -chardev pty,id=charserial0 \
    -device isa-serial,chardev=charserial0,id=serial0 \
    -serial unix:/tmp/console,server,nowait \
    -device sga \
    -machine graphics=off \

2. Qemu CML:
/usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -machine pc  \
    -nodefaults  \
    -vga cirrus  \
    -device vfio-pci,id=nvme,host=06:00.0,bootindex=0 \
    -device virtio-net-pci,mac=9a:ff:00:01:02:03,id=idB3A32U,vectors=4,netdev=idWLzEQC,bus=pci.0,addr=0x5  \
    -netdev tap,id=idWLzEQC,vhost=on \
    -m 15360  \
    -smp 12,maxcpus=12,cores=6,threads=1,sockets=2  \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -no-shutdown \
    -enable-kvm \
    -serial tcp:0:3333,nowait,server \
    -monitor stdio \

3. NVMe info:
# lspci -v -s 06:00.0
06:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01) (prog-if 02 [NVM Express])
	Subsystem: Intel Corporation DC P3700 SSD
	Physical Slot: 4
	Flags: fast devsel, IRQ 81, NUMA node 0
	Memory at 902fc000 (64-bit, non-prefetchable) [disabled] [size=16K]
	Expansion ROM at 90200000 [disabled] [size=64K]
	Capabilities: [40] Power Management version 3
	Capabilities: [50] MSI-X: Enable- Count=32 Masked-
	Capabilities: [60] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [150] Virtual Channel
	Capabilities: [180] Power Budgeting <?>
	Capabilities: [190] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [270] Device Serial Number 55-cd-2e-41-4e-91-0c-43
	Capabilities: [2a0] #19
	Kernel driver in use: vfio-pci
	Kernel modules: nvme

Comment 2 CongLi 2018-06-19 06:10:47 UTC
seabios-bin-1.11.0-2.el7.noarch
seabios-1.11.0-2.el7.x86_64

Comment 3 Alex Williamson 2018-06-19 13:00:15 UTC
Please explain the test scenario further, is the failure that the guest boots correctly from NVMe on the initial boot of the VM but fails when the guest is rebooted?

Comment 4 CongLi 2018-06-20 06:25:48 UTC
(In reply to Alex Williamson from comment #3)
> Please explain the test scenario further, is the failure that the guest
> boots correctly from NVMe on the initial boot of the VM but fails when the
> guest is rebooted?

Yes, guest could boot successfully from NVMe on the initial boot of the VM but reboot failed.

Comment 5 CongLi 2018-06-20 06:29:13 UTC
(In reply to CongLi from comment #4)
> (In reply to Alex Williamson from comment #3)
> > Please explain the test scenario further, is the failure that the guest
> > boots correctly from NVMe on the initial boot of the VM but fails when the
> > guest is rebooted?
> 
> Yes, guest could boot successfully from NVMe on the initial boot of the VM
> but reboot failed.

boot from NVMe on the initial boot of the vm -> reboot   -- failed
boot from NVMe on the initial boot of the vm -> shutdown -> boot guest again from NVMe    -- successfully

Comment 6 Alex Williamson 2018-07-18 23:46:34 UTC
[Cc +Laszlo - maybe another NVMe reset timeout issue]

Hi Cong Li,

Upstream SeaBIOS does have native NVMe support, but we de-configure it for the RHEL7 build of the package.  That means that for the device you're testing, we're using the device provided PCI option ROM driver under SeaBIOS.  I'm reminded of some comments Laszlo has made in the past about NVMe devices requiring a lengthy reset period on edk2 or else the built-in UEFI driver would fail to attach to the device.  Without disassembling the option ROM driver or perhaps tracing the device interaction through vfio, we don't really know how the ROM supplied driver handles waiting for reset to complete.  The Samsung NVMe drive I have on hand also does not support an option ROM, so cannot support boot without a driver built into the firmware.  Could you please test whether this issue reproduces with the SeaBIOS NVMe driver enabled using this package:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=17242643

It's probably advisable to add 'rombar=0' to the vfio-pci device options for QEMU to mask the physical device ROM.  Thanks,

Alex

Comment 7 CongLi 2018-07-19 07:25:09 UTC
(In reply to Alex Williamson from comment #6)
> [Cc +Laszlo - maybe another NVMe reset timeout issue]
> 
> Hi Cong Li,
> 
> Upstream SeaBIOS does have native NVMe support, but we de-configure it for
> the RHEL7 build of the package.  That means that for the device you're
> testing, we're using the device provided PCI option ROM driver under
> SeaBIOS.  I'm reminded of some comments Laszlo has made in the past about
> NVMe devices requiring a lengthy reset period on edk2 or else the built-in
> UEFI driver would fail to attach to the device.  Without disassembling the
> option ROM driver or perhaps tracing the device interaction through vfio, we
> don't really know how the ROM supplied driver handles waiting for reset to
> complete.  The Samsung NVMe drive I have on hand also does not support an
> option ROM, so cannot support boot without a driver built into the firmware.
> Could you please test whether this issue reproduces with the SeaBIOS NVMe
> driver enabled using this package:
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=17242643
> 
> It's probably advisable to add 'rombar=0' to the vfio-pci device options for
> QEMU to mask the physical device ROM.  Thanks,
> 
> Alex


Guest still could not boot up with the package above
seabios-bin-1.11.0-2.el7.nvme.noarch
seabios-1.11.0-2.el7.nvme.x86_64
CML:
    -device vfio-pci,id=nvme,host=06:00.0,rombar=0 \


NVMe device info:
06:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01) (prog-if 02 [NVM Express])
	Subsystem: Intel Corporation DC P3700 SSD
	Physical Slot: 4
	Flags: bus master, fast devsel, latency 0, IRQ 32, NUMA node 0
	Memory at 902fc000 (64-bit, non-prefetchable) [size=16K]
	Expansion ROM at 90200000 [disabled] [size=64K]
	Capabilities: [40] Power Management version 3
	Capabilities: [50] MSI-X: Enable- Count=32 Masked-
	Capabilities: [60] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [150] Virtual Channel
	Capabilities: [180] Power Budgeting <?>
	Capabilities: [190] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [270] Device Serial Number 55-cd-2e-41-4e-91-0c-43
	Capabilities: [2a0] #19
	Kernel driver in use: vfio-pci
	Kernel modules: nvme

Thanks.

Comment 9 Laszlo Ersek 2018-07-19 08:41:05 UTC
For me, this is the curious bit:

(In reply to CongLi from comment #0)
> Additional info:
> 1. Guest reboot successfully with:
>     -chardev pty,id=charserial0 \
>     -device isa-serial,chardev=charserial0,id=serial0 \
>     -serial unix:/tmp/console,server,nowait \
>     -device sga \
>     -machine graphics=off \
> 
> 2. Qemu CML:
> /usr/libexec/qemu-kvm \
>     [...]
>     -vga cirrus  \
>     [...]

IIUC, if we remove Cirrus from the picture (implying that we remove the Cirrus *oprom*, "vgabios-cirrus.bin", from the picture), the reboot works.

Does the issue reproduce with guest GPU models different from Cirrus?

It might also help to capture a SeaBIOS debug log (with high log-level setting) and compare between normal boot and reboot.

Comment 10 CongLi 2018-07-19 10:15:47 UTC
(In reply to Laszlo Ersek from comment #9)
> For me, this is the curious bit:
> 
> (In reply to CongLi from comment #0)
> > Additional info:
> > 1. Guest reboot successfully with:
> >     -chardev pty,id=charserial0 \
> >     -device isa-serial,chardev=charserial0,id=serial0 \
> >     -serial unix:/tmp/console,server,nowait \
> >     -device sga \
> >     -machine graphics=off \
> > 
> > 2. Qemu CML:
> > /usr/libexec/qemu-kvm \
> >     [...]
> >     -vga cirrus  \
> >     [...]
> 
> IIUC, if we remove Cirrus from the picture (implying that we remove the
> Cirrus *oprom*, "vgabios-cirrus.bin", from the picture), the reboot works.
> 
> Does the issue reproduce with guest GPU models different from Cirrus?

It could be reproduced with vga=std, and there is 'No bootable device' info.

> 
> It might also help to capture a SeaBIOS debug log (with high log-level
> setting) and compare between normal boot and reboot.

Will be attached.

Comment 11 CongLi 2018-07-19 10:17:12 UTC
Created attachment 1459969 [details]
seabios.log

Comment 13 Laszlo Ersek 2018-07-19 12:59:29 UTC
Comparing the normal SeaBIOS boot log and the SeaBIOS reboot log, Alex's
suspicion is confirmed (funnily enough, I've by now totally forgotten about
the discussion he refers to in comment 6). Splitting the attachment to
"normal boot" and "reboot" parts, and comparing them, we notice:

>  Found 0 lpt ports
>  Found 0 serial ports
>  PS2 keyboard initialized
> -Searching bootorder for: /pci@i0cf8/*@3
> +WARNING - Timeout at nvme_wait_csts_rdy:484!
>  All threads complete.
>  Scan for option roms
>  Running option rom at c980:0003

and

>  Searching bootorder for: /pci@i0cf8/*@5
>  Searching bootorder for: /rom@genroms/kvmvapic.bin
>  Searching bootorder for: HALT
> -drive 0x000f6080: PCHS=0/0/0 translation=lba LCHS=1024/255/63 s=781422768
>  Running option rom at ca80:0003
> -Space available for UMB: cd000-eb800, f4520-f6080
> -Returned 94208 bytes of ZoneHigh
> +Space available for UMB: cd000-eb800, f4520-f60c0
> +Returned 131072 bytes of ZoneHigh

and

>  Booting from Hard Disk...
> -Booting from 0000:7c00
> -[...]
> +Boot failed: could not read the boot disk

The main discovery is (I guess) "Timeout at nvme_wait_csts_rdy:484".

(The message "could not read the boot disk" is emitted from

    // Read sector
    struct bregs br;
    memset(&br, 0, sizeof(br));
    br.flags = F_IF;
    br.dl = bootdrv;
    br.es = bootseg;
    br.ah = 2;
    br.al = 1;
    br.cl = 1;
    call16_int(0x13, &br);

    if (br.flags & F_CF) {
        printf("Boot failed: could not read the boot disk\n\n");
        return;
    }

but I guess it's quite expected that the "read sector" BIOS service fails,
after the disk cannot be initialized / enumerated in the first place.)

This physical NVMe device
- either cannot be re-set, as an assigned device, through a VM reset,
- or it needs a lot more time to recover/settle than *both* SeaBIOS's builtin
  driver *and* its own option ROM would expect. It could be useful to involve
  the vendor (Intel).

That's my take anyway.

Comment 14 Alex Williamson 2018-07-19 23:24:49 UTC
Laszlo, thanks getting involved and your analysis on this.  The previous conversation about this was a reply to one of our status reports many months ago, I only had a spark of a memory sufficient to search my inbox ;)

So, I'm not able to reproduce this with my Samsung NVMe drive, but I did manage to find a case where the drive goes fatal if I quit the VM at the right point and we only ever read -1 from config space from that point on until the host is reset.  I did log in to Cong Li's system and found the issue to be readily reproducible there.

Hoping to tackle both issues, I wrote a device specific PCI reset quirk for NVMe.  Effectively, it reads the NVMe capability register to determine if NVM subsystem reset is available (neither controller supports this), reads the NVMe config register to test if the controller is enabled, disabling it and waiting for the ready status to reach the proper state before ultimately issuing the reset, which is still a PCIe FLR since neither supports the aforementioned subsystem reset.

This seems to resolve the issue with Samsung (yay!), but has no effect on Intel (boo!).  There must be something we can learn by the fact the the Intel device works so reliably on the first boot when the VM is instantiated.  More investigation is required and I'll also look to see if there's any published errata for the Intel device.   Cong Li, can the system remain available for use Friday (UTC-6)?  Thanks

Comment 15 Alex Williamson 2018-07-20 04:49:37 UTC
[Cc +Marc-André as some of these patches are his]

In the process of enabling tracing, I built upstream QEMU on the target system and was no longer able to reproduce the issue.  Some experimentation showed that using upstream bios.bin with qemu-kvm-rhev also resolved the issue.  This turned out to be a little bit of a red herring because I later identified that upstream is built with CONFIG_USE_SMM while downstream is not, which toggles the failure (I suppose this is an alternate fix).  However, even with SMM disabled like downstream, rel-1.11.0 exhibits the problem while master does not.  A quick bisect later and I arrive at the fix:

96060ad tpm: Wait for interface startup when probing

This is within a series of changes in the tpm space, so for a clean backport I included:

96060ad tpm: Wait for interface startup when probing
559b3e2 tpm: Refactor duplicated wait code in tis_wait_sts() & crb_wait_reg()
9c6e73b tpm: add TPM CRB device support
a197e20 tpm: use get_tpm_version() callback
c75d45a tpm: generalize init_timeout()
8694c3b x86: add readq()

The target system is currently installed with this build, from brew:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=17262519

I've also found that the nvme driver is already included in bios-256k.bin, which is what we were already using, so no special nvme enabling is required and it works regardless of whether we enable the option ROM or not (some uncertainty about which is being used though).

I have no idea why this resolves the issue at this point, Marc-André any thoughts?  I also don't know if the device specific reset noted in comment 14 is providing any benefit for this device (a test for tomorrow).

Comment 17 Marc-Andre Lureau 2018-07-20 10:13:41 UTC
(In reply to Alex Williamson from comment #15)
> [Cc +Marc-André as some of these patches are his]
> problem while master does not.  A quick bisect later and I arrive at the fix:
> 
> 96060ad tpm: Wait for interface startup when probing
> 
> I have no idea why this resolves the issue at this point, Marc-André any
> thoughts?

It looks like a timing issue. Stephen commit added a 750000us timeout during TPM probing.

Comment 18 Alex Williamson 2018-07-20 13:56:51 UTC
(In reply to Marc-Andre Lureau from comment #17)
> 
> It looks like a timing issue. Stephen commit added a 750000us timeout during
> TPM probing.

Aha, so if the only fix is the timing, maybe we can assume that's also why enabling SMM resolves it and perhaps we can extrapolate the the reason the initial boot works reliably is simply the additional delay incurred by performing the DMA mappings during device initialization.  I'll experiment with a stall after the device is reset.  I hope there's some status we can check on the device to avoid an arbitrary stall.

Comment 19 Alex Williamson 2018-07-20 14:46:56 UTC
Confirmed, adding a 250ms delay post reset makes reboots reliable :facepalm:

I assume whatever hack/quirk fixes this won't be in seabios, but either kernel or qemu, moving back to qemu-kvm-rhev for now.

Comment 20 Alex Williamson 2018-07-23 18:34:08 UTC
*** Bug 1601843 has been marked as a duplicate of this bug. ***

Comment 21 Alex Williamson 2018-07-23 18:42:25 UTC
The best place to resolve this seems like a device specific reset quirk in the host PCI subsystem covering all NVMe class devices, disabling the controller prior to FLR for all devices and an additional post-reset delay for particularly troublesome devices, like this Intel DC P3700.  Moving to kernel/PCI and cc'ing Myron.

Comment 24 Alex Williamson 2018-08-03 15:31:42 UTC
Latest (v3) proposed upstream patch series:

https://lkml.org/lkml/2018/7/24/708

Comment 26 Bruno Meneguele 2018-09-01 02:12:44 UTC
Patch(es) committed on kernel repository and an interim kernel build is undergoing testing

Comment 28 Bruno Meneguele 2018-09-03 19:48:01 UTC
Patch(es) available on kernel-3.10.0-942.el7

Comment 30 CongLi 2018-09-04 03:00:48 UTC
The original issue has been fixed in latest kernel: 3.10.0-943.el7.x86_64.

Guest could reboot successfully via '-device vfio-pci,id=nvme,host=06:00.0'.

# lspci -v -s 06:00.0
06:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01) (prog-if 02 [NVM Express])
	Subsystem: Intel Corporation DC P3700 SSD
	Physical Slot: 4
	Flags: bus master, fast devsel, latency 0, IRQ 32, NUMA node 0
	Memory at 902fc000 (64-bit, non-prefetchable) [size=16K]
	Expansion ROM at 90200000 [disabled] [size=64K]
	Capabilities: [40] Power Management version 3
	Capabilities: [50] MSI-X: Enable+ Count=32 Masked-
	Capabilities: [60] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [150] Virtual Channel
	Capabilities: [180] Power Budgeting <?>
	Capabilities: [190] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [270] Device Serial Number 55-cd-2e-41-4e-91-0c-43
	Capabilities: [2a0] #19
	Kernel driver in use: vfio-pci
	Kernel modules: nvme

Comment 31 Evan McNabb 2018-09-04 07:58:01 UTC
(In reply to CongLi from comment #30)
> The original issue has been fixed in latest kernel: 3.10.0-943.el7.x86_64.
> 
> Guest could reboot successfully via '-device vfio-pci,id=nvme,host=06:00.0'.

Thanks for testing! I'll set to VERIFIED, but let us know if there is any other testing that needs to be executed.

Comment 33 errata-xmlrpc 2018-10-30 09:22:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3083


Note You need to log in before you can comment on or make changes to this bug.