Bug 1508821
| Summary: | [Q35] host crash when boot up with pf device | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | jingzhao <jinzhao> | ||||
| Component: | qemu-kvm-rhev | Assignee: | Alex Williamson <alex.williamson> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Yanan Fu <yfu> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 7.5 | CC: | chayang, jinzhao, juzhang, knoel, michen, virt-maint, yfu | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2017-12-13 05:10:08 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
The host firmware escalated an error to fatal.... is this reproducible in other systems, including those from other vendors? Are there logs in the BIOS or DRAC that expose any more details of the fault? 1. X710 and 82599ES info: [root@dell-per730-29 home]# ethtool -i p6p1 driver: i40e version: 1.6.27-k firmware-version: 5.02 0x80002400 17.5.9 expansion-rom-version: bus-info: 0000:04:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes [root@dell-per730-29 home]# ethtool -i p5p1 driver: ixgbe version: 5.1.0-k-rh7.5 firmware-version: 0x546c0001 expansion-rom-version: bus-info: 0000:07:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes 2. Test Result info: a. X710: 1) (sebios)pcie.0 -- x710 nic host didn't crash 2) (sebios)pcie.0 -- root port -- X710 host crash 3) (ovmf)pcie.0 -- root port -- X710 host crash b. 82599ES: 1) (seabios)pcie.0 -- root port -- 82599ES host didn't crash 2) (ovmf)pcie.0 -- root port --82599ES host crash 3. version: [root@dell-per730-29 ~]# rpm -qa |grep OVMF OVMF-20171011-1.git92d07e48907f.el7.noarch [root@dell-per730-29 ~]# rpm -qa |grep seabios seabios-bin-1.10.2-5.el7.noarch [root@dell-per730-29 ~]# uname -r 3.10.0-766.el7.x86_64 [root@dell-per730-29 ~]# rpm -qa |grep qemu-kvm-rhev qemu-kvm-rhev-debuginfo-2.10.0-3.el7.x86_64 qemu-kvm-rhev-2.10.0-3.el7.x86_6 Tried 82576 nic and host didn't crash [root@hp-dl585g7-05 home]# ethtool -i ens8f0 driver: igb version: 5.4.0-k firmware-version: 1.2.1 expansion-rom-version: bus-info: 0000:44:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes [root@hp-dl585g7-05 home]# lspci -vvv -s 44:00.0 44:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) Subsystem: Intel Corporation Gigabit ET Dual Port Server Adapter Physical Slot: 8 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 46 NUMA node: 7 Region 0: Memory at fdfe0000 (32-bit, non-prefetchable) [size=128K] Region 1: Memory at fd800000 (32-bit, non-prefetchable) [size=4M] Region 2: I/O ports at 5000 [disabled] [size=32] Region 3: Memory at fd7f0000 (32-bit, non-prefetchable) [size=16K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+ Address: 0000000000000000 Data: 0000 Masking: 00000000 Pending: 00000000 Capabilities: [70] MSI-X: Enable+ Count=10 Masked- Vector table: BAR=3 offset=00000000 PBA: BAR=3 offset=00002000 Capabilities: [a0] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <4us, L1 <64us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 16ms to 55ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- UESvrt: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- Capabilities: [140 v1] Device Serial Number 90-e2-ba-ff-ff-05-63-5e Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 1 ARICtl: MFVC- ACS-, Function Group: 0 Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV) IOVCap: Migration-, Interrupt Message Number: 000 IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+ IOVSta: Migration- Initial VFs: 8, Total VFs: 8, Number of VFs: 0, Function Dependency Link: 00 VF offset: 128, stride: 2, Device ID: 10ca Supported Page Size: 00000553, System Page Size: 00000001 Region 0: Memory at 00000000fd7d0000 (64-bit, non-prefetchable) Region 3: Memory at 00000000fd7b0000 (64-bit, non-prefetchable) VF Migration: offset: 00000000, BIR: 0 Kernel driver in use: igb Kernel modules: igb seabios and ovmf: q35: pcie.0 --pcie-root-port -- 82576 nic host didn't crash Is this a regression? APEI errors usually indicate hardware errors. An X710 may be prone to generating PCIe bus errors where the platform may implement a firmware first error handling policy, which can decide the error is fatal. If this is not a regression on this system, we're going to need to defer it. The workaround for a customer encountering such issues would be to assign a VF rather than the PF. (In reply to Alex Williamson from comment #6) > Is this a regression? APEI errors usually indicate hardware errors. An > X710 may be prone to generating PCIe bus errors where the platform may > implement a firmware first error handling policy, which can decide the error > is fatal. If this is not a regression on this system, we're going to need > to defer it. The workaround for a customer encountering such issues would > be to assign a VF rather than the PF. Hi alex It's not a regression, hit the issue with qemu-kvm-rhev-2.9.0-16.el7_4.10.x86_64 and kernel-3.10.0-693.12.1.el7.x86_64 OVMF-20170228-5.gitc325e41585e3.el7.noarch.rpm when using X710 (pcie.0 - pcie-root-port -x710) Thanks Jing Thank you, moving to 7.6 double checked the issue 1. can reproduce the issue accroding to comment 0 2. test against with latest version, didn't reproduce it (qemu) [root@dell-per730-28 home]# uname -r 3.10.0-820.el7.x86_64 [root@dell-per730-28 home]# rpm -qa |grep qemu-kvm-rhev qemu-kvm-rhev-debuginfo-2.10.0-12.el7.x86_64 qemu-kvm-rhev-2.10.0-12.el7.x86_64 [root@dell-per730-28 home]# rpm -qa |grep OVMF OVMF-20171011-4.git92d07e48907f.el7.noarch So QE think can close it as current release, re-open it if hit again Thanks Jing |
Created attachment 1346951 [details] detailed crash message Description of problem: host crash when boot up with pf device Version-Release number of selected component (if applicable): [root@dell-per730-29 127.0.0.1-2017-11-02-05:45:37]# uname -r 3.10.0-766.el7.x86_64 [root@dell-per730-29 127.0.0.1-2017-11-02-05:45:37]# rpm -qa |grep qemu-kvm-rhev qemu-kvm-rhev-debuginfo-2.10.0-3.el7.x86_64 qemu-kvm-rhev-2.10.0-3.el7.x86_64 [root@dell-per730-29 127.0.0.1-2017-11-02-05:45:37]# rpm -qa |grep OVMF OVMF-20171011-1.git92d07e48907f.el7.noarch How reproducible: 3/3 Steps to Reproduce: 1. Boot up guest with pf device [1] 2. check the host status Actual results: host crash and system reboot [ 136.278486] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 3 [ 136.287714] {1}[Hardware Error]: event severity: fatal [ 136.293448] {1}[Hardware Error]: Error 0, type: fatal [ 136.299182] {1}[Hardware Error]: section_type: PCIe error [ 136.305397] {1}[Hardware Error]: port_type: 0, PCIe end point [ 136.312135] {1}[Hardware Error]: version: 1.16 [ 136.317286] {1}[Hardware Error]: command: 0x0406, status: 0x0010 [ 136.324181] {1}[Hardware Error]: device_id: 0000:04:00.1 [ 136.330299] {1}[Hardware Error]: slot: 0 [ 136.334867] {1}[Hardware Error]: secondary_bus: 0x00 [ 136.340600] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x1583 [ 136.347979] {1}[Hardware Error]: class_code: 000002 [ 136.353615] Kernel panic - not syncing: Fatal hardware error! Expected results: guest boot up successfully Additional info: 1. X710 nic info: [root@dell-per730-29 yiwei]# lspci -vvv -s 04:00.0 04:00.0 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02) Subsystem: Intel Corporation Ethernet Converged Network Adapter XL710-Q2 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 53 NUMA node: 0 Region 0: Memory at 91000000 (64-bit, prefetchable) [disabled] [size=16M] Region 3: Memory at 92808000 (64-bit, prefetchable) [disabled] [size=32K] Expansion ROM at 92b00000 [disabled] [size=512K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME- Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+ Address: 0000000000000000 Data: 0000 Masking: 00000000 Pending: 00000000 Capabilities: [70] MSI-X: Enable- Count=129 Masked- Vector table: BAR=3 offset=00000000 PBA: BAR=3 offset=00001000 Capabilities: [a0] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 2048 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset- MaxPayload 256 bytes, MaxReadReq 4096 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L0s <2us, L1 <16us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [e0] Vital Product Data Product Name: XL710 40GbE Controller Read-only fields: [V0] Vendor specific: FFV17.5.3 [PN] Part number: KF46X [MN] Manufacture ID: 31 30 32 38 [V1] Vendor specific: DSV1028VPDR.VER2.0 [V3] Vendor specific: DTINIC [V4] Vendor specific: DCM10010380C521010380C512020380C523020380C514030380C525030380C516040380C527040380C518050380C529050380C51A060380C52B060380C51C070380C52D070380C51E080380C52F080380C5 [V5] Vendor specific: NPY2 [V6] Vendor specific: PMTA [V7] Vendor specific: NMVIntel Corp [V8] Vendor specific: L1D0 [RV] Reserved: checksum good, 1 byte(s) reserved Read/write fields: [Y1] System specific: CCF1 End Capabilities: [100 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn+ ChkCap+ ChkEn+ Capabilities: [140 v1] Device Serial Number 88-90-15-ff-ff-fe-fd-3c Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 1 ARICtl: MFVC- ACS-, Function Group: 0 Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV) IOVCap: Migration-, Interrupt Message Number: 000 IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+ IOVSta: Migration- Initial VFs: 64, Total VFs: 64, Number of VFs: 0, Function Dependency Link: 00 VF offset: 16, stride: 1, Device ID: 154c Supported Page Size: 00000553, System Page Size: 00000001 Region 0: Memory at 0000000092400000 (64-bit, prefetchable) Region 3: Memory at 0000000092910000 (64-bit, prefetchable) VF Migration: offset: 00000000, BIR: 0 Capabilities: [1a0 v1] Transaction Processing Hints Device specific mode supported No steering table available Capabilities: [1b0 v1] Access Control Services ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- Capabilities: [1d0 v1] #19 Kernel driver in use: vfio-pci Kernel modules: i40e 2. [1] qemu command line /usr/libexec/qemu-kvm \ -M q35 \ -rtc base=utc \ -m 2G \ -drive file=/usr/share/OVMF/OVMF_CODE.secboot.fd,if=pflash,format=raw,unit=0,readonly=on \ -drive file=/home/yiwei/OVMF_VARS.fd,if=pflash,format=raw,unit=1 \ -serial unix:/tmp/serial0,server,nowait \ -smp 2,sockets=2,cores=1,threads=1 \ -enable-kvm \ -uuid 990ea161-6b67-47b2-b803-19fb01d30d12 \ -k en-us \ -global isa-debugcon.iobase=0x402 \ -boot menu=on \ -qmp tcp:0:6666,server,nowait \ -vnc :1 \ -vga qxl \ -drive file=/home/yiwei/ovmf-guest.qcow2,if=none,id=drive-scsi-disk0,format=qcow2,cache=none,werror=stop,rerror=stop \ -device virtio-blk-pci,drive=drive-scsi-disk0,id=scsi-disk0,bootindex=0 \ -device pcie-root-port,id=root2,slot=2 \ -device vfio-pci,host=04:00.0,id=pf-3,bus=pcie.0 \ -monitor stdio \