Bug 1464053

Summary:	[ppc64le] Guest file system shut down after trigger EEH for 6 times
Product:	Red Hat Enterprise Linux 7	Reporter:	Zhengtong <zhengtli>
Component:	qemu-kvm-rhev	Assignee:	David Gibson <dgibson>
Status:	CLOSED NOTABUG	QA Contact:	Zhengtong <zhengtli>
Severity:	high	Docs Contact:
Priority:	high
Version:	7.4	CC:	dgibson, hannsj_uhl, knoel, lvivier, qzhang, virt-maint, zhengtli
Target Milestone:	rc
Target Release:	7.4
Hardware:	ppc64le
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-29 04:11:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1473046

Description Zhengtong 2017-06-22 10:46:13 UTC

Description of problem:
Guest file system shut down after trigger EEH for 6 times

Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.9.0-10.el7
Host: 3.10.0-681.el7.ppc64le
Guest: 3.10.0-682.el7.ppc64le
SLOF: SLOF-20170303-4.git66d250e.el7.noarch

How reproducible:
always

Steps to Reproduce:
1. Assign a Nic device/Mellanox device/USB controller device to guest
2. After guest boot up, trigger EEH from host with the command like this(this is for Mellanox CX-4 device):
# echo 0x0:0:4:0:0 > /sys/kernel/debug/powerpc/PCI0004/err_injct;lspci -ns 0004:01:00.0
3. wait in the guest for the device recovering
4. After the device recovered, repeat step 2 - 3 for another 5 times

Actual results:
Guest filesystem shutdown . most of the linux command won't work:
[root@dhcp113-11 ~]# ls
ls
-bash: /bin/ls: Input/output error

Expected results:
The device don't recoverred as expected. but the guest should works well.

Additional info:
Guest bootcmd:
[root@ibm-p8-garrison-06 SR-IOV]# cat boot.sh 
/usr/libexec/qemu-kvm \
    -name vfio-liuzt \
    -machine pseries,accel=kvm,usb=off \
    -m 8192 \
    -realtime mlock=off \
    -smp 8 \
    -no-user-config \
    -nodefaults \
    -chardev socket,id=charmonitor,path=monitor,server,nowait \
    -mon chardev=charmonitor,id=monitor,mode=control \
    -rtc base=utc \
    -boot strict=on \
    -device pci-ohci,id=usb,bus=pci.0,addr=0x1 \
    -device virtio-scsi-pci,id=scsi0,addr=0x2 \
    -drive file=/home/rhel74-ppc64le-virtio-scsi.qcow2,if=none,id=drive-scsi0-0-0-0,format=qcow2,werror=stop,rerror=stop \
    -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 \
    -chardev socket,id=charserial0,path=serial,server,nowait \
    -device spapr-vty,chardev=charserial0,reg=0x30000000 \
    -device usb-kbd,id=input0 \
    -device usb-mouse,id=input1 \
    -netdev tap,id=tap0 \
    -device virtio-net-pci,netdev=tap0,id=net0 \
    -vnc 0.0.0.0:1 \
    -k en-us \
    -global spapr-nvram.reg=0x3000 \
    -qmp tcp:0:4444,server,nowait \
    -device vfio-pci,host=0004:01:00.0,id=pf1,addr=0xa \
    -monitor stdio

Comment 2 Zhengtong 2017-06-22 10:48:04 UTC

I copied the related kernel message here, after the 6th time recoverd:

[  176.633375] EEH: PE location: N/A, PHB location: N/A
[  176.633408] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.10.0-682.el7.ppc64le #1
[  176.633456] Call Trace:
[  176.633477] [c0000001fffebad0] [c00000000001b1c0] show_stack+0x80/0x330 (unreliable)
[  176.633538] [c0000001fffebb80] [c0000000009e63fc] dump_stack+0x30/0x44
[  176.633585] [c0000001fffebba0] [c00000000003fa78] eeh_check_failure+0x5d8/0x630
[  176.633640] [c0000001fffebc50] [c000000000510894] ioread32be+0x114/0x1c0
[  176.633824] [c0000001fffebcc0] [d0000000015b1f1c] poll_health+0x3c/0x280 [mlx5_core]
[  176.633882] [c0000001fffebd50] [c0000000000f43e0] call_timer_fn+0x60/0x170
[  176.633932] [c0000001fffebdf0] [c0000000000f692c] run_timer_softirq+0x2bc/0x3c0
[  176.633987] [c0000001fffebea0] [c0000000000ea374] __do_softirq+0x154/0x380
[  176.634036] [c0000001fffebf90] [c000000000028bec] call_do_softirq+0x14/0x24
[  176.634085] [c0000000011ff970] [c000000000014d10] do_softirq+0x120/0x170
[  176.634133] [c0000000011ff9b0] [c0000000000ea8f4] irq_exit+0x1e4/0x1f0
[  176.634180] [c0000000011ff9f0] [c000000000022d94] timer_interrupt+0xa4/0xe0
[  176.634229] [c0000000011ffa20] [c000000000002914] decrementer_common+0x114/0x180
[  176.634288] --- Exception: 901 at plpar_hcall_norets+0x8c/0xdc
[  176.634288]     LR = shared_cede_loop+0xb8/0xd0
[  176.634354] [c0000000011ffd10] [0000000040000000] 0x40000000 (unreliable)
[  176.634405] [c0000000011ffd80] [c0000000007b30bc] cpuidle_idle_call+0x11c/0x3c0
[  176.634459] [c0000000011ffdf0] [c00000000009e0a8] pseries_lpar_idle+0x18/0x60
[  176.634508] [c0000000011ffe50] [c00000000001bb98] arch_cpu_idle+0x68/0x160
[  176.634558] [c0000000011ffe80] [c000000000169ce0] cpu_startup_entry+0x170/0x1e0
[  176.634614] [c0000000011ffee0] [c00000000000caec] rest_init+0x9c/0xb0
[  176.634663] [c0000000011fff00] [c000000000cc3e68] start_kernel+0x4c4/0x4e0
[  176.634708] [c0000000011fff90] [c000000000009b6c] start_here_common+0x20/0xa8
[  176.634794] mlx5_core 0000:00:0a.0: assert_var[0] 0xffffffff
[  176.634844] mlx5_core 0000:00:0a.0: assert_var[1] 0xffffffff
[  176.634892] mlx5_core 0000:00:0a.0: assert_var[2] 0xffffffff
[  176.634939] mlx5_core 0000:00:0a.0: assert_var[3] 0xffffffff
[  176.634986] mlx5_core 0000:00:0a.0: assert_var[4] 0xffffffff
[  176.635034] mlx5_core 0000:00:0a.0: assert_exit_ptr 0xffffffff
[  176.635080] mlx5_core 0000:00:0a.0: assert_callra 0xffffffff
[  176.635128] mlx5_core 0000:00:0a.0: fw_ver 15.4095.65535
[  176.635167] mlx5_core 0000:00:0a.0: hw_id 0xffffffff
[  176.635207] mlx5_core 0000:00:0a.0: irisc_index 255
[  176.635255] mlx5_core 0000:00:0a.0: synd 0xff: unrecognized error
[  176.635303] mlx5_core 0000:00:0a.0: ext_synd 0xffff
[  176.635345] EEH: Detected PCI bus error on PHB#0-PE#1
[  176.635354] mlx5_core 0000:00:0a.0: health_care:192:(pid 72): handling bad device here
[  176.635377] mlx5_core 0000:00:0a.0: mlx5_handle_bad_state:151:(pid 72): Expected to see disabled NIC but it is has invalid value 3
[  176.635379] mlx5_core 0000:00:0a.0: mlx5_pci_err_detected was called
[  176.635380] mlx5_core 0000:00:0a.0: mlx5_enter_error_state:120:(pid 72): start
[  176.635391] mlx5_0:mlx5_ib_event:2473:(pid 72): warning: event on port 0
[  176.635393] mlx5_core 0000:00:0a.0: mlx5_enter_error_state:127:(pid 72): end
[  176.635454] ib_srpt disabling MAD processing failed.
[  176.635713] EEH: PHB#0-PE#1 has failed 6 times in the
[  176.635713] last hour and has been permanently disabled.
[  176.635713] Please try reseating or replacing it.
[  176.635988] EEH: of node=0000:00:0a:0
[  176.636018] EEH: PCI device/vendor: 101315b3
[  176.636064] EEH: PCI cmd/status register: 00100546
[  176.636095] EEH: PCI-E capabilities and status follow:
[  176.636208] EEH: PCI-E 00: 00024810 10008fe2 0000295e 0043f103 
[  176.636310] EEH: PCI-E 10: 11030000 00000000 00000000 00000000 
[  176.636348] EEH: PCI-E 20: 00000000 
[  176.636371] EEH: PCI-E AER capability register set follows:
[  176.636465] EEH: PCI-E AER 00: 18010001 00000000 00000000 00062010 
[  176.636559] EEH: PCI-E AER 10: 00000000 00002000 000001e4 00000000 
[  176.636654] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 
[  176.636708] EEH: PCI-E AER 30: 00000000 00000000 
[  176.636746] mlx5_core 0000:00:0a.0: mlx5_pci_err_detected was called
[  176.800869] mlx5_0:wait_for_async_commands:674:(pid 72): done with all pending requests
[  176.807085] mlx5_core 0000:00:0a.0: mlx5_unload_one: interface is down, NOP
[  176.807474] mlx5_core 0000:00:0a.0: mlx5_unload_one: interface is down, NOP
[  176.807615] iommu: Removing device 0000:00:0a.0 from group 0
[  176.812313] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[  176.976284] iommu: Removing device 0000:00:02.0 from group 0
[  176.976407] ohci-pci 0000:00:01.0: remove, state 4
[  176.976447] usb usb1: USB disconnect, device number 1
[  176.976478] usb 1-1: USB disconnect, device number 2
[  177.162790] usb 1-2: USB disconnect, device number 3
[  177.254292] ohci-pci 0000:00:01.0: USB bus 1 deregistered
[  177.255417] iommu: Removing device 0000:00:01.0 from group 0
[  177.322544] Buffer I/O error on dev dm-0, logical block 1295208, lost async page write
[  177.323458] Buffer I/O error on dev dm-0, logical block 3381939, lost async page write
[  177.373048] iommu: Removing device 0000:00:00.0 from group 0


[root@dhcp113-11 ~]# ^[[A
ibstat
[root@dhcp113-11 ~]# ^[[A
ibstat
[root@dhcp113-11 ~]# [  184.072909] XFS (dm-0): metadata I/O error: block 0x10ff3e7 ("xlog_iodone") error 5 numblks 64
[  184.072977] XFS (dm-0): xfs_do_force_shutdown(0x2) called from line 1200 of file fs/xfs/xfs_log.c.  Return address = 0xd000000001d32e4c
[  184.073125] XFS (dm-0): Log I/O Error Detected.  Shutting down filesystem
[  184.073159] XFS (dm-0): Please umount the filesystem and rectify the problem(s)

Comment 3 Zhengtong 2017-06-22 10:52:37 UTC

raw analysis: 
The system drive controller was removed too, after the target EEH testing device was offline for ever in 6th time testing.


I didn't hit this issue with this qemu command:

/usr/libexec/qemu-kvm \
...
    -device spapr-pci-host-bridge,id=br1,index=1 \
    -device virtio-scsi-pci,id=scsi0,bus=br1.0,addr=0x2 \
...
    -device spapr-pci-host-bridge,id=br2,index=2 \
    -device vfio-pci,host=0004:01:00.0,id=pf1,bus=br2.0 addr=0xa \

In this scenario, the virtio-scsi-pci controller won't be in the same iommu_group with vfio device.

Comment 4 Laurent Vivier 2017-06-22 12:04:31 UTC

According to comment 3, it really looks like a duplicate of BZ1463490

Comment 5 David Gibson 2017-08-29 04:11:58 UTC

Yes, this is basically that after enough EEH failures the entire guest IOMMU group - which means the entire vPHB is disabled.  In this setup this kills the system drive as well.

The only real solution to this is to put the system drive and passthrough devices on different vPHBs for isolation.  This work is mostly complete in libvirt.

At the kernel and qemu level this is behaving as expected.