Bug 1562750

Summary:

VM doesn't boot from HD

Product:

Red Hat Enterprise Linux 7

Reporter:

Liran Rotenberg <lrotenbe>

Component:

qemu-kvm-rhev

Assignee:

Fam Zheng <famz>

Status:

CLOSED ERRATA

QA Contact:

aihua liang <aliang>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

7.5

CC:

aliang, areis, asanders, chayang, chhu, coli, cshao, dgilbert, dyuan, famz, fweimer, huzhao, jiyan, juzhang, knoel, kraxel, lersek, lmen, lrotenbe, mavital, michal.skrivanek, michen, mkalinin, mkarg, mrezanin, ngu, pbonzini, pmatyas, qiyuan, ratamir, stefanha, virt-maint, weiwang, xuwei, xuzhang, yaniwang, ycui, ylavi

Target Milestone:

Keywords:

Automation, AutomationBlocker, Reopened, ZStream

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

qemu-kvm-rhev-2.12.0-11.el7

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1619128 1623085 (view as bug list)

Environment:

Last Closed:

2018-11-01 11:07:15 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1619128, 1623085

Attachments:

Description	Flags
engine dumpxml	none
attachment for comment7	none
logs	none

Description Liran Rotenberg 2018-04-02 11:18:18 UTC

Description of problem:
Starting VMs from RHEV doesn't starting from the disk, although the VM is bootable from it and OS is installed.

Version-Release number of selected component (if applicable):
libvirt-3.9.0-14.el7_5.2.x86_64
vdsm-4.20.23-1.el7ev.x86_64
ovirt-engine-4.2.2.6-0.1.el7.noarch

How reproducible:
Unknown.
We randomly reproduce it.. 
When this issue appears sometimes removing the VirtIO-SCSI solves it.

Steps to Reproduce:
1.
2.
3.

Actual results:
The VM doesn't start from the bootable HD with OS.
It goes to the other bootable options(PXE for example)

Expected results:
The VM should start from the HD, loaded up with the OS.

Additional info:
It might be related to VMs which optimized to server, with IO-Threads and VirtIO-SCSI enabled. We managed to reproduce it in that way but it's not consistent.

Comment 4 jiyan 2018-04-03 04:01:57 UTC

Hi, Liran.
I tried several scenarios in pure libvirt, but I am not sure I hit this bug.
Because in the following test scenario, S2, VM can not start with IOThreads configured for bus scsi target sda

Could you please provide more details about the disk configuration? thanks in advance. When I checked the 'engine.log' and found the xml configuration as following, ** I did not find disk with 'scsi' bus but 'virtio' bus. **
# cat engine.log
    <disk type="file" device="cdrom" snapshot="no">
      <driver name="qemu" type="raw" error_policy="report"/>
      <source file="" startupPolicy="optional"/>
      <target dev="hdc" bus="ide"/>
      <readonly/>
      <alias name="ua-acb27e03-ae91-4876-8c26-c75092d9be36"/>
      <address bus="1" controller="0" unit="0" type="drive" target="0"/>
    </disk>
    <disk snapshot="no" type="file" device="disk">
      <target dev="vda" ** bus="virtio" **/>
      <source file="/rhev/data-center/3ba182ae-3a1b-4c96-b742-41ca091cfc5e/7b10e8fe-c6bb-496a-8873-80b1067629db/images/d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c/2653a8c8-6aa6-4466-aa85-0ba5e58af129"/>
      <driver name="qemu" iothread="1" io="threads" type="qcow2" error_policy="stop" cache="none"/>
      <alias name="ua-d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c"/>
      <address bus="0x00" domain="0x0000" function="0x0" slot="0x07" type="pci"/>
      <boot order="1"/>
      <serial>d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c</serial>
    </disk>

Then I tried the following scenarios in pure libvirt:
Version:
qemu-kvm-rhev-2.10.0-21.el7_5.1.x86_64
libvirt-3.9.0-14.virtcov.el7_5.2.x86_64
kernel-3.10.0-862.el7.x86_64

S1: Use the XML above and start VM
# virsh dumpxml test1 --inactive |grep "<disk" -A10
    <disk type='file' device='disk' snapshot='no'>
      <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='threads' iothread='1'/>
      <source file='/var/lib/libvirt/images/optionalRHEL-7.5-x86_64-latest.qcow2'/>
      <target dev='sda' bus='virtio'/>
      <serial>d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c</serial>
      <boot order='1'/>
      <alias name='ua-d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw' error_policy='report'/>
      <source startupPolicy='optional'/>
      <target dev='hdc' bus='ide'/>
      <readonly/>
      <alias name='ua-acb27e03-ae91-4876-8c26-c75092d9be36'/>
      <address type='drive' controller='0' bus='1' target='0' unit='0'/>
    </disk>

# virsh start test1
Domain test1 started

# virsh console test1
Connected to domain test1
Escape character is ^]
Red Hat Enterprise Linux Server 7.5 Beta (Maipo)
Kernel 3.10.0-855.el7.x86_64 on an x86_64

localhost login: root
Password: 
[root@localhost ~]#

S1: Use the XML above but replace 'virtio' bus as 'scsi' bus.
# virsh dumpxml test1 --inactive |grep "<disk" -A10
    <disk type='file' device='disk' snapshot='no'>
      <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='threads' iothread='1'/>
      <source file='/var/lib/libvirt/images/optionalRHEL-7.5-x86_64-latest.qcow2'/>
      <target dev='sda' bus='scsi'/>
      <serial>d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c</serial>
      <boot order='1'/>
      <alias name='ua-d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw' error_policy='report'/>
      <source startupPolicy='optional'/>
      <target dev='hdc' bus='ide'/>
      <readonly/>
      <alias name='ua-acb27e03-ae91-4876-8c26-c75092d9be36'/>
      <address type='drive' controller='0' bus='1' target='0' unit='0'/>
    </disk>

# virsh start test1
error: Failed to start domain test1
error: unsupported configuration: IOThreads not available for bus scsi target sda

Comment 5 jiyan 2018-04-03 06:34:21 UTC

Sry for the wrong usage for virtio-scsi with iothread enabled, update test steps here. I tried maybe more than 20 times but can also always login into the guest.
S2: 'virtio-scsi' bus disk with iothread enabled
# virsh dumpxml test1 --inactive 
    <disk type='file' device='disk' snapshot='no'>
      <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='threads'/>
      <source file='/var/lib/libvirt/images/optionalRHEL-7.5-x86_64-latest.qcow2'/>
      <target dev='sda' bus='scsi'/>
      <serial>d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c</serial>
      <boot order='1'/>
      <alias name='ua-d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw' error_policy='report'/>
      <source startupPolicy='optional'/>
      <target dev='hdc' bus='ide'/>
      <readonly/>
      <alias name='ua-acb27e03-ae91-4876-8c26-c75092d9be36'/>
      <address type='drive' controller='0' bus='1' target='0' unit='0'/>
    </disk>

    <controller type='scsi' index='0' model='virtio-scsi'>
      <driver iothread='1'/>
      <alias name='ua-46657f70-84ff-4a4f-9ff2-345441b34e7a'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </controller>

# virsh start test1
Domain test1 started

# virsh console test1
Connected to domain test1
Escape character is ^]

Red Hat Enterprise Linux Server 7.5 Beta (Maipo)
Kernel 3.10.0-855.el7.x86_64 on an x86_64

localhost login: root
Password: 
[root@localhost ~]#

Could you please upload the engine.log with whole detailed dumpxml info for 'virtio-scsi' bus disk? (I can only find 'virtio' bus in the current engine.log.)

Comment 6 Liran Rotenberg 2018-04-03 11:42:21 UTC

Created attachment 1416719 [details]
engine dumpxml

Comment 7 jiyan 2018-04-04 02:39:51 UTC

Use the original xml from 'engine.log' can hit this bug, sometimes VM can boot successfully while sometimes failed, nearly half and half.

Steps: Define VM from the xml file and then start VM 

# diff bug.xml originalfromengine.xml 
3c3
<   <name>bug</name>
---
>   <name>golden_env_mixed_virtio_3</name>
37,39d36
<     <graphics type='vnc' port='-1' autoport='yes'>
<       <listen type='address'/>
<     </graphics>
42c39
<       <source mode="bind" path="/var/lib/libvirt/qemu/1"/>
---
>       <source mode="bind" path="/var/lib/libvirt/qemu/channels/3e685cf8-ea27-4a87-9c51-0f7edb550e64.ovirt-guest-agent.0"/>
46c43
<       <source mode="bind" path="/var/lib/libvirt/qemu/2"/>
---
>       <source mode="bind" path="/var/lib/libvirt/qemu/channels/3e685cf8-ea27-4a87-9c51-0f7edb550e64.org.qemu.guest_agent.0"/>
84a82
>       <listen type="network" network="vdsm-ovirtmgmt"/>
92c90
<       <source bridge="br0"/>
---
>       <source bridge="ovirtmgmt"/>
109c107
<       <source file="/var/lib/libvirt/images/optionalRHEL-7.5-x86_64-latest.qcow2"/>
---
>       <source file="/rhev/data-center/c73ab90f-ae2e-4289-bfd5-076a20c1501f/3bcff1ce-0131-4a5b-bddc-59240b1e6b36/images/48800e5d-1cd3-461b-b690-ed3a9684fe62/7ce2a429-5e5c-4d90-8556-46db52aa1987"/>

# virsh define bug.xml 
Domain bug defined from bug.xml

# virsh start bug
Domain bug started

As for the result, sometimes just hang on at 'Booting from Hard Disk' while sometimes can login into the VM.

The following attachment 'bug.tar.gz' includes: bug.xml, originalfromengine.xml, failedboot.png and successfullyboot.png

Comment 8 jiyan 2018-04-04 02:41:20 UTC

Created attachment 1417002 [details]
attachment for comment7

Comment 9 Jiri Denemark 2018-04-24 08:24:41 UTC

I was not able to reproduce this issue and Jing Yan is not able to reproduce
it anymore either. However, the limited logs we have for the last domain start
which was reported as the one that hit this bug show that libvirt started QEMU
with the specified disk as expected and QEMU did not report any errors
accessing the disk. In other words, it looks like some magic issue between
QEMU and SeaBIOS.

vdsm.log:
2018-04-02 11:44:55,592+0300 INFO  (vm/cadb924c) [virt.vm] (vmId='cadb924c-9eff-4f96-a4d9-42d053d62da9') <?xml version='1.0' encoding='utf-8'?>
<domain xmlns:ns0="http://ovirt.org/vm/tune/1.0" xmlns:ovirt-vm="http://ovirt.org/vm/1.0" type="kvm">
    <name>golden_env_mixed_virtio_1_0</name>
    <uuid>cadb924c-9eff-4f96-a4d9-42d053d62da9</uuid>
    <memory>1048576</memory>
    <currentMemory>1048576</currentMemory>
    <iothreads>1</iothreads>
    <maxMemory slots="16">4194304</maxMemory>
    <vcpu current="1">16</vcpu>
    <sysinfo type="smbios">
        <system>
            <entry name="manufacturer">oVirt</entry>
            <entry name="product">RHEV Hypervisor</entry>
            <entry name="version">7.5-8.el7</entry>
            <entry name="serial">4C4C4544-0056-3010-8056-C8C04F563232</entry>
            <entry name="uuid">cadb924c-9eff-4f96-a4d9-42d053d62da9</entry>
        </system>
    </sysinfo>
    <clock adjustment="0" offset="variable">
        <timer name="rtc" tickpolicy="catchup" />
        <timer name="pit" tickpolicy="delay" />
        <timer name="hpet" present="no" />
    </clock>
    <features>
        <acpi />
    </features>
    <cpu match="exact">
        <model>Conroe</model>
        <topology cores="1" sockets="16" threads="1" />
        <numa>
            <cell cpus="0" memory="1048576" />
        </numa>
    </cpu>
    <cputune />
    <devices>
        <input bus="ps2" type="mouse" />
        <channel type="unix">
            <target name="ovirt-guest-agent.0" type="virtio" />
            <source mode="bind" path="/var/lib/libvirt/qemu/channels/cadb924c-9eff-4f96-a4d9-42d053d62da9.ovirt-guest-agent.0" />
        </channel>
        <channel type="unix">
            <target name="org.qemu.guest_agent.0" type="virtio" />
            <source mode="bind" path="/var/lib/libvirt/qemu/channels/cadb924c-9eff-4f96-a4d9-42d053d62da9.org.qemu.guest_agent.0" />
        </channel>
        <graphics autoport="yes" passwd="*****" passwdValidTo="1970-01-01T00:00:01" port="-1" tlsPort="-1" type="spice">
            <channel mode="secure" name="main" />
            <channel mode="secure" name="inputs" />
            <channel mode="secure" name="cursor" />
            <channel mode="secure" name="playback" />
            <channel mode="secure" name="record" />
            <channel mode="secure" name="display" />
            <channel mode="secure" name="smartcard" />
            <channel mode="secure" name="usbredir" />
            <listen network="vdsm-ovirtmgmt" type="network" />
        </graphics>
        <controller index="0" model="piix3-uhci" type="usb">
            <address bus="0x00" domain="0x0000" function="0x2" slot="0x01" type="pci" />
        </controller>
        <rng model="virtio">
            <backend model="random">/dev/urandom</backend>
            <alias name="ua-70d0edb3-4dc7-48d5-be0a-30109d5c50a6" />
        </rng>
        <controller index="0" model="virtio-scsi" type="scsi">
            <driver iothread="1" />
            <alias name="ua-97cc068a-eafa-4c6b-a1e1-d61ffbcd6524" />
        </controller>
        <controller index="0" type="ide">
            <address bus="0x00" domain="0x0000" function="0x1" slot="0x01" type="pci" />
        </controller>
        <memballoon model="virtio">
            <stats period="5" />
            <alias name="ua-a925c27a-880d-43a1-adf3-b398968fd8f8" />
            <address bus="0x00" domain="0x0000" function="0x0" slot="0x08" type="pci" />
        </memballoon>
        <sound model="ich6">
            <alias name="ua-bd10e3e9-bb2d-4eac-a56b-944122b21d8d" />
            <address bus="0x00" domain="0x0000" function="0x0" slot="0x04" type="pci" />
        </sound>
        <video>
            <model heads="1" ram="65536" type="qxl" vgamem="16384" vram="8192" />
            <alias name="ua-d6bf25f1-27c1-4977-ba25-29041fea2c1a" />
            <address bus="0x00" domain="0x0000" function="0x0" slot="0x02" type="pci" />
        </video>
        <controller index="0" ports="16" type="virtio-serial">
            <alias name="ua-dfd83dc8-909c-4c5f-9af2-4d546abe82db" />
            <address bus="0x00" domain="0x0000" function="0x0" slot="0x06" type="pci" />
        </controller>
        <channel type="spicevmc">
            <target name="com.redhat.spice.0" type="virtio" />
        </channel>
        <disk device="cdrom" snapshot="no" type="file">
            <address bus="1" controller="0" target="0" type="drive" unit="0" />
            <source file="" startupPolicy="optional" />
            <target bus="ide" dev="hdc" />
            <readonly />
            <driver error_policy="report" name="qemu" type="raw" />
            <alias name="ua-acb27e03-ae91-4876-8c26-c75092d9be36" />
        </disk>
        <disk device="disk" snapshot="no" type="file">
            <address bus="0x00" domain="0x0000" function="0x0" slot="0x07" type="pci" />
            <source file="/rhev/data-center/mnt/mantis-nfs-lif2.lab.eng.tlv2.redhat.com:_nas01_ge__4__nfs__0/7b10e8fe-c6bb-496a-8873-80b1067629db/images/d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c/2653a8c8-6aa6-4466-aa85-0ba5e58af129" />
            <target bus="virtio" dev="vda" />
            <serial>d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c</serial>
            <boot order="1" />
            <driver cache="none" error_policy="stop" io="threads" iothread="1" name="qemu" type="qcow2" />
            <alias name="ua-d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c" />
        </disk>
        <interface type="bridge">
            <address bus="0x00" domain="0x0000" function="0x0" slot="0x03" type="pci" />
            <mac address="00:1a:4a:16:88:c9" />
            <model type="virtio" />
            <source bridge="ovirtmgmt" />
            <filterref filter="vdsm-no-mac-spoofing" />
            <alias name="ua-e5ae5ef0-eb1e-41bf-ba4b-277c95d52131" />
        </interface>
    </devices>
    <pm>
        <suspend-to-disk enabled="no" />
        <suspend-to-mem enabled="no" />
    </pm>
    <os>
        <type arch="x86_64" machine="pc-i440fx-rhel7.5.0">hvm</type>
        <smbios mode="sysinfo" />
    </os>
    <metadata>
        <ns0:qos />
        <ovirt-vm:vm>
            <minGuaranteedMemoryMb type="int">1024</minGuaranteedMemoryMb>
            <clusterVersion>4.2</clusterVersion>
            <ovirt-vm:custom />
            <ovirt-vm:device mac_address="00:1a:4a:16:88:c9">
                <ovirt-vm:custom />
            </ovirt-vm:device>
            <ovirt-vm:device devtype="disk" name="vda">
                <ovirt-vm:poolID>3ba182ae-3a1b-4c96-b742-41ca091cfc5e</ovirt-vm:poolID>
                <ovirt-vm:volumeID>2653a8c8-6aa6-4466-aa85-0ba5e58af129</ovirt-vm:volumeID>
                <ovirt-vm:imageID>d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c</ovirt-vm:imageID>
                <ovirt-vm:domainID>7b10e8fe-c6bb-496a-8873-80b1067629db</ovirt-vm:domainID>
            </ovirt-vm:device>
            <launchPaused>false</launchPaused>
            <resumeBehavior>auto_resume</resumeBehavior>
        </ovirt-vm:vm>
    </metadata>
</domain>

qemu.log:
2018-04-02 08:44:56.272+0000: starting up libvirt version: 3.9.0, package: 14.el7_5.2 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2018-03-20-14:43:16, x86-041.build.eng.bos.redhat.com), qemu version: 2.10.0(qemu-kvm-rhev-2.10.0-21.el7_5.1), hostname: lynx12.lab.eng.tlv2.redhat.com
LC_ALL=C \
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin \
QEMU_AUDIO_DRV=spice \
/usr/libexec/qemu-kvm \
-name guest=golden_env_mixed_virtio_1_0,debug-threads=on \
-S \
-object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-66-golden_env_mixed_vir/master-key.aes \
-machine pc-i440fx-rhel7.5.0,accel=kvm,usb=off,dump-guest-core=off \
-cpu Conroe \
-m size=1048576k,slots=16,maxmem=4194304k \
-realtime mlock=off \
-smp 1,maxcpus=16,sockets=16,cores=1,threads=1 \
-object iothread,id=iothread1 \
-numa node,nodeid=0,cpus=0,mem=1024 \
-uuid cadb924c-9eff-4f96-a4d9-42d053d62da9 \
-smbios 'type=1,manufacturer=oVirt,product=RHEV Hypervisor,version=7.5-8.el7,serial=4C4C4544-0056-3010-8056-C8C04F563232,uuid=cadb924c-9eff-4f96-a4d9-42d053d62da9' \
-no-user-config \
-nodefaults \
-chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-66-golden_env_mixed_vir/monitor.sock,server,nowait \
-mon chardev=charmonitor,id=monitor,mode=control \
-rtc base=2018-04-02T08:44:55,driftfix=slew \
-global kvm-pit.lost_tick_policy=delay \
-no-hpet \
-no-shutdown \
-global PIIX4_PM.disable_s3=1 \
-global PIIX4_PM.disable_s4=1 \
-boot strict=on \
-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \
-device virtio-scsi-pci,iothread=iothread1,id=ua-97cc068a-eafa-4c6b-a1e1-d61ffbcd6524,bus=pci.0,addr=0x5 \
-device virtio-serial-pci,id=ua-dfd83dc8-909c-4c5f-9af2-4d546abe82db,max_ports=16,bus=pci.0,addr=0x6 \
-drive if=none,id=drive-ua-acb27e03-ae91-4876-8c26-c75092d9be36,readonly=on,werror=report,rerror=report \
-device ide-cd,bus=ide.1,unit=0,drive=drive-ua-acb27e03-ae91-4876-8c26-c75092d9be36,id=ua-acb27e03-ae91-4876-8c26-c75092d9be36 \
-drive file=/rhev/data-center/mnt/mantis-nfs-lif2.lab.eng.tlv2.redhat.com:_nas01_ge__4__nfs__0/7b10e8fe-c6bb-496a-8873-80b1067629db/images/d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c/2653a8c8-6aa6-4466-aa85-0ba5e58af129,format=qcow2,if=none,id=drive-ua-d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c,serial=d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c,cache=none,werror=stop,rerror=stop,aio=threads \
-device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c,id=ua-d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c,bootindex=1 \
-netdev tap,fd=33,id=hostua-e5ae5ef0-eb1e-41bf-ba4b-277c95d52131,vhost=on,vhostfd=35 \
-device virtio-net-pci,netdev=hostua-e5ae5ef0-eb1e-41bf-ba4b-277c95d52131,id=ua-e5ae5ef0-eb1e-41bf-ba4b-277c95d52131,mac=00:1a:4a:16:88:c9,bus=pci.0,addr=0x3 \
-chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/cadb924c-9eff-4f96-a4d9-42d053d62da9.ovirt-guest-agent.0,server,nowait \
-device virtserialport,bus=ua-dfd83dc8-909c-4c5f-9af2-4d546abe82db.0,nr=1,chardev=charchannel0,id=channel0,name=ovirt-guest-agent.0 \
-chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/cadb924c-9eff-4f96-a4d9-42d053d62da9.org.qemu.guest_agent.0,server,nowait \
-device virtserialport,bus=ua-dfd83dc8-909c-4c5f-9af2-4d546abe82db.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 \
-chardev spicevmc,id=charchannel2,name=vdagent \
-device virtserialport,bus=ua-dfd83dc8-909c-4c5f-9af2-4d546abe82db.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 \
-spice port=5902,tls-port=5903,addr=10.46.16.27,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on \
-device qxl-vga,id=ua-d6bf25f1-27c1-4977-ba25-29041fea2c1a,ram_size=67108864,vram_size=8388608,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 \
-device intel-hda,id=ua-bd10e3e9-bb2d-4eac-a56b-944122b21d8d,bus=pci.0,addr=0x4 \
-device hda-duplex,id=ua-bd10e3e9-bb2d-4eac-a56b-944122b21d8d-codec0,bus=ua-bd10e3e9-bb2d-4eac-a56b-944122b21d8d.0,cad=0 \
-device virtio-balloon-pci,id=ua-a925c27a-880d-43a1-adf3-b398968fd8f8,bus=pci.0,addr=0x8 \
-object rng-random,id=objua-70d0edb3-4dc7-48d5-be0a-30109d5c50a6,filename=/dev/urandom \
-device virtio-rng-pci,rng=objua-70d0edb3-4dc7-48d5-be0a-30109d5c50a6,id=ua-70d0edb3-4dc7-48d5-be0a-30109d5c50a6,bus=pci.0,addr=0x9 \
-msg timestamp=on
2018-04-02 08:44:56.287+0000: 80940: info : virObjectUnref:350 : OBJECT_UNREF: obj=0x7f2210114a90
2018-04-02T08:44:56.466027Z qemu-kvm: -drive file=/rhev/data-center/mnt/mantis-nfs-lif2.lab.eng.tlv2.redhat.com:_nas01_ge__4__nfs__0/7b10e8fe-c6bb-496a-8873-80b1067629db/images/d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c/2653a8c8-6aa6-4466-aa85-0ba5e58af129,format=qcow2,if=none,id=drive-ua-d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c,serial=d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c,cache=none,werror=stop,rerror=stop,aio=threads: 'serial' is deprecated, please use the corresponding option of '-device' instead
2018-04-02T08:44:56.478337Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: CPU 1 [socket-id: 1, core-id: 0, thread-id: 0], CPU 2 [socket-id: 2, core-id: 0, thread-id: 0], CPU 3 [socket-id: 3, core-id: 0, thread-id: 0], CPU 4 [socket-id: 4, core-id: 0, thread-id: 0], CPU 5 [socket-id: 5, core-id: 0, thread-id: 0], CPU 6 [socket-id: 6, core-id: 0, thread-id: 0], CPU 7 [socket-id: 7, core-id: 0, thread-id: 0], CPU 8 [socket-id: 8, core-id: 0, thread-id: 0], CPU 9 [socket-id: 9, core-id: 0, thread-id: 0], CPU 10 [socket-id: 10, core-id: 0, thread-id: 0], CPU 11 [socket-id: 11, core-id: 0, thread-id: 0], CPU 12 [socket-id: 12, core-id: 0, thread-id: 0], CPU 13 [socket-id: 13, core-id: 0, thread-id: 0], CPU 14 [socket-id: 14, core-id: 0, thread-id: 0], CPU 15 [socket-id: 15, core-id: 0, thread-id: 0]
2018-04-02T08:44:56.478362Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future
main_channel_link: add main channel client
main_channel_client_handle_pong: net test: latency 6.278000 ms, bitrate 143982002 bps (137.311937 Mbps)
inputs_connect: inputs channel client create
red_qxl_set_cursor_peer: 
red_channel_client_disconnect: rcc=0x5613c1b4cdd0 (channel=0x5613c1bdf090 type=4 id=0)
red_channel_client_disconnect: rcc=0x5613c267a5b0 (channel=0x5613c092ea30 type=2 id=0)
red_channel_client_disconnect: rcc=0x5613c12f89c0 (channel=0x5613c092f100 type=3 id=0)
red_channel_client_disconnect: rcc=0x5613c24881b0 (channel=0x5613c092eae0 type=5 id=0)
red_channel_client_disconnect: rcc=0x5613c1b901b0 (channel=0x5613c092eba0 type=6 id=0)
red_channel_client_disconnect: rcc=0x5613c131c9f0 (channel=0x5613c092e960 type=1 id=0)
main_channel_client_on_disconnect: rcc=0x5613c131c9f0
red_client_destroy: destroy client 0x5613c0970170 with #channels=6
red_qxl_disconnect_cursor_peer: 
red_qxl_disconnect_display_peer: 
2018-04-02T08:45:13.486342Z qemu-kvm: terminating on signal 15 from pid 72746 (<unknown process>)
2018-04-02 08:45:13.686+0000: shutting down, reason=destroyed

Comment 10 Ademar Reis 2018-04-24 15:11:17 UTC

(In reply to Jiri Denemark from comment #9)
> I was not able to reproduce this issue and Jing Yan is not able to reproduce
> it anymore either.

:-(

CondNAK(Reproducer). In case it reproduces somehow somewhere, please report the details here.

> However, the limited logs we have for the last domain
> start
> which was reported as the one that hit this bug show that libvirt started
> QEMU
> with the specified disk as expected and QEMU did not report any errors
> accessing the disk. In other words, it looks like some magic issue between
> QEMU and SeaBIOS.

OK, we'll keep an eye on it, but waiting for a reproducer.

Comment 11 Xueqiang Wei 2018-04-26 05:58:40 UTC

QE tried 20 times, could not reproduce this issue by attached xml file.


Versions:

Host:
kernel-3.10.0-862.el7.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.1
libvirt-3.9.0-14.el7.x86_64
seabios-bin-1.11.0-2.el7.noarch

Guest:
kernel-3.10.0-862.el7.x86_64

Comment 12 Ademar Reis 2018-05-25 20:57:51 UTC

(In reply to Xueqiang Wei from comment #11)
> QE tried 20 times, could not reproduce this issue by attached xml file.
> 

Looks like this can't be reproduced anywhere (se also comment #9), so I'm closing this bug. Please reopen if it can be reproduced somehow or if there's new data.

Comment 13 Liran Rotenberg 2018-07-17 10:55:38 UTC

Hi,
Seems like this bug pop-up again.
Reopening and attaching logs.

ovirt-engine-4.2.5-0.1.el7ev.noarch
kernel-3.10.0-862.el7.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.4.x86_64
seabios-bin-1.11.0-2.el7.noarch
vdsm-4.20.33-1.el7ev.x86_64

Additional info:
logs have host3(the one with the problematic start-up, also spm) sosreport and engine log.
the correlation id on engine: 
vms_syncAction_5e62130a-9eb2-41a5

Comment 14 Liran Rotenberg 2018-07-17 10:56:50 UTC

Created attachment 1459389 [details]
logs

Comment 15 Ademar Reis 2018-07-17 15:12:02 UTC

(In reply to Liran Rotenberg from comment #13)
> Hi,
> Seems like this bug pop-up again.
> Reopening and attaching logs.
> 
> ovirt-engine-4.2.5-0.1.el7ev.noarch
> kernel-3.10.0-862.el7.x86_64
> qemu-kvm-rhev-2.10.0-21.el7_5.4.x86_64
> seabios-bin-1.11.0-2.el7.noarch
> vdsm-4.20.33-1.el7ev.x86_64
> 
> Additional info:
> logs have host3(the one with the problematic start-up, also spm) sosreport
> and engine log.
> the correlation id on engine: 
> vms_syncAction_5e62130a-9eb2-41a5

Can we access the machine where this reproduces? How frequently is it reproducing? Are the steps/configuration the same as before?

Comment 17 CongLi 2018-07-20 10:29:53 UTC

(In reply to Jiri Denemark from comment #9)
> I was not able to reproduce this issue and Jing Yan is not able to reproduce
> it anymore either. However, the limited logs we have for the last domain
> start
> which was reported as the one that hit this bug show that libvirt started
> QEMU
> with the specified disk as expected and QEMU did not report any errors
> accessing the disk. In other words, it looks like some magic issue between
> QEMU and SeaBIOS.
> qemu.log:
> 2018-04-02 08:44:56.272+0000: starting up libvirt version: 3.9.0, package:
> 14.el7_5.2 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>,
> 2018-03-20-14:43:16, x86-041.build.eng.bos.redhat.com), qemu version:
> 2.10.0(qemu-kvm-rhev-2.10.0-21.el7_5.1), hostname:
> lynx12.lab.eng.tlv2.redhat.com
> LC_ALL=C \
> PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin \
> QEMU_AUDIO_DRV=spice \
> /usr/libexec/qemu-kvm \
> -name guest=golden_env_mixed_virtio_1_0,debug-threads=on \
> -S \
> -object
> secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-66-
> golden_env_mixed_vir/master-key.aes \
> -machine pc-i440fx-rhel7.5.0,accel=kvm,usb=off,dump-guest-core=off \
> -cpu Conroe \
> -m size=1048576k,slots=16,maxmem=4194304k \
> -realtime mlock=off \
> -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 \
> -object iothread,id=iothread1 \
> -numa node,nodeid=0,cpus=0,mem=1024 \
> -uuid cadb924c-9eff-4f96-a4d9-42d053d62da9 \
> -smbios 'type=1,manufacturer=oVirt,product=RHEV
> Hypervisor,version=7.5-8.el7,serial=4C4C4544-0056-3010-8056-C8C04F563232,
> uuid=cadb924c-9eff-4f96-a4d9-42d053d62da9' \
> -no-user-config \
> -nodefaults \
> -chardev
> socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-66-
> golden_env_mixed_vir/monitor.sock,server,nowait \
> -mon chardev=charmonitor,id=monitor,mode=control \
> -rtc base=2018-04-02T08:44:55,driftfix=slew \
> -global kvm-pit.lost_tick_policy=delay \
> -no-hpet \
> -no-shutdown \
> -global PIIX4_PM.disable_s3=1 \
> -global PIIX4_PM.disable_s4=1 \
> -boot strict=on \
> -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \
> -device
> virtio-scsi-pci,iothread=iothread1,id=ua-97cc068a-eafa-4c6b-a1e1-
> d61ffbcd6524,bus=pci.0,addr=0x5 \
> -device
> virtio-serial-pci,id=ua-dfd83dc8-909c-4c5f-9af2-4d546abe82db,max_ports=16,
> bus=pci.0,addr=0x6 \
> -drive
> if=none,id=drive-ua-acb27e03-ae91-4876-8c26-c75092d9be36,readonly=on,
> werror=report,rerror=report \
> -device
> ide-cd,bus=ide.1,unit=0,drive=drive-ua-acb27e03-ae91-4876-8c26-c75092d9be36,
> id=ua-acb27e03-ae91-4876-8c26-c75092d9be36 \
> -drive
> file=/rhev/data-center/mnt/mantis-nfs-lif2.lab.eng.tlv2.redhat.com:
> _nas01_ge__4__nfs__0/7b10e8fe-c6bb-496a-8873-80b1067629db/images/d5316ae9-
> c6a4-4cca-a3ee-ef890f63cd8c/2653a8c8-6aa6-4466-aa85-0ba5e58af129,
> format=qcow2,if=none,id=drive-ua-d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c,
> serial=d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c,cache=none,werror=stop,
> rerror=stop,aio=threads \
> -device
> virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-
> d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c,id=ua-d5316ae9-c6a4-4cca-a3ee-
> ef890f63cd8c,bootindex=1 \
> -netdev
> tap,fd=33,id=hostua-e5ae5ef0-eb1e-41bf-ba4b-277c95d52131,vhost=on,vhostfd=35
> \
> -device
> virtio-net-pci,netdev=hostua-e5ae5ef0-eb1e-41bf-ba4b-277c95d52131,id=ua-
> e5ae5ef0-eb1e-41bf-ba4b-277c95d52131,mac=00:1a:4a:16:88:c9,bus=pci.0,
> addr=0x3 \
> -chardev
> socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/cadb924c-9eff-
> 4f96-a4d9-42d053d62da9.ovirt-guest-agent.0,server,nowait \
> -device
> virtserialport,bus=ua-dfd83dc8-909c-4c5f-9af2-4d546abe82db.0,nr=1,
> chardev=charchannel0,id=channel0,name=ovirt-guest-agent.0 \
> -chardev
> socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/cadb924c-9eff-
> 4f96-a4d9-42d053d62da9.org.qemu.guest_agent.0,server,nowait \
> -device
> virtserialport,bus=ua-dfd83dc8-909c-4c5f-9af2-4d546abe82db.0,nr=2,
> chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 \
> -chardev spicevmc,id=charchannel2,name=vdagent \
> -device
> virtserialport,bus=ua-dfd83dc8-909c-4c5f-9af2-4d546abe82db.0,nr=3,
> chardev=charchannel2,id=channel2,name=com.redhat.spice.0 \
> -spice
> port=5902,tls-port=5903,addr=10.46.16.27,x509-dir=/etc/pki/vdsm/libvirt-
> spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-
> channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,
> tls-channel=usbredir,seamless-migration=on \
> -device
> qxl-vga,id=ua-d6bf25f1-27c1-4977-ba25-29041fea2c1a,ram_size=67108864,
> vram_size=8388608,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,
> addr=0x2 \
> -device
> intel-hda,id=ua-bd10e3e9-bb2d-4eac-a56b-944122b21d8d,bus=pci.0,addr=0x4 \
> -device
> hda-duplex,id=ua-bd10e3e9-bb2d-4eac-a56b-944122b21d8d-codec0,bus=ua-bd10e3e9-
> bb2d-4eac-a56b-944122b21d8d.0,cad=0 \
> -device
> virtio-balloon-pci,id=ua-a925c27a-880d-43a1-adf3-b398968fd8f8,bus=pci.0,
> addr=0x8 \
> -object
> rng-random,id=objua-70d0edb3-4dc7-48d5-be0a-30109d5c50a6,filename=/dev/
> urandom \
> -device
> virtio-rng-pci,rng=objua-70d0edb3-4dc7-48d5-be0a-30109d5c50a6,id=ua-70d0edb3-
> 4dc7-48d5-be0a-30109d5c50a6,bus=pci.0,addr=0x9 \
> -msg timestamp=on
> 2018-04-02 08:44:56.287+0000: 80940: info : virObjectUnref:350 :
> OBJECT_UNREF: obj=0x7f2210114a90
> 2018-04-02T08:44:56.466027Z qemu-kvm: -drive
> file=/rhev/data-center/mnt/mantis-nfs-lif2.lab.eng.tlv2.redhat.com:
> _nas01_ge__4__nfs__0/7b10e8fe-c6bb-496a-8873-80b1067629db/images/d5316ae9-
> c6a4-4cca-a3ee-ef890f63cd8c/2653a8c8-6aa6-4466-aa85-0ba5e58af129,
> format=qcow2,if=none,id=drive-ua-d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c,
> serial=d5316ae9-c6a4-4cca-a3ee-ef890f63cd8c,cache=none,werror=stop,
> rerror=stop,aio=threads: 'serial' is deprecated, please use the
> corresponding option of '-device' instead
> 2018-04-02T08:44:56.478337Z qemu-kvm: warning: CPU(s) not present in any
> NUMA nodes: CPU 1 [socket-id: 1, core-id: 0, thread-id: 0], CPU 2
> [socket-id: 2, core-id: 0, thread-id: 0], CPU 3 [socket-id: 3, core-id: 0,
> thread-id: 0], CPU 4 [socket-id: 4, core-id: 0, thread-id: 0], CPU 5
> [socket-id: 5, core-id: 0, thread-id: 0], CPU 6 [socket-id: 6, core-id: 0,
> thread-id: 0], CPU 7 [socket-id: 7, core-id: 0, thread-id: 0], CPU 8
> [socket-id: 8, core-id: 0, thread-id: 0], CPU 9 [socket-id: 9, core-id: 0,
> thread-id: 0], CPU 10 [socket-id: 10, core-id: 0, thread-id: 0], CPU 11
> [socket-id: 11, core-id: 0, thread-id: 0], CPU 12 [socket-id: 12, core-id:
> 0, thread-id: 0], CPU 13 [socket-id: 13, core-id: 0, thread-id: 0], CPU 14
> [socket-id: 14, core-id: 0, thread-id: 0], CPU 15 [socket-id: 15, core-id:
> 0, thread-id: 0]
> 2018-04-02T08:44:56.478362Z qemu-kvm: warning: All CPU(s) up to maxcpus
> should be described in NUMA config, ability to start up with partial NUMA
> mappings is obsoleted and will be removed in future
> main_channel_link: add main channel client
> main_channel_client_handle_pong: net test: latency 6.278000 ms, bitrate
> 143982002 bps (137.311937 Mbps)
> inputs_connect: inputs channel client create
> red_qxl_set_cursor_peer: 
> red_channel_client_disconnect: rcc=0x5613c1b4cdd0 (channel=0x5613c1bdf090
> type=4 id=0)
> red_channel_client_disconnect: rcc=0x5613c267a5b0 (channel=0x5613c092ea30
> type=2 id=0)
> red_channel_client_disconnect: rcc=0x5613c12f89c0 (channel=0x5613c092f100
> type=3 id=0)
> red_channel_client_disconnect: rcc=0x5613c24881b0 (channel=0x5613c092eae0
> type=5 id=0)
> red_channel_client_disconnect: rcc=0x5613c1b901b0 (channel=0x5613c092eba0
> type=6 id=0)
> red_channel_client_disconnect: rcc=0x5613c131c9f0 (channel=0x5613c092e960
> type=1 id=0)
> main_channel_client_on_disconnect: rcc=0x5613c131c9f0
> red_client_destroy: destroy client 0x5613c0970170 with #channels=6
> red_qxl_disconnect_cursor_peer: 
> red_qxl_disconnect_display_peer: 
> 2018-04-02T08:45:13.486342Z qemu-kvm: terminating on signal 15 from pid
> 72746 (<unknown process>)
> 2018-04-02 08:45:13.686+0000: shutting down, reason=destroyed


QE still could not reproduce this issue as comment 11 with the QEMU CML above.
"""
> 2018-04-02T08:45:13.486342Z qemu-kvm: terminating on signal 15 from pid
> 72746 (<unknown process>)
> 2018-04-02 08:45:13.686+0000: shutting down, reason=destroyed
"""

qemu-kvm process was killed here, not sure if it's impact the vm boot.

Comment 20 Laszlo Ersek 2018-07-31 18:55:21 UTC

I've tried to take a look at the logs in comment 13 and comment 14. The things I'm seeing could be "business as usual" for ovirt, but they are certainly new to me.

(1) Is this a nested virt scenario?

When I untar "sosreport-host_mixed_3-20180717134200.tar", it produces a directory called "sosreport-virt-nested-vm13.scl.lab.tlv.redhat.com-20180717134200".

(2) Grepping "engine.log" for "virt-nested-vm13", there's a good number of "Heartbeat exceeded for host 'virt-nested-vm13.scl.lab.tlv.redhat.com'" messages.

(3) Looking at "/var/log/messages" in the sosreport, I see three soft lockups. The first such message seems to be the first lookup since the host had been booted (because when it is logged, the kernel isn't tainted yet); the other two already display "soft lockup" as taint reason:

  Jul 16 16:18:08 virt-nested-vm13 kernel:
  NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [python:22132]

  Jul 16 16:39:36 virt-nested-vm13 kernel:
  NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [python:8725]

  Jul 16 16:47:40 virt-nested-vm13 kernel:
  NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [systemd:1]

(References: <https://access.redhat.com/solutions/40594>, <https://access.redhat.com/articles/371803>.)

Note that these soft lockups all precede the actual VM start in question -- the time for that can be located in "engine.log", using the correlation ID specified in comment 13 --> "2018-07-17 12:28:32,448+03".

(4) Looking at the generated domain XML, it's not clear to me what storage the disk image is on.

    <disk snapshot="no" type="file" device="disk">
      <target dev="vda" bus="virtio"/>
      <source file="/rhev/data-center/efffe225-3927-4563-968c-42d3d98f482e/d011d522-efb3-47f2-859e-d3e444eb4d4b/images/d4af4c5d-ade6-46d7-acde-d25e6b5109f2/10de0412-2454-46a1-a453-8750f0d3a3af"/>
      <driver name="qemu" iothread="1" io="threads" type="qcow2" error_policy="stop" cache="none"/>
      <alias name="ua-d4af4c5d-ade6-46d7-acde-d25e6b5109f2"/>
      <address bus="0x00" domain="0x0000" function="0x0" slot="0x07" type="pci"/>
      <boot order="1"/>
      <serial>d4af4c5d-ade6-46d7-acde-d25e6b5109f2</serial>
    </disk>

Can it be confirmed please that this recent domain XML fragment is consistent with the originally reported symptom? ("The VM doesn't start from the bootable HD with OS. It goes to the other bootable options(PXE for example)".)

Because, if the storage backend had failed, then I believe the above fragment would have caused QEMU to pause the guest (error_policy="stop"). But, according to the original report, the guest is not paused, it advances to the next boot option.

Now, in the domain XML generated for correlation ID "vms_syncAction_5e62130a-9eb2-41a5", only this one disk is marked with <boot order=...>, and libvirt always generates "HALT" in the "bootorder" fw_cfg file for the firmware. Meaning that, if all of the explicitly marked boot devices fail (= the sole disk), then the firmware will not boot at all.

So... points (1) through (3) suggest to me that the host is over-subscribed (nested virt, heartbeats missed, CPUs intermittently stuck), and the boot fails due to that, for some reason.

Whereas point (4) suggests that the firmware fails to read from the storage backend (which is of unknown kind), such that QEMU doesn't actually detect the failure as an I/O error (because in that case QEMU would set the VM state to "paused"); as a consequence, SeaBIOS has nothing else to boot, and stops. This could be validated against the graphical display of the VM (the domain XML does specify QXL + spice).

Additionally, the SeaBIOS log should be possible to capture, assuming we can modify the generated domain XML fragment, and write to some filesystem (such as /tmp). I sent Markus some ideas for this earlier.


... Is it possible that the storage backend is over-subscribed (... as well)? Because, from comment 19,

"This bug reproduced very easily on <REDACTED>. Just mark all vms and start them."

Launching a herd of VMs simultaneously suggests a spike in both host CPU load and storage backend load.

Comment 29 Laszlo Ersek 2018-08-01 22:49:01 UTC

Made some progress. The following SeaBIOS debug patch was non-intrusive
enough to let the hang / spin reproduce (in 1 out of 6 VMs launched at the
same time), and it provides some insight:

> diff --git a/src/hw/virtio-blk.c b/src/hw/virtio-blk.c
> index 88d7e54a2a92..853738f96eea 100644
> --- a/src/hw/virtio-blk.c
> +++ b/src/hw/virtio-blk.c
> @@ -23,57 +23,69 @@
>  #include "virtio-ring.h"
>  #include "virtio-blk.h"
>
> +#define OPS_DONE_LOG_LIMIT 10
> +
>  struct virtiodrive_s {
>      struct drive_s drive;
>      struct vring_virtqueue *vq;
>      struct vp_device vp;
> +    unsigned ops_done;
>  };
>
>  static int
>  virtio_blk_op(struct disk_op_s *op, int write)
>  {
>      struct virtiodrive_s *vdrive =
>          container_of(op->drive_fl, struct virtiodrive_s, drive);
>      struct vring_virtqueue *vq = vdrive->vq;
>      struct virtio_blk_outhdr hdr = {
>          .type = write ? VIRTIO_BLK_T_OUT : VIRTIO_BLK_T_IN,
>          .ioprio = 0,
>          .sector = op->lba,
>      };
>      u8 status = VIRTIO_BLK_S_UNSUPP;
>      struct vring_list sg[] = {
>          {
>              .addr       = (void*)(&hdr),
>              .length     = sizeof(hdr),
>          },
>          {
>              .addr       = op->buf_fl,
>              .length     = vdrive->drive.blksize * op->count,
>          },
>          {
>              .addr       = (void*)(&status),
>              .length     = sizeof(status),
>          },
>      };
> +    unsigned op_nr = vdrive->ops_done++;
> +
> +    if (op_nr < OPS_DONE_LOG_LIMIT) {
> +        dprintf(1, "virtio-blk @ %p op#%u submit\n", vdrive, op_nr);
> +    }
>
>      /* Add to virtqueue and kick host */
>      if (write)
>          vring_add_buf(vq, sg, 2, 1, 0, 0);
>      else
>          vring_add_buf(vq, sg, 1, 2, 0, 0);
>      vring_kick(&vdrive->vp, vq, 1);
>
>      /* Wait for reply */
>      while (!vring_more_used(vq))
>          usleep(5);
>
> +    if (op_nr < OPS_DONE_LOG_LIMIT) {
> +        dprintf(1, "virtio-blk @ %p op#%u done\n", vdrive, op_nr);
> +    }
> +
>      /* Reclaim virtqueue element */
>      vring_get_buf(vq, NULL);
>
>      /* Clear interrupt status register.  Avoid leaving interrupts stuck if
>       * VRING_AVAIL_F_NO_INTERRUPT was ignored and interrupts were raised.
>       */
>      vp_get_isr(&vdrive->vp);
>
>      return status == VIRTIO_BLK_S_OK ? DISK_RET_SUCCESS : DISK_RET_EBADTRACK;
>  }

5 out of 6 VMs logged:

> enter handle_19:
>   NULL
> Booting from Hard Disk...
> virtio-blk @ 0x000ebb80 op#0 submit
> virtio-blk @ 0x000ebb80 op#0 done
> Booting from 0000:7c00
> virtio-blk @ 0x000ebb80 op#1 submit
> virtio-blk @ 0x000ebb80 op#1 done
> virtio-blk @ 0x000ebb80 op#2 submit
> virtio-blk @ 0x000ebb80 op#2 done
> virtio-blk @ 0x000ebb80 op#3 submit
> virtio-blk @ 0x000ebb80 op#3 done
> virtio-blk @ 0x000ebb80 op#4 submit
> virtio-blk @ 0x000ebb80 op#4 done
> virtio-blk @ 0x000ebb80 op#5 submit
> virtio-blk @ 0x000ebb80 op#5 done
> virtio-blk @ 0x000ebb80 op#6 submit
> virtio-blk @ 0x000ebb80 op#6 done
> virtio-blk @ 0x000ebb80 op#7 submit
> virtio-blk @ 0x000ebb80 op#7 done
> virtio-blk @ 0x000ebb80 op#8 submit
> virtio-blk @ 0x000ebb80 op#8 done
> virtio-blk @ 0x000ebb80 op#9 submit
> virtio-blk @ 0x000ebb80 op#9 done
> VBE mode info request: 140
> ...

but the 1 that got stuck spinning logged:

> enter handle_19:
>   NULL
> Booting from Hard Disk...
> virtio-blk @ 0x000ebb80 op#0 submit
> [HANG]

In other words, SeaBIOS submits the virtio-blk read request for the MBR, but
it never sees the request complete, so it waits forever.

I'll enable some tracing in QEMU to see what happens.

Comment 30 Laszlo Ersek 2018-08-02 01:48:04 UTC

More progress. After hours of struggle which I won't go into (basically,
read "/usr/share/doc/qemu-kvm/README.systemtap", and try to fill in the gaps
the best you can...), I managed to create the following files on the host:

"/etc/systemtap/conf.d/qemu_kvm.conf":

> # Force load uprobes (see BZ#1118352)
> stap -e 'probe process("/usr/libexec/qemu-kvm").function("main") { printf("") }' -c true
>
> qemu_kvm_OPT="-s1024" # per-CPU buffer size, in megabytes

"/etc/systemtap/script.d/qemu_kvm.stp"

> probe qemu.kvm.simpletrace.virtio_blk_data_plane_start {}
> probe qemu.kvm.simpletrace.virtio_blk_data_plane_stop {}
> probe qemu.kvm.simpletrace.virtio_blk_handle_read {}
> probe qemu.kvm.simpletrace.virtio_blk_handle_write {}
> probe qemu.kvm.simpletrace.virtio_blk_req_complete {}
> probe qemu.kvm.simpletrace.virtio_blk_rw_complete {}
> probe qemu.kvm.simpletrace.virtio_blk_submit_multireq {}
> probe qemu.kvm.simpletrace.virtio_notify {}
> probe qemu.kvm.simpletrace.virtio_notify_irqfd {}
> probe qemu.kvm.simpletrace.virtio_queue_notify {}
> probe qemu.kvm.simpletrace.virtio_set_status {}
> probe qemu.kvm.simpletrace.virtqueue_alloc_element {}
> probe qemu.kvm.simpletrace.virtqueue_fill {}
> probe qemu.kvm.simpletrace.virtqueue_flush {}
> probe qemu.kvm.simpletrace.virtqueue_pop {}

Started the flight recorder:

# service systemtap start qemu_kvm

Started all 6 VMs on the WebUI. Made sure one was stuck again, and grabbed
its PID with "top" (easy to see from the CPU load which one it was -- it was
spinning). That was PID 74403.

Powered off all VMs; forced off the one that was stuck. (At this point all
non-stuck VMs had booted their OSes and quiesced.)

Dumped the flight recorder buffer:

# staprun -A qemu_kvm >/tmp/trace.dat &

After "trace.dat" stopped growing, I interrupted "staprun". (This is
*absolutely* non-intuitive: "staprun -A" attaches to the stap module, dumps
whatever's accumulated in the buffer, and then continues dumping new entries
as they appear. One has no idea when the buffer is empty, except when the
file stops growing. Yes, I guess I could read the hundreds of kilobytes of
stap documentation. :) )

Formatted the binary records to text:

# /usr/share/qemu-kvm/simpletrace.py \
    --no-header \
    /usr/share/qemu-kvm/trace-events-all \
    /tmp/trace.dat \
    > /tmp/trace.txt

Then collated the trace into files by PID:

# egrep -o '\<pid=[0-9]+\>' trace.txt  | sort -u
pid=74403
pid=74495
pid=74589
pid=74699
pid=74852
pid=74965

# for PID in 74403 74495 74589 74699 74852 74965; do
    egrep '\<pid='"$PID"'\>' trace.txt > trace-$PID.txt
  done

The file sizes are telling:

# ls -l --sort=size trace-*.txt
-rw-r--r--. 1 root root 7832694 Aug  2 03:53 trace-74699.txt
-rw-r--r--. 1 root root 6679923 Aug  2 03:53 trace-74589.txt
-rw-r--r--. 1 root root 5778395 Aug  2 03:53 trace-74852.txt
-rw-r--r--. 1 root root 5309522 Aug  2 03:53 trace-74495.txt
-rw-r--r--. 1 root root 4904952 Aug  2 03:53 trace-74965.txt
-rw-r--r--. 1 root root  109037 Aug  2 03:53 trace-74403.txt <-- stuck guy

I counted how many times each trace point was logged for each VM. Here's a
table:
                            stuck guy
                                 |
                                 v
PID                          74403  74495  74589  74699  74852  74965
virtio_blk_data_plane_start      1      2      2      2      2      2
virtio_blk_data_plane_stop       0      1      1      1      1      1
virtio_blk_handle_read           0   5198   6228   7253   5211   5199
virtio_blk_handle_write          0    583     75     76   1095    199
virtio_blk_req_complete          0   5800   6322   7348   6325   5417
virtio_blk_rw_complete           0   5781   6303   7329   6306   5398
virtio_blk_submit_multireq       0     11      9      8     11     20
virtio_notify                    0    123    125    127    124    127
virtio_notify_irqfd              0   3871   4191   5159   4275   3409
virtio_queue_notify            279  15140  27976  33593  16623  13425
virtio_set_status               38    122    122    122    122    122
virtqueue_alloc_element        256   6449   6974   7999   6974   6069
virtqueue_fill                 256   6448   6973   7998   6973   6068
virtqueue_flush                256   6448   6973   7998   6973   6068
virtqueue_pop                  256   6449   6974   7999   6974   6069

The relevant observation is that there are *zero* virtio_blk_* tracepoints
logged for PID=74403, beyond the sole "virtio_blk_data_plane_start". Compare
comment 29: SeaBIOS did try to submit a read request.

My take is that the virtio-blk data plane misses the virtio kick from
SeaBIOS when the host is under load.

... Trying to correlate "trace-74403.txt" with "trace-74495.txt" (i.e. the
stuck one with one of the lucky ones), the last *common* entries are:

> virtio_set_status 337.846 pid=74495 vdev=0x55e0c1d1c170 val=0x0
> virtio_set_status 3.248 pid=74495 vdev=0x55e0c1d1c170 val=0x0

and the first entries in the latter trace that seem to be missing from the
former (= hung one) are:

> virtio_queue_notify 102338.774 pid=74495 vdev=0x55e0c3724170 n=0x0 vq=0x55e0c3732000
> virtqueue_alloc_element 26.545 pid=74495 elem=0x55e0c42b74a0 sz=0xa8 in_num=0x2 out_num=0x1
> virtqueue_pop 4.815 pid=74495 vq=0x55e0c3732000 elem=0x55e0c42b74a0 in_num=0x2 out_num=0x1
> virtio_blk_handle_read 6.049 pid=74495 vdev=0x55e0c3724170 req=0x55e0c42b74a0 sector=0x0 nsectors=0x1
> virtio_blk_rw_complete 2705.791 pid=74495 vdev=0x55e0c3724170 req=0x55e0c42b74a0 ret=0x0
> virtio_blk_req_complete 4.540 pid=74495 vdev=0x55e0c3724170 req=0x55e0c42b74a0 status=0x0
> virtqueue_fill 3.757 pid=74495 vq=0x55e0c3732000 elem=0x55e0c42b74a0 len=0x201 idx=0x0
> virtqueue_flush 6.440 pid=74495 vq=0x55e0c3732000 count=0x1

This is where the MBR is read by one of the lucky VMs.

One interesting bit about that: note the value "102338.774". If I understand
correctly, that's the time (in microseconds) since the last tracepoint was
hit. It's *huge*, relative to the rest of the intervals, and it's also not
small in absolute sense -- it's ~0.1 second. Why does it take so long?

And, it takes so long for three out of the five non-stuck VMs:

> trace-74495.txt-virtio_queue_notify 102338.774 pid=74495 vdev=0x55e0c3724170 n=0x0 vq=0x55e0c3732000
> trace-74589.txt-virtio_queue_notify 102939.733 pid=74589 vdev=0x56082c58e170 n=0x0 vq=0x56082c59a000
> trace-74699.txt-virtio_queue_notify 106781.136 pid=74699 vdev=0x557220736170 n=0x0 vq=0x557220742000
> trace-74852.txt-virtio_queue_notify  21459.920 pid=74852 vdev=0x56231fd62170 n=0x0 vq=0x56231fd70000
> trace-74965.txt-virtio_queue_notify    108.849 pid=74965 vdev=0x55cdf3b20170 n=0x0 vq=0x55cdf3b2c000

Comment 31 Laszlo Ersek 2018-08-02 01:57:49 UTC

To highlight it -- the virtio_queue_notify tracepoint is not triggered in QEMU when SeaBIOS submits the virtio-blk read request for the MBR. This means that QEMU doesn't even notice the request (hence it doesn't start any host-side IO either).

Comment 34 Fam Zheng 2018-08-02 03:26:51 UTC

(In reply to Laszlo Ersek from comment #31)
> To highlight it -- the virtio_queue_notify tracepoint is not triggered in
> QEMU when SeaBIOS submits the virtio-blk read request for the MBR. This
> means that QEMU doesn't even notice the request (hence it doesn't start any
> host-side IO either).

In this setup I believe the notification path is MMIO -> KVM ioeventfd -> QEMU IOThread. But there is a chance that QEMU IOthread is polling the virt queue directly.

I think it will help if I can gdb attach to the stuck QEMU and check VQ and even loop status. It might be a race condition specific to IOThread. Can I have access to the machine and the reproducing steps, Laszlo?

Comment 35 Raz Tamir 2018-08-02 13:19:16 UTC

(In reply to Fam Zheng from comment #34)
> (In reply to Laszlo Ersek from comment #31)
> > To highlight it -- the virtio_queue_notify tracepoint is not triggered in
> > QEMU when SeaBIOS submits the virtio-blk read request for the MBR. This
> > means that QEMU doesn't even notice the request (hence it doesn't start any
> > host-side IO either).
> 
> In this setup I believe the notification path is MMIO -> KVM ioeventfd ->
> QEMU IOThread. But there is a chance that QEMU IOthread is polling the virt
> queue directly.
> 
> I think it will help if I can gdb attach to the stuck QEMU and check VQ and
> even loop status. It might be a race condition specific to IOThread. Can I
> have access to the machine and the reproducing steps, Laszlo?
Yes, feel free to use it

Comment 42 Fam Zheng 2018-08-02 16:58:17 UTC

I've successfully reproduced the hang following Laszlo's steps. Then I gdb attached to the hanging QEMU. The IOThread is not making progress with a blocking aio_poll():

(gdb) bt
#0  0x00007ff7909e3f5f in ppoll () from /lib64/libc.so.6
#1  0x000055a3fc8e64cb in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at util/qemu-timer.c:322
#3  0x000055a3fc8e8195 in aio_poll (ctx=0x55a3ff44bb80, blocking=blocking@entry=true) at util/aio-posix.c:629
#4  0x000055a3fc6dab56 in iothread_run (opaque=0x55a3ff453960) at iothread.c:59
#5  0x00007ff790cc4dd5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007ff7909eeb3d in clone () from /lib64/libc.so.6

aio_poll() is only polling the dummy event notifier used in aio_notify():

#6  0x000055a3fc8e8195 in aio_poll (ctx=0x55a3ff44bb80, blocking=blocking@entry=true) at util/aio-posix.c:629
629                 ret = qemu_poll_ns(pollfds, npfd, timeout);
(gdb) p npfd
$36 = 1
(gdb) p *pollfds
$37 = {fd = 13, events = 25, revents = 0}

But there is an active AioHandler which should be polled:

(gdb) set $n = ctx.aio_handlers.lh_first
(gdb) while $n
 >p *$n
 >set $n = $n.node.le_next
 >end

[hidding deleted nodes...]

$6 = {pfd = {fd = 37, events = 25, revents = 0}, io_read = 0x55a3fc652cf0 <virtio_queue_host_notifier_aio_read>, io_write = 0x0, 
  io_poll = 0x55a3fc6530f0 <virtio_queue_host_notifier_aio_poll>, io_poll_begin = 0x55a3fc652ed0 <virtio_queue_host_notifier_aio_poll_begin>, 
  io_poll_end = 0x55a3fc652eb0 <virtio_queue_host_notifier_aio_poll_end>, deleted = 0, opaque = 0x55a400eca068, is_external = true, node = {le_next = 0x55a3ff4721e0, 
    le_prev = 0x55a3ff473068}}
$7 = {pfd = {fd = 13, events = 25, revents = 0}, io_read = 0x55a3fc8e4dc0 <event_notifier_dummy_cb>, io_write = 0x0, io_poll = 0x55a3fc8e4dd0 <event_notifier_poll>, 
  io_poll_begin = 0x0, io_poll_end = 0x0, deleted = 0, opaque = 0x55a3ff44bc2c, is_external = false, node = {le_next = 0x0, le_prev = 0x55a3ff472e28}}

Checking the handler virtio_queue_host_notifier_aio_read and opaque 0x55a400eca068, this is exactly the virtio-blk dataplane handler.

So the question is why aio_poll() didn't add this node to "pollfds", or alternatively an aio_notify() is missed. Will continue to debug tomorrow.

Comment 43 Laszlo Ersek 2018-08-02 18:14:00 UTC

Fam, thanks for the excellent analysis! :)

You mention "alternatively an aio_notify() is missed".

I'm eerily reminded of:

- bug 1184405, and the corresponding downstream commits on the
  "rhev7/master-2.6.0" branch (aka RHV-7.3.z):

  - 3e531f1d6e28 ("aio_notify(): force main loop wakeup with SIGIO (aarch64
                  host only)", 2016-05-12)

  - d76a59993511 ("ignore SIGIO in tests that use AIO context (aarch64 host
                  only)", 2016-05-12)

- and bug 1188656, and the corresponding downstream commit (on the same
  branch):

  - 02daa7f3cb1a ("Revert "aio_notify(): force main loop wakeup with SIGIO
                  (aarch64 host only)"", 2016-06-21)


The last downstream commit mentioned above references the following upstream
commits as the real fix:

- eabc977 ("AioContext: fix broken ctx->dispatching optimization",
           2015-07-22)

- 21a03d1 ("AioContext: fix broken placement of
           event_notifier_test_and_clear", 2015-07-22), and

I don't know if those commits have any relevance now, but the summary back
then looked somewhat similar ("aio_notify() wouldn't do it's job", kind of);
perhaps reviewing the upstream fixes from back then could help your
investigation. Just an idea.

(If, at the top, you meant that aio_notify() wasn't even called, then I'm
sorry about the noise.)

Comment 44 Fam Zheng 2018-08-03 15:35:28 UTC

Summary: continuing the investigation in comment 42 I came to this change that seems to fix the problem (survives >20 repeats of the reproducer):

Author: Fam Zheng <famz>
Date:   Fri Aug 3 21:47:18 2018 +0800

    async: Fix aio_notify_accept

    Signed-off-by: Fam Zheng <famz>

diff --git a/util/async.c b/util/async.c
index 4dd9d95a9e..6fbadda2b1 100644
--- a/util/async.c
+++ b/util/async.c
@@ -346,7 +346,11 @@ void aio_notify(AioContext *ctx)

 void aio_notify_accept(AioContext *ctx)
 {
-    if (atomic_xchg(&ctx->notified, false)) {
+    /* If ctx->notify_me >= 2, another aio_poll() is waiting which may need the
+     * ctx->notifier event to wake up, so don't clear clear it because "we" are
+     * already in a running state. */
+    if (atomic_read(&ctx->notify_me) < 2
+        && atomic_xchg(&ctx->notified, false)) {
         event_notifier_test_and_clear(&ctx->notifier);
     }
 }

---

To avoid forgetting details in the weekend (I'm off on Monday too), I will elaborate some on the analysis I did.

From comment 42 we already know that aio_notify() is the suspect. But before I get to it, let's say other flaws in the event loop may also lead to the same abnormal condition. The approach to get more clues is like how Laszlo narrowed things down previously: using tracepoints. I wanted to trace what happened during the timespan of the hanging aio_poll() in the IOThread - not just the very aio_poll() itself, but also other activities in the main loop and vCPU thread.

I decided to write a poor man's tracing code instead of using existing tracing points for a few reasons: 1) I can significantly limit the number of events to log and analysis; 2) I can log a wide range of things, not limited by the defined QEMU tracepoints; 3) I can keep the log in a QEMU global array which can be easily inspected with gdb once I attach.

So my tracer keeps records of (file, func, line, thread id, some scalar), like any other tracer does. For example, here are the most interesting events I've collected right at the point of the hanging (don't try to understand it, I'll explain):

file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 623, thread = 43911, data = 1
file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 619, thread = 43911, data = 94828190945312
file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 620, thread = 43911, data = 1
file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 621, thread = 43911, data = 25
file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 622, thread = 43911, data = 1
file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 623, thread = 43911, data = 1
file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 619, thread = 43911, data = 94828190945216
file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 620, thread = 43911, data = 1
file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 646, thread = 0,     data = 5
file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 621, thread = 43911, data = 25
file = "util/async.c\000\000\000", func = "otify_accept",         line = 356, thread = 0,     data = 94828190768000
file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 622, thread = 43911, data = 1
file = "util/async.c\000\000\000", func = "otify_accept",         line = 357, thread = 0,     data = 94828190767680
file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 623, thread = 43911, data = 1
file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 619, thread = 43911, data = 94828190944640
file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 620, thread = 43911, data = 0
file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 621, thread = 43911, data = 25
file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 622, thread = 43911, data = 1
file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 623, thread = 43911, data = 1
file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 619, thread = 43911, data = 94828190941568
file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 620, thread = 43911, data = 0
file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 621, thread = 43911, data = 25
file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 622, thread = 43911, data = 0
file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 623, thread = 43911, data = 1
file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 626, thread = 43911, data = 94828190941568
file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 644, thread = 43911, data = 4
file = "util/async.c\000\000\000", func = "aio_notify\000",       line = 344, thread = 0,     data = 94828190768000
file = "util/async.c\000\000\000", func = "aio_notify\000",       line = 345, thread = 0,     data = 94828190767680
file = "util/async.c\000\000\000", func = "aio_notify\000",       line = 348, thread = 0,     data = 2
file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 590, thread = 0,     data = 94828190768000
file = "util/aio-posix.c",         func = "aio_poll\000\000\000", line = 610, thread = 0,     data = 1
file = "util/async.c\000\000\000", func = "otify_accept",         line = 356, thread = 0,     data = 94828190768000
file = "util/async.c\000\000\000", func = "otify_accept",         line = 357, thread = 0,     data = 94828190767680
file = "util/async.c\000\000\000", func = "otify_accept",         line = 359, thread = 0,     data = 1


Match the tracepoints back to the modified QEMU code, from the func="aio_poll" lines I know that IOThread is doing an aio_poll() with ctx->external_disable_cnt == 1. That's why the pollfds doesn't include the VQ ioeventfd; from the func="aio_notify" and func="otify_accept" (it is truncated aio_notify_accept) lines it's obvious that the main loop or the vCPU thread is doing aio_disable_external/aio_enable_external. What looks wrong here is the aio_notify_accept part: it is STEALING the ctx->notifier event before IOThread is awakened!

The fix is obvious at this point. The hanging aio_poll() made sure ctx->notify_me is 2 before entering ppoll(). Main loop (or vCPU thread) sees that when it calls aio_enable_external(), and calls aio_notify(). The ctx->notifier is notified. But immediately, main loop calls aio_poll(), probably from bdrv_set_aio_context() (which is very questionable by itself, but let's see how Paolo and others think.). This questionable aio_poll() clears the event on ctx->notifier so the IOThread never wakes up.

Comment 47 Laszlo Ersek 2018-08-03 17:04:44 UTC

Upstream discussion started by Fam (let's make the connection bi-directional):

[Qemu-devel] [RFC PATCH] async: Fix aio_notify_accept
https://lists.gnu.org/archive/html/qemu-devel/2018-08/msg00470.html
http://mid.mail-archive.com/20180803154955.25251-1-famz@redhat.com

Comment 48 Paolo Bonzini 2018-08-03 17:16:19 UTC

> This questionable aio_poll() clears the event on ctx->notifier so the IOThread 
> never wakes up.

Yeah, that aio_poll is a bit questionable...  even though I added it. :)  But as long as blocking is false it should be okay.

My concern was that this caused a busy wait in bdrv_set_aio_context, but again blocking=false avoids that concern.

I think the patch is correct and the idea below (skip aio_notify_accept) is certainly the right one.  I commented on the upstream mailing list about less tricky ways to get the same effect, instead of changing aio_notify_accept directly.

Comment 53 Fam Zheng 2018-08-07 02:37:05 UTC

Two changes are included in the build:

Author: Fam Zheng <famz>
Date:   Fri Aug 3 21:47:18 2018 +0800

    aio: Don't call aio_notify_accept unless blocking
    
    Signed-off-by: Fam Zheng <famz>

diff --git a/util/aio-posix.c b/util/aio-posix.c
index 429bc0e592..c4ee624047 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -633,6 +633,7 @@ bool aio_poll(AioContext *ctx, bool blocking)
 
     if (blocking) {
         atomic_sub(&ctx->notify_me, 2);
+        aio_notify_accept(ctx);
     }
 
     /* Adjust polling time */
@@ -676,8 +677,6 @@ bool aio_poll(AioContext *ctx, bool blocking)
         }
     }
 
-    aio_notify_accept(ctx);
-
     /* if we have any readable fds, dispatch event */
     if (ret > 0) {
         for (i = 0; i < npfd; i++) {
diff --git a/util/aio-win32.c b/util/aio-win32.c
index d6d5e02f00..0bc41e1bf2 100644
--- a/util/aio-win32.c
+++ b/util/aio-win32.c
@@ -374,10 +374,12 @@ bool aio_poll(AioContext *ctx, bool blocking)
         if (blocking) {
             assert(first);
             atomic_sub(&ctx->notify_me, 2);
+            if (first) {
+                aio_notify_accept(ctx);
+            }
         }
 
         if (first) {
-            aio_notify_accept(ctx);
             progress |= aio_bh_poll(ctx);
             first = false;
         }

commit 4ee3e6d1acea818f7a7c39ab375ed04039dfb18f
Author: Fam Zheng <famz>
Date:   Tue Aug 7 10:16:05 2018 +0800

    aio-posix: ctx->notifier doesn't count as progress

    Signed-off-by: Fam Zheng <famz>

diff --git a/util/aio-posix.c b/util/aio-posix.c
index 5946ac09f0..429bc0e592 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -494,7 +494,8 @@ static bool run_poll_handlers_once(AioContext *ctx)
     QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
         if (!node->deleted && node->io_poll &&
             aio_node_check(ctx, node->is_external) &&
-            node->io_poll(node->opaque)) {
+            node->io_poll(node->opaque) &&
+            node->opaque != &ctx->notifier) {
             progress = true;
         }

Comment 55 aihua liang 2018-08-09 02:52:11 UTC

Hi, Fam
 I can reproduce this bug without dataplane setting, just boot multi vms on a host.
 So please help to make sure if my reproducer hit this issue?

Reproduce env:
 kernerl version: 3.10.0-926.el7.x86_64
 qemu-kvm-rhev version: qemu-kvm-rhev-2.12.0-8.el7.x86_64

****host's meminfo:*******
# cat /proc/meminfo 
MemTotal:       32775484 kB
MemFree:        27778608 kB
MemAvailable:   27589008 kB
Buffers:               0 kB
Cached:           109768 kB
SwapCached:        28216 kB
Active:          4154444 kB
Inactive:         149352 kB
Active(anon):    4139712 kB
Inactive(anon):    98808 kB
Active(file):      14732 kB
Inactive(file):    50544 kB
Unevictable:       10100 kB
Mlocked:        2765958948724 kB
SwapTotal:      16515068 kB
SwapFree:       16375540 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:       4180996 kB
Mapped:            65336 kB
Shmem:             35404 kB
Slab:              85536 kB
SReclaimable:      31456 kB
SUnreclaim:        54080 kB
KernelStack:        5120 kB
PageTables:        22408 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    32902808 kB
Committed_AS:   24856892 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      391892 kB
VmallocChunk:   34358947836 kB
HardwareCorrupted:     0 kB
AnonHugePages:   3977216 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      267584 kB
DirectMap2M:    33282048 kB

***********host's cpuinfo:***********
# cat /proc/cpuinfo 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Xeon(R) CPU           E5205  @ 1.86GHz
stepping	: 6
microcode	: 0x612
cpu MHz		: 1861.994
cache size	: 6144 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm tpr_shadow vnmi flexpriority dtherm
bogomips	: 3723.98
clflush size	: 64
cache_alignment	: 64
address sizes	: 38 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Xeon(R) CPU           E5205  @ 1.86GHz
stepping	: 6
microcode	: 0x612
cpu MHz		: 1861.994
cache size	: 6144 KB
physical id	: 3
siblings	: 2
core id		: 0
cpu cores	: 2
apicid		: 6
initial apicid	: 6
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm tpr_shadow vnmi flexpriority dtherm
bogomips	: 3723.82
clflush size	: 64
cache_alignment	: 64
address sizes	: 38 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Xeon(R) CPU           E5205  @ 1.86GHz
stepping	: 6
microcode	: 0x612
cpu MHz		: 1861.994
cache size	: 6144 KB
physical id	: 0
siblings	: 2
core id		: 1
cpu cores	: 2
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm tpr_shadow vnmi flexpriority dtherm
bogomips	: 3723.98
clflush size	: 64
cache_alignment	: 64
address sizes	: 38 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Xeon(R) CPU           E5205  @ 1.86GHz
stepping	: 6
microcode	: 0x612
cpu MHz		: 1861.994
cache size	: 6144 KB
physical id	: 3
siblings	: 2
core id		: 1
cpu cores	: 2
apicid		: 7
initial apicid	: 7
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm tpr_shadow vnmi flexpriority dtherm
bogomips	: 3723.82
clflush size	: 64
cache_alignment	: 64
address sizes	: 38 bits physical, 48 bits virtual
power management:

Reproduce steps:
 1.Start multi vms (without dataplane setting)on a host, in my env, i start 5 vms. 
 qemu cmd of one vm as bellow:
  /usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -machine pc  \
    -nodefaults \
    -device VGA,bus=pci.0,addr=0x2  \
    -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1-20180803-032320-wyZb4cEl,server,nowait \
    -mon chardev=qmp_id_qmpmonitor1,mode=control  \
    -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor-20180803-032320-wyZb4cEl,server,nowait \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=idyjQL6c  \
    -chardev socket,id=serial_id_serial0,path=/var/tmp/serial-serial0-20180803-032320-wyZb4cEl,server,nowait \
    -device isa-serial,chardev=serial_id_serial0  \
    -chardev socket,id=seabioslog_id_20180803-032320-wyZb4cEl,path=/var/tmp/seabios-20180803-032320-wyZb4cEl,server,nowait \
    -device isa-debugcon,chardev=seabioslog_id_20180803-032320-wyZb4cEl,iobase=0x402 \
    -device nec-usb-xhci,id=usb1,bus=pci.0,addr=0x3 \
    -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,format=raw,file=//home/kvm_autotest_root/images/rhel610-32-virtio.raw \
    -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=0x4 \
    -device virtio-net-pci,mac=9a:7f:80:81:82:83,id=id6T4Avb,vectors=4,netdev=id3Wo1qc,bus=pci.0,addr=0x6  \
    -netdev tap,id=id3Wo1qc,vhost=on \
    -m 4096  \
    -smp 2,maxcpus=2,cores=1,threads=1,sockets=2  \
    -cpu host \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1  \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,strict=on \
    -enable-kvm \
    -monitor stdio \

 2. Wait for some minutes, check if all vms can start successfully.

Test result:
  4 vms can start successfully, but the last one failed to start with screendisplay hang on "Booting from Hard Disk...."

Expected result:
  All vms can start successfully.

Comment 56 aihua liang 2018-08-09 03:21:11 UTC

Additional info:
 I tried to boot from scsi-hd/ide-hd/virtio-blk-pci, all of these device type can hit this issue.

Comment 57 Fam Zheng 2018-08-09 13:20:06 UTC

Does the scratch build in comment 52 fix the issue for you?

Comment 58 aihua liang 2018-08-12 10:47:07 UTC

Hi, Fam
  Sorry for the late reply, there's something wrong with my test host.
When i prepared test env for the new build, the host hang(maybe with too heave load, i think). The only thing i can do was to restart it by force. And after it restarted, the heavy load env gone and i couldn't rebuild it any more.
  I tried for days, but it had no progress. So please cc others that have the env to help to verify the build. 
  And I'll continue to rebuild the env. Once it's ok, i'll verify it.

thanks,
aliang

Comment 62 meital avital 2018-08-16 15:54:17 UTC

Adding additional info from email thread between Fam and Liran :
Liran did 50 iterations, in each iteration it starts the 6 VMs in the environment. It took around 75 minutes and I didn't hit the bug.
The patch(below) is fixing the issue.

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=17644958

Comment 66 Miroslav Rezanina 2018-08-21 04:43:04 UTC

Fix included in qemu-kvm-rhev-2.12.0-11.el7

Comment 68 aihua liang 2018-08-21 06:01:23 UTC

Hi, Liran
 
   We can't still reproduce this bug, even on our own rhev env with seven vms started.

   The bug is "ON_QA" now. So, can you help to verify it on your env?


Thanks,
aliang

Comment 70 Liran Rotenberg 2018-08-23 06:07:26 UTC

From the RHEV side:
qemu-img-rhev-2.10.0-21.el7_5.5.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.5.x86_64
qemu-kvm-common-rhev-2.10.0-21.el7_5.5.x86_64
libvirt-3.9.0-14.el7_5.7.x86_64
vdsm-4.20.37-1.el7ev.x86_64
ovirt-engine-4.2.6.3-0.1.el7ev.noarch

First i made sure i actually hit the bug.

Steps:
1. Create 6 VMs in the environment.
2. Pin the VMs to one host.
3. Set IO Threads enable on each VM.
4. Start the 6VMs.
5. Shutdown and repeat on step 4 until i hit the bug.

On the 4th iteration - got the bug on one VM

From this step on i tested Fam's patch:
6. Install Fam's RPMs on the desired host:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=17946743

The new rpms:
# rpm -qa | grep qemu
libvirt-daemon-driver-qemu-3.9.0-14.el7_5.7.x86_64
qemu-kvm-common-rhev-2.10.0-21.el7_5.5.fzheng201808201632.x86_64
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
qemu-kvm-rhev-debuginfo-2.10.0-21.el7_5.5.fzheng201808201632.x86_64
qemu-kvm-tools-rhev-2.10.0-21.el7_5.5.fzheng201808201632.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.5.fzheng201808201632.x86_64
qemu-img-rhev-2.10.0-21.el7_5.5.fzheng201808201632.x86_64

7. Go back to step 4 and try to reproduce again.

Results:
After 10 hours, 317 iteration all the VMs went up as supposed.
I didn't hit the bug again.

Fam's patch seems to fix this bug.

Comment 72 aihua liang 2018-08-30 07:57:15 UTC

Has run regression test on kernel-3.10.0-935+qemu-kvm-rhev-2.12.0-11.el7, no regression bug introduced.

Comment 73 aihua liang 2018-09-01 06:56:52 UTC

In Liran's RHV env with qemu-kvm-rhev-2.12.0-11.el7.x86_64,
test 100 times, don't hit this issue, so set bug's status to "Verified"

Test steps:
  1. Create 6 VMs.
  2. Pin the VMs to one host.
  3. Set IO Threads enable on each VM.
  4. Start the 6VMs, check if they all start successfully.
  5. Shutdown vms, then start them.
  6. Repeat step5 for times.

Comment 77 errata-xmlrpc 2018-11-01 11:07:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3443