1723530 – qemu aio=native on Gluster: Metadata corruption detected at xfs_buf_ioend+0x58/0x1e0 [xfs], xfs_inode block 0x240 xfs_inode_bug_verify

Bug 1723530 - qemu aio=native on Gluster: Metadata corruption detected at xfs_buf_ioend+0x58/0x1e0 [xfs], xfs_inode block 0x240 xfs_inode_bug_verify

Summary: qemu aio=native on Gluster: Metadata corruption detected at xfs_buf_ioend+0x5...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux Advanced Virtualization
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	---
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Virtualization Maintenance
QA Contact:	qing.wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1758964
TreeView+	depends on / blocked

Reported:	2019-06-24 18:00 UTC by Avihai
Modified:	2023-03-24 14:58 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-09-08 19:15:19 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
All logs and print screen of the issue (5.48 MB, application/gzip) 2019-06-24 18:00 UTC, Avihai	no flags	Details
vm xml extracted from vdsm.log (11.68 KB, text/plain) 2019-06-24 21:23 UTC, Nir Soffer	no flags	Details
gluster logs from the host (31.62 KB, application/gzip) 2019-06-25 05:23 UTC, Avihai	no flags	Details
print screen of the issue 3 VM's out of 12 has the issue (291.00 KB, image/png) 2019-08-11 14:53 UTC, Avihai	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	4373791	0	None	None	None	2019-08-27 04:56:10 UTC

Description Avihai 2019-06-24 18:00:17 UTC

Created attachment 1584111 [details]
All logs and print screen of the issue

Description of problem:
I started seeing this on several TC's on 4.3.5 Tier1 runs only when those TC's were running GlusterFS.

The scenario is very basic and this is why this is more alarming in severity and happens ONLY with glusterfs.(same TC runs on ISCSI/FCP/NFS and does not occur)

1) Create VM from template
2) Start VM

Result:
VM start and gets IP but right afterward in the console you see this error and from that point IP is still available, ssh fails all the VM is 'UP' but you cannot do anything with it an also console is stuck. 

In console you see:
XFS(dm-0): Metadata corruption detected at xfs_buf_ioend+0x58/0x1e0 [xfs], xfs_inode block 0x240 xfs_inode_bug_verify

Every attempt to ssh the VM fails with a core dump shown is console(see print screen)


What is different from the last run it passed:
1) RHEL7.7 (instead of RHEL7.6) hosts
2) rhv-4.3.5.-2 Engine/VDSM/libvirt versions [1]
3) New sanlock-3.7.3-1.el7.x86_64

What did not change:
1) The same ENV machines(happened on several ENVS)
2) same gluster cluster
3) Same exact test

https://rhv-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/rhv-4.3-ge-runner-tier1/85/

Stand with the issue you can look around:
hosted-engine-06.lab.eng.tlv2.redhat.com

Details:
Issue reproduce running TestCase18868 on glusterFS simply creating a VM from a rhel8 image template.

Impact:
ping to the VM works but ssh fails with 'Authentication failed.' so you cannot use the VM.

Issue seen only on gluster SD (!) on several TC.

I see that TC

[root@lynx22 vdsm]# ssh root.17.164
Authentication failed.

At VM console you see(print screen attached)
XFS(dm-0): Metadata corruption detected at xfs_buf_ioend+0x58/0x1e0 [xfs], xfs_inode block 0x240 xfs_inode_bug_verify

Of looks like the first 128 bytes of corrupted metadata buffer is zeroed .


Engine:
2019-06-24 16:44:56,327+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-28) [vms_syncAction_5bd75391-29cf-4c1d] EVENT_ID: USER_STARTED_VM(153), VM vm_TestCase18868_
2416380014 was started by admin@internal-authz (Host: host_mixed_1)
2019-06-24 16:44:58,448+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-100) [] EVENT_ID: VM_CONSOLE_DISCONNECTED(168), User <UNKN
OWN> got disconnected from VM vm_TestCase18868_2416380014.

2019-06-24 16:45:58,573+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-38) [] EVENT_ID: USER_RUN_VM(32), VM vm_TestCase18868_2416
380014 started on Host host_mixed_1

VDSM:
2019-06-24 16:44:27,740+0300 WARN  (libvirt/events) [root] File: /var/lib/libvirt/qemu/channels/dbfc4b9a-74bf-4c21-95c6-0840743fd57a.ovirt-guest-agent.0 already removed (fileutils:54)
2019-06-24 16:44:27,741+0300 WARN  (qgapoller/3) [virt.periodic.VmDispatcher] could not run <function <lambda> at 0x7fe86c1a19b0> on ['dbfc4b9a-74bf-4c21-95c6-0840743fd57a'] (periodic:289)
2019-06-24 16:44:27,746+0300 WARN  (libvirt/events) [root] File: /var/lib/libvirt/qemu/channels/dbfc4b9a-74bf-4c21-95c6-0840743fd57a.org.qemu.guest_agent.0 already removed (fileutils:54)
2019-06-24 16:45:12,051+0300 WARN  (periodic/4) [virt.periodic.VmDispatcher] could not run <class 'vdsm.virt.periodic.BlockjobMonitor'> on ['dbfc4b9a-74bf-4c21-95c6-0840743fd57a'] (periodic:289)



Libvirt log:
2019-06-24 13:44:26.372+0000: 31609: error : qemuMonitorIO:718 : internal error: End of file from qemu monitor
2019-06-24 13:44:26.594+0000: 15326: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/iptables -w10 -w -D libvirt-out -m physdev --physdev-is-bridged --physdev-out vnet0 -g FP-vnet0' failed: iptables v1.4.21: goto 'FP-vnet0' is not a chain
2019-06-24 13:44:26.603+0000: 15326: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/iptables -w10 -w -D libvirt-out -m physdev --physdev-out vnet0 -g FP-vnet0' failed: iptables v1.4.21: goto 'FP-vnet0' is not a chain
2019-06-24 13:44:26.611+0000: 15326: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/iptables -w10 -w -D libvirt-in -m physdev --physdev-in vnet0 -g FJ-vnet0' failed: iptables v1.4.21: goto 'FJ-vnet0' is not a chain
2019-06-24 13:44:26.619+0000: 15326: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/iptables -w10 -w -D libvirt-host-in -m physdev --physdev-in vnet0 -g HJ-vnet0' failed: iptables v1.4.21: goto 'HJ-vnet0' is not a chain
2019-06-24 13:44:26.626+0000: 15326: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/iptables -w10 -w -F FP-vnet0' failed: iptables: No chain/target/match by that name.
2019-06-24 13:44:26.634+0000: 15326: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/iptables -w10 -w -X FP-vnet0' failed: iptables: No chain/target/match by that name.
2019-06-24 13:44:26.641+0000: 15326: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/iptables -w10 -w -F FJ-vnet0' failed: iptables: No chain/target/match by that name.
2019-06-24 13:44:26.648+0000: 15326: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/iptables -w10 -w -X FJ-vnet0' failed: iptables: No chain/target/match by that name.
2019-06-24 13:44:26.656+0000: 15326: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/iptables -w10 -w -F HJ-vnet0' failed: iptables: No chain/target/match by that name.
2019-06-24 13:44:26.664+0000: 15326: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/iptables -w10 -w -X HJ-vnet0' failed: iptables: No chain/target/match by that name.
2019-06-24 13:44:26.671+0000: 15326: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/ip6tables -w10 -w -D libvirt-out -m physdev --physdev-is-bridged --physdev-out vnet0 -g FP-vnet0' failed: ip6tables v1.4.21: goto 'FP-vnet0' is not a chain

2019-06-24 13:44:59.554+0000: 31611: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -F I-vnet0-arp-mac' failed: Chain 'I-vnet0-arp-mac' doesn't exist.
2019-06-24 13:44:59.563+0000: 31611: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -X I-vnet0-arp-mac' failed: Chain 'I-vnet0-arp-mac' doesn't exist.
2019-06-24 13:45:07.357+0000: 31613: error : qemuDomainAgentAvailable:9131 : Guest agent is not responding: QEMU guest agent is not connected
2019-06-24 13:46:07.363+0000: 31611: error : virNetClientProgramDispatchError:174 : Cannot open log file: '/var/log/libvirt/qemu/vm_TestCase18868_2416380014.log': Device or resource busy

qemu log (vm_TestCase18868_2416380014.log):
2019-06-24T13:44:26.302416Z qemu-kvm: terminating on signal 15 from pid 31609 (<unknown process>)
red_channel_client_disconnect: rcc=0x55b8aa5961b0 (channel=0x55b8a9db42d0 type=5 id=0)
red_channel_client_disconnect: rcc=0x55b8aade01b0 (channel=0x55b8a9db4390 type=6 id=0)
2019-06-24 13:44:26.574+0000: shutting down, reason=shutdown
2019-06-24 13:44:59.896+0000: starting up libvirt version: 4.5.0, package: 22.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2019-06-12-08:32:19, x86-037.build.eng.bos.redhat.com), qemu version: 2.12.0qemu-kvm-rhev-2.12.0-33.el7, kernel: 3.10.0-1057.el7.x86_64, hostname: lynx22.lab.eng.tlv2.redhat.com
LC_ALL=C \
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin \
QEMU_AUDIO_DRV=spice \
/usr/libexec/qemu-kvm \
-name guest=vm_TestCase18868_2416380014,debug-threads=on \
-S \
-object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-11-vm_TestCase18868_241/master-key.aes \
-machine pc-i440fx-rhel7.6.0,accel=kvm,usb=off,dump-guest-core=off \
-cpu Westmere \
-m size=1048576k,slots=16,maxmem=4194304k \
-realtime mlock=off \
-smp 1,maxcpus=16,sockets=16,cores=1,threads=1 \
-numa node,nodeid=0,cpus=0,mem=1024 \
-uuid dbfc4b9a-74bf-4c21-95c6-0840743fd57a \
-smbios 'type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.7-7.el7,serial=4c4c4544-0047-3210-8053-c4c04f473632,uuid=dbfc4b9a-74bf-4c21-95c6-0840743fd57a' \
-no-user-config \
-nodefaults \
-chardev socket,id=charmonitor,fd=31,server,nowait \
-mon chardev=charmonitor,id=monitor,mode=control \
-rtc base=2019-06-24T13:44:58,driftfix=slew \
-global kvm-pit.lost_tick_policy=delay \
-no-hpet \
-no-shutdown \
-global PIIX4_PM.disable_s3=1 \
-global PIIX4_PM.disable_s4=1 \
-boot strict=on \
-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \
-device virtio-scsi-pci,id=ua-fbe4dc8f-50eb-4533-9c62-37b521706ac7,bus=pci.0,addr=0x6 \
-device virtio-serial-pci,id=ua-33adb81c-95a7-40fc-9a3b-88fd227724f7,max_ports=16,bus=pci.0,addr=0x5 \
-drive if=none,id=drive-ua-5ae09403-2dae-4685-97ed-38838d055fa0,werror=report,rerror=report,readonly=on \
-device ide-cd,bus=ide.1,unit=0,drive=drive-ua-5ae09403-2dae-4685-97ed-38838d055fa0,id=ua-5ae09403-2dae-4685-97ed-38838d055fa0 \
-drive file=/rhev/data-center/mnt/glusterSD/gluster01.lab.eng.tlv2.redhat.com:_GE__he6__volume01/4eea07cd-bab6-4fc3-a7ce-beed72ac0f1c/images/022b31a1-5a1d-470b-b269-090a03641caa/25e111ec-ebed-491c-b9e8-b32845f87053,format=qcow2,if=none,id=drive-ua-022b31a1-5a1d-470b-b269-090a03641caa,serial=022b31a1-5a1d-470b-b269-090a03641caa,werror=stop,rerror=stop,cache=none,aio=native \
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-022b31a1-5a1d-470b-b269-090a03641caa,id=ua-022b31a1-5a1d-470b-b269-090a03641caa,bootindex=1,write-cache=on \
-netdev tap,fd=33,id=hostua-488d441f-1215-4525-a87c-e29f336ad555,vhost=on,vhostfd=34 \
-device virtio-net-pci,host_mtu=1500,netdev=hostua-488d441f-1215-4525-a87c-e29f336ad555,id=ua-488d441f-1215-4525-a87c-e29f336ad555,mac=00:1a:4a:16:88:a4,bus=pci.0,addr=0x3 \
-chardev socket,id=charchannel0,fd=35,server,nowait \
-device virtserialport,bus=ua-33adb81c-95a7-40fc-9a3b-88fd227724f7.0,nr=1,chardev=charchannel0,id=channel0,name=ovirt-guest-agent.0 \
-chardev socket,id=charchannel1,fd=36,server,nowait \
-device virtserialport,bus=ua-33adb81c-95a7-40fc-9a3b-88fd227724f7.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 \
-chardev spicevmc,id=charchannel2,name=vdagent \
-device virtserialport,bus=ua-33adb81c-95a7-40fc-9a3b-88fd227724f7.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 \
-spice port=5900,tls-port=5901,addr=10.46.16.37,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on \
-device qxl-vga,id=ua-2d6fbf5d-69ab-40f2-ac17-6036ccb4e3b6,ram_size=67108864,vram_size=33554432,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 \
-device intel-hda,id=ua-f571026d-9805-4506-92fd-8bdb4ad1e2e5,bus=pci.0,addr=0x4 \
-device hda-duplex,id=ua-f571026d-9805-4506-92fd-8bdb4ad1e2e5-codec0,bus=ua-f571026d-9805-4506-92fd-8bdb4ad1e2e5.0,cad=0 \
-device virtio-balloon-pci,id=ua-9a2da065-2312-4d42-b59e-b0b591b2c77d,bus=pci.0,addr=0x8 \
-object rng-random,id=objua-dfdb11b2-5569-49b4-a375-729326cd8ab4,filename=/dev/urandom \
-device virtio-rng-pci,rng=objua-dfdb11b2-5569-49b4-a375-729326cd8ab4,id=ua-dfdb11b2-5569-49b4-a375-729326cd8ab4,bus=pci.0,addr=0x9 \
-device vmcoreinfo \
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-msg timestamp=on
2019-06-24 13:44:59.917+0000: 27700: info : virObjectUnref:344 : OBJECT_UNREF: obj=0x7f037011ede0
2019-06-24T13:45:00.075797Z qemu-kvm: -drive file=/rhev/data-center/mnt/glusterSD/gluster01.lab.eng.tlv2.redhat.com:_GE__he6__volume01/4eea07cd-bab6-4fc3-a7ce-beed72ac0f1c/images/022b31a1-5a1d-470b-b269-090a03641caa/25e111ec-ebed-491c-b9e8-b32845f87053,format=qcow2,if=none,id=drive-ua-022b31a1-5a1d-470b-b269-090a03641caa,serial=022b31a1-5a1d-470b-b269-090a03641caa,werror=stop,rerror=stop,cache=none,aio=native: 'serial' is deprecated, please use the corresponding option of '-device' instead
Spice-Message: 16:45:00.217: setting TLS option 'CipherString' to 'kECDHE+FIPS:kDHE+FIPS:kRSA+FIPS:!eNULL:!aNULL' from /etc/pki/tls/spice.cnf configuration file
2019-06-24T13:45:00.226064Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: CPU 1 [socket-id: 1, core-id: 0, thread-id: 0], CPU 2 [socket-id: 2, core-id: 0, thread-id: 0], CPU 3 [socket-id: 3, core-id: 0, thread-id: 0], CPU 4 [socket-id: 4, core-id: 0, thread-id: 0], CPU 5 [socket-id: 5, core-id: 0, thread-id: 0], CPU 6 [socket-id: 6, core-id: 0, thread-id: 0], CPU 7 [socket-id: 7, core-id: 0, thread-id: 0], CPU 8 [socket-id: 8, core-id: 0, thread-id: 0], CPU 9 [socket-id: 9, core-id: 0, thread-id: 0], CPU 10 [socket-id: 10, core-id: 0, thread-id: 0], CPU 11 [socket-id: 11, core-id: 0, thread-id: 0], CPU 12 [socket-id: 12, core-id: 0, thread-id: 0], CPU 13 [socket-id: 13, core-id: 0, thread-id: 0], CPU 14 [socket-id: 14, core-id: 0, thread-id: 0], CPU 15 [socket-id: 15, core-id: 0, thread-id: 0]
2019-06-24T13:45:00.226096Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future
main_channel_link: add main channel client
main_channel_client_handle_pong: net test: latency 4.062000 ms, bitrate 79161996 bps (75.494762 Mbps)
inputs_connect: inputs channel client create
red_qxl_set_cursor_peer: 


Version-Release number of selected component (if applicable):
Engine:
ovirt-engine-4.3.5.1-0.1.el7.noarch

Host:
vdsm	vdsm-4.30.20-1.el7ev.x86_64
libvirt	libvirt-4.5.0-22.el7.x86_64
qemu-img-rhev	qemu-img-rhev-2.12.0-33.el7.x86_64
glusterfs	glusterfs-3.12.2-47.2.el7.x86_64
sanlock	sanlock-3.7.3-1.el7.x86_64
redhat_release	Red Hat Enterprise Linux Server 7.7 Beta (Maipo)

How reproducible:
Almost 100%

Steps to Reproduce:
1) Create VM from template
2) Start VM

Actual results:
VM Starts and is in 'UP' state , you can login to the VM and it gets an IP but after a minute or two after you see in the console the following message:

"XFS(dm-0): Metadata corruption detected at xfs_buf_ioend+0x58/0x1e0 [xfs], xfs_inode block 0x240 xfs_inode_bug_verify"

From that point you can only ping the VM but SSH fails with authentication error and console is stuck so you cannot use the VM anymore.

Expected results:


Additional info:
What is different from the last run it passed:
1) RHEL7.7 (instead of RHEL7.6) hosts
2) rhv-4.3.5.-2 Engine/VDSM/libvirt versions [1]
3) New sanlock-3.7.3-1.el7.x86_64

What did not change:
1) The same ENV machines(happened on several ENVS)
2) same gluster cluster
3) Same exact test

Comment 1 Nir Soffer 2019-06-24 21:23:01 UTC

Created attachment 1584165 [details]
vm xml extracted from vdsm.log

Comment 2 Nir Soffer 2019-06-24 21:34:48 UTC

Avihay, we need more data about this.

- can you reproduce this manually?
- Does it happen with rhel 7.6 guest or only with rhel 7.7 guest?
- Does it happen only with rhel 7.7 host or also with rhel 7.6 hosts and 7.7 guest?
- Can you attach the vm journal? (journalctl -b) maybe start the same vm on rhel 7.6
  if the issue happens only with 7.7.
- Can you add gluster logs from /var/log/gluster*/?

When we have more info we can ask the qemu team to look at this.

Comment 3 Avihai 2019-06-25 05:23:29 UTC

Created attachment 1584203 [details]
gluster logs from the host

Comment 4 Avihai 2019-06-25 06:33:55 UTC

(In reply to Nir Soffer from comment #2)
> Avihay, we need more data about this.
> 
> - can you reproduce this manually?
Yes indeed but about 70% of the times (only on rhel8 guest with gluster os disk)
Reproducible is ~70% manually , from 14 created VMs(rhel8 guest) issue occurred only 10 times
In automation each time I run TestCase18868 issue occurs(ran about ~7 times).

> - Does it happen with rhel 7.6 guest or only with rhel 7.7 guest?

7.6 guest - Tried 8 times issue did not reproduce
7.7 guest -  Tried 8 times issue did not reproduce
8.1 guest with gluster os disk- Issue occurred 10/14(~ 70% rep ratio) times manually and each time running automation TestCase18868(which does create VM from rhel8 template +cold snapshot)
8.1 guest with FCP os disk- created 8 VM's issue did not occur once.

> - Does it happen only with rhel 7.7 host or also with rhel 7.6 hosts and 7.7
> guest?
Until now we worked only with rhel7.6(4.3.4) hosts with same ENVs/gluster  and the issue did not occur once. 
If you are asking if we tested latest 4.3.5 with rhel7.6 hosts we did not as 4.3.5 is synced with RHEL7.7.


> - Can you attach the vm journal? (journalctl -b) maybe start the same vm on
> rhel 7.6
>   if the issue happens only with 7.7.
No as when the VM reaches this that you cannot connect via ssh and console is not responsive.

> - Can you add gluster logs from /var/log/gluster*/?
Added
> When we have more info we can ask the qemu team to look at this.


Also more info:
Also I saw the issue occurs on multiple enviroments.
Also the issue occur with 2 different gluster clusters with 2 different versions:
glusterfs 3.12.6  (TLV2 site) - out infra TierX runs are done here
glusterfs 6.3      (Raanana site) - local RHV storage team

Comment 5 Sahina Bose 2019-06-25 07:28:07 UTC

(In reply to Avihai from comment #4)
> (In reply to Nir Soffer from comment #2)
> > Avihay, we need more data about this.
> > 
> > - can you reproduce this manually?
> Yes indeed but about 70% of the times (only on rhel8 guest with gluster os
> disk)
> Reproducible is ~70% manually , from 14 created VMs(rhel8 guest) issue
> occurred only 10 times
> In automation each time I run TestCase18868 issue occurs(ran about ~7 times).
> 
> > - Does it happen with rhel 7.6 guest or only with rhel 7.7 guest?
> 
> 7.6 guest - Tried 8 times issue did not reproduce
> 7.7 guest -  Tried 8 times issue did not reproduce
> 8.1 guest with gluster os disk- Issue occurred 10/14(~ 70% rep ratio) times
> manually and each time running automation TestCase18868(which does create VM
> from rhel8 template +cold snapshot)
> 8.1 guest with FCP os disk- created 8 VM's issue did not occur once.
> 
> > - Does it happen only with rhel 7.7 host or also with rhel 7.6 hosts and 7.7
> > guest?
> Until now we worked only with rhel7.6(4.3.4) hosts with same ENVs/gluster 
> and the issue did not occur once. 
> If you are asking if we tested latest 4.3.5 with rhel7.6 hosts we did not as
> 4.3.5 is synced with RHEL7.7.
> 
> 
> > - Can you attach the vm journal? (journalctl -b) maybe start the same vm on
> > rhel 7.6
> >   if the issue happens only with 7.7.
> No as when the VM reaches this that you cannot connect via ssh and console
> is not responsive.
> 
> > - Can you add gluster logs from /var/log/gluster*/?
> Added
> > When we have more info we can ask the qemu team to look at this.
> 
> 
> Also more info:
> Also I saw the issue occurs on multiple enviroments.
> Also the issue occur with 2 different gluster clusters with 2 different
> versions:
> glusterfs 3.12.6  (TLV2 site) - out infra TierX runs are done here
> glusterfs 6.3      (Raanana site) - local RHV storage team

Could this be related to Bug 1701736? Can you try changing the UseNativeIOForGluster to false in vdc_options before the run?

Comment 6 Avihai 2019-06-25 08:45:02 UTC

(In reply to Sahina Bose from comment #5)
> (In reply to Avihai from comment #4)
> > (In reply to Nir Soffer from comment #2)
> > > Avihay, we need more data about this.
> > > 
> > > - can you reproduce this manually?
> > Yes indeed but about 70% of the times (only on rhel8 guest with gluster os
> > disk)
> > Reproducible is ~70% manually , from 14 created VMs(rhel8 guest) issue
> > occurred only 10 times
> > In automation each time I run TestCase18868 issue occurs(ran about ~7 times).
> > 
> > > - Does it happen with rhel 7.6 guest or only with rhel 7.7 guest?
> > 
> > 7.6 guest - Tried 8 times issue did not reproduce
> > 7.7 guest -  Tried 8 times issue did not reproduce
> > 8.1 guest with gluster os disk- Issue occurred 10/14(~ 70% rep ratio) times
> > manually and each time running automation TestCase18868(which does create VM
> > from rhel8 template +cold snapshot)
> > 8.1 guest with FCP os disk- created 8 VM's issue did not occur once.
> > 
> > > - Does it happen only with rhel 7.7 host or also with rhel 7.6 hosts and 7.7
> > > guest?
> > Until now we worked only with rhel7.6(4.3.4) hosts with same ENVs/gluster 
> > and the issue did not occur once. 
> > If you are asking if we tested latest 4.3.5 with rhel7.6 hosts we did not as
> > 4.3.5 is synced with RHEL7.7.
> > 
> > 
> > > - Can you attach the vm journal? (journalctl -b) maybe start the same vm on
> > > rhel 7.6
> > >   if the issue happens only with 7.7.
> > No as when the VM reaches this that you cannot connect via ssh and console
> > is not responsive.
> > 
> > > - Can you add gluster logs from /var/log/gluster*/?
> > Added
> > > When we have more info we can ask the qemu team to look at this.
> > 
> > 
> > Also more info:
> > Also I saw the issue occurs on multiple enviroments.
> > Also the issue occur with 2 different gluster clusters with 2 different
> > versions:
> > glusterfs 3.12.6  (TLV2 site) - out infra TierX runs are done here
> > glusterfs 6.3      (Raanana site) - local RHV storage team
> 
> Could this be related to Bug 1701736? Can you try changing the
> UseNativeIOForGluster to false in vdc_options before the run?
I tried but I cannot set it, please help. (host FQDN= hosted-engine-06.lab.eng.tlv2.redhat.com)

This is what I get:
[root@hosted-engine-06 ~]# engine-config -s UseNativeIOForGluster=False
Please select a version:
1. 4.1
2. 4.2
3. 4.3
3
Error setting UseNativeIOForGluster's value. No such entry with version 4.3.

Comment 7 Avihai 2019-06-25 08:49:57 UTC

OK , I found how to add it according to Bug 1701736:

root@hosted-engine-06 ~]# vi /etc/ovirt-engine/engine-config/engine-config.properties

Add the following lines at the end:
+UseNativeIOForGluster.description=Access volumes on glusterfs with aio native insteat of thread
+UseNativeIOForGluster.type=Boolean


[root@hosted-engine-06 ~]# engine-config -s UseNativeIOForGluster=False
Please select a version:
1. 4.1
2. 4.2
3. 4.3
3

[root@hosted-engine-06 ~]# engine-config -g UseNativeIOForGluster
UseNativeIOForGluster: false version: 4.1
UseNativeIOForGluster: true version: 4.2
UseNativeIOForGluster: False version: 4.3

[root@hosted-engine-06 ~]# systemctl restart ovirt-engine

Comment 8 Avihai 2019-06-25 12:48:49 UTC

So to answer Sahina's question :

> Could this be related to Bug 1701736? Can you try changing the
> UseNativeIOForGluster to false in vdc_options before the run?

Indeed it looks related.

What I did:
1) Once I turned UseNativeIOForGluster off the issue was not seen anymore (tried on 16 VM's and non of them had the issue)
2) Once I returned to original engine-config.properties(without a UseNativeIOForGluster value) I saw the issue again 
   (created pool of 8 VM's and saw the issue on 2 out of 8 VM's)

As the at moment engine-config.properties does not have a UseNativeIOForGluster and it should be added manually.
Was UseNativeIOForGluster value was changed to be enabled at RHEL7.7 or is this a RHV4.3.5 addition?


Another odd thing is that I see this issue only on RHEL8 guest.
This isuue is seen only on the following combination:

Host=RHEL7.7/RHV4.3.5
Guest = RHEL8.1
VM OS disk is gluster (does not matter what version of the gluster storage is)
UseNativeIOForGluster value does not exist in engine-config.properties and is turned on somehow.

Sahina, how do you want to proceed here?

Comment 9 Tal Nisan 2019-06-25 13:57:12 UTC

So I guess this should be marked as duplicate then, isn't it?

Comment 10 Sahina Bose 2019-06-26 09:06:32 UTC

(In reply to Avihai from comment #8)
> So to answer Sahina's question :
> 
> > Could this be related to Bug 1701736? Can you try changing the
> > UseNativeIOForGluster to false in vdc_options before the run?
> 
> Indeed it looks related.
> 
> What I did:
> 1) Once I turned UseNativeIOForGluster off the issue was not seen anymore
> (tried on 16 VM's and non of them had the issue)
> 2) Once I returned to original engine-config.properties(without a
> UseNativeIOForGluster value) I saw the issue again 
>    (created pool of 8 VM's and saw the issue on 2 out of 8 VM's)
> 
> As the at moment engine-config.properties does not have a
> UseNativeIOForGluster and it should be added manually.
> Was UseNativeIOForGluster value was changed to be enabled at RHEL7.7 or is
> this a RHV4.3.5 addition?
> 
> 
> Another odd thing is that I see this issue only on RHEL8 guest.
> This isuue is seen only on the following combination:
> 
> Host=RHEL7.7/RHV4.3.5
> Guest = RHEL8.1
> VM OS disk is gluster (does not matter what version of the gluster storage
> is)
> UseNativeIOForGluster value does not exist in engine-config.properties and
> is turned on somehow.
> 
> Sahina, how do you want to proceed here?

Thanks Avihai for checking this.

We introduced aio=native due to the performance gains seen on test (Bug 1630744) 
One option is to revert to using aio=threads. (which we will do since corruption takes precedence over performance)

But there does seem to be an issue with aio=native which is seen based on the guest OS used - this also needs to be investigated. Perhaps change the component to aio to investigate?

Comment 11 Avihai 2019-06-26 09:39:02 UTC

(In reply to Sahina Bose from comment #10)
> (In reply to Avihai from comment #8)
> > So to answer Sahina's question :
> > 
> > > Could this be related to Bug 1701736? Can you try changing the
> > > UseNativeIOForGluster to false in vdc_options before the run?
> > 
> > Indeed it looks related.
> > 
> > What I did:
> > 1) Once I turned UseNativeIOForGluster off the issue was not seen anymore
> > (tried on 16 VM's and non of them had the issue)
> > 2) Once I returned to original engine-config.properties(without a
> > UseNativeIOForGluster value) I saw the issue again 
> >    (created pool of 8 VM's and saw the issue on 2 out of 8 VM's)
> > 
> > As the at moment engine-config.properties does not have a
> > UseNativeIOForGluster and it should be added manually.
> > Was UseNativeIOForGluster value was changed to be enabled at RHEL7.7 or is
> > this a RHV4.3.5 addition?
> > 
> > 
> > Another odd thing is that I see this issue only on RHEL8 guest.
> > This isuue is seen only on the following combination:
> > 
> > Host=RHEL7.7/RHV4.3.5
> > Guest = RHEL8.1
> > VM OS disk is gluster (does not matter what version of the gluster storage
> > is)
> > UseNativeIOForGluster value does not exist in engine-config.properties and
> > is turned on somehow.
> > 
> > Sahina, how do you want to proceed here?
> 
> Thanks Avihai for checking this.
> 
> We introduced aio=native due to the performance gains seen on test (Bug
> 1630744) 
> One option is to revert to using aio=threads. (which we will do since
> corruption takes precedence over performance)
> 
> But there does seem to be an issue with aio=native which is seen based on
> the guest OS used - this also needs to be investigated. Perhaps change the
> component to aio to investigate?

(In reply to Sahina Bose from comment #10)
> (In reply to Avihai from comment #8)
> > So to answer Sahina's question :
> > 
> > > Could this be related to Bug 1701736? Can you try changing the
> > > UseNativeIOForGluster to false in vdc_options before the run?
> > 
> > Indeed it looks related.
> > 
> > What I did:
> > 1) Once I turned UseNativeIOForGluster off the issue was not seen anymore
> > (tried on 16 VM's and non of them had the issue)
> > 2) Once I returned to original engine-config.properties(without a
> > UseNativeIOForGluster value) I saw the issue again 
> >    (created pool of 8 VM's and saw the issue on 2 out of 8 VM's)
> > 
> > As the at moment engine-config.properties does not have a
> > UseNativeIOForGluster and it should be added manually.
> > Was UseNativeIOForGluster value was changed to be enabled at RHEL7.7 or is
> > this a RHV4.3.5 addition?
> > 
> > 
> > Another odd thing is that I see this issue only on RHEL8 guest.
> > This isuue is seen only on the following combination:
> > 
> > Host=RHEL7.7/RHV4.3.5
> > Guest = RHEL8.1
> > VM OS disk is gluster (does not matter what version of the gluster storage
> > is)
> > UseNativeIOForGluster value does not exist in engine-config.properties and
> > is turned on somehow.
> > 
> > Sahina, how do you want to proceed here?
> 
> Thanks Avihai for checking this.
> 
> We introduced aio=native due to the performance gains seen on test (Bug
> 1630744) 

This bug was introduced in 4.2 and the issue was not seen until now, what changed in 4.3.5 to make it noticeble?

> One option is to revert to using aio=threads. (which we will do since
> corruption takes precedence over performance)
> 
> But there does seem to be an issue with aio=native which is seen based on
> the guest OS used - this also needs to be investigated. Perhaps change the
> component to aio to investigate?


I am not familiar with this component(not in the drop-down), can you please change it?

Comment 12 Sahina Bose 2019-06-27 09:11:50 UTC

(In reply to Avihai from comment #11)
> (In reply to Sahina Bose from comment #10)
> > (In reply to Avihai from comment #8)
> > > So to answer Sahina's question :
> > > 
> > > > Could this be related to Bug 1701736? Can you try changing the
> > > > UseNativeIOForGluster to false in vdc_options before the run?
> > > 
> > > Indeed it looks related.
> > > 
> > > What I did:
> > > 1) Once I turned UseNativeIOForGluster off the issue was not seen anymore
> > > (tried on 16 VM's and non of them had the issue)
> > > 2) Once I returned to original engine-config.properties(without a
> > > UseNativeIOForGluster value) I saw the issue again 
> > >    (created pool of 8 VM's and saw the issue on 2 out of 8 VM's)
> > > 
> > > As the at moment engine-config.properties does not have a
> > > UseNativeIOForGluster and it should be added manually.
> > > Was UseNativeIOForGluster value was changed to be enabled at RHEL7.7 or is
> > > this a RHV4.3.5 addition?
> > > 
> > > 
> > > Another odd thing is that I see this issue only on RHEL8 guest.
> > > This isuue is seen only on the following combination:
> > > 
> > > Host=RHEL7.7/RHV4.3.5
> > > Guest = RHEL8.1
> > > VM OS disk is gluster (does not matter what version of the gluster storage
> > > is)
> > > UseNativeIOForGluster value does not exist in engine-config.properties and
> > > is turned on somehow.
> > > 
> > > Sahina, how do you want to proceed here?
> > 
> > Thanks Avihai for checking this.
> > 
> > We introduced aio=native due to the performance gains seen on test (Bug
> > 1630744) 
> > One option is to revert to using aio=threads. (which we will do since
> > corruption takes precedence over performance)
> > 
> > But there does seem to be an issue with aio=native which is seen based on
> > the guest OS used - this also needs to be investigated. Perhaps change the
> > component to aio to investigate?
> 
> (In reply to Sahina Bose from comment #10)
> > (In reply to Avihai from comment #8)
> > > So to answer Sahina's question :
> > > 
> > > > Could this be related to Bug 1701736? Can you try changing the
> > > > UseNativeIOForGluster to false in vdc_options before the run?
> > > 
> > > Indeed it looks related.
> > > 
> > > What I did:
> > > 1) Once I turned UseNativeIOForGluster off the issue was not seen anymore
> > > (tried on 16 VM's and non of them had the issue)
> > > 2) Once I returned to original engine-config.properties(without a
> > > UseNativeIOForGluster value) I saw the issue again 
> > >    (created pool of 8 VM's and saw the issue on 2 out of 8 VM's)
> > > 
> > > As the at moment engine-config.properties does not have a
> > > UseNativeIOForGluster and it should be added manually.
> > > Was UseNativeIOForGluster value was changed to be enabled at RHEL7.7 or is
> > > this a RHV4.3.5 addition?
> > > 
> > > 
> > > Another odd thing is that I see this issue only on RHEL8 guest.
> > > This isuue is seen only on the following combination:
> > > 
> > > Host=RHEL7.7/RHV4.3.5
> > > Guest = RHEL8.1
> > > VM OS disk is gluster (does not matter what version of the gluster storage
> > > is)
> > > UseNativeIOForGluster value does not exist in engine-config.properties and
> > > is turned on somehow.
> > > 
> > > Sahina, how do you want to proceed here?
> > 
> > Thanks Avihai for checking this.
> > 
> > We introduced aio=native due to the performance gains seen on test (Bug
> > 1630744) 
> 
> This bug was introduced in 4.2 and the issue was not seen until now, what
> changed in 4.3.5 to make it noticeble?

The change was introduced in 4.2 , and bug is seen now. The only change I can see is the guest OS, and RHEL 7.7 on host.

> 
> > One option is to revert to using aio=threads. (which we will do since
> > corruption takes precedence over performance)
> > 
> > But there does seem to be an issue with aio=native which is seen based on
> > the guest OS used - this also needs to be investigated. Perhaps change the
> > component to aio to investigate?
> 
> 
> I am not familiar with this component(not in the drop-down), can you please
> change it?

Comment 19 CongLi 2019-07-04 10:20:54 UTC

Hi Avihai,

KVM QE could not reproduce it in QEMU env.

Is it possible for you to provide the QEMU CML when set 
'UseNativeIOForGluster' = False ?
I would like to confirm the option 'UseNativeIOForGluster' 
is corresponding to the option 'aio=native' in QEMU side.

Thanks.

Comment 20 CongLi 2019-07-04 11:00:30 UTC

(In reply to CongLi from comment #19)
> Hi Avihai,
> 
> KVM QE could not reproduce it in QEMU env.
> 
> Is it possible for you to provide the QEMU CML when set 
> 'UseNativeIOForGluster' = False ?
> I would like to confirm the option 'UseNativeIOForGluster' 
> is corresponding to the option 'aio=native' in QEMU side.
> 

Sorry, I mean 'UseNativeIOForGluster' = False is corresponding 
to the option 'aio=threads' in QEMU side.

Thanks.

> Thanks.

Comment 21 Avihai 2019-07-04 14:36:18 UTC

(In reply to CongLi from comment #20)
> (In reply to CongLi from comment #19)
> > Hi Avihai,
> > 
> > KVM QE could not reproduce it in QEMU env.
> > 
> > Is it possible for you to provide the QEMU CML when set 
> > 'UseNativeIOForGluster' = False ?
> > I would like to confirm the option 'UseNativeIOForGluster' 
> > is corresponding to the option 'aio=native' in QEMU side.
> > 
> 
> Sorry, I mean 'UseNativeIOForGluster' = False is corresponding 
> to the option 'aio=threads' in QEMU side.
> 
> Thanks.
> 
> > Thanks.

I'v set 'UseNativeIOForGluster' = False at this engine FQDN "storage-ge-08.scl.lab.tlv.redhat.com".
But I need your help on providing the QEMU CML, can you please help?

details on how I set it:
[root@storage-ge-08 ~]# vim /etc/ovirt-engine/engine-config/engine-config.properties
Added the following lines at the end:
UseNativeIOForGluster.description=Access volumes on glusterfs with aio native insteat of thread
UseNativeIOForGluster.type=Boolean

[root@storage-ge-08 ~]# engine-config -s UseNativeIOForGluster=False
Please select a version:
1. 4.1
2. 4.2
3. 4.3
3
[root@storage-ge-08 ~]# engine-config -g UseNativeIOForGluster
UseNativeIOForGluster: false version: 4.1
UseNativeIOForGluster: true version: 4.2
UseNativeIOForGluster: False version: 4.3
[root@storage-ge-08 ~]# systemctl restart ovirt-engine

Comment 22 CongLi 2019-07-05 01:25:45 UTC

(In reply to Avihai from comment #21)
> (In reply to CongLi from comment #20)
> > (In reply to CongLi from comment #19)
> > > Hi Avihai,
> > > 
> > > KVM QE could not reproduce it in QEMU env.
> > > 
> > > Is it possible for you to provide the QEMU CML when set 
> > > 'UseNativeIOForGluster' = False ?
> > > I would like to confirm the option 'UseNativeIOForGluster' 
> > > is corresponding to the option 'aio=native' in QEMU side.
> > > 
> > 
> > Sorry, I mean 'UseNativeIOForGluster' = False is corresponding 
> > to the option 'aio=threads' in QEMU side.
> > 
> > Thanks.
> > 
> > > Thanks.
> 
> I'v set 'UseNativeIOForGluster' = False at this engine FQDN
> "storage-ge-08.scl.lab.tlv.redhat.com".
> But I need your help on providing the QEMU CML, can you please help?

Could you please help try '# ps aux | grep qemu' in your host / engine ?

Thanks.

> 
> details on how I set it:
> [root@storage-ge-08 ~]# vim
> /etc/ovirt-engine/engine-config/engine-config.properties
> Added the following lines at the end:
> UseNativeIOForGluster.description=Access volumes on glusterfs with aio
> native insteat of thread
> UseNativeIOForGluster.type=Boolean
> 
> [root@storage-ge-08 ~]# engine-config -s UseNativeIOForGluster=False
> Please select a version:
> 1. 4.1
> 2. 4.2
> 3. 4.3
> 3
> [root@storage-ge-08 ~]# engine-config -g UseNativeIOForGluster
> UseNativeIOForGluster: false version: 4.1
> UseNativeIOForGluster: true version: 4.2
> UseNativeIOForGluster: False version: 4.3
> [root@storage-ge-08 ~]# systemctl restart ovirt-engine

Comment 23 Avihai 2019-07-07 12:19:31 UTC

(In reply to CongLi from comment #22)
> (In reply to Avihai from comment #21)
> > (In reply to CongLi from comment #20)
> > > (In reply to CongLi from comment #19)
> > > > Hi Avihai,
> > > > 
> > > > KVM QE could not reproduce it in QEMU env.
> > > > 
> > > > Is it possible for you to provide the QEMU CML when set 
> > > > 'UseNativeIOForGluster' = False ?
> > > > I would like to confirm the option 'UseNativeIOForGluster' 
> > > > is corresponding to the option 'aio=native' in QEMU side.
> > > > 
> > > 
> > > Sorry, I mean 'UseNativeIOForGluster' = False is corresponding 
> > > to the option 'aio=threads' in QEMU side.
> > > 
> > > Thanks.
> > > 
> > > > Thanks.
> > 
> > I'v set 'UseNativeIOForGluster' = False at this engine FQDN
> > "storage-ge-08.scl.lab.tlv.redhat.com".
> > But I need your help on providing the QEMU CML, can you please help?
> 
> Could you please help try '# ps aux | grep qemu' in your host / engine ?

Indeed I see aio=threads after setting UseNativeIOForGluster to false.

Engine output:
[root@storage-ge-08 ~]# ps aux | grep qemu
root      1176  0.0  0.0  44216  2564 ?        Ss   Jul03   0:36 /usr/bin/qemu-ga --method=virtio-serial --path=/dev/virtio-ports/org.qemu.guest_agent.0 --blacklist=guest-file-open,guest-file-close,guest-file-read,guest-file-write,guest-file-seek,guest-file-flush,guest-exec,guest-exec-status -F/etc/qemu-ga/fsfreeze-hook
root     15741  0.0  0.0 112720   964 pts/0    S+   15:12   0:00 grep --color=auto qemu


Host output:
[root@storage-ge8-vdsm2 ~]# ps aux | grep qemu
root       725  0.0  0.0  44216  2436 ?        Ss   Jul03   1:14 /usr/bin/qemu-ga --method=virtio-serial --path=/dev/virtio-ports/org.qemu.guest_agent.0 --blacklist=guest-file-open,guest-file-close,guest-file-read,guest-file-write,guest-file-seek,guest-file-flush,guest-exec,guest-exec-status -F/etc/qemu-ga/fsfreeze-hook
qemu     12943 13.0 19.6 1998592 760756 ?      Rl   Jul04 543:30 /usr/libexec/qemu-kvm -name guest=pool_vm_gluster-1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-4-pool_vm_gluster-1/master-key.aes -machine pc-i440fx-rhel7.6.0,accel=kvm,usb=off,dump-guest-core=off -cpu Nehalem -m size=1048576k,slots=16,maxmem=4194304k -realtime mlock=off -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -object iothread,id=iothread1 -numa node,nodeid=0,cpus=0,mem=1024 -uuid 13b09533-008c-4e40-acde-3efa7150e383 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.7-7.el7,serial=53a0fc14-3841-4c1d-a7cc-b5b7b25874a4,uuid=13b09533-008c-4e40-acde-3efa7150e383 -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=32,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2019-07-04T14:48:37,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,iothread=iothread1,id=ua-5f9f4544-5e3a-4f0b-9c4b-f2b08bb8e9ba,bus=pci.0,addr=0x6 -device virtio-serial-pci,id=ua-0669b3db-2ede-40e1-af66-42ed04cf035c,max_ports=16,bus=pci.0,addr=0x5 -drive if=none,id=drive-ua-2f4d2630-0b13-4973-9c9d-309548498205,werror=report,rerror=report,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-ua-2f4d2630-0b13-4973-9c9d-309548498205,id=ua-2f4d2630-0b13-4973-9c9d-309548498205 -drive file=/rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_storage__local__ge8__volume__0/dc3f1c4c-10a8-459b-ada3-901201bd1df2/images/db3220fa-7af3-4220-a5e4-52e49086edb2/161ae4ac-f98f-47f1-a59a-63dc97531130,format=qcow2,if=none,id=drive-ua-db3220fa-7af3-4220-a5e4-52e49086edb2,serial=db3220fa-7af3-4220-a5e4-52e49086edb2,werror=stop,rerror=stop,cache=none,aio=threads -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-db3220fa-7af3-4220-a5e4-52e49086edb2,id=ua-db3220fa-7af3-4220-a5e4-52e49086edb2,bootindex=1,write-cache=on -netdev tap,fd=36,id=hostua-ced84a2f-971d-4766-914a-e0a1852ea21c,vhost=on,vhostfd=41 -device virtio-net-pci,host_mtu=1500,netdev=hostua-ced84a2f-971d-4766-914a-e0a1852ea21c,id=ua-ced84a2f-971d-4766-914a-e0a1852ea21c,mac=00:1a:4a:16:25:e9,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,fd=42,server,nowait -device virtserialport,bus=ua-0669b3db-2ede-40e1-af66-42ed04cf035c.0,nr=1,chardev=charchannel0,id=channel0,name=ovirt-guest-agent.0 -chardev socket,id=charchannel1,fd=43,server,nowait -device virtserialport,bus=ua-0669b3db-2ede-40e1-af66-42ed04cf035c.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=ua-0669b3db-2ede-40e1-af66-42ed04cf035c.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -device usb-tablet,id=input0,bus=usb.0,port=1 -spice port=5900,tls-port=5901,addr=10.35.82.80,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -vnc 10.35.82.80:2,password,tls,x509=/etc/pki/vdsm/libvirt-vnc -k en-us -device qxl-vga,id=ua-5bca4eec-f416-45b3-a56b-291efea91a1c,ram_size=67108864,vram_size=8388608,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -device intel-hda,id=ua-3d0e9700-ad99-40c0-b15d-665b730373de,bus=pci.0,addr=0x4 -device hda-duplex,id=ua-3d0e9700-ad99-40c0-b15d-665b730373de-codec0,bus=ua-3d0e9700-ad99-40c0-b15d-665b730373de.0,cad=0 -device virtio-balloon-pci,id=ua-a0d9d265-ffc4-4a92-b6db-e4ae8fc59db6,bus=pci.0,addr=0x8 -object rng-random,id=objua-9806d9a9-c31c-4ca8-b87a-b09efdc9309e,filename=/dev/urandom -device virtio-rng-pci,rng=objua-9806d9a9-c31c-4ca8-b87a-b09efdc9309e,id=ua-9806d9a9-c31c-4ca8-b87a-b09efdc9309e,bus=pci.0,addr=0x9 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
qemu     12971 12.9 19.1 1936940 740856 ?      Rl   Jul04 540:09 /usr/libexec/qemu-kvm -name guest=pool_vm_gluster-2,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-5-pool_vm_gluster-2/master-key.aes -machine pc-i440fx-rhel7.6.0,accel=kvm,usb=off,dump-guest-core=off -cpu Nehalem -m size=1048576k,slots=16,maxmem=4194304k -realtime mlock=off -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -object iothread,id=iothread1 -numa node,nodeid=0,cpus=0,mem=1024 -uuid e1059eb4-0526-43ef-8a27-62487d5b4588 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.7-7.el7,serial=53a0fc14-3841-4c1d-a7cc-b5b7b25874a4,uuid=e1059eb4-0526-43ef-8a27-62487d5b4588 -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=38,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2019-07-04T14:48:38,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,iothread=iothread1,id=ua-70ab3713-14ce-47d6-a5fc-34312acedf3d,bus=pci.0,addr=0x5 -device virtio-serial-pci,id=ua-8ca2e0ce-47d2-44cc-b97b-f681af344975,max_ports=16,bus=pci.0,addr=0x6 -drive if=none,id=drive-ua-22b0ab7e-3454-4bfb-829d-78822ce8a4e3,werror=report,rerror=report,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-ua-22b0ab7e-3454-4bfb-829d-78822ce8a4e3,id=ua-22b0ab7e-3454-4bfb-829d-78822ce8a4e3 -drive file=/rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_storage__local__ge8__volume__0/dc3f1c4c-10a8-459b-ada3-901201bd1df2/images/e97c1add-a000-4100-9045-40da96ee378b/4f12ef9d-a1f2-4772-815f-9b2c268acbe0,format=qcow2,if=none,id=drive-ua-e97c1add-a000-4100-9045-40da96ee378b,serial=e97c1add-a000-4100-9045-40da96ee378b,werror=stop,rerror=stop,cache=none,aio=threads -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-e97c1add-a000-4100-9045-40da96ee378b,id=ua-e97c1add-a000-4100-9045-40da96ee378b,bootindex=1,write-cache=on -netdev tap,fd=40,id=hostua-e554b0e4-7c8d-4263-9cea-3a54b34900db,vhost=on,vhostfd=32 -device virtio-net-pci,host_mtu=1500,netdev=hostua-e554b0e4-7c8d-4263-9cea-3a54b34900db,id=ua-e554b0e4-7c8d-4263-9cea-3a54b34900db,mac=00:1a:4a:16:25:ea,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,fd=36,server,nowait -device virtserialport,bus=ua-8ca2e0ce-47d2-44cc-b97b-f681af344975.0,nr=1,chardev=charchannel0,id=channel0,name=ovirt-guest-agent.0 -chardev socket,id=charchannel1,fd=41,server,nowait -device virtserialport,bus=ua-8ca2e0ce-47d2-44cc-b97b-f681af344975.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=ua-8ca2e0ce-47d2-44cc-b97b-f681af344975.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -device usb-tablet,id=input0,bus=usb.0,port=1 -spice port=5903,tls-port=5904,addr=10.35.82.80,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -vnc 10.35.82.80:5,password,tls,x509=/etc/pki/vdsm/libvirt-vnc -k en-us -device qxl-vga,id=ua-759d513c-80e2-4eb2-b9e4-7a40029f55b0,ram_size=67108864,vram_size=8388608,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -device intel-hda,id=ua-ae3ba68a-8dfd-4eb9-b43d-0b585110a540,bus=pci.0,addr=0x4 -device hda-duplex,id=ua-ae3ba68a-8dfd-4eb9-b43d-0b585110a540-codec0,bus=ua-ae3ba68a-8dfd-4eb9-b43d-0b585110a540.0,cad=0 -device virtio-balloon-pci,id=ua-82435524-5918-4777-aeeb-17cc2fca4fa3,bus=pci.0,addr=0x8 -object rng-random,id=objua-9806d9a9-c31c-4ca8-b87a-b09efdc9309e,filename=/dev/urandom -device virtio-rng-pci,rng=objua-9806d9a9-c31c-4ca8-b87a-b09efdc9309e,id=ua-9806d9a9-c31c-4ca8-b87a-b09efdc9309e,bus=pci.0,addr=0x9 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
qemu     13370 11.8 14.6 1968764 569236 ?      Rl   Jul04 494:22 /usr/libexec/qemu-kvm -name guest=pool_vm_gluster-6,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-6-pool_vm_gluster-6/master-key.aes -machine pc-i440fx-rhel7.6.0,accel=kvm,usb=off,dump-guest-core=off -cpu Nehalem -m size=1048576k,slots=16,maxmem=4194304k -realtime mlock=off -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -object iothread,id=iothread1 -numa node,nodeid=0,cpus=0,mem=1024 -uuid b35a0328-6340-46c3-a697-888a1eec74ac -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.7-7.el7,serial=53a0fc14-3841-4c1d-a7cc-b5b7b25874a4,uuid=b35a0328-6340-46c3-a697-888a1eec74ac -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=33,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2019-07-04T14:49:31,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,iothread=iothread1,id=ua-08d5b00d-83f4-4249-8c5d-7dad3ca97c2c,bus=pci.0,addr=0x5 -device virtio-serial-pci,id=ua-43532932-fe0c-4698-bc7f-615e0ebe69aa,max_ports=16,bus=pci.0,addr=0x6 -drive if=none,id=drive-ua-28053b06-9947-4b3e-a737-03a150e8809f,werror=report,rerror=report,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-ua-28053b06-9947-4b3e-a737-03a150e8809f,id=ua-28053b06-9947-4b3e-a737-03a150e8809f -drive file=/rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_storage__local__ge8__volume__0/dc3f1c4c-10a8-459b-ada3-901201bd1df2/images/fea12c3b-307d-4f3d-89fc-db0ac82d5b67/f4169dcc-c8a8-40fa-9eb9-61b65b6566be,format=qcow2,if=none,id=drive-ua-fea12c3b-307d-4f3d-89fc-db0ac82d5b67,serial=fea12c3b-307d-4f3d-89fc-db0ac82d5b67,werror=stop,rerror=stop,cache=none,aio=threads -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-fea12c3b-307d-4f3d-89fc-db0ac82d5b67,id=ua-fea12c3b-307d-4f3d-89fc-db0ac82d5b67,bootindex=1,write-cache=on -netdev tap,fd=35,id=hostua-4439dbe7-a366-43bd-82aa-fe3a0836afe3,vhost=on,vhostfd=36 -device virtio-net-pci,host_mtu=1500,netdev=hostua-4439dbe7-a366-43bd-82aa-fe3a0836afe3,id=ua-4439dbe7-a366-43bd-82aa-fe3a0836afe3,mac=00:1a:4a:16:25:ee,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,fd=37,server,nowait -device virtserialport,bus=ua-43532932-fe0c-4698-bc7f-615e0ebe69aa.0,nr=1,chardev=charchannel0,id=channel0,name=ovirt-guest-agent.0 -chardev socket,id=charchannel1,fd=38,server,nowait -device virtserialport,bus=ua-43532932-fe0c-4698-bc7f-615e0ebe69aa.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=ua-43532932-fe0c-4698-bc7f-615e0ebe69aa.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -device usb-tablet,id=input0,bus=usb.0,port=1 -spice port=5906,tls-port=5907,addr=10.35.82.80,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -vnc 10.35.82.80:8,password,tls,x509=/etc/pki/vdsm/libvirt-vnc -k en-us -device qxl-vga,id=ua-1b09afab-447d-45da-aa2c-05dd017663fa,ram_size=67108864,vram_size=8388608,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -device intel-hda,id=ua-f9ceb167-c491-4630-8d0b-19430b7d9d9a,bus=pci.0,addr=0x4 -device hda-duplex,id=ua-f9ceb167-c491-4630-8d0b-19430b7d9d9a-codec0,bus=ua-f9ceb167-c491-4630-8d0b-19430b7d9d9a.0,cad=0 -device virtio-balloon-pci,id=ua-2d0a78cb-7d26-4b96-811e-e61b6a687c2b,bus=pci.0,addr=0x8 -object rng-random,id=objua-9806d9a9-c31c-4ca8-b87a-b09efdc9309e,filename=/dev/urandom -device virtio-rng-pci,rng=objua-9806d9a9-c31c-4ca8-b87a-b09efdc9309e,id=ua-9806d9a9-c31c-4ca8-b87a-b09efdc9309e,bus=pci.0,addr=0x9 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on





> Thanks.
> 
> > 
> > details on how I set it:
> > [root@storage-ge-08 ~]# vim
> > /etc/ovirt-engine/engine-config/engine-config.properties
> > Added the following lines at the end:
> > UseNativeIOForGluster.description=Access volumes on glusterfs with aio
> > native insteat of thread
> > UseNativeIOForGluster.type=Boolean
> > 
> > [root@storage-ge-08 ~]# engine-config -s UseNativeIOForGluster=False
> > Please select a version:
> > 1. 4.1
> > 2. 4.2
> > 3. 4.3
> > 3
> > [root@storage-ge-08 ~]# engine-config -g UseNativeIOForGluster
> > UseNativeIOForGluster: false version: 4.1
> > UseNativeIOForGluster: true version: 4.2
> > UseNativeIOForGluster: False version: 4.3
> > [root@storage-ge-08 ~]# systemctl restart ovirt-engine

Comment 24 CongLi 2019-07-08 07:58:45 UTC

Thanks Avihai.

Could you please also help provide the steps and script of TestCase18868 ?

Thanks.

Comment 25 Avihai 2019-07-10 08:51:14 UTC

(In reply to CongLi from comment #24)
> Thanks Avihai.
> 
> Could you please also help provide the steps and script of TestCase18868 ?
> 
> Thanks.

All the mentioned tests in the QA whiteboard (TestCase18868 included) all do the same simple thing which is to start VM from a RHEL8 OS disk which resides on a gluster storage domain than wait for IP and try to do SSH which fails as the issue occurred.

We use a huge framework called ART which uses RESTAPI calls data structure for tests so there is no simple script I can supply unless you are already working with this framework.

However, this was reproduced manually many times not on all VM's but in 2 out of 8 VM's or more :

1) Create a template which is based on a RHEL8 OS disk on a gluster storage domain. 
2) Create many VM's(as many as possible) from that template
3) Start VM and wait for IP
4) Once VM got IP try to do SSH => fails

VM start and gets IP but right afterward in the console you see this error and from that point IP is still available, ssh fails all the VM is 'UP' but you cannot do anything with it an also console is stuck.

Comment 26 Stefano Garzarella 2019-08-05 12:59:41 UTC

(In reply to CongLi from comment #19)
> Hi Avihai,
> 
> KVM QE could not reproduce it in QEMU env.
> 
> Is it possible for you to provide the QEMU CML when set 
> 'UseNativeIOForGluster' = False ?
> I would like to confirm the option 'UseNativeIOForGluster' 
> is corresponding to the option 'aio=native' in QEMU side.
> 
> Thanks.

Hi CongLi,

I'm starting to work on this BZ.
Are you able to reproduce in the QEMU env?

Thanks,
Stefano

Comment 28 qing.wang 2019-08-06 10:21:58 UTC

I tried following command :

/usr/libexec/qemu-kvm   \
    -name  "guest-rhel8.0-2"    \
    -machine  pc-i440fx-rhel7.6.0,accel=kvm,usb=off,dump-guest-core=off   \
    -cpu Westmere \
    -m 6144 \
    -realtime mlock=off \
    -uuid dbfc4b9a-74bf-4c21-95c6-0840743fd57a \
    -smbios 'type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.7-7.el7,serial=4c4c4544-0047-3210-8053-c4c04f473632,uuid=dbfc4b9a-74bf-4c21-95c6-0840743fd57a' \
    -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 \
    -nodefaults   \
    -vga  qxl \
    -object iothread,id=iothread0 \
    -drive file=/mnt/gluster/rhel810-64-virtio2.qcow2,format=qcow2,if=none,id=drive-ua-1,serial=1,werror=stop,rerror=stop,cache=none,aio=native \
    -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-ua-1,id=ua-1,bootindex=1,write-cache=on \
    -vnc :2 \
    -monitor  stdio \
    -device virtio-net-pci,mac=9a:b5:b6:a1:b2:c2,id=idMmq1jH,vectors=4,netdev=idxgXAlm,bus=pci.0,addr=0x9      \
    -netdev tap,id=idxgXAlm,vhost=on \
    -qmp tcp:localhost:5952,server,nowait  \
    -chardev file,path=/home/serial2.log,id=serial_id_serial0 \
    -device isa-serial,chardev=serial_id_serial0  \


/mnt/gluster/rhel810-64-virtio2.qcow2 is my guest image.  I can not reproduce this issue.

Hi Stefano, could you please help to run above command on your vdsm node. 
You need to replace "/mnt/gluster/rhel810-64-virtio2.qcow2 " with your guest image which locate your gluster fs.

Could you please share your guest image if you may reproduce this issue with this command.

Comment 29 Stefano Garzarella 2019-08-06 10:37:14 UTC

(In reply to qing.wang from comment #28)
> 
> Hi Stefano, could you please help to run above command on your vdsm node. 
> You need to replace "/mnt/gluster/rhel810-64-virtio2.qcow2 " with your guest
> image which locate your gluster fs.

Hi Qing,
I'm not able too. I started QEMU with a very similar command line, but it works well.

Maybe Avihai can help us because I don't have access to the VDSM node.

Comment 30 qing.wang 2019-08-07 01:36:22 UTC

It looks like this issue related to ovirt environment.
Avihai said new sanlock-3.7.3-1.el7.x86_64 involved, it maybe result in regression issue. 
I suggest to check sanlock log when it happens.    

Is it possible to rollback last version of sanlock package and test it again?

Comment 32 Avihai 2019-08-11 14:52:13 UTC

Guys,

The ENV is there for you to use/debug - Please use it ASAP to extract what you need.

It took me a while to reproduce this issue but I did so this is hard to reproduce.

From 12 VM's only 3 VM's has this issue(see print screen attached):
VM name= rhel8_nativeaio-3
VM name= rhel8_nativeaio-6
VM name= rhel8_nativeaio-10

Comment 33 Avihai 2019-08-11 14:53:07 UTC

Created attachment 1602624 [details]
print screen of the issue 3 VM's out of 12 has the issue

Comment 34 qing.wang 2019-08-12 11:11:06 UTC

My test steps:

1.shutdown  rhel8_nativeaio-1 - rhel8_nativeaio-8 on ovirt-engine.
2.create guest script (vm1.sh -vm8.sh) with qemu-kvm command like :

file=/rhev/data-center/mnt/glusterSD/gluster01.lab.eng.tlv2.redhat.com:_GE__7__volume01/92e8d1b3-bc5b-4759-a997-c5cbdb32f6e7/images/6e798b8b-ef87-4b13-9ef1-43726b79f724/6c7c2f45-a756-48e5-bc73-41955cb8d629
mac=00:1a:4a:16:88:45
idx=1

/usr/libexec/qemu-kvm   \
    -name  "guest-rhel8.0-${idx}"    \
    -machine  pc-i440fx-rhel7.6.0,accel=kvm,usb=off,dump-guest-core=off   \
    -cpu Westmere \
    -m 6144 \
    -realtime mlock=off \
    -uuid dbfc4b9a-74bf-4c21-95c6-0840743fd57a \
    -smbios 'type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.7-7.el7,serial=4c4c4544-0047-3210-8053-c4c04f473632,uuid=dbfc4b9a-74bf-4c21-95c6-0840743fd57a' \
    -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 \
    -nodefaults   \
    -vga  qxl \
    -object iothread,id=iothread0 \
    -drive file=${file},format=qcow2,if=none,id=drive-ua-1,serial=1,werror=stop,rerror=stop,cache=none,aio=native \
    -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-ua-1,id=ua-1,bootindex=1,write-cache=on \
    -vnc :2${idx} \
    -monitor  stdio \
    -device virtio-net-pci,mac=${mac},id=idMmq1jH,vectors=4,netdev=idxgXAlm,bus=pci.0,addr=0x9      \
    -netdev tap,id=idxgXAlm,vhost=on \
    -qmp tcp:localhost:595${idx},server,nowait  \
    -chardev file,path=/home/serial${idx}.log,id=serial_id_serial0 \
    -device isa-serial,chardev=serial_id_serial0  \
    -device vmcoreinfo \

on each vm.sh this file and mac address is respective in ovirt guest vms rhel8_nativeaio-1-8.

(those scripts locate  /root/test/  on lynx25.lab.eng.tlv2.redhat.com and lynx26.lab.eng.tlv2.redhat.com )

3. run vm1.sh vm2.sh vm3.sh vm4.sh on lynx25.lab.eng.tlv2.redhat.com
   run vm5.sh vm6.sh vm7.sh vm8.sh on lynx26.lab.eng.tlv2.redhat.com

   (set the default nic to dhcp enabled in guest vm) 

4 wait guest vm started and get ip ,then wait 20-30 minuutes

5. check remote console to see if have any issue on guests

6 . poweroff vm1-vm8 and repeat step3- step5 .


===================================================================

I run above test not found issue. And vm1-vm8 still running on lynx25.lab.eng.tlv2.redhat.com and lynx26.lab.eng.tlv2.redhat.com.
We may wait to tomorrow to see if have problem on long time running. 


vm1- vm8 hostname  of vms:  vm-17-69 to vm-17-76

Comment 35 Avihai 2019-08-12 12:20:23 UTC

This is a shared stand and also other teams need to run other tests on it by tomorrow so please debug as needed till then.

The issue was already there on 3 VM's(see below), were you able to see what is special about them?
VM name= rhel8_nativeaio-3
VM name= rhel8_nativeaio-6
VM name= rhel8_nativeaio-10

Comment 36 qing.wang 2019-08-13 07:48:18 UTC

(In reply to Avihai from comment #35)
> This is a shared stand and also other teams need to run other tests on it by
> tomorrow so please debug as needed till then.
> 
> The issue was already there on 3 VM's(see below), were you able to see what
> is special about them?
> VM name= rhel8_nativeaio-3
> VM name= rhel8_nativeaio-6
> VM name= rhel8_nativeaio-10

I have shutdown above VMs due to my VMs shared their images in my qemu command line testing. 

But i can not reproduce it. 

After some investigation, i suspect it is not valid testing in ovirt testing: set aio=native when create vm based on template.

More detail please refer to https://www.redhat.com/archives/libvirt-users/2017-January/msg00025.html.

I pick up one vm rhel8_nativeaio-12 to explain it.

disk info:

<disk type='file' device='disk' snapshot='no'>
      <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='native'/>
      <source file='/rhev/data-center/mnt/glusterSD/gluster01.lab.eng.tlv2.redhat.com:_GE__7__volume01/92e8d1b3-bc5b-4759-a997-c5cbdb32f6e7/images/e5508a87-85c0-44cc-94fd-19ebc974c9a8/77ff077c-3ac6-4614-b75b-eb54294058bc'>
        <seclabel model='dac' relabel='no'/>
      </source>
      <backingStore type='file' index='1'>
        <format type='qcow2'/>
        <source file='/rhev/data-center/mnt/glusterSD/gluster01.lab.eng.tlv2.redhat.com:_GE__7__volume01/92e8d1b3-bc5b-4759-a997-c5cbdb32f6e7/images/e5508a87-85c0-44cc-94fd-19ebc974c9a8/d5d90f0a-c496-4946-95fd-41e17abac359'/>
        <backingStore/>
      </backingStore>
      <target dev='vda' bus='virtio'/>
      <serial>e5508a87-85c0-44cc-94fd-19ebc974c9a8</serial>
      <boot order='1'/>
      <alias name='ua-e5508a87-85c0-44cc-94fd-19ebc974c9a8'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
 </disk>


This VM used one image /rhev/data-center/mnt/glusterSD/gluster01.lab.eng.tlv2.redhat.com:_GE__7__volume01/92e8d1b3-bc5b-4759-a997-c5cbdb32f6e7/images/e5508a87-85c0-44cc-94fd-19ebc974c9a8/77ff077c-3ac6-4614-b75b-eb54294058bc 


[root@lynx26 test]# qemu-img info /rhev/data-center/mnt/glusterSD/gluster01.lab.eng.tlv2.redhat.com:_GE__7__volume01/92e8d1b3-bc5b-4759-a997-c5cbdb32f6e7/images/e5508a87-85c0-44cc-94fd-19ebc974c9a8/77ff077c-3ac6-4614-b75b-eb54294058bc
image: /rhev/data-center/mnt/glusterSD/gluster01.lab.eng.tlv2.redhat.com:_GE__7__volume01/92e8d1b3-bc5b-4759-a997-c5cbdb32f6e7/images/e5508a87-85c0-44cc-94fd-19ebc974c9a8/77ff077c-3ac6-4614-b75b-eb54294058bc
file format: qcow2
virtual size: 10G (10737418240 bytes)
disk size: 121M
cluster_size: 65536
backing file: d5d90f0a-c496-4946-95fd-41e17abac359 (actual path: /rhev/data-center/mnt/glusterSD/gluster01.lab.eng.tlv2.redhat.com:_GE__7__volume01/92e8d1b3-bc5b-4759-a997-c5cbdb32f6e7/images/e5508a87-85c0-44cc-94fd-19ebc974c9a8/d5d90f0a-c496-4946-95fd-41e17abac359)
backing file format: qcow2
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

It indicates this image is not pre-allocated, so aio=native is not suitable for it.  

@Stefano Please help to confirm it

Comment 37 Stefano Garzarella 2019-08-26 09:07:42 UTC

(In reply to Avihai from comment #32)
> Guys,
> 
> The ENV is there for you to use/debug - Please use it ASAP to extract what
> you need.
> 
> It took me a while to reproduce this issue but I did so this is hard to
> reproduce.
> 
> From 12 VM's only 3 VM's has this issue(see print screen attached):
> VM name= rhel8_nativeaio-3
> VM name= rhel8_nativeaio-6
> VM name= rhel8_nativeaio-10

Sorry for the late response, but I was on PTO.

I think that the VMs are not running anymore.
Please, can you try to reproduce the issue again?

Comment 38 Stefano Garzarella 2019-08-26 09:36:38 UTC

(In reply to qing.wang from comment #36)
> (In reply to Avihai from comment #35)
> > This is a shared stand and also other teams need to run other tests on it by
> > tomorrow so please debug as needed till then.
> > 
> > The issue was already there on 3 VM's(see below), were you able to see what
> > is special about them?
> > VM name= rhel8_nativeaio-3
> > VM name= rhel8_nativeaio-6
> > VM name= rhel8_nativeaio-10
> 
> I have shutdown above VMs due to my VMs shared their images in my qemu
> command line testing. 
> 
> But i can not reproduce it. 
> 
> After some investigation, i suspect it is not valid testing in ovirt
> testing: set aio=native when create vm based on template.

Good catch!

> 
> More detail please refer to
> https://www.redhat.com/archives/libvirt-users/2017-January/msg00025.html.
> 
[...] 
> 
> @Stefano Please help to confirm it

Right, it is not recommended and here there are other details:
- https://access.redhat.com/articles/41313
  "Specifically, if qemu-kvm is used with the aio=native IO mode over a sparse device image hosted on the ext4 or xfs filesystem, guest filesystem corruption will occur if partitions are not aligned with the host filesystem block size"
- https://drive.google.com/file/d/0B44EcgFDZNtXSnhKbEZfNE1ad28 [Slide 61/66]
  "Native AIO can block the VM if the file is not fully allocated and is therefore not recommended for use on sparse files"
  "Writes to sparsely allocated files are more likely to block than fully preallocated files. Therefore it is recommended to only use aio=native on fully preallocated files, local disks, or logical volumes."

So, if we want to use aio=native, maybe we should provide an image fully preallocated.

Comment 41 Marina Kalinin 2019-08-29 16:01:51 UTC

Suggestion for Insights Rule: if qemu-kvm is used with the aio=native IO mode over a sparse device image hosted on the ext4 or xfs filesystem, give a warning that this may cause guest filesystem corruption if partitions are not aligned with the host filesystem block size. For details - see: https://access.redhat.com/articles/41313 and comment 23 above.

Comment 42 Avihai 2019-09-02 11:57:56 UTC

This is an old issue which was triggered when RHHI/gluster Default was set to aio=native at rhv 4.3.5 failing our regression tests(bug 1701736).

I think this issue is no longer urgent as Default was set to aio=threads for gluster/RHHI storage at bug 1701736.
Since this fix was made this issue is no longer seen running our automation regressions.

Lowering severity to high, please raise it if you feel otherwise.

Comment 43 Stefano Garzarella 2019-09-02 12:56:08 UTC

Avihai,
do you have any VM where you can replicate this issue?

Comment 45 qing.wang 2019-09-04 01:33:15 UTC

Which level do we ready to fix this issue? if we follow the Insights Rule mentioned in comment 41 , i think it is upper application (ovirt) issue , not the qemu issue ,right?

Comment 46 Stefano Garzarella 2019-09-05 09:40:14 UTC

(In reply to qing.wang from comment #45)
> Which level do we ready to fix this issue? if we follow the Insights Rule
> mentioned in comment 41 , i think it is upper application (ovirt) issue ,
> not the qemu issue ,right?

We should understand better why if we use aio=native, we have the corruption.
It could be an alignment problem or another issue in XFS or QEMU or the Fuse driver or Gluster, maybe aio=native changes the timing and brings out a hidden bug.

Comment 47 Avihai 2019-09-15 17:30:23 UTC

(In reply to Stefano Garzarella from comment #43)
> Avihai,
> do you have any VM where you can replicate this issue?

Hi Stefano,
I reproduce the issue multiple times and left the environment for DEV debug and can do it again if this will help.
As this issue seems not supported for now and the default was set to aio=threads for gluster/RHHI storage at bug 1701736, is this still needed?

Comment 48 Stefano Garzarella 2019-09-18 10:46:18 UTC

(In reply to Avihai from comment #47)
> (In reply to Stefano Garzarella from comment #43)
> > Avihai,
> > do you have any VM where you can replicate this issue?
> 
> Hi Stefano,
> I reproduce the issue multiple times and left the environment for DEV debug
> and can do it again if this will help.

Thanks! I'll ping you on IRC when I'll work on it.

> As this issue seems not supported for now and the default was set to
> aio=threads for gluster/RHHI storage at bug 1701736, is this still needed?

Maybe we can reduce the priority and severity, but I think can be useful to solve this issue.

Comment 49 Ademar Reis 2019-09-24 15:19:40 UTC

Deferring it to RHEL8-AV. Worth fixing, but not urgent or regression (see previous comments).

Comment 50 Ademar Reis 2020-02-05 22:59:36 UTC

QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks

Comment 53 Marina Kalinin 2020-11-12 21:31:30 UTC

Since this issue is not relevant to RHV anymore (due to changing defaults to aio=threads when working with glusterfs) and since it is so hard to reproduce, I propose to close this bug as WONTFIX or DEFERRED.

Comment 54 John Ferlan 2021-09-08 19:15:19 UTC

Closing based on comment 53

Note You need to log in before you can comment on or make changes to this bug.