Created attachment 1584111 [details] All logs and print screen of the issue Description of problem: I started seeing this on several TC's on 4.3.5 Tier1 runs only when those TC's were running GlusterFS. The scenario is very basic and this is why this is more alarming in severity and happens ONLY with glusterfs.(same TC runs on ISCSI/FCP/NFS and does not occur) 1) Create VM from template 2) Start VM Result: VM start and gets IP but right afterward in the console you see this error and from that point IP is still available, ssh fails all the VM is 'UP' but you cannot do anything with it an also console is stuck. In console you see: XFS(dm-0): Metadata corruption detected at xfs_buf_ioend+0x58/0x1e0 [xfs], xfs_inode block 0x240 xfs_inode_bug_verify Every attempt to ssh the VM fails with a core dump shown is console(see print screen) What is different from the last run it passed: 1) RHEL7.7 (instead of RHEL7.6) hosts 2) rhv-4.3.5.-2 Engine/VDSM/libvirt versions [1] 3) New sanlock-3.7.3-1.el7.x86_64 What did not change: 1) The same ENV machines(happened on several ENVS) 2) same gluster cluster 3) Same exact test https://rhv-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/rhv-4.3-ge-runner-tier1/85/ Stand with the issue you can look around: hosted-engine-06.lab.eng.tlv2.redhat.com Details: Issue reproduce running TestCase18868 on glusterFS simply creating a VM from a rhel8 image template. Impact: ping to the VM works but ssh fails with 'Authentication failed.' so you cannot use the VM. Issue seen only on gluster SD (!) on several TC. I see that TC [root@lynx22 vdsm]# ssh root.17.164 Authentication failed. At VM console you see(print screen attached) XFS(dm-0): Metadata corruption detected at xfs_buf_ioend+0x58/0x1e0 [xfs], xfs_inode block 0x240 xfs_inode_bug_verify Of looks like the first 128 bytes of corrupted metadata buffer is zeroed . Engine: 2019-06-24 16:44:56,327+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-28) [vms_syncAction_5bd75391-29cf-4c1d] EVENT_ID: USER_STARTED_VM(153), VM vm_TestCase18868_ 2416380014 was started by admin@internal-authz (Host: host_mixed_1) 2019-06-24 16:44:58,448+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-100) [] EVENT_ID: VM_CONSOLE_DISCONNECTED(168), User <UNKN OWN> got disconnected from VM vm_TestCase18868_2416380014. 2019-06-24 16:45:58,573+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-38) [] EVENT_ID: USER_RUN_VM(32), VM vm_TestCase18868_2416 380014 started on Host host_mixed_1 VDSM: 2019-06-24 16:44:27,740+0300 WARN (libvirt/events) [root] File: /var/lib/libvirt/qemu/channels/dbfc4b9a-74bf-4c21-95c6-0840743fd57a.ovirt-guest-agent.0 already removed (fileutils:54) 2019-06-24 16:44:27,741+0300 WARN (qgapoller/3) [virt.periodic.VmDispatcher] could not run <function <lambda> at 0x7fe86c1a19b0> on ['dbfc4b9a-74bf-4c21-95c6-0840743fd57a'] (periodic:289) 2019-06-24 16:44:27,746+0300 WARN (libvirt/events) [root] File: /var/lib/libvirt/qemu/channels/dbfc4b9a-74bf-4c21-95c6-0840743fd57a.org.qemu.guest_agent.0 already removed (fileutils:54) 2019-06-24 16:45:12,051+0300 WARN (periodic/4) [virt.periodic.VmDispatcher] could not run <class 'vdsm.virt.periodic.BlockjobMonitor'> on ['dbfc4b9a-74bf-4c21-95c6-0840743fd57a'] (periodic:289) Libvirt log: 2019-06-24 13:44:26.372+0000: 31609: error : qemuMonitorIO:718 : internal error: End of file from qemu monitor 2019-06-24 13:44:26.594+0000: 15326: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/iptables -w10 -w -D libvirt-out -m physdev --physdev-is-bridged --physdev-out vnet0 -g FP-vnet0' failed: iptables v1.4.21: goto 'FP-vnet0' is not a chain 2019-06-24 13:44:26.603+0000: 15326: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/iptables -w10 -w -D libvirt-out -m physdev --physdev-out vnet0 -g FP-vnet0' failed: iptables v1.4.21: goto 'FP-vnet0' is not a chain 2019-06-24 13:44:26.611+0000: 15326: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/iptables -w10 -w -D libvirt-in -m physdev --physdev-in vnet0 -g FJ-vnet0' failed: iptables v1.4.21: goto 'FJ-vnet0' is not a chain 2019-06-24 13:44:26.619+0000: 15326: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/iptables -w10 -w -D libvirt-host-in -m physdev --physdev-in vnet0 -g HJ-vnet0' failed: iptables v1.4.21: goto 'HJ-vnet0' is not a chain 2019-06-24 13:44:26.626+0000: 15326: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/iptables -w10 -w -F FP-vnet0' failed: iptables: No chain/target/match by that name. 2019-06-24 13:44:26.634+0000: 15326: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/iptables -w10 -w -X FP-vnet0' failed: iptables: No chain/target/match by that name. 2019-06-24 13:44:26.641+0000: 15326: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/iptables -w10 -w -F FJ-vnet0' failed: iptables: No chain/target/match by that name. 2019-06-24 13:44:26.648+0000: 15326: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/iptables -w10 -w -X FJ-vnet0' failed: iptables: No chain/target/match by that name. 2019-06-24 13:44:26.656+0000: 15326: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/iptables -w10 -w -F HJ-vnet0' failed: iptables: No chain/target/match by that name. 2019-06-24 13:44:26.664+0000: 15326: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/iptables -w10 -w -X HJ-vnet0' failed: iptables: No chain/target/match by that name. 2019-06-24 13:44:26.671+0000: 15326: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/ip6tables -w10 -w -D libvirt-out -m physdev --physdev-is-bridged --physdev-out vnet0 -g FP-vnet0' failed: ip6tables v1.4.21: goto 'FP-vnet0' is not a chain 2019-06-24 13:44:59.554+0000: 31611: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -F I-vnet0-arp-mac' failed: Chain 'I-vnet0-arp-mac' doesn't exist. 2019-06-24 13:44:59.563+0000: 31611: info : virDBusCall:1558 : DBUS_METHOD_ERROR: 'org.fedoraproject.FirewallD1.direct.passthrough' on '/org/fedoraproject/FirewallD1' at 'org.fedoraproject.FirewallD1' error org.fedoraproject.FirewallD1.Exception: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -X I-vnet0-arp-mac' failed: Chain 'I-vnet0-arp-mac' doesn't exist. 2019-06-24 13:45:07.357+0000: 31613: error : qemuDomainAgentAvailable:9131 : Guest agent is not responding: QEMU guest agent is not connected 2019-06-24 13:46:07.363+0000: 31611: error : virNetClientProgramDispatchError:174 : Cannot open log file: '/var/log/libvirt/qemu/vm_TestCase18868_2416380014.log': Device or resource busy qemu log (vm_TestCase18868_2416380014.log): 2019-06-24T13:44:26.302416Z qemu-kvm: terminating on signal 15 from pid 31609 (<unknown process>) red_channel_client_disconnect: rcc=0x55b8aa5961b0 (channel=0x55b8a9db42d0 type=5 id=0) red_channel_client_disconnect: rcc=0x55b8aade01b0 (channel=0x55b8a9db4390 type=6 id=0) 2019-06-24 13:44:26.574+0000: shutting down, reason=shutdown 2019-06-24 13:44:59.896+0000: starting up libvirt version: 4.5.0, package: 22.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2019-06-12-08:32:19, x86-037.build.eng.bos.redhat.com), qemu version: 2.12.0qemu-kvm-rhev-2.12.0-33.el7, kernel: 3.10.0-1057.el7.x86_64, hostname: lynx22.lab.eng.tlv2.redhat.com LC_ALL=C \ PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin \ QEMU_AUDIO_DRV=spice \ /usr/libexec/qemu-kvm \ -name guest=vm_TestCase18868_2416380014,debug-threads=on \ -S \ -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-11-vm_TestCase18868_241/master-key.aes \ -machine pc-i440fx-rhel7.6.0,accel=kvm,usb=off,dump-guest-core=off \ -cpu Westmere \ -m size=1048576k,slots=16,maxmem=4194304k \ -realtime mlock=off \ -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 \ -numa node,nodeid=0,cpus=0,mem=1024 \ -uuid dbfc4b9a-74bf-4c21-95c6-0840743fd57a \ -smbios 'type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.7-7.el7,serial=4c4c4544-0047-3210-8053-c4c04f473632,uuid=dbfc4b9a-74bf-4c21-95c6-0840743fd57a' \ -no-user-config \ -nodefaults \ -chardev socket,id=charmonitor,fd=31,server,nowait \ -mon chardev=charmonitor,id=monitor,mode=control \ -rtc base=2019-06-24T13:44:58,driftfix=slew \ -global kvm-pit.lost_tick_policy=delay \ -no-hpet \ -no-shutdown \ -global PIIX4_PM.disable_s3=1 \ -global PIIX4_PM.disable_s4=1 \ -boot strict=on \ -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \ -device virtio-scsi-pci,id=ua-fbe4dc8f-50eb-4533-9c62-37b521706ac7,bus=pci.0,addr=0x6 \ -device virtio-serial-pci,id=ua-33adb81c-95a7-40fc-9a3b-88fd227724f7,max_ports=16,bus=pci.0,addr=0x5 \ -drive if=none,id=drive-ua-5ae09403-2dae-4685-97ed-38838d055fa0,werror=report,rerror=report,readonly=on \ -device ide-cd,bus=ide.1,unit=0,drive=drive-ua-5ae09403-2dae-4685-97ed-38838d055fa0,id=ua-5ae09403-2dae-4685-97ed-38838d055fa0 \ -drive file=/rhev/data-center/mnt/glusterSD/gluster01.lab.eng.tlv2.redhat.com:_GE__he6__volume01/4eea07cd-bab6-4fc3-a7ce-beed72ac0f1c/images/022b31a1-5a1d-470b-b269-090a03641caa/25e111ec-ebed-491c-b9e8-b32845f87053,format=qcow2,if=none,id=drive-ua-022b31a1-5a1d-470b-b269-090a03641caa,serial=022b31a1-5a1d-470b-b269-090a03641caa,werror=stop,rerror=stop,cache=none,aio=native \ -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-022b31a1-5a1d-470b-b269-090a03641caa,id=ua-022b31a1-5a1d-470b-b269-090a03641caa,bootindex=1,write-cache=on \ -netdev tap,fd=33,id=hostua-488d441f-1215-4525-a87c-e29f336ad555,vhost=on,vhostfd=34 \ -device virtio-net-pci,host_mtu=1500,netdev=hostua-488d441f-1215-4525-a87c-e29f336ad555,id=ua-488d441f-1215-4525-a87c-e29f336ad555,mac=00:1a:4a:16:88:a4,bus=pci.0,addr=0x3 \ -chardev socket,id=charchannel0,fd=35,server,nowait \ -device virtserialport,bus=ua-33adb81c-95a7-40fc-9a3b-88fd227724f7.0,nr=1,chardev=charchannel0,id=channel0,name=ovirt-guest-agent.0 \ -chardev socket,id=charchannel1,fd=36,server,nowait \ -device virtserialport,bus=ua-33adb81c-95a7-40fc-9a3b-88fd227724f7.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 \ -chardev spicevmc,id=charchannel2,name=vdagent \ -device virtserialport,bus=ua-33adb81c-95a7-40fc-9a3b-88fd227724f7.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 \ -spice port=5900,tls-port=5901,addr=10.46.16.37,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on \ -device qxl-vga,id=ua-2d6fbf5d-69ab-40f2-ac17-6036ccb4e3b6,ram_size=67108864,vram_size=33554432,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 \ -device intel-hda,id=ua-f571026d-9805-4506-92fd-8bdb4ad1e2e5,bus=pci.0,addr=0x4 \ -device hda-duplex,id=ua-f571026d-9805-4506-92fd-8bdb4ad1e2e5-codec0,bus=ua-f571026d-9805-4506-92fd-8bdb4ad1e2e5.0,cad=0 \ -device virtio-balloon-pci,id=ua-9a2da065-2312-4d42-b59e-b0b591b2c77d,bus=pci.0,addr=0x8 \ -object rng-random,id=objua-dfdb11b2-5569-49b4-a375-729326cd8ab4,filename=/dev/urandom \ -device virtio-rng-pci,rng=objua-dfdb11b2-5569-49b4-a375-729326cd8ab4,id=ua-dfdb11b2-5569-49b4-a375-729326cd8ab4,bus=pci.0,addr=0x9 \ -device vmcoreinfo \ -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \ -msg timestamp=on 2019-06-24 13:44:59.917+0000: 27700: info : virObjectUnref:344 : OBJECT_UNREF: obj=0x7f037011ede0 2019-06-24T13:45:00.075797Z qemu-kvm: -drive file=/rhev/data-center/mnt/glusterSD/gluster01.lab.eng.tlv2.redhat.com:_GE__he6__volume01/4eea07cd-bab6-4fc3-a7ce-beed72ac0f1c/images/022b31a1-5a1d-470b-b269-090a03641caa/25e111ec-ebed-491c-b9e8-b32845f87053,format=qcow2,if=none,id=drive-ua-022b31a1-5a1d-470b-b269-090a03641caa,serial=022b31a1-5a1d-470b-b269-090a03641caa,werror=stop,rerror=stop,cache=none,aio=native: 'serial' is deprecated, please use the corresponding option of '-device' instead Spice-Message: 16:45:00.217: setting TLS option 'CipherString' to 'kECDHE+FIPS:kDHE+FIPS:kRSA+FIPS:!eNULL:!aNULL' from /etc/pki/tls/spice.cnf configuration file 2019-06-24T13:45:00.226064Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: CPU 1 [socket-id: 1, core-id: 0, thread-id: 0], CPU 2 [socket-id: 2, core-id: 0, thread-id: 0], CPU 3 [socket-id: 3, core-id: 0, thread-id: 0], CPU 4 [socket-id: 4, core-id: 0, thread-id: 0], CPU 5 [socket-id: 5, core-id: 0, thread-id: 0], CPU 6 [socket-id: 6, core-id: 0, thread-id: 0], CPU 7 [socket-id: 7, core-id: 0, thread-id: 0], CPU 8 [socket-id: 8, core-id: 0, thread-id: 0], CPU 9 [socket-id: 9, core-id: 0, thread-id: 0], CPU 10 [socket-id: 10, core-id: 0, thread-id: 0], CPU 11 [socket-id: 11, core-id: 0, thread-id: 0], CPU 12 [socket-id: 12, core-id: 0, thread-id: 0], CPU 13 [socket-id: 13, core-id: 0, thread-id: 0], CPU 14 [socket-id: 14, core-id: 0, thread-id: 0], CPU 15 [socket-id: 15, core-id: 0, thread-id: 0] 2019-06-24T13:45:00.226096Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future main_channel_link: add main channel client main_channel_client_handle_pong: net test: latency 4.062000 ms, bitrate 79161996 bps (75.494762 Mbps) inputs_connect: inputs channel client create red_qxl_set_cursor_peer: Version-Release number of selected component (if applicable): Engine: ovirt-engine-4.3.5.1-0.1.el7.noarch Host: vdsm vdsm-4.30.20-1.el7ev.x86_64 libvirt libvirt-4.5.0-22.el7.x86_64 qemu-img-rhev qemu-img-rhev-2.12.0-33.el7.x86_64 glusterfs glusterfs-3.12.2-47.2.el7.x86_64 sanlock sanlock-3.7.3-1.el7.x86_64 redhat_release Red Hat Enterprise Linux Server 7.7 Beta (Maipo) How reproducible: Almost 100% Steps to Reproduce: 1) Create VM from template 2) Start VM Actual results: VM Starts and is in 'UP' state , you can login to the VM and it gets an IP but after a minute or two after you see in the console the following message: "XFS(dm-0): Metadata corruption detected at xfs_buf_ioend+0x58/0x1e0 [xfs], xfs_inode block 0x240 xfs_inode_bug_verify" From that point you can only ping the VM but SSH fails with authentication error and console is stuck so you cannot use the VM anymore. Expected results: Additional info: What is different from the last run it passed: 1) RHEL7.7 (instead of RHEL7.6) hosts 2) rhv-4.3.5.-2 Engine/VDSM/libvirt versions [1] 3) New sanlock-3.7.3-1.el7.x86_64 What did not change: 1) The same ENV machines(happened on several ENVS) 2) same gluster cluster 3) Same exact test
Created attachment 1584165 [details] vm xml extracted from vdsm.log
Avihay, we need more data about this. - can you reproduce this manually? - Does it happen with rhel 7.6 guest or only with rhel 7.7 guest? - Does it happen only with rhel 7.7 host or also with rhel 7.6 hosts and 7.7 guest? - Can you attach the vm journal? (journalctl -b) maybe start the same vm on rhel 7.6 if the issue happens only with 7.7. - Can you add gluster logs from /var/log/gluster*/? When we have more info we can ask the qemu team to look at this.
Created attachment 1584203 [details] gluster logs from the host
(In reply to Nir Soffer from comment #2) > Avihay, we need more data about this. > > - can you reproduce this manually? Yes indeed but about 70% of the times (only on rhel8 guest with gluster os disk) Reproducible is ~70% manually , from 14 created VMs(rhel8 guest) issue occurred only 10 times In automation each time I run TestCase18868 issue occurs(ran about ~7 times). > - Does it happen with rhel 7.6 guest or only with rhel 7.7 guest? 7.6 guest - Tried 8 times issue did not reproduce 7.7 guest - Tried 8 times issue did not reproduce 8.1 guest with gluster os disk- Issue occurred 10/14(~ 70% rep ratio) times manually and each time running automation TestCase18868(which does create VM from rhel8 template +cold snapshot) 8.1 guest with FCP os disk- created 8 VM's issue did not occur once. > - Does it happen only with rhel 7.7 host or also with rhel 7.6 hosts and 7.7 > guest? Until now we worked only with rhel7.6(4.3.4) hosts with same ENVs/gluster and the issue did not occur once. If you are asking if we tested latest 4.3.5 with rhel7.6 hosts we did not as 4.3.5 is synced with RHEL7.7. > - Can you attach the vm journal? (journalctl -b) maybe start the same vm on > rhel 7.6 > if the issue happens only with 7.7. No as when the VM reaches this that you cannot connect via ssh and console is not responsive. > - Can you add gluster logs from /var/log/gluster*/? Added > When we have more info we can ask the qemu team to look at this. Also more info: Also I saw the issue occurs on multiple enviroments. Also the issue occur with 2 different gluster clusters with 2 different versions: glusterfs 3.12.6 (TLV2 site) - out infra TierX runs are done here glusterfs 6.3 (Raanana site) - local RHV storage team
(In reply to Avihai from comment #4) > (In reply to Nir Soffer from comment #2) > > Avihay, we need more data about this. > > > > - can you reproduce this manually? > Yes indeed but about 70% of the times (only on rhel8 guest with gluster os > disk) > Reproducible is ~70% manually , from 14 created VMs(rhel8 guest) issue > occurred only 10 times > In automation each time I run TestCase18868 issue occurs(ran about ~7 times). > > > - Does it happen with rhel 7.6 guest or only with rhel 7.7 guest? > > 7.6 guest - Tried 8 times issue did not reproduce > 7.7 guest - Tried 8 times issue did not reproduce > 8.1 guest with gluster os disk- Issue occurred 10/14(~ 70% rep ratio) times > manually and each time running automation TestCase18868(which does create VM > from rhel8 template +cold snapshot) > 8.1 guest with FCP os disk- created 8 VM's issue did not occur once. > > > - Does it happen only with rhel 7.7 host or also with rhel 7.6 hosts and 7.7 > > guest? > Until now we worked only with rhel7.6(4.3.4) hosts with same ENVs/gluster > and the issue did not occur once. > If you are asking if we tested latest 4.3.5 with rhel7.6 hosts we did not as > 4.3.5 is synced with RHEL7.7. > > > > - Can you attach the vm journal? (journalctl -b) maybe start the same vm on > > rhel 7.6 > > if the issue happens only with 7.7. > No as when the VM reaches this that you cannot connect via ssh and console > is not responsive. > > > - Can you add gluster logs from /var/log/gluster*/? > Added > > When we have more info we can ask the qemu team to look at this. > > > Also more info: > Also I saw the issue occurs on multiple enviroments. > Also the issue occur with 2 different gluster clusters with 2 different > versions: > glusterfs 3.12.6 (TLV2 site) - out infra TierX runs are done here > glusterfs 6.3 (Raanana site) - local RHV storage team Could this be related to Bug 1701736? Can you try changing the UseNativeIOForGluster to false in vdc_options before the run?
(In reply to Sahina Bose from comment #5) > (In reply to Avihai from comment #4) > > (In reply to Nir Soffer from comment #2) > > > Avihay, we need more data about this. > > > > > > - can you reproduce this manually? > > Yes indeed but about 70% of the times (only on rhel8 guest with gluster os > > disk) > > Reproducible is ~70% manually , from 14 created VMs(rhel8 guest) issue > > occurred only 10 times > > In automation each time I run TestCase18868 issue occurs(ran about ~7 times). > > > > > - Does it happen with rhel 7.6 guest or only with rhel 7.7 guest? > > > > 7.6 guest - Tried 8 times issue did not reproduce > > 7.7 guest - Tried 8 times issue did not reproduce > > 8.1 guest with gluster os disk- Issue occurred 10/14(~ 70% rep ratio) times > > manually and each time running automation TestCase18868(which does create VM > > from rhel8 template +cold snapshot) > > 8.1 guest with FCP os disk- created 8 VM's issue did not occur once. > > > > > - Does it happen only with rhel 7.7 host or also with rhel 7.6 hosts and 7.7 > > > guest? > > Until now we worked only with rhel7.6(4.3.4) hosts with same ENVs/gluster > > and the issue did not occur once. > > If you are asking if we tested latest 4.3.5 with rhel7.6 hosts we did not as > > 4.3.5 is synced with RHEL7.7. > > > > > > > - Can you attach the vm journal? (journalctl -b) maybe start the same vm on > > > rhel 7.6 > > > if the issue happens only with 7.7. > > No as when the VM reaches this that you cannot connect via ssh and console > > is not responsive. > > > > > - Can you add gluster logs from /var/log/gluster*/? > > Added > > > When we have more info we can ask the qemu team to look at this. > > > > > > Also more info: > > Also I saw the issue occurs on multiple enviroments. > > Also the issue occur with 2 different gluster clusters with 2 different > > versions: > > glusterfs 3.12.6 (TLV2 site) - out infra TierX runs are done here > > glusterfs 6.3 (Raanana site) - local RHV storage team > > Could this be related to Bug 1701736? Can you try changing the > UseNativeIOForGluster to false in vdc_options before the run? I tried but I cannot set it, please help. (host FQDN= hosted-engine-06.lab.eng.tlv2.redhat.com) This is what I get: [root@hosted-engine-06 ~]# engine-config -s UseNativeIOForGluster=False Please select a version: 1. 4.1 2. 4.2 3. 4.3 3 Error setting UseNativeIOForGluster's value. No such entry with version 4.3.
OK , I found how to add it according to Bug 1701736: root@hosted-engine-06 ~]# vi /etc/ovirt-engine/engine-config/engine-config.properties Add the following lines at the end: +UseNativeIOForGluster.description=Access volumes on glusterfs with aio native insteat of thread +UseNativeIOForGluster.type=Boolean [root@hosted-engine-06 ~]# engine-config -s UseNativeIOForGluster=False Please select a version: 1. 4.1 2. 4.2 3. 4.3 3 [root@hosted-engine-06 ~]# engine-config -g UseNativeIOForGluster UseNativeIOForGluster: false version: 4.1 UseNativeIOForGluster: true version: 4.2 UseNativeIOForGluster: False version: 4.3 [root@hosted-engine-06 ~]# systemctl restart ovirt-engine
So to answer Sahina's question : > Could this be related to Bug 1701736? Can you try changing the > UseNativeIOForGluster to false in vdc_options before the run? Indeed it looks related. What I did: 1) Once I turned UseNativeIOForGluster off the issue was not seen anymore (tried on 16 VM's and non of them had the issue) 2) Once I returned to original engine-config.properties(without a UseNativeIOForGluster value) I saw the issue again (created pool of 8 VM's and saw the issue on 2 out of 8 VM's) As the at moment engine-config.properties does not have a UseNativeIOForGluster and it should be added manually. Was UseNativeIOForGluster value was changed to be enabled at RHEL7.7 or is this a RHV4.3.5 addition? Another odd thing is that I see this issue only on RHEL8 guest. This isuue is seen only on the following combination: Host=RHEL7.7/RHV4.3.5 Guest = RHEL8.1 VM OS disk is gluster (does not matter what version of the gluster storage is) UseNativeIOForGluster value does not exist in engine-config.properties and is turned on somehow. Sahina, how do you want to proceed here?
So I guess this should be marked as duplicate then, isn't it?
(In reply to Avihai from comment #8) > So to answer Sahina's question : > > > Could this be related to Bug 1701736? Can you try changing the > > UseNativeIOForGluster to false in vdc_options before the run? > > Indeed it looks related. > > What I did: > 1) Once I turned UseNativeIOForGluster off the issue was not seen anymore > (tried on 16 VM's and non of them had the issue) > 2) Once I returned to original engine-config.properties(without a > UseNativeIOForGluster value) I saw the issue again > (created pool of 8 VM's and saw the issue on 2 out of 8 VM's) > > As the at moment engine-config.properties does not have a > UseNativeIOForGluster and it should be added manually. > Was UseNativeIOForGluster value was changed to be enabled at RHEL7.7 or is > this a RHV4.3.5 addition? > > > Another odd thing is that I see this issue only on RHEL8 guest. > This isuue is seen only on the following combination: > > Host=RHEL7.7/RHV4.3.5 > Guest = RHEL8.1 > VM OS disk is gluster (does not matter what version of the gluster storage > is) > UseNativeIOForGluster value does not exist in engine-config.properties and > is turned on somehow. > > Sahina, how do you want to proceed here? Thanks Avihai for checking this. We introduced aio=native due to the performance gains seen on test (Bug 1630744) One option is to revert to using aio=threads. (which we will do since corruption takes precedence over performance) But there does seem to be an issue with aio=native which is seen based on the guest OS used - this also needs to be investigated. Perhaps change the component to aio to investigate?
(In reply to Sahina Bose from comment #10) > (In reply to Avihai from comment #8) > > So to answer Sahina's question : > > > > > Could this be related to Bug 1701736? Can you try changing the > > > UseNativeIOForGluster to false in vdc_options before the run? > > > > Indeed it looks related. > > > > What I did: > > 1) Once I turned UseNativeIOForGluster off the issue was not seen anymore > > (tried on 16 VM's and non of them had the issue) > > 2) Once I returned to original engine-config.properties(without a > > UseNativeIOForGluster value) I saw the issue again > > (created pool of 8 VM's and saw the issue on 2 out of 8 VM's) > > > > As the at moment engine-config.properties does not have a > > UseNativeIOForGluster and it should be added manually. > > Was UseNativeIOForGluster value was changed to be enabled at RHEL7.7 or is > > this a RHV4.3.5 addition? > > > > > > Another odd thing is that I see this issue only on RHEL8 guest. > > This isuue is seen only on the following combination: > > > > Host=RHEL7.7/RHV4.3.5 > > Guest = RHEL8.1 > > VM OS disk is gluster (does not matter what version of the gluster storage > > is) > > UseNativeIOForGluster value does not exist in engine-config.properties and > > is turned on somehow. > > > > Sahina, how do you want to proceed here? > > Thanks Avihai for checking this. > > We introduced aio=native due to the performance gains seen on test (Bug > 1630744) > One option is to revert to using aio=threads. (which we will do since > corruption takes precedence over performance) > > But there does seem to be an issue with aio=native which is seen based on > the guest OS used - this also needs to be investigated. Perhaps change the > component to aio to investigate? (In reply to Sahina Bose from comment #10) > (In reply to Avihai from comment #8) > > So to answer Sahina's question : > > > > > Could this be related to Bug 1701736? Can you try changing the > > > UseNativeIOForGluster to false in vdc_options before the run? > > > > Indeed it looks related. > > > > What I did: > > 1) Once I turned UseNativeIOForGluster off the issue was not seen anymore > > (tried on 16 VM's and non of them had the issue) > > 2) Once I returned to original engine-config.properties(without a > > UseNativeIOForGluster value) I saw the issue again > > (created pool of 8 VM's and saw the issue on 2 out of 8 VM's) > > > > As the at moment engine-config.properties does not have a > > UseNativeIOForGluster and it should be added manually. > > Was UseNativeIOForGluster value was changed to be enabled at RHEL7.7 or is > > this a RHV4.3.5 addition? > > > > > > Another odd thing is that I see this issue only on RHEL8 guest. > > This isuue is seen only on the following combination: > > > > Host=RHEL7.7/RHV4.3.5 > > Guest = RHEL8.1 > > VM OS disk is gluster (does not matter what version of the gluster storage > > is) > > UseNativeIOForGluster value does not exist in engine-config.properties and > > is turned on somehow. > > > > Sahina, how do you want to proceed here? > > Thanks Avihai for checking this. > > We introduced aio=native due to the performance gains seen on test (Bug > 1630744) This bug was introduced in 4.2 and the issue was not seen until now, what changed in 4.3.5 to make it noticeble? > One option is to revert to using aio=threads. (which we will do since > corruption takes precedence over performance) > > But there does seem to be an issue with aio=native which is seen based on > the guest OS used - this also needs to be investigated. Perhaps change the > component to aio to investigate? I am not familiar with this component(not in the drop-down), can you please change it?
(In reply to Avihai from comment #11) > (In reply to Sahina Bose from comment #10) > > (In reply to Avihai from comment #8) > > > So to answer Sahina's question : > > > > > > > Could this be related to Bug 1701736? Can you try changing the > > > > UseNativeIOForGluster to false in vdc_options before the run? > > > > > > Indeed it looks related. > > > > > > What I did: > > > 1) Once I turned UseNativeIOForGluster off the issue was not seen anymore > > > (tried on 16 VM's and non of them had the issue) > > > 2) Once I returned to original engine-config.properties(without a > > > UseNativeIOForGluster value) I saw the issue again > > > (created pool of 8 VM's and saw the issue on 2 out of 8 VM's) > > > > > > As the at moment engine-config.properties does not have a > > > UseNativeIOForGluster and it should be added manually. > > > Was UseNativeIOForGluster value was changed to be enabled at RHEL7.7 or is > > > this a RHV4.3.5 addition? > > > > > > > > > Another odd thing is that I see this issue only on RHEL8 guest. > > > This isuue is seen only on the following combination: > > > > > > Host=RHEL7.7/RHV4.3.5 > > > Guest = RHEL8.1 > > > VM OS disk is gluster (does not matter what version of the gluster storage > > > is) > > > UseNativeIOForGluster value does not exist in engine-config.properties and > > > is turned on somehow. > > > > > > Sahina, how do you want to proceed here? > > > > Thanks Avihai for checking this. > > > > We introduced aio=native due to the performance gains seen on test (Bug > > 1630744) > > One option is to revert to using aio=threads. (which we will do since > > corruption takes precedence over performance) > > > > But there does seem to be an issue with aio=native which is seen based on > > the guest OS used - this also needs to be investigated. Perhaps change the > > component to aio to investigate? > > (In reply to Sahina Bose from comment #10) > > (In reply to Avihai from comment #8) > > > So to answer Sahina's question : > > > > > > > Could this be related to Bug 1701736? Can you try changing the > > > > UseNativeIOForGluster to false in vdc_options before the run? > > > > > > Indeed it looks related. > > > > > > What I did: > > > 1) Once I turned UseNativeIOForGluster off the issue was not seen anymore > > > (tried on 16 VM's and non of them had the issue) > > > 2) Once I returned to original engine-config.properties(without a > > > UseNativeIOForGluster value) I saw the issue again > > > (created pool of 8 VM's and saw the issue on 2 out of 8 VM's) > > > > > > As the at moment engine-config.properties does not have a > > > UseNativeIOForGluster and it should be added manually. > > > Was UseNativeIOForGluster value was changed to be enabled at RHEL7.7 or is > > > this a RHV4.3.5 addition? > > > > > > > > > Another odd thing is that I see this issue only on RHEL8 guest. > > > This isuue is seen only on the following combination: > > > > > > Host=RHEL7.7/RHV4.3.5 > > > Guest = RHEL8.1 > > > VM OS disk is gluster (does not matter what version of the gluster storage > > > is) > > > UseNativeIOForGluster value does not exist in engine-config.properties and > > > is turned on somehow. > > > > > > Sahina, how do you want to proceed here? > > > > Thanks Avihai for checking this. > > > > We introduced aio=native due to the performance gains seen on test (Bug > > 1630744) > > This bug was introduced in 4.2 and the issue was not seen until now, what > changed in 4.3.5 to make it noticeble? The change was introduced in 4.2 , and bug is seen now. The only change I can see is the guest OS, and RHEL 7.7 on host. > > > One option is to revert to using aio=threads. (which we will do since > > corruption takes precedence over performance) > > > > But there does seem to be an issue with aio=native which is seen based on > > the guest OS used - this also needs to be investigated. Perhaps change the > > component to aio to investigate? > > > I am not familiar with this component(not in the drop-down), can you please > change it?
Hi Avihai, KVM QE could not reproduce it in QEMU env. Is it possible for you to provide the QEMU CML when set 'UseNativeIOForGluster' = False ? I would like to confirm the option 'UseNativeIOForGluster' is corresponding to the option 'aio=native' in QEMU side. Thanks.
(In reply to CongLi from comment #19) > Hi Avihai, > > KVM QE could not reproduce it in QEMU env. > > Is it possible for you to provide the QEMU CML when set > 'UseNativeIOForGluster' = False ? > I would like to confirm the option 'UseNativeIOForGluster' > is corresponding to the option 'aio=native' in QEMU side. > Sorry, I mean 'UseNativeIOForGluster' = False is corresponding to the option 'aio=threads' in QEMU side. Thanks. > Thanks.
(In reply to CongLi from comment #20) > (In reply to CongLi from comment #19) > > Hi Avihai, > > > > KVM QE could not reproduce it in QEMU env. > > > > Is it possible for you to provide the QEMU CML when set > > 'UseNativeIOForGluster' = False ? > > I would like to confirm the option 'UseNativeIOForGluster' > > is corresponding to the option 'aio=native' in QEMU side. > > > > Sorry, I mean 'UseNativeIOForGluster' = False is corresponding > to the option 'aio=threads' in QEMU side. > > Thanks. > > > Thanks. I'v set 'UseNativeIOForGluster' = False at this engine FQDN "storage-ge-08.scl.lab.tlv.redhat.com". But I need your help on providing the QEMU CML, can you please help? details on how I set it: [root@storage-ge-08 ~]# vim /etc/ovirt-engine/engine-config/engine-config.properties Added the following lines at the end: UseNativeIOForGluster.description=Access volumes on glusterfs with aio native insteat of thread UseNativeIOForGluster.type=Boolean [root@storage-ge-08 ~]# engine-config -s UseNativeIOForGluster=False Please select a version: 1. 4.1 2. 4.2 3. 4.3 3 [root@storage-ge-08 ~]# engine-config -g UseNativeIOForGluster UseNativeIOForGluster: false version: 4.1 UseNativeIOForGluster: true version: 4.2 UseNativeIOForGluster: False version: 4.3 [root@storage-ge-08 ~]# systemctl restart ovirt-engine
(In reply to Avihai from comment #21) > (In reply to CongLi from comment #20) > > (In reply to CongLi from comment #19) > > > Hi Avihai, > > > > > > KVM QE could not reproduce it in QEMU env. > > > > > > Is it possible for you to provide the QEMU CML when set > > > 'UseNativeIOForGluster' = False ? > > > I would like to confirm the option 'UseNativeIOForGluster' > > > is corresponding to the option 'aio=native' in QEMU side. > > > > > > > Sorry, I mean 'UseNativeIOForGluster' = False is corresponding > > to the option 'aio=threads' in QEMU side. > > > > Thanks. > > > > > Thanks. > > I'v set 'UseNativeIOForGluster' = False at this engine FQDN > "storage-ge-08.scl.lab.tlv.redhat.com". > But I need your help on providing the QEMU CML, can you please help? Could you please help try '# ps aux | grep qemu' in your host / engine ? Thanks. > > details on how I set it: > [root@storage-ge-08 ~]# vim > /etc/ovirt-engine/engine-config/engine-config.properties > Added the following lines at the end: > UseNativeIOForGluster.description=Access volumes on glusterfs with aio > native insteat of thread > UseNativeIOForGluster.type=Boolean > > [root@storage-ge-08 ~]# engine-config -s UseNativeIOForGluster=False > Please select a version: > 1. 4.1 > 2. 4.2 > 3. 4.3 > 3 > [root@storage-ge-08 ~]# engine-config -g UseNativeIOForGluster > UseNativeIOForGluster: false version: 4.1 > UseNativeIOForGluster: true version: 4.2 > UseNativeIOForGluster: False version: 4.3 > [root@storage-ge-08 ~]# systemctl restart ovirt-engine
(In reply to CongLi from comment #22) > (In reply to Avihai from comment #21) > > (In reply to CongLi from comment #20) > > > (In reply to CongLi from comment #19) > > > > Hi Avihai, > > > > > > > > KVM QE could not reproduce it in QEMU env. > > > > > > > > Is it possible for you to provide the QEMU CML when set > > > > 'UseNativeIOForGluster' = False ? > > > > I would like to confirm the option 'UseNativeIOForGluster' > > > > is corresponding to the option 'aio=native' in QEMU side. > > > > > > > > > > Sorry, I mean 'UseNativeIOForGluster' = False is corresponding > > > to the option 'aio=threads' in QEMU side. > > > > > > Thanks. > > > > > > > Thanks. > > > > I'v set 'UseNativeIOForGluster' = False at this engine FQDN > > "storage-ge-08.scl.lab.tlv.redhat.com". > > But I need your help on providing the QEMU CML, can you please help? > > Could you please help try '# ps aux | grep qemu' in your host / engine ? Indeed I see aio=threads after setting UseNativeIOForGluster to false. Engine output: [root@storage-ge-08 ~]# ps aux | grep qemu root 1176 0.0 0.0 44216 2564 ? Ss Jul03 0:36 /usr/bin/qemu-ga --method=virtio-serial --path=/dev/virtio-ports/org.qemu.guest_agent.0 --blacklist=guest-file-open,guest-file-close,guest-file-read,guest-file-write,guest-file-seek,guest-file-flush,guest-exec,guest-exec-status -F/etc/qemu-ga/fsfreeze-hook root 15741 0.0 0.0 112720 964 pts/0 S+ 15:12 0:00 grep --color=auto qemu Host output: [root@storage-ge8-vdsm2 ~]# ps aux | grep qemu root 725 0.0 0.0 44216 2436 ? Ss Jul03 1:14 /usr/bin/qemu-ga --method=virtio-serial --path=/dev/virtio-ports/org.qemu.guest_agent.0 --blacklist=guest-file-open,guest-file-close,guest-file-read,guest-file-write,guest-file-seek,guest-file-flush,guest-exec,guest-exec-status -F/etc/qemu-ga/fsfreeze-hook qemu 12943 13.0 19.6 1998592 760756 ? Rl Jul04 543:30 /usr/libexec/qemu-kvm -name guest=pool_vm_gluster-1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-4-pool_vm_gluster-1/master-key.aes -machine pc-i440fx-rhel7.6.0,accel=kvm,usb=off,dump-guest-core=off -cpu Nehalem -m size=1048576k,slots=16,maxmem=4194304k -realtime mlock=off -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -object iothread,id=iothread1 -numa node,nodeid=0,cpus=0,mem=1024 -uuid 13b09533-008c-4e40-acde-3efa7150e383 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.7-7.el7,serial=53a0fc14-3841-4c1d-a7cc-b5b7b25874a4,uuid=13b09533-008c-4e40-acde-3efa7150e383 -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=32,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2019-07-04T14:48:37,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,iothread=iothread1,id=ua-5f9f4544-5e3a-4f0b-9c4b-f2b08bb8e9ba,bus=pci.0,addr=0x6 -device virtio-serial-pci,id=ua-0669b3db-2ede-40e1-af66-42ed04cf035c,max_ports=16,bus=pci.0,addr=0x5 -drive if=none,id=drive-ua-2f4d2630-0b13-4973-9c9d-309548498205,werror=report,rerror=report,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-ua-2f4d2630-0b13-4973-9c9d-309548498205,id=ua-2f4d2630-0b13-4973-9c9d-309548498205 -drive file=/rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_storage__local__ge8__volume__0/dc3f1c4c-10a8-459b-ada3-901201bd1df2/images/db3220fa-7af3-4220-a5e4-52e49086edb2/161ae4ac-f98f-47f1-a59a-63dc97531130,format=qcow2,if=none,id=drive-ua-db3220fa-7af3-4220-a5e4-52e49086edb2,serial=db3220fa-7af3-4220-a5e4-52e49086edb2,werror=stop,rerror=stop,cache=none,aio=threads -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-db3220fa-7af3-4220-a5e4-52e49086edb2,id=ua-db3220fa-7af3-4220-a5e4-52e49086edb2,bootindex=1,write-cache=on -netdev tap,fd=36,id=hostua-ced84a2f-971d-4766-914a-e0a1852ea21c,vhost=on,vhostfd=41 -device virtio-net-pci,host_mtu=1500,netdev=hostua-ced84a2f-971d-4766-914a-e0a1852ea21c,id=ua-ced84a2f-971d-4766-914a-e0a1852ea21c,mac=00:1a:4a:16:25:e9,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,fd=42,server,nowait -device virtserialport,bus=ua-0669b3db-2ede-40e1-af66-42ed04cf035c.0,nr=1,chardev=charchannel0,id=channel0,name=ovirt-guest-agent.0 -chardev socket,id=charchannel1,fd=43,server,nowait -device virtserialport,bus=ua-0669b3db-2ede-40e1-af66-42ed04cf035c.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=ua-0669b3db-2ede-40e1-af66-42ed04cf035c.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -device usb-tablet,id=input0,bus=usb.0,port=1 -spice port=5900,tls-port=5901,addr=10.35.82.80,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -vnc 10.35.82.80:2,password,tls,x509=/etc/pki/vdsm/libvirt-vnc -k en-us -device qxl-vga,id=ua-5bca4eec-f416-45b3-a56b-291efea91a1c,ram_size=67108864,vram_size=8388608,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -device intel-hda,id=ua-3d0e9700-ad99-40c0-b15d-665b730373de,bus=pci.0,addr=0x4 -device hda-duplex,id=ua-3d0e9700-ad99-40c0-b15d-665b730373de-codec0,bus=ua-3d0e9700-ad99-40c0-b15d-665b730373de.0,cad=0 -device virtio-balloon-pci,id=ua-a0d9d265-ffc4-4a92-b6db-e4ae8fc59db6,bus=pci.0,addr=0x8 -object rng-random,id=objua-9806d9a9-c31c-4ca8-b87a-b09efdc9309e,filename=/dev/urandom -device virtio-rng-pci,rng=objua-9806d9a9-c31c-4ca8-b87a-b09efdc9309e,id=ua-9806d9a9-c31c-4ca8-b87a-b09efdc9309e,bus=pci.0,addr=0x9 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on qemu 12971 12.9 19.1 1936940 740856 ? Rl Jul04 540:09 /usr/libexec/qemu-kvm -name guest=pool_vm_gluster-2,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-5-pool_vm_gluster-2/master-key.aes -machine pc-i440fx-rhel7.6.0,accel=kvm,usb=off,dump-guest-core=off -cpu Nehalem -m size=1048576k,slots=16,maxmem=4194304k -realtime mlock=off -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -object iothread,id=iothread1 -numa node,nodeid=0,cpus=0,mem=1024 -uuid e1059eb4-0526-43ef-8a27-62487d5b4588 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.7-7.el7,serial=53a0fc14-3841-4c1d-a7cc-b5b7b25874a4,uuid=e1059eb4-0526-43ef-8a27-62487d5b4588 -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=38,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2019-07-04T14:48:38,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,iothread=iothread1,id=ua-70ab3713-14ce-47d6-a5fc-34312acedf3d,bus=pci.0,addr=0x5 -device virtio-serial-pci,id=ua-8ca2e0ce-47d2-44cc-b97b-f681af344975,max_ports=16,bus=pci.0,addr=0x6 -drive if=none,id=drive-ua-22b0ab7e-3454-4bfb-829d-78822ce8a4e3,werror=report,rerror=report,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-ua-22b0ab7e-3454-4bfb-829d-78822ce8a4e3,id=ua-22b0ab7e-3454-4bfb-829d-78822ce8a4e3 -drive file=/rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_storage__local__ge8__volume__0/dc3f1c4c-10a8-459b-ada3-901201bd1df2/images/e97c1add-a000-4100-9045-40da96ee378b/4f12ef9d-a1f2-4772-815f-9b2c268acbe0,format=qcow2,if=none,id=drive-ua-e97c1add-a000-4100-9045-40da96ee378b,serial=e97c1add-a000-4100-9045-40da96ee378b,werror=stop,rerror=stop,cache=none,aio=threads -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-e97c1add-a000-4100-9045-40da96ee378b,id=ua-e97c1add-a000-4100-9045-40da96ee378b,bootindex=1,write-cache=on -netdev tap,fd=40,id=hostua-e554b0e4-7c8d-4263-9cea-3a54b34900db,vhost=on,vhostfd=32 -device virtio-net-pci,host_mtu=1500,netdev=hostua-e554b0e4-7c8d-4263-9cea-3a54b34900db,id=ua-e554b0e4-7c8d-4263-9cea-3a54b34900db,mac=00:1a:4a:16:25:ea,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,fd=36,server,nowait -device virtserialport,bus=ua-8ca2e0ce-47d2-44cc-b97b-f681af344975.0,nr=1,chardev=charchannel0,id=channel0,name=ovirt-guest-agent.0 -chardev socket,id=charchannel1,fd=41,server,nowait -device virtserialport,bus=ua-8ca2e0ce-47d2-44cc-b97b-f681af344975.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=ua-8ca2e0ce-47d2-44cc-b97b-f681af344975.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -device usb-tablet,id=input0,bus=usb.0,port=1 -spice port=5903,tls-port=5904,addr=10.35.82.80,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -vnc 10.35.82.80:5,password,tls,x509=/etc/pki/vdsm/libvirt-vnc -k en-us -device qxl-vga,id=ua-759d513c-80e2-4eb2-b9e4-7a40029f55b0,ram_size=67108864,vram_size=8388608,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -device intel-hda,id=ua-ae3ba68a-8dfd-4eb9-b43d-0b585110a540,bus=pci.0,addr=0x4 -device hda-duplex,id=ua-ae3ba68a-8dfd-4eb9-b43d-0b585110a540-codec0,bus=ua-ae3ba68a-8dfd-4eb9-b43d-0b585110a540.0,cad=0 -device virtio-balloon-pci,id=ua-82435524-5918-4777-aeeb-17cc2fca4fa3,bus=pci.0,addr=0x8 -object rng-random,id=objua-9806d9a9-c31c-4ca8-b87a-b09efdc9309e,filename=/dev/urandom -device virtio-rng-pci,rng=objua-9806d9a9-c31c-4ca8-b87a-b09efdc9309e,id=ua-9806d9a9-c31c-4ca8-b87a-b09efdc9309e,bus=pci.0,addr=0x9 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on qemu 13370 11.8 14.6 1968764 569236 ? Rl Jul04 494:22 /usr/libexec/qemu-kvm -name guest=pool_vm_gluster-6,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-6-pool_vm_gluster-6/master-key.aes -machine pc-i440fx-rhel7.6.0,accel=kvm,usb=off,dump-guest-core=off -cpu Nehalem -m size=1048576k,slots=16,maxmem=4194304k -realtime mlock=off -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -object iothread,id=iothread1 -numa node,nodeid=0,cpus=0,mem=1024 -uuid b35a0328-6340-46c3-a697-888a1eec74ac -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.7-7.el7,serial=53a0fc14-3841-4c1d-a7cc-b5b7b25874a4,uuid=b35a0328-6340-46c3-a697-888a1eec74ac -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=33,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2019-07-04T14:49:31,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,iothread=iothread1,id=ua-08d5b00d-83f4-4249-8c5d-7dad3ca97c2c,bus=pci.0,addr=0x5 -device virtio-serial-pci,id=ua-43532932-fe0c-4698-bc7f-615e0ebe69aa,max_ports=16,bus=pci.0,addr=0x6 -drive if=none,id=drive-ua-28053b06-9947-4b3e-a737-03a150e8809f,werror=report,rerror=report,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-ua-28053b06-9947-4b3e-a737-03a150e8809f,id=ua-28053b06-9947-4b3e-a737-03a150e8809f -drive file=/rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_storage__local__ge8__volume__0/dc3f1c4c-10a8-459b-ada3-901201bd1df2/images/fea12c3b-307d-4f3d-89fc-db0ac82d5b67/f4169dcc-c8a8-40fa-9eb9-61b65b6566be,format=qcow2,if=none,id=drive-ua-fea12c3b-307d-4f3d-89fc-db0ac82d5b67,serial=fea12c3b-307d-4f3d-89fc-db0ac82d5b67,werror=stop,rerror=stop,cache=none,aio=threads -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-fea12c3b-307d-4f3d-89fc-db0ac82d5b67,id=ua-fea12c3b-307d-4f3d-89fc-db0ac82d5b67,bootindex=1,write-cache=on -netdev tap,fd=35,id=hostua-4439dbe7-a366-43bd-82aa-fe3a0836afe3,vhost=on,vhostfd=36 -device virtio-net-pci,host_mtu=1500,netdev=hostua-4439dbe7-a366-43bd-82aa-fe3a0836afe3,id=ua-4439dbe7-a366-43bd-82aa-fe3a0836afe3,mac=00:1a:4a:16:25:ee,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,fd=37,server,nowait -device virtserialport,bus=ua-43532932-fe0c-4698-bc7f-615e0ebe69aa.0,nr=1,chardev=charchannel0,id=channel0,name=ovirt-guest-agent.0 -chardev socket,id=charchannel1,fd=38,server,nowait -device virtserialport,bus=ua-43532932-fe0c-4698-bc7f-615e0ebe69aa.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=ua-43532932-fe0c-4698-bc7f-615e0ebe69aa.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -device usb-tablet,id=input0,bus=usb.0,port=1 -spice port=5906,tls-port=5907,addr=10.35.82.80,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -vnc 10.35.82.80:8,password,tls,x509=/etc/pki/vdsm/libvirt-vnc -k en-us -device qxl-vga,id=ua-1b09afab-447d-45da-aa2c-05dd017663fa,ram_size=67108864,vram_size=8388608,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -device intel-hda,id=ua-f9ceb167-c491-4630-8d0b-19430b7d9d9a,bus=pci.0,addr=0x4 -device hda-duplex,id=ua-f9ceb167-c491-4630-8d0b-19430b7d9d9a-codec0,bus=ua-f9ceb167-c491-4630-8d0b-19430b7d9d9a.0,cad=0 -device virtio-balloon-pci,id=ua-2d0a78cb-7d26-4b96-811e-e61b6a687c2b,bus=pci.0,addr=0x8 -object rng-random,id=objua-9806d9a9-c31c-4ca8-b87a-b09efdc9309e,filename=/dev/urandom -device virtio-rng-pci,rng=objua-9806d9a9-c31c-4ca8-b87a-b09efdc9309e,id=ua-9806d9a9-c31c-4ca8-b87a-b09efdc9309e,bus=pci.0,addr=0x9 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on > Thanks. > > > > > details on how I set it: > > [root@storage-ge-08 ~]# vim > > /etc/ovirt-engine/engine-config/engine-config.properties > > Added the following lines at the end: > > UseNativeIOForGluster.description=Access volumes on glusterfs with aio > > native insteat of thread > > UseNativeIOForGluster.type=Boolean > > > > [root@storage-ge-08 ~]# engine-config -s UseNativeIOForGluster=False > > Please select a version: > > 1. 4.1 > > 2. 4.2 > > 3. 4.3 > > 3 > > [root@storage-ge-08 ~]# engine-config -g UseNativeIOForGluster > > UseNativeIOForGluster: false version: 4.1 > > UseNativeIOForGluster: true version: 4.2 > > UseNativeIOForGluster: False version: 4.3 > > [root@storage-ge-08 ~]# systemctl restart ovirt-engine
Thanks Avihai. Could you please also help provide the steps and script of TestCase18868 ? Thanks.
(In reply to CongLi from comment #24) > Thanks Avihai. > > Could you please also help provide the steps and script of TestCase18868 ? > > Thanks. All the mentioned tests in the QA whiteboard (TestCase18868 included) all do the same simple thing which is to start VM from a RHEL8 OS disk which resides on a gluster storage domain than wait for IP and try to do SSH which fails as the issue occurred. We use a huge framework called ART which uses RESTAPI calls data structure for tests so there is no simple script I can supply unless you are already working with this framework. However, this was reproduced manually many times not on all VM's but in 2 out of 8 VM's or more : 1) Create a template which is based on a RHEL8 OS disk on a gluster storage domain. 2) Create many VM's(as many as possible) from that template 3) Start VM and wait for IP 4) Once VM got IP try to do SSH => fails VM start and gets IP but right afterward in the console you see this error and from that point IP is still available, ssh fails all the VM is 'UP' but you cannot do anything with it an also console is stuck.
(In reply to CongLi from comment #19) > Hi Avihai, > > KVM QE could not reproduce it in QEMU env. > > Is it possible for you to provide the QEMU CML when set > 'UseNativeIOForGluster' = False ? > I would like to confirm the option 'UseNativeIOForGluster' > is corresponding to the option 'aio=native' in QEMU side. > > Thanks. Hi CongLi, I'm starting to work on this BZ. Are you able to reproduce in the QEMU env? Thanks, Stefano
I tried following command : /usr/libexec/qemu-kvm \ -name "guest-rhel8.0-2" \ -machine pc-i440fx-rhel7.6.0,accel=kvm,usb=off,dump-guest-core=off \ -cpu Westmere \ -m 6144 \ -realtime mlock=off \ -uuid dbfc4b9a-74bf-4c21-95c6-0840743fd57a \ -smbios 'type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.7-7.el7,serial=4c4c4544-0047-3210-8053-c4c04f473632,uuid=dbfc4b9a-74bf-4c21-95c6-0840743fd57a' \ -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 \ -nodefaults \ -vga qxl \ -object iothread,id=iothread0 \ -drive file=/mnt/gluster/rhel810-64-virtio2.qcow2,format=qcow2,if=none,id=drive-ua-1,serial=1,werror=stop,rerror=stop,cache=none,aio=native \ -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-ua-1,id=ua-1,bootindex=1,write-cache=on \ -vnc :2 \ -monitor stdio \ -device virtio-net-pci,mac=9a:b5:b6:a1:b2:c2,id=idMmq1jH,vectors=4,netdev=idxgXAlm,bus=pci.0,addr=0x9 \ -netdev tap,id=idxgXAlm,vhost=on \ -qmp tcp:localhost:5952,server,nowait \ -chardev file,path=/home/serial2.log,id=serial_id_serial0 \ -device isa-serial,chardev=serial_id_serial0 \ /mnt/gluster/rhel810-64-virtio2.qcow2 is my guest image. I can not reproduce this issue. Hi Stefano, could you please help to run above command on your vdsm node. You need to replace "/mnt/gluster/rhel810-64-virtio2.qcow2 " with your guest image which locate your gluster fs. Could you please share your guest image if you may reproduce this issue with this command.
(In reply to qing.wang from comment #28) > > Hi Stefano, could you please help to run above command on your vdsm node. > You need to replace "/mnt/gluster/rhel810-64-virtio2.qcow2 " with your guest > image which locate your gluster fs. Hi Qing, I'm not able too. I started QEMU with a very similar command line, but it works well. Maybe Avihai can help us because I don't have access to the VDSM node.
It looks like this issue related to ovirt environment. Avihai said new sanlock-3.7.3-1.el7.x86_64 involved, it maybe result in regression issue. I suggest to check sanlock log when it happens. Is it possible to rollback last version of sanlock package and test it again?
Guys, The ENV is there for you to use/debug - Please use it ASAP to extract what you need. It took me a while to reproduce this issue but I did so this is hard to reproduce. From 12 VM's only 3 VM's has this issue(see print screen attached): VM name= rhel8_nativeaio-3 VM name= rhel8_nativeaio-6 VM name= rhel8_nativeaio-10
Created attachment 1602624 [details] print screen of the issue 3 VM's out of 12 has the issue
My test steps: 1.shutdown rhel8_nativeaio-1 - rhel8_nativeaio-8 on ovirt-engine. 2.create guest script (vm1.sh -vm8.sh) with qemu-kvm command like : file=/rhev/data-center/mnt/glusterSD/gluster01.lab.eng.tlv2.redhat.com:_GE__7__volume01/92e8d1b3-bc5b-4759-a997-c5cbdb32f6e7/images/6e798b8b-ef87-4b13-9ef1-43726b79f724/6c7c2f45-a756-48e5-bc73-41955cb8d629 mac=00:1a:4a:16:88:45 idx=1 /usr/libexec/qemu-kvm \ -name "guest-rhel8.0-${idx}" \ -machine pc-i440fx-rhel7.6.0,accel=kvm,usb=off,dump-guest-core=off \ -cpu Westmere \ -m 6144 \ -realtime mlock=off \ -uuid dbfc4b9a-74bf-4c21-95c6-0840743fd57a \ -smbios 'type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.7-7.el7,serial=4c4c4544-0047-3210-8053-c4c04f473632,uuid=dbfc4b9a-74bf-4c21-95c6-0840743fd57a' \ -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 \ -nodefaults \ -vga qxl \ -object iothread,id=iothread0 \ -drive file=${file},format=qcow2,if=none,id=drive-ua-1,serial=1,werror=stop,rerror=stop,cache=none,aio=native \ -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-ua-1,id=ua-1,bootindex=1,write-cache=on \ -vnc :2${idx} \ -monitor stdio \ -device virtio-net-pci,mac=${mac},id=idMmq1jH,vectors=4,netdev=idxgXAlm,bus=pci.0,addr=0x9 \ -netdev tap,id=idxgXAlm,vhost=on \ -qmp tcp:localhost:595${idx},server,nowait \ -chardev file,path=/home/serial${idx}.log,id=serial_id_serial0 \ -device isa-serial,chardev=serial_id_serial0 \ -device vmcoreinfo \ on each vm.sh this file and mac address is respective in ovirt guest vms rhel8_nativeaio-1-8. (those scripts locate /root/test/ on lynx25.lab.eng.tlv2.redhat.com and lynx26.lab.eng.tlv2.redhat.com ) 3. run vm1.sh vm2.sh vm3.sh vm4.sh on lynx25.lab.eng.tlv2.redhat.com run vm5.sh vm6.sh vm7.sh vm8.sh on lynx26.lab.eng.tlv2.redhat.com (set the default nic to dhcp enabled in guest vm) 4 wait guest vm started and get ip ,then wait 20-30 minuutes 5. check remote console to see if have any issue on guests 6 . poweroff vm1-vm8 and repeat step3- step5 . =================================================================== I run above test not found issue. And vm1-vm8 still running on lynx25.lab.eng.tlv2.redhat.com and lynx26.lab.eng.tlv2.redhat.com. We may wait to tomorrow to see if have problem on long time running. vm1- vm8 hostname of vms: vm-17-69 to vm-17-76
This is a shared stand and also other teams need to run other tests on it by tomorrow so please debug as needed till then. The issue was already there on 3 VM's(see below), were you able to see what is special about them? VM name= rhel8_nativeaio-3 VM name= rhel8_nativeaio-6 VM name= rhel8_nativeaio-10
(In reply to Avihai from comment #35) > This is a shared stand and also other teams need to run other tests on it by > tomorrow so please debug as needed till then. > > The issue was already there on 3 VM's(see below), were you able to see what > is special about them? > VM name= rhel8_nativeaio-3 > VM name= rhel8_nativeaio-6 > VM name= rhel8_nativeaio-10 I have shutdown above VMs due to my VMs shared their images in my qemu command line testing. But i can not reproduce it. After some investigation, i suspect it is not valid testing in ovirt testing: set aio=native when create vm based on template. More detail please refer to https://www.redhat.com/archives/libvirt-users/2017-January/msg00025.html. I pick up one vm rhel8_nativeaio-12 to explain it. disk info: <disk type='file' device='disk' snapshot='no'> <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='native'/> <source file='/rhev/data-center/mnt/glusterSD/gluster01.lab.eng.tlv2.redhat.com:_GE__7__volume01/92e8d1b3-bc5b-4759-a997-c5cbdb32f6e7/images/e5508a87-85c0-44cc-94fd-19ebc974c9a8/77ff077c-3ac6-4614-b75b-eb54294058bc'> <seclabel model='dac' relabel='no'/> </source> <backingStore type='file' index='1'> <format type='qcow2'/> <source file='/rhev/data-center/mnt/glusterSD/gluster01.lab.eng.tlv2.redhat.com:_GE__7__volume01/92e8d1b3-bc5b-4759-a997-c5cbdb32f6e7/images/e5508a87-85c0-44cc-94fd-19ebc974c9a8/d5d90f0a-c496-4946-95fd-41e17abac359'/> <backingStore/> </backingStore> <target dev='vda' bus='virtio'/> <serial>e5508a87-85c0-44cc-94fd-19ebc974c9a8</serial> <boot order='1'/> <alias name='ua-e5508a87-85c0-44cc-94fd-19ebc974c9a8'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </disk> This VM used one image /rhev/data-center/mnt/glusterSD/gluster01.lab.eng.tlv2.redhat.com:_GE__7__volume01/92e8d1b3-bc5b-4759-a997-c5cbdb32f6e7/images/e5508a87-85c0-44cc-94fd-19ebc974c9a8/77ff077c-3ac6-4614-b75b-eb54294058bc [root@lynx26 test]# qemu-img info /rhev/data-center/mnt/glusterSD/gluster01.lab.eng.tlv2.redhat.com:_GE__7__volume01/92e8d1b3-bc5b-4759-a997-c5cbdb32f6e7/images/e5508a87-85c0-44cc-94fd-19ebc974c9a8/77ff077c-3ac6-4614-b75b-eb54294058bc image: /rhev/data-center/mnt/glusterSD/gluster01.lab.eng.tlv2.redhat.com:_GE__7__volume01/92e8d1b3-bc5b-4759-a997-c5cbdb32f6e7/images/e5508a87-85c0-44cc-94fd-19ebc974c9a8/77ff077c-3ac6-4614-b75b-eb54294058bc file format: qcow2 virtual size: 10G (10737418240 bytes) disk size: 121M cluster_size: 65536 backing file: d5d90f0a-c496-4946-95fd-41e17abac359 (actual path: /rhev/data-center/mnt/glusterSD/gluster01.lab.eng.tlv2.redhat.com:_GE__7__volume01/92e8d1b3-bc5b-4759-a997-c5cbdb32f6e7/images/e5508a87-85c0-44cc-94fd-19ebc974c9a8/d5d90f0a-c496-4946-95fd-41e17abac359) backing file format: qcow2 Format specific information: compat: 1.1 lazy refcounts: false refcount bits: 16 corrupt: false It indicates this image is not pre-allocated, so aio=native is not suitable for it. @Stefano Please help to confirm it
(In reply to Avihai from comment #32) > Guys, > > The ENV is there for you to use/debug - Please use it ASAP to extract what > you need. > > It took me a while to reproduce this issue but I did so this is hard to > reproduce. > > From 12 VM's only 3 VM's has this issue(see print screen attached): > VM name= rhel8_nativeaio-3 > VM name= rhel8_nativeaio-6 > VM name= rhel8_nativeaio-10 Sorry for the late response, but I was on PTO. I think that the VMs are not running anymore. Please, can you try to reproduce the issue again?
(In reply to qing.wang from comment #36) > (In reply to Avihai from comment #35) > > This is a shared stand and also other teams need to run other tests on it by > > tomorrow so please debug as needed till then. > > > > The issue was already there on 3 VM's(see below), were you able to see what > > is special about them? > > VM name= rhel8_nativeaio-3 > > VM name= rhel8_nativeaio-6 > > VM name= rhel8_nativeaio-10 > > I have shutdown above VMs due to my VMs shared their images in my qemu > command line testing. > > But i can not reproduce it. > > After some investigation, i suspect it is not valid testing in ovirt > testing: set aio=native when create vm based on template. Good catch! > > More detail please refer to > https://www.redhat.com/archives/libvirt-users/2017-January/msg00025.html. > [...] > > @Stefano Please help to confirm it Right, it is not recommended and here there are other details: - https://access.redhat.com/articles/41313 "Specifically, if qemu-kvm is used with the aio=native IO mode over a sparse device image hosted on the ext4 or xfs filesystem, guest filesystem corruption will occur if partitions are not aligned with the host filesystem block size" - https://drive.google.com/file/d/0B44EcgFDZNtXSnhKbEZfNE1ad28 [Slide 61/66] "Native AIO can block the VM if the file is not fully allocated and is therefore not recommended for use on sparse files" "Writes to sparsely allocated files are more likely to block than fully preallocated files. Therefore it is recommended to only use aio=native on fully preallocated files, local disks, or logical volumes." So, if we want to use aio=native, maybe we should provide an image fully preallocated.
Suggestion for Insights Rule: if qemu-kvm is used with the aio=native IO mode over a sparse device image hosted on the ext4 or xfs filesystem, give a warning that this may cause guest filesystem corruption if partitions are not aligned with the host filesystem block size. For details - see: https://access.redhat.com/articles/41313 and comment 23 above.
This is an old issue which was triggered when RHHI/gluster Default was set to aio=native at rhv 4.3.5 failing our regression tests(bug 1701736). I think this issue is no longer urgent as Default was set to aio=threads for gluster/RHHI storage at bug 1701736. Since this fix was made this issue is no longer seen running our automation regressions. Lowering severity to high, please raise it if you feel otherwise.
Avihai, do you have any VM where you can replicate this issue?
Which level do we ready to fix this issue? if we follow the Insights Rule mentioned in comment 41 , i think it is upper application (ovirt) issue , not the qemu issue ,right?
(In reply to qing.wang from comment #45) > Which level do we ready to fix this issue? if we follow the Insights Rule > mentioned in comment 41 , i think it is upper application (ovirt) issue , > not the qemu issue ,right? We should understand better why if we use aio=native, we have the corruption. It could be an alignment problem or another issue in XFS or QEMU or the Fuse driver or Gluster, maybe aio=native changes the timing and brings out a hidden bug.
(In reply to Stefano Garzarella from comment #43) > Avihai, > do you have any VM where you can replicate this issue? Hi Stefano, I reproduce the issue multiple times and left the environment for DEV debug and can do it again if this will help. As this issue seems not supported for now and the default was set to aio=threads for gluster/RHHI storage at bug 1701736, is this still needed?
(In reply to Avihai from comment #47) > (In reply to Stefano Garzarella from comment #43) > > Avihai, > > do you have any VM where you can replicate this issue? > > Hi Stefano, > I reproduce the issue multiple times and left the environment for DEV debug > and can do it again if this will help. Thanks! I'll ping you on IRC when I'll work on it. > As this issue seems not supported for now and the default was set to > aio=threads for gluster/RHHI storage at bug 1701736, is this still needed? Maybe we can reduce the priority and severity, but I think can be useful to solve this issue.
Deferring it to RHEL8-AV. Worth fixing, but not urgent or regression (see previous comments).
QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks
Since this issue is not relevant to RHV anymore (due to changing defaults to aio=threads when working with glusterfs) and since it is so hard to reproduce, I propose to close this bug as WONTFIX or DEFERRED.
Closing based on comment 53