Description of problem: During RHHI deployment at te time of HE health check deployment failed and he vm moved to paused state. Engine status : {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "Paused"} Error: [ INFO ] TASK [ovirt.hosted_engine_setup : Check engine VM health] [ ERROR ] fatal: [localhost]: FAILED! => {"attempts": 120, "changed": true, "cmd": ["hosted-engine", "--vm-status", "--json"], "delta": "0:00:00.147544", "end": "2019-09-16 14:56:59.790588", "rc": 0, "start": "2019-09-16 14:56:59.643044", "stderr": "", "stderr_lines": [], "stdout": "{\"1\": {\"conf_on_shared_storage\": true, \"live-data\": true, \"extra\": \"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=235457 (Mon Sep 16 14:56:59 2019)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=235457 (Mon Sep 16 14:56:59 2019)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n\", \"hostname\": \"tendrl27.lab.eng.blr.redhat.com\", \"host-id\": 1, \"engine-status\": {\"reason\": \"bad vm status\", \"health\": \"bad\", \"vm\": \"up\", \"detail\": \"Paused\"}, \"score\": 3400, \"stopped\": false, \"maintenance\": false, \"crc32\": \"33211cf8\", \"local_conf_timestamp\": 235457, \"host-ts\": 235457}, \"global_maintenance\": false}", "stdout_lines": ["{\"1\": {\"conf_on_shared_storage\": true, \"live-data\": true, \"extra\": \"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=235457 (Mon Sep 16 14:56:59 2019)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=235457 (Mon Sep 16 14:56:59 2019)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n\", \"hostname\": \"tendrl27.lab.eng.blr.redhat.com\", \"host-id\": 1, \"engine-status\": {\"reason\": \"bad vm status\", \"health\": \"bad\", \"vm\": \"up\", \"detail\": \"Paused\"}, \"score\": 3400, \"stopped\": false, \"maintenance\": false, \"crc32\": \"33211cf8\", \"local_conf_timestamp\": 235457, \"host-ts\": 235457}, \"global_maintenance\": false}"]} [ INFO ] TASK [ovirt.hosted_engine_setup : Check VM status at virt level] [ INFO ] TASK [ovirt.hosted_engine_setup : Fail if engine VM is not running] [ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "Engine VM is not running, please check vdsm logs"} I checked vdsm,sanlock and agent log but could not find anything. Below are the pkgs installed: rhvm-appliance-4.3-20190904.0.el7.x86_64 vdsm-4.30.30-1.el7ev.x86_64 glusterfs-server-3.12.2-47.4.el7rhgs.x86_64 sanlock-3.7.3-1.el7.x86_64 qemu-kvm-rhev-2.12.0-33.el7.x86_64 ovirt-hosted-engine-ha-2.3.5-1.el7ev.noarch ovirt-hosted-engine-setup-2.3.12-1.el7ev.noarch ovirt-ansible-hosted-engine-setup-1.0.28-1.el7ev.noarch In further investigation I came to know qemu paused it due to a storage error: [root@tendrl27 ~]# virsh -r domstate --reason HostedEngine paused (I/O error) [root@tendrl27 ~]# virsh -r domblkerror HostedEngine vda: unspecified error Checked gluster storage domain mount log: [2019-09-16 10:00:09.158798] W [fuse-bridge.c:2756:fuse_readv_cbk] 0-glusterfs-fuse: 112451: READ => -1 gfid=64de5488-2fd9-49bf-9295-249060907fe6 fd=0x7f646001be98 (Invalid argument) [2019-09-16 10:00:09.158817] W [fuse-bridge.c:2756:fuse_readv_cbk] 0-glusterfs-fuse: 112445: READ => -1 gfid=64de5488-2fd9-49bf-9295-249060907fe6 fd=0x7f646001be98 (Invalid argument) [2019-09-16 10:00:09.158840] W [fuse-bridge.c:2756:fuse_readv_cbk] 0-glusterfs-fuse: 112449: READ => -1 gfid=64de5488-2fd9-49bf-9295-249060907fe6 fd=0x7f646001be98 (Invalid argument) The message "W [MSGID: 114031] [client-rpc-fops.c:2927:client3_3_readv_cbk] 0-engine-client-2: remote operation failed [Invalid argument]" repeated 22 times between [2019-09-16 10:00:01.485358] and [2019-09-16 10:00:09.157584] The message "W [MSGID: 114031] [client-rpc-fops.c:2927:client3_3_readv_cbk] 0-engine-client-0: remote operation failed [Invalid argument]" repeated 22 times between [2019-09-16 10:00:01.484026] and [2019-09-16 10:00:09.158515] The message "W [MSGID: 114031] [client-rpc-fops.c:2927:client3_3_readv_cbk] 0-engine-client-1: remote operation failed [Invalid argument]" repeated 22 times between [2019-09-16 10:00:01.484882] and [2019-09-16 10:00:09.158827] [2019-09-16 10:01:10.419516] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-0: remote operation failed [Invalid argument] [2019-09-16 10:01:10.419518] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-1: remote operation failed [Invalid argument] [2019-09-16 10:01:10.419613] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-2: remote operation failed [Invalid argument] [2019-09-16 10:01:12.150878] W [fuse-bridge.c:2841:fuse_writev_cbk] 0-glusterfs-fuse: 113275: WRITE => -1 gfid=bfe83557-4629-4894-add2-55746b5cb90c fd=0x7f64600364a8 (Invalid argument) [2019-09-16 10:01:14.151399] W [fuse-bridge.c:2841:fuse_writev_cbk] 0-glusterfs-fuse: 113299: WRITE => -1 gfid=bfe83557-4629-4894-add2-55746b5cb90c fd=0x7f6460012b48 (Invalid argument) [2019-09-16 10:01:12.154307] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-2: remote operation failed [Invalid argument] [2019-09-16 10:01:12.154366] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-0: remote operation failed [Invalid argument] [2019-09-16 10:01:12.154373] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-1: remote operation failed [Invalid argument] Looks like reads are not 4k-aligned gluster is merely executing the fops it got from higher layers. For instance, one of the reads that failed with 'Invalid argument' had offset=6121984 in the tcpdump. This is not an integral multiple of 4096, and as a result the read failed. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Gobinda, can you check if this is due to Bug 1745443? Have we included the correct version?
(In reply to Gobinda Das from comment #0) > Description of problem: > During RHHI deployment at te time of HE health check deployment failed and > he vm moved to paused state. > Engine status : {"reason": "bad vm status", "health": > "bad", "vm": "up", "detail": "Paused"} Please share: - /var/log/vdsm/vdsm.log showing the time frame when the vm was started and paused - /var/log/libvirt/qemu/vm-name.log > qemu-kvm-rhev-2.12.0-33.el7.x86_64 You need qemu-kvm-rhev-2.12.0-33.el7_7.3 for 4k support. Can you check the version again? > Looks like reads are not 4k-aligned gluster is merely executing the fops it > got from higher layers. > For instance, one of the reads that failed with 'Invalid argument' had > offset=6121984 in the tcpdump. > This is not an integral multiple of 4096, and as a result the read failed. This sounds like qemu issue, but lets make sure we are using the correct qemu version before pass this to qemu folks.
Closing this as dependent bug is closed.