Description of problem: During RHHI deployment at te time of HE health check, deployment failed and he vm moved to paused state. Engine status : {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "Paused"} Error: [ INFO ] TASK [ovirt.hosted_engine_setup : Check engine VM health] [ ERROR ] fatal: [localhost]: FAILED! => {"attempts": 120, "changed": true, "cmd": ["hosted-engine", "--vm-status", "--json"], "delta": "0:00:00.147544", "end": "2019-09-16 14:56:59.790588", "rc": 0, "start": "2019-09-16 14:56:59.643044", "stderr": "", "stderr_lines": [], "stdout": "{\"1\": {\"conf_on_shared_storage\": true, \"live-data\": true, \"extra\": \"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=235457 (Mon Sep 16 14:56:59 2019)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=235457 (Mon Sep 16 14:56:59 2019)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n\", \"hostname\": \"tendrl27.lab.eng.blr.redhat.com\", \"host-id\": 1, \"engine-status\": {\"reason\": \"bad vm status\", \"health\": \"bad\", \"vm\": \"up\", \"detail\": \"Paused\"}, \"score\": 3400, \"stopped\": false, \"maintenance\": false, \"crc32\": \"33211cf8\", \"local_conf_timestamp\": 235457, \"host-ts\": 235457}, \"global_maintenance\": false}", "stdout_lines": ["{\"1\": {\"conf_on_shared_storage\": true, \"live-data\": true, \"extra\": \"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=235457 (Mon Sep 16 14:56:59 2019)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=235457 (Mon Sep 16 14:56:59 2019)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n\", \"hostname\": \"tendrl27.lab.eng.blr.redhat.com\", \"host-id\": 1, \"engine-status\": {\"reason\": \"bad vm status\", \"health\": \"bad\", \"vm\": \"up\", \"detail\": \"Paused\"}, \"score\": 3400, \"stopped\": false, \"maintenance\": false, \"crc32\": \"33211cf8\", \"local_conf_timestamp\": 235457, \"host-ts\": 235457}, \"global_maintenance\": false}"]} [ INFO ] TASK [ovirt.hosted_engine_setup : Check VM status at virt level] [ INFO ] TASK [ovirt.hosted_engine_setup : Fail if engine VM is not running] [ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "Engine VM is not running, please check vdsm logs"} I checked vdsm,sanlock and agent log but could not find anything. Below are the pkgs installed: rhvm-appliance-4.3-20190904.0.el7.x86_64 vdsm-4.30.30-1.el7ev.x86_64 glusterfs-server-3.12.2-47.4.el7rhgs.x86_64 sanlock-3.7.3-1.el7.x86_64 qemu-kvm-rhev-2.12.0-33.el7.x86_64 ovirt-hosted-engine-ha-2.3.5-1.el7ev.noarch ovirt-hosted-engine-setup-2.3.12-1.el7ev.noarch ovirt-ansible-hosted-engine-setup-1.0.28-1.el7ev.noarch In further investigation I came to know qemu paused it due to a storage error: [root@tendrl27 ~]# virsh -r domstate --reason HostedEngine paused (I/O error) [root@tendrl27 ~]# virsh -r domblkerror HostedEngine vda: unspecified error Checked gluster storage domain mount log: [2019-09-16 10:00:09.158798] W [fuse-bridge.c:2756:fuse_readv_cbk] 0-glusterfs-fuse: 112451: READ => -1 gfid=64de5488-2fd9-49bf-9295-249060907fe6 fd=0x7f646001be98 (Invalid argument) [2019-09-16 10:00:09.158817] W [fuse-bridge.c:2756:fuse_readv_cbk] 0-glusterfs-fuse: 112445: READ => -1 gfid=64de5488-2fd9-49bf-9295-249060907fe6 fd=0x7f646001be98 (Invalid argument) [2019-09-16 10:00:09.158840] W [fuse-bridge.c:2756:fuse_readv_cbk] 0-glusterfs-fuse: 112449: READ => -1 gfid=64de5488-2fd9-49bf-9295-249060907fe6 fd=0x7f646001be98 (Invalid argument) The message "W [MSGID: 114031] [client-rpc-fops.c:2927:client3_3_readv_cbk] 0-engine-client-2: remote operation failed [Invalid argument]" repeated 22 times between [2019-09-16 10:00:01.485358] and [2019-09-16 10:00:09.157584] The message "W [MSGID: 114031] [client-rpc-fops.c:2927:client3_3_readv_cbk] 0-engine-client-0: remote operation failed [Invalid argument]" repeated 22 times between [2019-09-16 10:00:01.484026] and [2019-09-16 10:00:09.158515] The message "W [MSGID: 114031] [client-rpc-fops.c:2927:client3_3_readv_cbk] 0-engine-client-1: remote operation failed [Invalid argument]" repeated 22 times between [2019-09-16 10:00:01.484882] and [2019-09-16 10:00:09.158827] [2019-09-16 10:01:10.419516] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-0: remote operation failed [Invalid argument] [2019-09-16 10:01:10.419518] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-1: remote operation failed [Invalid argument] [2019-09-16 10:01:10.419613] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-2: remote operation failed [Invalid argument] [2019-09-16 10:01:12.150878] W [fuse-bridge.c:2841:fuse_writev_cbk] 0-glusterfs-fuse: 113275: WRITE => -1 gfid=bfe83557-4629-4894-add2-55746b5cb90c fd=0x7f64600364a8 (Invalid argument) [2019-09-16 10:01:14.151399] W [fuse-bridge.c:2841:fuse_writev_cbk] 0-glusterfs-fuse: 113299: WRITE => -1 gfid=bfe83557-4629-4894-add2-55746b5cb90c fd=0x7f6460012b48 (Invalid argument) [2019-09-16 10:01:12.154307] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-2: remote operation failed [Invalid argument] [2019-09-16 10:01:12.154366] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-0: remote operation failed [Invalid argument] [2019-09-16 10:01:12.154373] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-1: remote operation failed [Invalid argument] Looks like reads are not 4k-aligned gluster is merely executing the fops it got from higher layers. For instance, one of the reads that failed with 'Invalid argument' had offset=6121984 in the tcpdump. This is not an integral multiple of 4096, and as a result the read failed. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Closing this as the deployment works with qemu-kvm fix.