Bug 1753112 - During RHHI setup HE deployment failed with 4k in rhel-7.7
Summary: During RHHI setup HE deployment failed with 4k in rhel-7.7
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: rhhi
Version: rhhiv-1.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Gobinda Das
QA Contact: SATHEESARAN
URL:
Whiteboard:
Depends On: 1752806
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-09-18 06:51 UTC by Gobinda Das
Modified: 2019-09-26 14:05 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1752806
Environment:
Last Closed: 2019-09-26 14:05:33 UTC
Embargoed:


Attachments (Terms of Use)

Description Gobinda Das 2019-09-18 06:51:40 UTC
Description of problem:
 During RHHI deployment at te time of HE health check, deployment failed and he vm moved to paused state. 
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "Paused"}

Error:

[ INFO ] TASK [ovirt.hosted_engine_setup : Check engine VM health]
[ ERROR ] fatal: [localhost]: FAILED! => {"attempts": 120, "changed": true, "cmd": ["hosted-engine", "--vm-status", "--json"], "delta": "0:00:00.147544", "end": "2019-09-16 14:56:59.790588", "rc": 0, "start": "2019-09-16 14:56:59.643044", "stderr": "", "stderr_lines": [], "stdout": "{\"1\": {\"conf_on_shared_storage\": true, \"live-data\": true, \"extra\": \"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=235457 (Mon Sep 16 14:56:59 2019)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=235457 (Mon Sep 16 14:56:59 2019)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n\", \"hostname\": \"tendrl27.lab.eng.blr.redhat.com\", \"host-id\": 1, \"engine-status\": {\"reason\": \"bad vm status\", \"health\": \"bad\", \"vm\": \"up\", \"detail\": \"Paused\"}, \"score\": 3400, \"stopped\": false, \"maintenance\": false, \"crc32\": \"33211cf8\", \"local_conf_timestamp\": 235457, \"host-ts\": 235457}, \"global_maintenance\": false}", "stdout_lines": ["{\"1\": {\"conf_on_shared_storage\": true, \"live-data\": true, \"extra\": \"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=235457 (Mon Sep 16 14:56:59 2019)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=235457 (Mon Sep 16 14:56:59 2019)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n\", \"hostname\": \"tendrl27.lab.eng.blr.redhat.com\", \"host-id\": 1, \"engine-status\": {\"reason\": \"bad vm status\", \"health\": \"bad\", \"vm\": \"up\", \"detail\": \"Paused\"}, \"score\": 3400, \"stopped\": false, \"maintenance\": false, \"crc32\": \"33211cf8\", \"local_conf_timestamp\": 235457, \"host-ts\": 235457}, \"global_maintenance\": false}"]}
[ INFO ] TASK [ovirt.hosted_engine_setup : Check VM status at virt level]
[ INFO ] TASK [ovirt.hosted_engine_setup : Fail if engine VM is not running]
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "Engine VM is not running, please check vdsm logs"}


I checked vdsm,sanlock and agent log but could not find anything.
Below are the pkgs installed:
rhvm-appliance-4.3-20190904.0.el7.x86_64
vdsm-4.30.30-1.el7ev.x86_64
glusterfs-server-3.12.2-47.4.el7rhgs.x86_64
sanlock-3.7.3-1.el7.x86_64
qemu-kvm-rhev-2.12.0-33.el7.x86_64
ovirt-hosted-engine-ha-2.3.5-1.el7ev.noarch
ovirt-hosted-engine-setup-2.3.12-1.el7ev.noarch
ovirt-ansible-hosted-engine-setup-1.0.28-1.el7ev.noarch


In further investigation I came to know qemu paused it due to a storage error:

[root@tendrl27 ~]# virsh -r domstate --reason HostedEngine
paused (I/O error)

[root@tendrl27 ~]# virsh -r domblkerror HostedEngine
vda: unspecified error


Checked gluster storage domain mount log:

[2019-09-16 10:00:09.158798] W [fuse-bridge.c:2756:fuse_readv_cbk] 0-glusterfs-fuse: 112451: READ => -1 gfid=64de5488-2fd9-49bf-9295-249060907fe6 fd=0x7f646001be98 (Invalid argument)
[2019-09-16 10:00:09.158817] W [fuse-bridge.c:2756:fuse_readv_cbk] 0-glusterfs-fuse: 112445: READ => -1 gfid=64de5488-2fd9-49bf-9295-249060907fe6 fd=0x7f646001be98 (Invalid argument)
[2019-09-16 10:00:09.158840] W [fuse-bridge.c:2756:fuse_readv_cbk] 0-glusterfs-fuse: 112449: READ => -1 gfid=64de5488-2fd9-49bf-9295-249060907fe6 fd=0x7f646001be98 (Invalid argument)
The message "W [MSGID: 114031] [client-rpc-fops.c:2927:client3_3_readv_cbk] 0-engine-client-2: remote operation failed [Invalid argument]" repeated 22 times between [2019-09-16 10:00:01.485358] and [2019-09-16 10:00:09.157584]
The message "W [MSGID: 114031] [client-rpc-fops.c:2927:client3_3_readv_cbk] 0-engine-client-0: remote operation failed [Invalid argument]" repeated 22 times between [2019-09-16 10:00:01.484026] and [2019-09-16 10:00:09.158515]
The message "W [MSGID: 114031] [client-rpc-fops.c:2927:client3_3_readv_cbk] 0-engine-client-1: remote operation failed [Invalid argument]" repeated 22 times between [2019-09-16 10:00:01.484882] and [2019-09-16 10:00:09.158827]
[2019-09-16 10:01:10.419516] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-0: remote operation failed [Invalid argument]
[2019-09-16 10:01:10.419518] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-1: remote operation failed [Invalid argument]
[2019-09-16 10:01:10.419613] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-2: remote operation failed [Invalid argument]
[2019-09-16 10:01:12.150878] W [fuse-bridge.c:2841:fuse_writev_cbk] 0-glusterfs-fuse: 113275: WRITE => -1 gfid=bfe83557-4629-4894-add2-55746b5cb90c fd=0x7f64600364a8 (Invalid argument)
[2019-09-16 10:01:14.151399] W [fuse-bridge.c:2841:fuse_writev_cbk] 0-glusterfs-fuse: 113299: WRITE => -1 gfid=bfe83557-4629-4894-add2-55746b5cb90c fd=0x7f6460012b48 (Invalid argument)
[2019-09-16 10:01:12.154307] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-2: remote operation failed [Invalid argument]
[2019-09-16 10:01:12.154366] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-0: remote operation failed [Invalid argument]
[2019-09-16 10:01:12.154373] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-1: remote operation failed [Invalid argument]

Looks like  reads are not 4k-aligned gluster is merely executing the fops it got from higher layers.
For instance, one of the reads that failed with 'Invalid argument' had offset=6121984 in the tcpdump.
This is not an integral multiple of 4096, and as a result the read failed.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Gobinda Das 2019-09-26 14:05:33 UTC
Closing this as the deployment works with qemu-kvm fix.


Note You need to log in before you can comment on or make changes to this bug.