1752806 – During RHHI setup HE deployment failed with 4k in rhel-7.7

Bug 1752806 - During RHHI setup HE deployment failed with 4k in rhel-7.7

Summary: During RHHI setup HE deployment failed with 4k in rhel-7.7

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Gluster
Sub Component:
Version:	4.3.6.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-4.3.7
Target Release:	---
Assignee:	Nir Soffer
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1753112
TreeView+	depends on / blocked

Reported:	2019-09-17 09:28 UTC by Gobinda Das
Modified:	2019-10-03 13:47 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Clones:	1753112 (view as bug list)
Environment:
Last Closed:	2019-10-03 13:47:19 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	pm-rhel: ovirt-4.3+ sabose: blocker?

Attachments	(Terms of Use)

Description Gobinda Das 2019-09-17 09:28:17 UTC

Description of problem:
 During RHHI deployment at te time of HE health check deployment failed and he vm moved to paused state. 
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "Paused"}

Error:

[ INFO ] TASK [ovirt.hosted_engine_setup : Check engine VM health]
[ ERROR ] fatal: [localhost]: FAILED! => {"attempts": 120, "changed": true, "cmd": ["hosted-engine", "--vm-status", "--json"], "delta": "0:00:00.147544", "end": "2019-09-16 14:56:59.790588", "rc": 0, "start": "2019-09-16 14:56:59.643044", "stderr": "", "stderr_lines": [], "stdout": "{\"1\": {\"conf_on_shared_storage\": true, \"live-data\": true, \"extra\": \"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=235457 (Mon Sep 16 14:56:59 2019)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=235457 (Mon Sep 16 14:56:59 2019)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n\", \"hostname\": \"tendrl27.lab.eng.blr.redhat.com\", \"host-id\": 1, \"engine-status\": {\"reason\": \"bad vm status\", \"health\": \"bad\", \"vm\": \"up\", \"detail\": \"Paused\"}, \"score\": 3400, \"stopped\": false, \"maintenance\": false, \"crc32\": \"33211cf8\", \"local_conf_timestamp\": 235457, \"host-ts\": 235457}, \"global_maintenance\": false}", "stdout_lines": ["{\"1\": {\"conf_on_shared_storage\": true, \"live-data\": true, \"extra\": \"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=235457 (Mon Sep 16 14:56:59 2019)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=235457 (Mon Sep 16 14:56:59 2019)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n\", \"hostname\": \"tendrl27.lab.eng.blr.redhat.com\", \"host-id\": 1, \"engine-status\": {\"reason\": \"bad vm status\", \"health\": \"bad\", \"vm\": \"up\", \"detail\": \"Paused\"}, \"score\": 3400, \"stopped\": false, \"maintenance\": false, \"crc32\": \"33211cf8\", \"local_conf_timestamp\": 235457, \"host-ts\": 235457}, \"global_maintenance\": false}"]}
[ INFO ] TASK [ovirt.hosted_engine_setup : Check VM status at virt level]
[ INFO ] TASK [ovirt.hosted_engine_setup : Fail if engine VM is not running]
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "Engine VM is not running, please check vdsm logs"}


I checked vdsm,sanlock and agent log but could not find anything.
Below are the pkgs installed:
rhvm-appliance-4.3-20190904.0.el7.x86_64
vdsm-4.30.30-1.el7ev.x86_64
glusterfs-server-3.12.2-47.4.el7rhgs.x86_64
sanlock-3.7.3-1.el7.x86_64
qemu-kvm-rhev-2.12.0-33.el7.x86_64
ovirt-hosted-engine-ha-2.3.5-1.el7ev.noarch
ovirt-hosted-engine-setup-2.3.12-1.el7ev.noarch
ovirt-ansible-hosted-engine-setup-1.0.28-1.el7ev.noarch


In further investigation I came to know qemu paused it due to a storage error:

[root@tendrl27 ~]# virsh -r domstate --reason HostedEngine
paused (I/O error)

[root@tendrl27 ~]# virsh -r domblkerror HostedEngine
vda: unspecified error


Checked gluster storage domain mount log:

[2019-09-16 10:00:09.158798] W [fuse-bridge.c:2756:fuse_readv_cbk] 0-glusterfs-fuse: 112451: READ => -1 gfid=64de5488-2fd9-49bf-9295-249060907fe6 fd=0x7f646001be98 (Invalid argument)
[2019-09-16 10:00:09.158817] W [fuse-bridge.c:2756:fuse_readv_cbk] 0-glusterfs-fuse: 112445: READ => -1 gfid=64de5488-2fd9-49bf-9295-249060907fe6 fd=0x7f646001be98 (Invalid argument)
[2019-09-16 10:00:09.158840] W [fuse-bridge.c:2756:fuse_readv_cbk] 0-glusterfs-fuse: 112449: READ => -1 gfid=64de5488-2fd9-49bf-9295-249060907fe6 fd=0x7f646001be98 (Invalid argument)
The message "W [MSGID: 114031] [client-rpc-fops.c:2927:client3_3_readv_cbk] 0-engine-client-2: remote operation failed [Invalid argument]" repeated 22 times between [2019-09-16 10:00:01.485358] and [2019-09-16 10:00:09.157584]
The message "W [MSGID: 114031] [client-rpc-fops.c:2927:client3_3_readv_cbk] 0-engine-client-0: remote operation failed [Invalid argument]" repeated 22 times between [2019-09-16 10:00:01.484026] and [2019-09-16 10:00:09.158515]
The message "W [MSGID: 114031] [client-rpc-fops.c:2927:client3_3_readv_cbk] 0-engine-client-1: remote operation failed [Invalid argument]" repeated 22 times between [2019-09-16 10:00:01.484882] and [2019-09-16 10:00:09.158827]
[2019-09-16 10:01:10.419516] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-0: remote operation failed [Invalid argument]
[2019-09-16 10:01:10.419518] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-1: remote operation failed [Invalid argument]
[2019-09-16 10:01:10.419613] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-2: remote operation failed [Invalid argument]
[2019-09-16 10:01:12.150878] W [fuse-bridge.c:2841:fuse_writev_cbk] 0-glusterfs-fuse: 113275: WRITE => -1 gfid=bfe83557-4629-4894-add2-55746b5cb90c fd=0x7f64600364a8 (Invalid argument)
[2019-09-16 10:01:14.151399] W [fuse-bridge.c:2841:fuse_writev_cbk] 0-glusterfs-fuse: 113299: WRITE => -1 gfid=bfe83557-4629-4894-add2-55746b5cb90c fd=0x7f6460012b48 (Invalid argument)
[2019-09-16 10:01:12.154307] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-2: remote operation failed [Invalid argument]
[2019-09-16 10:01:12.154366] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-0: remote operation failed [Invalid argument]
[2019-09-16 10:01:12.154373] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-1: remote operation failed [Invalid argument]

Looks like  reads are not 4k-aligned gluster is merely executing the fops it got from higher layers.
For instance, one of the reads that failed with 'Invalid argument' had offset=6121984 in the tcpdump.
This is not an integral multiple of 4096, and as a result the read failed.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Sahina Bose 2019-09-18 14:55:38 UTC

Gobinda, can you check if this is due to Bug 1745443? Have we included the correct version?

Comment 2 Nir Soffer 2019-09-18 19:39:28 UTC

(In reply to Gobinda Das from comment #0)
> Description of problem:
>  During RHHI deployment at te time of HE health check deployment failed and
> he vm moved to paused state. 
> Engine status                      : {"reason": "bad vm status", "health":
> "bad", "vm": "up", "detail": "Paused"}

Please share:
- /var/log/vdsm/vdsm.log showing the time frame when the vm was started and paused
- /var/log/libvirt/qemu/vm-name.log 

> qemu-kvm-rhev-2.12.0-33.el7.x86_64

You need qemu-kvm-rhev-2.12.0-33.el7_7.3 for 4k support. Can you check the version
again?

> Looks like  reads are not 4k-aligned gluster is merely executing the fops it
> got from higher layers.
> For instance, one of the reads that failed with 'Invalid argument' had
> offset=6121984 in the tcpdump.
> This is not an integral multiple of 4096, and as a result the read failed.

This sounds like qemu issue, but lets make sure we are using the correct qemu
version before pass this to qemu folks.

Comment 3 Gobinda Das 2019-10-03 13:47:19 UTC

Closing this as dependent bug is  closed.

Note You need to log in before you can comment on or make changes to this bug.