Bug 1752806

Summary: During RHHI setup HE deployment failed with 4k in rhel-7.7
Product: [oVirt] ovirt-engine Reporter: Gobinda Das <godas>
Component: BLL.GlusterAssignee: Nir Soffer <nsoffer>
Status: CLOSED NOTABUG QA Contact: Elad <ebenahar>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.3.6.5CC: bugs, kdhananj, nsoffer, sabose
Target Milestone: ovirt-4.3.7Flags: pm-rhel: ovirt-4.3+
sabose: blocker?
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1753112 (view as bug list) Environment:
Last Closed: 2019-10-03 13:47:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1753112    

Description Gobinda Das 2019-09-17 09:28:17 UTC
Description of problem:
 During RHHI deployment at te time of HE health check deployment failed and he vm moved to paused state. 
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "Paused"}

Error:

[ INFO ] TASK [ovirt.hosted_engine_setup : Check engine VM health]
[ ERROR ] fatal: [localhost]: FAILED! => {"attempts": 120, "changed": true, "cmd": ["hosted-engine", "--vm-status", "--json"], "delta": "0:00:00.147544", "end": "2019-09-16 14:56:59.790588", "rc": 0, "start": "2019-09-16 14:56:59.643044", "stderr": "", "stderr_lines": [], "stdout": "{\"1\": {\"conf_on_shared_storage\": true, \"live-data\": true, \"extra\": \"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=235457 (Mon Sep 16 14:56:59 2019)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=235457 (Mon Sep 16 14:56:59 2019)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n\", \"hostname\": \"tendrl27.lab.eng.blr.redhat.com\", \"host-id\": 1, \"engine-status\": {\"reason\": \"bad vm status\", \"health\": \"bad\", \"vm\": \"up\", \"detail\": \"Paused\"}, \"score\": 3400, \"stopped\": false, \"maintenance\": false, \"crc32\": \"33211cf8\", \"local_conf_timestamp\": 235457, \"host-ts\": 235457}, \"global_maintenance\": false}", "stdout_lines": ["{\"1\": {\"conf_on_shared_storage\": true, \"live-data\": true, \"extra\": \"metadata_parse_version=1\\nmetadata_feature_version=1\\ntimestamp=235457 (Mon Sep 16 14:56:59 2019)\\nhost-id=1\\nscore=3400\\nvm_conf_refresh_time=235457 (Mon Sep 16 14:56:59 2019)\\nconf_on_shared_storage=True\\nmaintenance=False\\nstate=EngineStarting\\nstopped=False\\n\", \"hostname\": \"tendrl27.lab.eng.blr.redhat.com\", \"host-id\": 1, \"engine-status\": {\"reason\": \"bad vm status\", \"health\": \"bad\", \"vm\": \"up\", \"detail\": \"Paused\"}, \"score\": 3400, \"stopped\": false, \"maintenance\": false, \"crc32\": \"33211cf8\", \"local_conf_timestamp\": 235457, \"host-ts\": 235457}, \"global_maintenance\": false}"]}
[ INFO ] TASK [ovirt.hosted_engine_setup : Check VM status at virt level]
[ INFO ] TASK [ovirt.hosted_engine_setup : Fail if engine VM is not running]
[ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "Engine VM is not running, please check vdsm logs"}


I checked vdsm,sanlock and agent log but could not find anything.
Below are the pkgs installed:
rhvm-appliance-4.3-20190904.0.el7.x86_64
vdsm-4.30.30-1.el7ev.x86_64
glusterfs-server-3.12.2-47.4.el7rhgs.x86_64
sanlock-3.7.3-1.el7.x86_64
qemu-kvm-rhev-2.12.0-33.el7.x86_64
ovirt-hosted-engine-ha-2.3.5-1.el7ev.noarch
ovirt-hosted-engine-setup-2.3.12-1.el7ev.noarch
ovirt-ansible-hosted-engine-setup-1.0.28-1.el7ev.noarch


In further investigation I came to know qemu paused it due to a storage error:

[root@tendrl27 ~]# virsh -r domstate --reason HostedEngine
paused (I/O error)

[root@tendrl27 ~]# virsh -r domblkerror HostedEngine
vda: unspecified error


Checked gluster storage domain mount log:

[2019-09-16 10:00:09.158798] W [fuse-bridge.c:2756:fuse_readv_cbk] 0-glusterfs-fuse: 112451: READ => -1 gfid=64de5488-2fd9-49bf-9295-249060907fe6 fd=0x7f646001be98 (Invalid argument)
[2019-09-16 10:00:09.158817] W [fuse-bridge.c:2756:fuse_readv_cbk] 0-glusterfs-fuse: 112445: READ => -1 gfid=64de5488-2fd9-49bf-9295-249060907fe6 fd=0x7f646001be98 (Invalid argument)
[2019-09-16 10:00:09.158840] W [fuse-bridge.c:2756:fuse_readv_cbk] 0-glusterfs-fuse: 112449: READ => -1 gfid=64de5488-2fd9-49bf-9295-249060907fe6 fd=0x7f646001be98 (Invalid argument)
The message "W [MSGID: 114031] [client-rpc-fops.c:2927:client3_3_readv_cbk] 0-engine-client-2: remote operation failed [Invalid argument]" repeated 22 times between [2019-09-16 10:00:01.485358] and [2019-09-16 10:00:09.157584]
The message "W [MSGID: 114031] [client-rpc-fops.c:2927:client3_3_readv_cbk] 0-engine-client-0: remote operation failed [Invalid argument]" repeated 22 times between [2019-09-16 10:00:01.484026] and [2019-09-16 10:00:09.158515]
The message "W [MSGID: 114031] [client-rpc-fops.c:2927:client3_3_readv_cbk] 0-engine-client-1: remote operation failed [Invalid argument]" repeated 22 times between [2019-09-16 10:00:01.484882] and [2019-09-16 10:00:09.158827]
[2019-09-16 10:01:10.419516] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-0: remote operation failed [Invalid argument]
[2019-09-16 10:01:10.419518] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-1: remote operation failed [Invalid argument]
[2019-09-16 10:01:10.419613] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-2: remote operation failed [Invalid argument]
[2019-09-16 10:01:12.150878] W [fuse-bridge.c:2841:fuse_writev_cbk] 0-glusterfs-fuse: 113275: WRITE => -1 gfid=bfe83557-4629-4894-add2-55746b5cb90c fd=0x7f64600364a8 (Invalid argument)
[2019-09-16 10:01:14.151399] W [fuse-bridge.c:2841:fuse_writev_cbk] 0-glusterfs-fuse: 113299: WRITE => -1 gfid=bfe83557-4629-4894-add2-55746b5cb90c fd=0x7f6460012b48 (Invalid argument)
[2019-09-16 10:01:12.154307] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-2: remote operation failed [Invalid argument]
[2019-09-16 10:01:12.154366] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-0: remote operation failed [Invalid argument]
[2019-09-16 10:01:12.154373] W [MSGID: 114031] [client-rpc-fops.c:833:client3_3_writev_cbk] 0-engine-client-1: remote operation failed [Invalid argument]

Looks like  reads are not 4k-aligned gluster is merely executing the fops it got from higher layers.
For instance, one of the reads that failed with 'Invalid argument' had offset=6121984 in the tcpdump.
This is not an integral multiple of 4096, and as a result the read failed.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Sahina Bose 2019-09-18 14:55:38 UTC
Gobinda, can you check if this is due to Bug 1745443? Have we included the correct version?

Comment 2 Nir Soffer 2019-09-18 19:39:28 UTC
(In reply to Gobinda Das from comment #0)
> Description of problem:
>  During RHHI deployment at te time of HE health check deployment failed and
> he vm moved to paused state. 
> Engine status                      : {"reason": "bad vm status", "health":
> "bad", "vm": "up", "detail": "Paused"}

Please share:
- /var/log/vdsm/vdsm.log showing the time frame when the vm was started and paused
- /var/log/libvirt/qemu/vm-name.log 

> qemu-kvm-rhev-2.12.0-33.el7.x86_64

You need qemu-kvm-rhev-2.12.0-33.el7_7.3 for 4k support. Can you check the version
again?

> Looks like  reads are not 4k-aligned gluster is merely executing the fops it
> got from higher layers.
> For instance, one of the reads that failed with 'Invalid argument' had
> offset=6121984 in the tcpdump.
> This is not an integral multiple of 4096, and as a result the read failed.

This sounds like qemu issue, but lets make sure we are using the correct qemu
version before pass this to qemu folks.

Comment 3 Gobinda Das 2019-10-03 13:47:19 UTC
Closing this as dependent bug is  closed.