Bug 1649513 - After updating hypervisor to vdsm-4.20.43-1 and gluster to glusterfs-3.12.2-25, ovirt-hosted-engine-ha can not mount gluster hosted-engine storage domain
Summary: After updating hypervisor to vdsm-4.20.43-1 and gluster to glusterfs-3.12.2-2...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-ha
Version: 4.2.6
Hardware: All
OS: Linux
high
high
Target Milestone: ovirt-4.3.1
: 4.3.0
Assignee: Sahina Bose
QA Contact: SATHEESARAN
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-13 18:42 UTC by Allie DeVolder
Modified: 2020-08-03 15:32 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-01-30 11:03:52 UTC
oVirt Team: Gluster
Target Upstream Version:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)

Description Allie DeVolder 2018-11-13 18:42:48 UTC
Description of problem:
After updating hypervisor to vdsm-4.20.43-1 and gluster to glusterfs-3.12.2-25, ovirt-hosted-engine-ha can not mount hosted-engine storage domain

Version-Release number of selected component (if applicable):
vdsm-4.20.43-1.el7ev.x86_64
glusterfs-3.12.2-25.el7rhgs.x86_64
ovirt-hosted-engine-ha-2.2.18-1.el7ev.noarch

How reproducible:
Unknown

Steps to Reproduce:
1. Configure RHHI with gluster hosted-engine storage domain
2. Update hypervisors
3.

Actual results:
hosted engine will not start up
# hosted-engine --vm-status 
The hosted engine configuration has not been retrieved from shared storage

Expected results:
successful hosted-engine startup

Additional info:
broker.log shows:
~~~
BackendFailureException: path to storage domain db091c92-ea98-4155-a963-fa54db750675 not found in /rhev/data-center/mnt/glusterSD
~~~ 
which suggests that this may be a regression of bz 1306901

Comment 1 Sandro Bonazzola 2018-11-20 09:27:20 UTC
Sahina can you look into this?

Comment 3 Sahina Bose 2018-11-20 11:57:39 UTC
1. From the case comments, I see that if this was a RHHI deployment, shard should have been turned on, but it is not, they do not have the default options for the gluster volume..maybe there's some history behind this?

# gluster volume info HostedEngine 
performance.readdir-ahead: on 
cluster.quorum-type: auto 
auth.allow: * 
performance.quick-read=off 
performance.read-ahead=off 
performance.io-cache=off 
performance.stat-prefetch: off 
network.remote-dio=enable 
cluster.eager-lock=enable 
cluster.server-quorum-type=server 
storage.owner-uid: 36 
storage.owner-gid: 36 
server.allow-insecure: on 
nfs.export-volumes: on 
network.ping-timeout: 120

2. Nov 11 16:51:57 c01h01 journal: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Traceback (most recent call last):#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent#012 return action(he)#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper#012 return he.start_monitoring()#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 412, in start_monitoring#012 self._initialize_vdsm()#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 569, in _initialize_vdsm#012 logger=self._log#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 468, in connect_vdsm_json_rpc#012 __vdsm_json_rpc_connect(logger, timeout)#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 411, in __vdsm_json_rpc_connect#012 timeout=VDSM_MAX_RETRY * VDSM_DELAY#012RuntimeError: Couldn't connect to VDSM within 60 seconds

Is vdsm service not started?

3. From the gluster mount logs, I see there are issues with mounting gluster volume on 2018-11-09 , but seems to be ok after that though there are disconnects

[2018-11-09 20:34:14.853652] E [MSGID: 100009] [glusterfsd.c:652:get_volfp] 0-glusterfsd: loading volume file /rhev/data-center/mnt/glusterSD/c01h01.gluster:_HostedEngine failed [Transport endpoint is not connected]
[2018-11-09 20:34:14.853707] E [MSGID: 100028] [glusterfsd.c:2421:glusterfs_volumes_init] 0-glusterfsd: Cannot reach volume specification file

[2018-11-10 23:13:36.996415] W [socket.c:593:__socket_rwv] 0-glusterfs: readv on xx.xx.2.1:24007 failed (No data available)
[2018-11-10 23:13:36.996460] I [glusterfsd-mgmt.c:2337:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: c01h01.gluster

From going through the vdsm.log, there seems to be an issue there. It is flooded with
2018-11-11 14:59:12,901+0000 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer] In recovery, ignoring 'Host.getAllVmStats' in bridge with {} (__init__:585)

Also, in messages.log:
Nov  9 19:24:50 c01h01 systemd: mom-vdsm.service failed.
Nov  9 19:24:50 c01h01 vdsm: VDSM failed to start: Vdsm user could not manage to run sudo operation: (stderr: ['sudo: account validation failure, is your account locked?']). Verify sudoer rules configuration
Nov  9 19:24:50 c01h01 python2: detected unhandled Python exception in '/usr/share/vdsm/vdsm'
Nov  9 19:24:50 c01h01 python2: can't communicate with ABRT daemon, is it running? [Errno 111] Connection refused
Nov  9 19:24:50 c01h01 systemd: vdsmd.service: main process exited, code=exited, status=1/FAILURE
Nov  9 19:24:50 c01h01 vdsmd_init_common.sh: vdsm: Running run_final_hooks
Nov  9 19:24:50 c01h01 systemd: Unit vdsmd.service entered failed state.

Sandro, any ideas about above error?

Comment 4 Sahina Bose 2018-11-21 06:32:23 UTC
Do we need to continue to investigate the earlier errors for gluster HE domain..I see that case has moved in a different direction

Comment 5 Sahina Bose 2018-11-26 04:45:46 UTC
Restoring needinfo

Comment 6 Sandro Bonazzola 2018-11-28 09:59:34 UTC
(In reply to Sahina Bose from comment #3)

> Also, in messages.log:
> Nov  9 19:24:50 c01h01 systemd: mom-vdsm.service failed.
> Nov  9 19:24:50 c01h01 vdsm: VDSM failed to start: Vdsm user could not
> manage to run sudo operation: (stderr: ['sudo: account validation failure,
> is your account locked?']). Verify sudoer rules configuration
> Nov  9 19:24:50 c01h01 python2: detected unhandled Python exception in
> '/usr/share/vdsm/vdsm'
> Nov  9 19:24:50 c01h01 python2: can't communicate with ABRT daemon, is it
> running? [Errno 111] Connection refused
> Nov  9 19:24:50 c01h01 systemd: vdsmd.service: main process exited,
> code=exited, status=1/FAILURE
> Nov  9 19:24:50 c01h01 vdsmd_init_common.sh: vdsm: Running run_final_hooks
> Nov  9 19:24:50 c01h01 systemd: Unit vdsmd.service entered failed state.
> 
> Sandro, any ideas about above error?

No, routing the question to Dan

Comment 7 Dan Kenigsberg 2018-11-28 10:34:06 UTC
vdsm user could not manage to run sudo operation: (stderr: ['sudo: account validation failure, is your account locked?']). Verify sudoer rules configuration

can somebody look at /etc/sudoers** ?

It smells as if /etc/sudoers.d/50_vdsm was corrupted during rhv-h upgrade.

Comment 8 Sahina Bose 2019-01-07 07:46:29 UTC
Can we close this bug? I don't think there's relevant information to proceed, and the case is already closed

Comment 9 Sandro Bonazzola 2019-01-28 09:39:57 UTC
This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 11 Sahina Bose 2019-01-30 11:03:52 UTC
Closing as the requested information was not available


Note You need to log in before you can comment on or make changes to this bug.