Description of problem: After updating hypervisor to vdsm-4.20.43-1 and gluster to glusterfs-3.12.2-25, ovirt-hosted-engine-ha can not mount hosted-engine storage domain Version-Release number of selected component (if applicable): vdsm-4.20.43-1.el7ev.x86_64 glusterfs-3.12.2-25.el7rhgs.x86_64 ovirt-hosted-engine-ha-2.2.18-1.el7ev.noarch How reproducible: Unknown Steps to Reproduce: 1. Configure RHHI with gluster hosted-engine storage domain 2. Update hypervisors 3. Actual results: hosted engine will not start up # hosted-engine --vm-status The hosted engine configuration has not been retrieved from shared storage Expected results: successful hosted-engine startup Additional info: broker.log shows: ~~~ BackendFailureException: path to storage domain db091c92-ea98-4155-a963-fa54db750675 not found in /rhev/data-center/mnt/glusterSD ~~~ which suggests that this may be a regression of bz 1306901
Sahina can you look into this?
1. From the case comments, I see that if this was a RHHI deployment, shard should have been turned on, but it is not, they do not have the default options for the gluster volume..maybe there's some history behind this? # gluster volume info HostedEngine performance.readdir-ahead: on cluster.quorum-type: auto auth.allow: * performance.quick-read=off performance.read-ahead=off performance.io-cache=off performance.stat-prefetch: off network.remote-dio=enable cluster.eager-lock=enable cluster.server-quorum-type=server storage.owner-uid: 36 storage.owner-gid: 36 server.allow-insecure: on nfs.export-volumes: on network.ping-timeout: 120 2. Nov 11 16:51:57 c01h01 journal: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Traceback (most recent call last):#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent#012 return action(he)#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper#012 return he.start_monitoring()#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 412, in start_monitoring#012 self._initialize_vdsm()#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 569, in _initialize_vdsm#012 logger=self._log#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 468, in connect_vdsm_json_rpc#012 __vdsm_json_rpc_connect(logger, timeout)#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 411, in __vdsm_json_rpc_connect#012 timeout=VDSM_MAX_RETRY * VDSM_DELAY#012RuntimeError: Couldn't connect to VDSM within 60 seconds Is vdsm service not started? 3. From the gluster mount logs, I see there are issues with mounting gluster volume on 2018-11-09 , but seems to be ok after that though there are disconnects [2018-11-09 20:34:14.853652] E [MSGID: 100009] [glusterfsd.c:652:get_volfp] 0-glusterfsd: loading volume file /rhev/data-center/mnt/glusterSD/c01h01.gluster:_HostedEngine failed [Transport endpoint is not connected] [2018-11-09 20:34:14.853707] E [MSGID: 100028] [glusterfsd.c:2421:glusterfs_volumes_init] 0-glusterfsd: Cannot reach volume specification file [2018-11-10 23:13:36.996415] W [socket.c:593:__socket_rwv] 0-glusterfs: readv on xx.xx.2.1:24007 failed (No data available) [2018-11-10 23:13:36.996460] I [glusterfsd-mgmt.c:2337:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: c01h01.gluster From going through the vdsm.log, there seems to be an issue there. It is flooded with 2018-11-11 14:59:12,901+0000 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] In recovery, ignoring 'Host.getAllVmStats' in bridge with {} (__init__:585) Also, in messages.log: Nov 9 19:24:50 c01h01 systemd: mom-vdsm.service failed. Nov 9 19:24:50 c01h01 vdsm: VDSM failed to start: Vdsm user could not manage to run sudo operation: (stderr: ['sudo: account validation failure, is your account locked?']). Verify sudoer rules configuration Nov 9 19:24:50 c01h01 python2: detected unhandled Python exception in '/usr/share/vdsm/vdsm' Nov 9 19:24:50 c01h01 python2: can't communicate with ABRT daemon, is it running? [Errno 111] Connection refused Nov 9 19:24:50 c01h01 systemd: vdsmd.service: main process exited, code=exited, status=1/FAILURE Nov 9 19:24:50 c01h01 vdsmd_init_common.sh: vdsm: Running run_final_hooks Nov 9 19:24:50 c01h01 systemd: Unit vdsmd.service entered failed state. Sandro, any ideas about above error?
Do we need to continue to investigate the earlier errors for gluster HE domain..I see that case has moved in a different direction
Restoring needinfo
(In reply to Sahina Bose from comment #3) > Also, in messages.log: > Nov 9 19:24:50 c01h01 systemd: mom-vdsm.service failed. > Nov 9 19:24:50 c01h01 vdsm: VDSM failed to start: Vdsm user could not > manage to run sudo operation: (stderr: ['sudo: account validation failure, > is your account locked?']). Verify sudoer rules configuration > Nov 9 19:24:50 c01h01 python2: detected unhandled Python exception in > '/usr/share/vdsm/vdsm' > Nov 9 19:24:50 c01h01 python2: can't communicate with ABRT daemon, is it > running? [Errno 111] Connection refused > Nov 9 19:24:50 c01h01 systemd: vdsmd.service: main process exited, > code=exited, status=1/FAILURE > Nov 9 19:24:50 c01h01 vdsmd_init_common.sh: vdsm: Running run_final_hooks > Nov 9 19:24:50 c01h01 systemd: Unit vdsmd.service entered failed state. > > Sandro, any ideas about above error? No, routing the question to Dan
vdsm user could not manage to run sudo operation: (stderr: ['sudo: account validation failure, is your account locked?']). Verify sudoer rules configuration can somebody look at /etc/sudoers** ? It smells as if /etc/sudoers.d/50_vdsm was corrupted during rhv-h upgrade.
Can we close this bug? I don't think there's relevant information to proceed, and the case is already closed
This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.
Closing as the requested information was not available