Bug 1406612

Summary: Hosted-Engine volume is mounted back, post moving the node in to maintenance
Product: [oVirt] ovirt-engine Reporter: SATHEESARAN <sasundar>
Component: BLL.HostedEngineAssignee: Doron Fediuck <dfediuck>
Status: CLOSED NOTABUG QA Contact: meital avital <mavital>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: bugs, knarra, msivak, sabose
Target Milestone: ---Flags: sasundar: planning_ack?
sasundar: devel_ack?
sasundar: testing_ack?
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-14 15:18:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Gluster RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1277939    
Attachments:
Description Flags
vdsm.log
none
agent.log from the node that was moved to maintenance none

Description SATHEESARAN 2016-12-21 03:47:56 UTC
Description of problem:
-----------------------
When the node is moved to maintenance, all the storage domains are unmounted. But after few minutes, the hosted-engine storage domain is mounted, though the node continues to be in maintenance

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
ovirt-4.1

How reproducible:
-----------------
Always

Steps to Reproduce:
-------------------
1. Install self-hosted engine setup with glusterfs replica 3 volume as backend
2. Post installation is done, add 2 more nodes from RHV UI, with action of 'hosted-engine deploy'. ( so there are 3 nodes in the default cluster with virt+ gluster capability enabled )
3. Create 2 GlusterFS storage(data) domain backed with glusterfs replica 3 volume
4. Enable 'local' maintenance on one of the node
5. Move the host to MAINTENANCE from RHV UI, choosing to stop all gluster services

Actual results:
---------------
All the storage domains are unmounted from the node. After few seconds, I could see that hosted-engine storage domain gets mounted on that host which is still in maintenance

Expected results:
-----------------
When the node is in maintenance, all the storage domains should get unmounted

Comment 1 SATHEESARAN 2016-12-21 03:51:57 UTC
I could see that HA agent is trying to mount the hosted-engine storage

<snip>
ainThread::INFO::2016-12-21 07:42:34,833::storage_server::226::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server
MainThread::INFO::2016-12-21 07:42:34,852::storage_server::233::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Refreshing the storage domain
MainThread::INFO::2016-12-21 07:42:35,071::hosted_engine::657::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Preparing images
MainThread::INFO::2016-12-21 07:42:35,071::image::126::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) Preparing images
MainThread::ERROR::2016-12-21 07:42:37,755::image::171::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) Error preparing image - storagepoolID: 00000000-0000-0000-0000-000000000000 - storagedomainID: fb8
21e6b-eb02-408e-910b-11e7e3072973 - imageID: cb417d4b-1459-401d-80c9-f32af00c3afb - volumeID: 5b229278-4c77-4c48-a0f1-627f1fc2ba14: Volume does not exist: (u'5b229278-4c77-4c48-a0f1-627f1fc2ba14',)
MainThread::INFO::2016-12-21 07:42:37,778::config::313::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_vm_conf) Reloading vm.conf from the shared storage domain
MainThread::INFO::2016-12-21 07:42:37,779::config::213::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file) Trying to get a fresher copy of vm configuration from the OVF_STO
RE
MainThread::WARNING::2016-12-21 07:42:40,113::ovf_store::107::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Unable to find OVF_STORE
MainThread::ERROR::2016-12-21 07:42:40,117::config::252::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file) Unable to identify the OVF_STORE volume, falling back to initial
 vm.conf. Please ensure you already added your first data domain for regular VMs
MainThread::ERROR::2016-12-21 07:42:40,117::config::260::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file) Reading initial vm.conf
MainThread::WARNING::2016-12-21 07:42:40,118::hosted_engine::469::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Error while monitoring engine: Path to volume 14c760b3-89e8-4e5c-a196
-5e9c97d6b1ac not found in /rhev/data-center/mnt
MainThread::WARNING::2016-12-21 07:42:40,118::hosted_engine::472::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unexpected error
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 437, in start_monitoring
    self._initialize_storage_images()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 660, in _initialize_storage_images
    self._config.refresh_vm_conf()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/env/config.py", line 318, in refresh_vm_conf
    archive_fname=constants.HEConfFiles.HECONFD_VM_CONF,
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/env/config.py", line 265, in refresh_local_conf_file
    conf_vol_id,
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/heconflib.py", line 274, in get_volume_path
    root=envconst.SD_MOUNT_PARENT,
RuntimeError: Path to volume 14c760b3-89e8-4e5c-a196-5e9c97d6b1ac not found in /rhev/data-center/mnt
MainThread::INFO::2016-12-21 07:42:40,145::hosted_engine::488::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 60 seconds
MainThread::ERROR::2016-12-21 07:43:40,209::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Request failed: failed to read metadata: [Errno 2] No such file or directory: '/var/run/vdsm
/storage/fb821e6b-eb02-408e-910b-11e7e3072973/73b7eeb3-1363-4893-9de6-47d5a0002889/effc0798-69cd-4bf5-9d27-53d82ec1ea61'' - trying to restart agent
MainThread::WARNING::2016-12-21 07:43:45,215::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '2'
MainThread::INFO::2016-12-21 07:43:45,243::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate common name: moonshine.lab.eng.blr.redhat.com
MainThread::INFO::2016-12-21 07:43:45,245::hosted_engine::604::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_vdsm) Initializing VDSM
MainThread::INFO::2016-12-21 07:43:47,640::hosted_engine::630::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Connecting the storage
MainThread::INFO::2016-12-21 07:43:47,641::storage_server::219::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server
MainThread::INFO::2016-12-21 07:43:49,992::storage_server::226::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server
MainThread::INFO::2016-12-21 07:43:50,510::storage_server::233::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Refreshing the storage domain
MainThread::INFO::2016-12-21 07:43:50,730::hosted_engine::657::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Preparing images
MainThread::INFO::2016-12-21 07:43:50,730::image::126::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) Preparing images
MainThread::INFO::2016-12-21 07:43:53,695::config::313::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_vm_conf) Reloading vm.conf from the shared storage domain
MainThread::INFO::2016-12-21 07:43:53,696::config::213::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file) Trying to get a fresher copy of vm configuration from the OVF_STORE
</snip>

Comment 2 SATHEESARAN 2016-12-21 03:57:33 UTC
Created attachment 1234202 [details]
vdsm.log

Comment 3 SATHEESARAN 2016-12-21 03:59:11 UTC
Created attachment 1234204 [details]
agent.log from the node that was moved to maintenance

Comment 4 SATHEESARAN 2016-12-21 04:08:30 UTC
This is problem for Grafton in-service updates.

Gluster supports in-service software upgrade/update, where all the gluster processes running on one of the node in the cluster is stopped, and then upgrade is kicked in.

Post upgrade is completed, all the gluster processes are started back by starting glusterd on that node. This triggers self-heal from older bricks ( non-upgraded ) to the newer ( upgraded ) bricks. 

The above mentioned step is done on the nodes one after the other, until all the nodes in the cluster are upgraded/updated.


When the hosted-engine storage is getting mounted again, the upgrade procedure may not get completed

Comment 5 SATHEESARAN 2016-12-21 04:17:11 UTC
(In reply to SATHEESARAN from comment #4)
> This is problem for Grafton in-service updates.
> 
> Gluster supports in-service software upgrade/update, where all the gluster
> processes running on one of the node in the cluster is stopped, and then
> upgrade is kicked in.
> 
> Post upgrade is completed, all the gluster processes are started back by
> starting glusterd on that node. This triggers self-heal from older bricks (
> non-upgraded ) to the newer ( upgraded ) bricks. 
> 
> The above mentioned step is done on the nodes one after the other, until all
> the nodes in the cluster are upgraded/updated.
> 
> 
> When the hosted-engine storage is getting mounted again, the upgrade
> procedure may not get completed

Sorry that I haven't completed with the statement.

Upgrade/update of Gluster proceeds, but it requires to remount the hosted-engine storage again, which means users need to activate the host again, then move the host again to maintenance and then activate to have engine volume remounted, which is a extra-step

Comment 6 Martin Sivák 2016-12-21 09:42:18 UTC
Hosted engine owns the mount and the associated domain monitor for its domain. The agent requires the storage to be available because it we have the synchronization whiteboard and a sanlock lockspace located there. This is true even for the maintenance mode.

The proper way to do this is setting local maintenance mode (+ waiting for hosted engine to move to local maintenance state), stopping the hosted engine services (ovirt-ha-agent and ovirt-ha-broker) and only then you can unmount the storage.

Comment 7 SATHEESARAN 2016-12-22 04:08:39 UTC
(In reply to Martin Sivák from comment #6)
> Hosted engine owns the mount and the associated domain monitor for its
> domain. The agent requires the storage to be available because it we have
> the synchronization whiteboard and a sanlock lockspace located there. This
> is true even for the maintenance mode.
> 
> The proper way to do this is setting local maintenance mode (+ waiting for
> hosted engine to move to local maintenance state), stopping the hosted
> engine services (ovirt-ha-agent and ovirt-ha-broker) and only then you can
> unmount the storage.

Martin,

Thanks for that information.
In this case, this issue looks like expected behavior with HA services.

As HC environment consists of Gluster services as well, even gluster services needs to be stopped when the node is moved in to maintenance. Taking that in to consideration, the steps to move the node in to maintenance ( along with gluster services ) will look like :

1. Enable 'local maintenance' on the node
2. Wait for HE to enter in to 'local maintenance'
3. Stop hosted-engine HA services ( agent & broker )
4. Move the node in to maintenance ( stopping gluster services )

Does the above steps look reasonable ?

Will there by any change required to the workflow as you mentioned in comment6 ?

Comment 8 Sahina Bose 2016-12-22 06:50:54 UTC
Martin, ping - could you confirm workflow?

Comment 9 Martin Sivák 2017-02-13 10:42:57 UTC
Yep, this looks reasonable.

Comment 10 Sahina Bose 2017-02-14 15:18:52 UTC
Based on Comment 7

Comment 11 SATHEESARAN 2017-02-14 16:02:28 UTC
(In reply to Martin Sivák from comment #9)
> Yep, this looks reasonable.

Thanks Martin