Bug 1406612
Summary: | Hosted-Engine volume is mounted back, post moving the node in to maintenance | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | SATHEESARAN <sasundar> | ||||||
Component: | BLL.HostedEngine | Assignee: | Doron Fediuck <dfediuck> | ||||||
Status: | CLOSED NOTABUG | QA Contact: | meital avital <mavital> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 4.1.0 | CC: | bugs, knarra, msivak, sabose | ||||||
Target Milestone: | --- | Flags: | sasundar:
planning_ack?
sasundar: devel_ack? sasundar: testing_ack? |
||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2017-02-14 15:18:52 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Gluster | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1277939 | ||||||||
Attachments: |
|
Description
SATHEESARAN
2016-12-21 03:47:56 UTC
I could see that HA agent is trying to mount the hosted-engine storage <snip> ainThread::INFO::2016-12-21 07:42:34,833::storage_server::226::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server MainThread::INFO::2016-12-21 07:42:34,852::storage_server::233::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Refreshing the storage domain MainThread::INFO::2016-12-21 07:42:35,071::hosted_engine::657::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Preparing images MainThread::INFO::2016-12-21 07:42:35,071::image::126::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) Preparing images MainThread::ERROR::2016-12-21 07:42:37,755::image::171::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) Error preparing image - storagepoolID: 00000000-0000-0000-0000-000000000000 - storagedomainID: fb8 21e6b-eb02-408e-910b-11e7e3072973 - imageID: cb417d4b-1459-401d-80c9-f32af00c3afb - volumeID: 5b229278-4c77-4c48-a0f1-627f1fc2ba14: Volume does not exist: (u'5b229278-4c77-4c48-a0f1-627f1fc2ba14',) MainThread::INFO::2016-12-21 07:42:37,778::config::313::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_vm_conf) Reloading vm.conf from the shared storage domain MainThread::INFO::2016-12-21 07:42:37,779::config::213::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file) Trying to get a fresher copy of vm configuration from the OVF_STO RE MainThread::WARNING::2016-12-21 07:42:40,113::ovf_store::107::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Unable to find OVF_STORE MainThread::ERROR::2016-12-21 07:42:40,117::config::252::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file) Unable to identify the OVF_STORE volume, falling back to initial vm.conf. Please ensure you already added your first data domain for regular VMs MainThread::ERROR::2016-12-21 07:42:40,117::config::260::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file) Reading initial vm.conf MainThread::WARNING::2016-12-21 07:42:40,118::hosted_engine::469::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Error while monitoring engine: Path to volume 14c760b3-89e8-4e5c-a196 -5e9c97d6b1ac not found in /rhev/data-center/mnt MainThread::WARNING::2016-12-21 07:42:40,118::hosted_engine::472::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unexpected error Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 437, in start_monitoring self._initialize_storage_images() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 660, in _initialize_storage_images self._config.refresh_vm_conf() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/env/config.py", line 318, in refresh_vm_conf archive_fname=constants.HEConfFiles.HECONFD_VM_CONF, File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/env/config.py", line 265, in refresh_local_conf_file conf_vol_id, File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/heconflib.py", line 274, in get_volume_path root=envconst.SD_MOUNT_PARENT, RuntimeError: Path to volume 14c760b3-89e8-4e5c-a196-5e9c97d6b1ac not found in /rhev/data-center/mnt MainThread::INFO::2016-12-21 07:42:40,145::hosted_engine::488::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 60 seconds MainThread::ERROR::2016-12-21 07:43:40,209::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Request failed: failed to read metadata: [Errno 2] No such file or directory: '/var/run/vdsm /storage/fb821e6b-eb02-408e-910b-11e7e3072973/73b7eeb3-1363-4893-9de6-47d5a0002889/effc0798-69cd-4bf5-9d27-53d82ec1ea61'' - trying to restart agent MainThread::WARNING::2016-12-21 07:43:45,215::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '2' MainThread::INFO::2016-12-21 07:43:45,243::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate common name: moonshine.lab.eng.blr.redhat.com MainThread::INFO::2016-12-21 07:43:45,245::hosted_engine::604::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_vdsm) Initializing VDSM MainThread::INFO::2016-12-21 07:43:47,640::hosted_engine::630::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Connecting the storage MainThread::INFO::2016-12-21 07:43:47,641::storage_server::219::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server MainThread::INFO::2016-12-21 07:43:49,992::storage_server::226::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server MainThread::INFO::2016-12-21 07:43:50,510::storage_server::233::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Refreshing the storage domain MainThread::INFO::2016-12-21 07:43:50,730::hosted_engine::657::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Preparing images MainThread::INFO::2016-12-21 07:43:50,730::image::126::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) Preparing images MainThread::INFO::2016-12-21 07:43:53,695::config::313::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_vm_conf) Reloading vm.conf from the shared storage domain MainThread::INFO::2016-12-21 07:43:53,696::config::213::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file) Trying to get a fresher copy of vm configuration from the OVF_STORE </snip> Created attachment 1234202 [details]
vdsm.log
Created attachment 1234204 [details]
agent.log from the node that was moved to maintenance
This is problem for Grafton in-service updates. Gluster supports in-service software upgrade/update, where all the gluster processes running on one of the node in the cluster is stopped, and then upgrade is kicked in. Post upgrade is completed, all the gluster processes are started back by starting glusterd on that node. This triggers self-heal from older bricks ( non-upgraded ) to the newer ( upgraded ) bricks. The above mentioned step is done on the nodes one after the other, until all the nodes in the cluster are upgraded/updated. When the hosted-engine storage is getting mounted again, the upgrade procedure may not get completed (In reply to SATHEESARAN from comment #4) > This is problem for Grafton in-service updates. > > Gluster supports in-service software upgrade/update, where all the gluster > processes running on one of the node in the cluster is stopped, and then > upgrade is kicked in. > > Post upgrade is completed, all the gluster processes are started back by > starting glusterd on that node. This triggers self-heal from older bricks ( > non-upgraded ) to the newer ( upgraded ) bricks. > > The above mentioned step is done on the nodes one after the other, until all > the nodes in the cluster are upgraded/updated. > > > When the hosted-engine storage is getting mounted again, the upgrade > procedure may not get completed Sorry that I haven't completed with the statement. Upgrade/update of Gluster proceeds, but it requires to remount the hosted-engine storage again, which means users need to activate the host again, then move the host again to maintenance and then activate to have engine volume remounted, which is a extra-step Hosted engine owns the mount and the associated domain monitor for its domain. The agent requires the storage to be available because it we have the synchronization whiteboard and a sanlock lockspace located there. This is true even for the maintenance mode. The proper way to do this is setting local maintenance mode (+ waiting for hosted engine to move to local maintenance state), stopping the hosted engine services (ovirt-ha-agent and ovirt-ha-broker) and only then you can unmount the storage. (In reply to Martin Sivák from comment #6) > Hosted engine owns the mount and the associated domain monitor for its > domain. The agent requires the storage to be available because it we have > the synchronization whiteboard and a sanlock lockspace located there. This > is true even for the maintenance mode. > > The proper way to do this is setting local maintenance mode (+ waiting for > hosted engine to move to local maintenance state), stopping the hosted > engine services (ovirt-ha-agent and ovirt-ha-broker) and only then you can > unmount the storage. Martin, Thanks for that information. In this case, this issue looks like expected behavior with HA services. As HC environment consists of Gluster services as well, even gluster services needs to be stopped when the node is moved in to maintenance. Taking that in to consideration, the steps to move the node in to maintenance ( along with gluster services ) will look like : 1. Enable 'local maintenance' on the node 2. Wait for HE to enter in to 'local maintenance' 3. Stop hosted-engine HA services ( agent & broker ) 4. Move the node in to maintenance ( stopping gluster services ) Does the above steps look reasonable ? Will there by any change required to the workflow as you mentioned in comment6 ? Martin, ping - could you confirm workflow? Yep, this looks reasonable. Based on Comment 7 (In reply to Martin Sivák from comment #9) > Yep, this looks reasonable. Thanks Martin |