Created attachment 1273859 [details] logs Description of problem: Put the HE host to the maintenance via the engine unmount the HE storage domain from the host in result ovirt-ha-agent need to pass via disaster recovery procedure to mount it again Version-Release number of selected component (if applicable): rhevm-4.1.1.8-0.1.el7.noarch How reproducible: 80% Steps to Reproduce: 1. Configure HE environment with at least two hosts 2. Put host without HE VM to the maintenance 3. Actual results: MainThread::INFO::2017-04-25 11:42:32,218::hosted_engine::488::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 60 seconds MainThread::ERROR::2017-04-25 11:43:32,255::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 191, in _run_agent return action(he) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 64, in action_proper return he.start_monitoring() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 423, in start_monitoring for old_state, state, delay in self.fsm: File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/fsm/machine.py", line 125, in next new_data = self.refresh(self._state.data) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/state_machine.py", line 81, in refresh stats.update(self.hosted_engine.collect_stats()) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 974, in collect_stats constants.SERVICE_TYPE + constants.MD_EXTENSION) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 233, in get_stats_from_storage result = self._checked_communicate(request) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 261, in _checked_communicate .format(message or response)) RequestError: Request failed: failed to read metadata: [Errno 2] No such file or directory: '/var/run/vdsm/storage/cb2fac54-0d54-44ad-8fd6-ad721d39187a/3d8a8740-4d1a-4d81-a409-2c84cf69b038/1297d7f1-d285-4d1b-b312-68e0f7c43bd5' MainThread::ERROR::2017-04-25 11:43:32,255::agent::206::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::WARNING::2017-04-25 11:43:37,260::agent::209::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '8' Expected results: When I put the HE host to the maintenance the engine must leave the HE SD mounted on the HE host Additional info: ================ engine.log - it has incorrect timezone (-7) 2017-04-25 04:42:26,819-04 INFO [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (default task-31) [hosts_syncAction_e8d05a78-d3c6-498e] Running command: MaintenanceNumberOfVdssCommand internal: false. Entities affected : ID: 3df7817d-a0e7-41fc-a36f-aae6ee4dba36 Type: VDSAction group MANIPULATE_HOST with role type ADMIN 2017-04-25 04:42:26,846-04 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (default task-31) [hosts_syncAction_e8d05a78-d3c6-498e] START, SetVdsStatusVDSCommand(HostName = host_mixed_3, SetVdsStatusVDSCommandParameters:{runAsync='true', hostId='3df7817d-a0e7-41fc-a36f-aae6ee4dba36', status='PreparingForMaintenance', nonOperationalReason='NONE', stopSpmFailureLogged='true', maintenanceReason='null'}), log id: d24960d 2017-04-25 04:42:26,874-04 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (default task-31) [hosts_syncAction_e8d05a78-d3c6-498e] FINISH, SetVdsStatusVDSCommand, log id: d24960d 2017-04-25 04:42:27,016-04 INFO [org.ovirt.engine.core.bll.storage.ovfstore.OvfDataUpdater] (DefaultQuartzScheduler6) [] Attempting to update VMs/Templates Ovf. 2017-04-25 04:42:27,022-04 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (DefaultQuartzScheduler6) [40ef5f93] Before acquiring and wait lock 'EngineLock:{exclusiveLocks='[00000001-0001-0001-0001-000000000311=<OVF_UPDATE, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}' 2017-04-25 04:42:27,022-04 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (DefaultQuartzScheduler6) [40ef5f93] Lock-wait acquired to object 'EngineLock:{exclusiveLocks='[00000001-0001-0001-0001-000000000311=<OVF_UPDATE, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}' 2017-04-25 04:42:27,023-04 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (DefaultQuartzScheduler6) [40ef5f93] Running command: ProcessOvfUpdateForStoragePoolCommand internal: true. Entities affected : ID: 00000001-0001-0001-0001-000000000311 Type: StoragePool 2017-04-25 04:42:27,034-04 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (DefaultQuartzScheduler6) [40ef5f93] Attempting to update VM OVFs in Data Center 'golden_env_mixed' 2017-04-25 04:42:27,041-04 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (DefaultQuartzScheduler6) [40ef5f93] Successfully updated VM OVFs in Data Center 'golden_env_mixed' 2017-04-25 04:42:27,041-04 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (DefaultQuartzScheduler6) [40ef5f93] Attempting to update template OVFs in Data Center 'golden_env_mixed' 2017-04-25 04:42:27,043-04 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (DefaultQuartzScheduler6) [40ef5f93] Successfully updated templates OVFs in Data Center 'golden_env_mixed' 2017-04-25 04:42:27,043-04 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (DefaultQuartzScheduler6) [40ef5f93] Attempting to remove unneeded template/vm OVFs in Data Center 'golden_env_mixed' 2017-04-25 04:42:27,046-04 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (DefaultQuartzScheduler6) [40ef5f93] Successfully removed unneeded template/vm OVFs in Data Center 'golden_env_mixed' 2017-04-25 04:42:27,047-04 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStoragePoolCommand] (DefaultQuartzScheduler6) [40ef5f93] Lock freed to object 'EngineLock:{exclusiveLocks='[00000001-0001-0001-0001-000000000311=<OVF_UPDATE, ACTION_TYPE_FAILED_OBJECT_LOCKED>]', sharedLocks='null'}' 2017-04-25 04:42:27,052-04 INFO [org.ovirt.engine.core.bll.MaintenanceVdsCommand] (default task-31) [hosts_syncAction_e8d05a78-d3c6-498e] Running command: MaintenanceVdsCommand internal: true. Entities affected : ID: 3df7817d-a0e7-41fc-a36f-aae6ee4dba36 Type: VDS 2017-04-25 04:42:27,123-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SetHaMaintenanceModeVDSCommand] (default task-31) [hosts_syncAction_e8d05a78-d3c6-498e] START, SetHaMaintenanceModeVDSCommand(HostName = host_mixed_3, SetHaMaintenanceModeVDSCommandParameters:{runAsync='true', hostId='3df7817d-a0e7-41fc-a36f-aae6ee4dba36'}), log id: 1a5efce3 2017-04-25 04:42:27,938-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SetHaMaintenanceModeVDSCommand] (default task-31) [hosts_syncAction_e8d05a78-d3c6-498e] FINISH, SetHaMaintenanceModeVDSCommand, log id: 1a5efce3 2017-04-25 04:42:27,952-04 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-31) [hosts_syncAction_e8d05a78-d3c6-498e] EVENT_ID: USER_VDS_MAINTENANCE_WITHOUT_REASON(620), Correlation ID: hosts_syncAction_e8d05a78-d3c6-498e, Job ID: 81c443d3-e747-49bd-a201-bcff58483150, Call Stack: null, Custom Event ID: -1, Message: Host host_mixed_3 was switched to Maintenance mode by admin@internal-authz. agent.log ========= MainThread::INFO::2017-04-25 11:42:29,899::config::412::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(_get_vm_conf_content_from_ovf_store) Trying to get a fresher copy of vm configuration from the OVF_STORE MainThread::ERROR::2017-04-25 11:42:32,216::config::424::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(_get_vm_conf_content_from_ovf_store) Failed scanning for OVF_STORE due to Image path does not exist or cannot be accessed/created: (u'/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_alukiano_compute-ge-he-1/cb2fac54-0d54-44ad-8fd6-ad721d39187a/images/3d8a8740-4d1a-4d81-a409-2c84cf69b038',) MainThread::ERROR::2017-04-25 11:42:32,216::config::450::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(_get_vm_conf_content_from_ovf_store) Unable to identify the OVF_STORE volume, falling back to initial vm.conf. Please ensure you already added your first data domain for regular VMs MainThread::WARNING::2017-04-25 11:42:32,218::hosted_engine::469::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Error while monitoring engine: Path to volume ab0d9f79-8578-4cdd-8254-e9406694609f not found in /rhev/data-center/mnt MainThread::WARNING::2017-04-25 11:42:32,218::hosted_engine::472::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unexpected error Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 437, in start_monitoring self._initialize_storage_images() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 662, in _initialize_storage_images self._config.refresh_vm_conf() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/env/config.py", line 492, in refresh_vm_conf content = self._get_file_content_from_shared_storage(VM) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/env/config.py", line 461, in _get_file_content_from_shared_storage config_volume_path = self._get_config_volume_path() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/env/config.py", line 179, in _get_config_volume_path conf_vol_uuid File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/heconflib.py", line 330, in get_volume_path root=envconst.SD_MOUNT_PARENT, RuntimeError: Path to volume ab0d9f79-8578-4cdd-8254-e9406694609f not found in /rhev/data-center/mnt MainThread::INFO::2017-04-25 11:42:32,218::hosted_engine::488::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 60 seconds MainThread::ERROR::2017-04-25 11:43:32,255::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 191, in _run_agent return action(he) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 64, in action_proper return he.start_monitoring() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 423, in start_monitoring for old_state, state, delay in self.fsm: File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/fsm/machine.py", line 125, in next new_data = self.refresh(self._state.data) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/state_machine.py", line 81, in refresh stats.update(self.hosted_engine.collect_stats()) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 974, in collect_stats constants.SERVICE_TYPE + constants.MD_EXTENSION) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 233, in get_stats_from_storage result = self._checked_communicate(request) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 261, in _checked_communicate .format(message or response)) RequestError: Request failed: failed to read metadata: [Errno 2] No such file or directory: '/var/run/vdsm/storage/cb2fac54-0d54-44ad-8fd6-ad721d39187a/3d8a8740-4d1a-4d81-a409-2c84cf69b038/1297d7f1-d285-4d1b-b312-68e0f7c43bd5' MainThread::ERROR::2017-04-25 11:43:32,255::agent::206::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::WARNING::2017-04-25 11:43:37,260::agent::209::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '8' MainThread::INFO::2017-04-25 11:43:37,279::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate common name: rose05.qa.lab.tlv.redhat.com MainThread::INFO::2017-04-25 11:43:37,279::hosted_engine::604::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_vdsm) Initializing VDSM MainThread::INFO::2017-04-25 11:43:39,598::hosted_engine::630::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Connecting the storage MainThread::INFO::2017-04-25 11:43:39,603::storage_server::219::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server MainThread::INFO::2017-04-25 11:43:41,899::storage_server::226::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server MainThread::INFO::2017-04-25 11:43:41,972::storage_server::233::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Refreshing the storage domain MainThread::INFO::2017-04-25 11:43:42,096::hosted_engine::657::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Preparing images MainThread::INFO::2017-04-25 11:43:42,096::image::126::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) Preparing images
Can you assign someone to look into this?
(In reply to Yaniv Dary from comment #1) > Can you assign someone to look into this? This has very little to do with storage - the flow is wrong. Putting the host to maintenance should either be blocked, or the VM successfully migrated first. If either is done, no special treatment is needed for the domain.
*** Bug 1455166 has been marked as a duplicate of this bug. ***
*** Bug 1455164 has been marked as a duplicate of this bug. ***
I still can see that the ovirt-ha-agent restarted after maintenance operation MainThread::INFO::2017-09-17 10:48:21,587::states::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Score is 0 due to local maintenance mode MainThread::INFO::2017-09-17 10:48:21,587::hosted_engine::471::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state LocalMaintenance (s core: 0) MainThread::INFO::2017-09-17 10:48:21,587::hosted_engine::476::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Best remote host alma07.qa.lab.tl v.redhat.com (id: 2, score: 3400) MainThread::INFO::2017-09-17 10:48:21,592::hosted_engine::487::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Monitoring loop execution time 2 sec MainThread::INFO::2017-09-17 10:48:29,613::config::493::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_vm_conf) Reloading vm.conf from the shared storage domain MainThread::INFO::2017-09-17 10:48:29,614::config::416::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(_get_vm_conf_content_from_ovf_store) Trying to get a fresher copy of vm configuration from the OVF_STORE MainThread::INFO::2017-09-17 10:48:29,831::ovf_store::103::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Found OVF_STORE: imgUUID:f0081182-bf8d-40ee-8852-725558 bda23d, volUUID:1cd99b41-5e57-4385-85e3-e4663a728a06 MainThread::INFO::2017-09-17 10:48:29,871::ovf_store::103::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Found OVF_STORE: imgUUID:b551eab4-ff6c-4e10-a28d-643a77 7f64ec, volUUID:3a99c043-310c-479f-923c-cfe9f00efd86 MainThread::INFO::2017-09-17 10:48:29,871::ovf_store::112::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) Extracting Engine VM OVF from the OVF_STORE MainThread::INFO::2017-09-17 10:48:29,872::ovf_store::119::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF) OVF_STORE volume path: /rhe v/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Compute__NFS_alukiano_compute-ge-he-1/d9ba642e-0bfc-493c-98c5-001f150ad37b/images/b551eab4-ff6c-4e10-a28d-643a777f64ec/3a99c043-310c-479f-923c-cfe9f00efd86 MainThread::INFO::2017-09-17 10:48:29,879::config::435::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(_get_vm_conf_content_from_ovf_store) Found an OVF for HE VM, trying to convert MainThread::INFO::2017-09-17 10:48:29,883::config::440::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(_get_vm_conf_content_from_ovf_store) Got vm.conf from OVF_STORE MainThread::INFO::2017-09-17 10:48:29,883::state_decorators::124::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check) Local maintenance detected MainThread::INFO::2017-09-17 10:48:29,886::hosted_engine::520::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_vdsm) Initializing VDSM MainThread::INFO::2017-09-17 10:48:30,004::storage_server::220::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(validate_storage_server) Validating storage server MainThread::WARNING::2017-09-17 10:48:30,011::storage_server::232::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(validate_storage_server) Hosted-engine storage domain is in invalid state MainThread::WARNING::2017-09-17 10:48:30,011::hosted_engine::548::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_validate_storage_images) Hosted-engine storage domain is in invalid state MainThread::INFO::2017-09-17 10:48:33,028::hosted_engine::765::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_domain_monitor_status) VDSM domain monitor status: NONE MainThread::INFO::2017-09-17 10:48:33,028::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down MainThread::INFO::2017-09-17 10:48:43,489::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.2.0-0.0.master.20170914130805 started Does it desired behavior?
At the moment we expect this restart. As broker and monitor are contending for the storage at the moment, agent may detect incorrect storage domain state and restart for recover. It will be fixed as first step of [1] [1] https://www.ovirt.org/develop/release-management/features/sla/hosted-engine-agent-offloading/
Thanks for the clarification. Verified on ovirt-engine-4.2.0-0.0.master.20170913112412.git2eb3c0a.el7.centos.noarch
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017. Since the problem described in this bug report should be resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.