Bug 1599616
| Summary: | Setting host to local maintenance doesn't pass to the HA agent if ovirt-ha-agent and ovirt-ha-broker are down on the host | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-hosted-engine-ha | Reporter: | Liran Rotenberg <lrotenbe> | ||||||||||||||||
| Component: | Agent | Assignee: | Asaf Rachmani <arachman> | ||||||||||||||||
| Status: | CLOSED DEFERRED | QA Contact: | Polina <pagranat> | ||||||||||||||||
| Severity: | medium | Docs Contact: | |||||||||||||||||
| Priority: | high | ||||||||||||||||||
| Version: | 2.2.14 | CC: | bugs, lrotenbe, mavital, mburman, michal.skrivanek, msivak, pagranat, yzhao | ||||||||||||||||
| Target Milestone: | --- | Keywords: | Automation, PrioBumpQA, Reopened | ||||||||||||||||
| Target Release: | --- | ||||||||||||||||||
| Hardware: | Unspecified | ||||||||||||||||||
| OS: | Unspecified | ||||||||||||||||||
| Whiteboard: | |||||||||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||
| Last Closed: | 2021-09-29 11:33:06 UTC | Type: | Bug | ||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||
| oVirt Team: | Integration | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
| Embargoed: | |||||||||||||||||||
| Attachments: |
|
||||||||||||||||||
|
Description
Liran Rotenberg
2018-07-10 08:09:21 UTC
Created attachment 1457718 [details]
logs
problematic correlation id action(on engine - ~10:41:47): 1f6d276f-ed34-42bc-a89a-7cc705c01b29
Martin, should the score be 0? I don't think that is in the design. That is in the design. The other nodes are not using anything else but score to detect better hosts. Please note the extra info block that says the state machine is in EngineDown and not in local maintenance. Liran: can you please attach the hosted engine config and runtime files too? Local maintenance is set by writing a flag to one of the /var/run files iirc. Created attachment 1459428 [details]
HE config and runtime
The files after reproducing.
Yep, that is what I thought. Both ha.conf files contain local_maintenance=False. That means vdsm never wrote the maintenance status there and the agent never initiated the maintenance procedure. Created attachment 1459429 [details]
reproduce logs
engine correlation id: daaa204e-882a-4b67-b512-26d4f2dc595a
It doesn't seams reproducible for me on a correctly working environment but for sure we have a kind of side effect if the host is not recognized by engine/vdsm as an hosted-engine host. An host is managed by the engine as an hosted-engine not according to fixed attribute in the engine DB but according to it'ha score as a transient value so, if ovirt-ha-agent is down for any reason, the host will not be handled as an hosted-engine one. OK, reproduced:
stop ovirt-ha-agent and ovirt-ha-broker on host side.
VDSM will still report hosted-engine as configured:
[root@c75he20180802h2 ~]# vdsm-client Host getCapabilities | grep hostedEngineDeployed
"hostedEngineDeployed": true,
but after a bit of time the HE silver crow will disappear on engine side.
In that case, and only in that case, HE local maintenance will be ignored setting the host in maintenance mode from the engine.
(In reply to Simone Tiraboschi from comment #11) > In that case, and only in that case, HE local maintenance will be ignored > setting the host in maintenance mode from the engine. Neither in that case it seams really reproducible. Also in engine.log attached on comment 9 we see: 2018-07-10 10:52:37,808+03 INFO [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (default task-32) [8c994883-5d78-4cca-8fc5-a354f4360faa] Running command: MaintenanceNumberOfVdssCommand internal: false. Entities affected : ID: b5d45439-4561-4230-86f3-536b21e98ef8 Type: VDSAction group MANIPULATE_HOST with role type ADMIN 2018-07-10 10:52:37,813+03 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (default task-32) [8c994883-5d78-4cca-8fc5-a354f4360faa] START, SetVdsStatusVDSCommand(HostName = host_mixed_2, SetVdsStatusVDSCommandParameters:{hostId='b5d45439-4561-4230-86f3-536b21e98ef8', status='PreparingForMaintenance', nonOperationalReason='NONE', stopSpmFailureLogged='true', maintenanceReason='null'}), log id: 796080e3 2018-07-10 10:52:37,820+03 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (default task-32) [8c994883-5d78-4cca-8fc5-a354f4360faa] FINISH, SetVdsStatusVDSCommand, log id: 796080e3 2018-07-10 10:52:37,900+03 INFO [org.ovirt.engine.core.bll.MaintenanceVdsCommand] (default task-32) [8c994883-5d78-4cca-8fc5-a354f4360faa] Running command: MaintenanceVdsCommand internal: true. Entities affected : ID: b5d45439-4561-4230-86f3-536b21e98ef8 Type: VDS 2018-07-10 10:52:37,957+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SetHaMaintenanceModeVDSCommand] (default task-32) [8c994883-5d78-4cca-8fc5-a354f4360faa] START, SetHaMaintenanceModeVDSCommand(HostName = host_mixed_2, SetHaMaintenanceModeVDSCommandParameters:{hostId='b5d45439-4561-4230-86f3-536b21e98ef8'}), log id: 6b541a18 2018-07-10 10:52:37,962+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SetHaMaintenanceModeVDSCommand] (default task-32) [8c994883-5d78-4cca-8fc5-a354f4360faa] FINISH, SetHaMaintenanceModeVDSCommand, log id: 6b541a18 2018-07-10 10:52:37,974+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-32) [8c994883-5d78-4cca-8fc5-a354f4360faa] EVENT_ID: USER_VDS_MAINTENANCE_WITHOUT_REASON(620), Host host_mixed_2 was switched to Maintenance mode by admin@internal-authz. 2018-07-10 10:52:40,774+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-83) [] Updated host status from 'Preparing for Maintenance' to 'Maintenance' in database, host 'host_mixed_2'(b5d45439-4561-4230-86f3-536b21e98ef8) When everything worked as expected and indeed the engine sent also SetHaMaintenanceModeVDSCommand But at 2018-07-10 10:41:44,526+03 for instance: 2018-07-10 10:41:44,526+03 INFO [org.ovirt.engine.core.bll.MaintenanceNumberOfVdssCommand] (default task-15) [1f6d276f-ed34-42bc-a89a-7cc705c01b29] Running command: MaintenanceNumberOfVdssCommand internal: false. Entities affected : ID: b5d45439-4561-4230-86f3-536b21e98ef8 Type: VDSAction group MANIPULATE_HOST with role type ADMIN 2018-07-10 10:41:44,568+03 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (default task-15) [1f6d276f-ed34-42bc-a89a-7cc705c01b29] START, SetVdsStatusVDSCommand(HostName = host_mixed_2, SetVdsStatusVDSCommandParameters:{hostId='b5d45439-4561-4230-86f3-536b21e98ef8', status='PreparingForMaintenance', nonOperationalReason='NONE', stopSpmFailureLogged='true', maintenanceReason='null'}), log id: 7ccb54ca 2018-07-10 10:41:44,714+03 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (default task-15) [1f6d276f-ed34-42bc-a89a-7cc705c01b29] FINISH, SetVdsStatusVDSCommand, log id: 7ccb54ca 2018-07-10 10:41:45,746+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-24) [] Updated host status from 'Preparing for Maintenance' to 'Maintenance' in database, host 'host_mixed_2'(b5d45439-4561-4230-86f3-536b21e98ef8) 2018-07-10 10:41:46,771+03 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-40203) [] Clearing cache of pool: '5049911a-5139-11e8-8235-001a4a1610a0' for problematic entities of VDS: 'host_mixed_2'. 2018-07-10 10:41:46,784+03 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-40203) [] Removing vds '[b5d45439-4561-4230-86f3-536b21e98ef8]' from the domain in maintenance cache 2018-07-10 10:41:46,784+03 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-40203) [] Removing host(s) '[b5d45439-4561-4230-86f3-536b21e98ef8]' from hosts unseen domain report cache 2018-07-10 10:41:46,796+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-24) [] START, DisconnectStoragePoolVDSCommand(HostName = host_mixed_2, DisconnectStoragePoolVDSCommandParameters:{hostId='b5d45439-4561-4230-86f3-536b21e98ef8', storagePoolId='5049911a-5139-11e8-8235-001a4a1610a0', vds_spm_id='3'}), log id: 297c9f83 2018-07-10 10:41:46,987+03 INFO [org.ovirt.engine.core.bll.MaintenanceVdsCommand] (default task-15) [1f6d276f-ed34-42bc-a89a-7cc705c01b29] Running command: MaintenanceVdsCommand internal: true. Entities affected : ID: b5d45439-4561-4230-86f3-536b21e98ef8 Type: VDS 2018-07-10 10:41:47,923+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-15) [1f6d276f-ed34-42bc-a89a-7cc705c01b29] EVENT_ID: USER_VDS_MAINTENANCE_WITHOUT_REASON(620), Host host_mixed_2 was switched to Maintenance mode by admin@internal-authz. 2018-07-10 10:41:57,292+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-24) [] FINISH, DisconnectStoragePoolVDSCommand, log id: 297c9f83 2018-07-10 10:41:58,518+03 INFO [org.ovirt.engine.core.bll.storage.pool.DisconnectHostFromStoragePoolServersCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-24) [391c9e97] Running command: DisconnectHostFromStoragePoolServersCommand internal: true. Entities affected : ID: 5049911a-5139-11e8-8235-001a4a1610a0 Type: StoragePool 2018-07-10 10:41:58,576+03 INFO [org.ovirt.engine.core.bll.scheduling.policyunits.PowerSavingBalancePolicyUnit] (EE-ManagedThreadFactory-engineScheduled-Thread-20) [] Cluster 'golden_env_mixed_1' does not have enough spare hosts, but no additional host is available. 2018-07-10 10:42:00,668+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStorageServerVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-24) [391c9e97] START, DisconnectStorageServerVDSCommand(HostName = host_mixed_2, StorageServerConnectionManagementVDSParameters:{hostId='b5d45439-4561-4230-86f3-536b21e98ef8', storagePoolId='5049911a-5139-11e8-8235-001a4a1610a0', storageType='NFS', connectionList='[StorageServerConnections:{id='d2c56264-4908-48a6-a9e7-60f69a94b557', connection='yellow-vdsb.qa.lab.tlv.redhat.com:/Compute_NFS/GE/compute-he-2/export_domain', iqn='null', vfsType='null', mountOptions='null', nfsVersion='AUTO', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='39feddef-f7bb-4518-be34-3ba5c3a4fc86', connection='yellow-vdsb.qa.lab.tlv.redhat.com:/Compute_NFS/GE/compute-he-2/nfs_0', iqn='null', vfsType='null', mountOptions='null', nfsVersion='AUTO', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='3998fc4b-a9fd-4a94-afb9-4e2b30fb1172', connection='yellow-vdsb.qa.lab.tlv.redhat.com:/Compute_NFS/GE/compute-he-2/nfs_1', iqn='null', vfsType='null', mountOptions='null', nfsVersion='AUTO', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}, StorageServerConnections:{id='855e728d-5064-4fbe-9e3e-6e1585e644c2', connection='yellow-vdsb.qa.lab.tlv.redhat.com:/Compute_NFS/GE/compute-he-2/nfs_2', iqn='null', vfsType='null', mountOptions='null', nfsVersion='AUTO', nfsRetrans='null', nfsTimeo='null', iface='null', netIfaceName='null'}]', sendNetworkEventOnFailure='true'}), log id: 7d3b5c65 2018-07-10 10:42:06,037+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStorageServerVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-24) [391c9e97] FINISH, DisconnectStorageServerVDSCommand, return: {855e728d-5064-4fbe-9e3e-6e1585e644c2=0, 3998fc4b-a9fd-4a94-afb9-4e2b30fb1172=0, d2c56264-4908-48a6-a9e7-60f69a94b557=0, 39feddef-f7bb-4518-be34-3ba5c3a4fc86=0}, log id: 7d3b5c65 we don't see any SetHaMaintenanceModeVDSCommand Seen again on compute-ge-he-2
[root@compute-ge-he-2 ~]# grep "START, SetVdsStatusVDSCommand(HostName = host_mixed_1" /var/log/ovirt-engine/engine.log
...
2018-08-06 15:50:07,769+03 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (default task-28) [hosts_syncAction_d109a670-4a0a-4526] START, SetVdsStatusVDSCommand(HostName = host_mixed_1, SetVdsStatusVDSCommandParameters:{hostId='da9c8586-53d4-4b5f-b35d-0ab9da211dec', status='Unassigned', nonOperationalReason='NONE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 4f5c3bbb
2018-08-06 18:37:56,454+03 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (default task-25) [hosts_syncAction_67a1ab0c-1eae-4910] START, SetVdsStatusVDSCommand(HostName = host_mixed_1, SetVdsStatusVDSCommandParameters:{hostId='da9c8586-53d4-4b5f-b35d-0ab9da211dec', status='PreparingForMaintenance', nonOperationalReason='NONE', stopSpmFailureLogged='true', maintenanceReason='null'}), log id: 2a663ad6
grep "START, SetHaMaintenanceModeVDSCommand(HostName = host_mixed_1" /var/log/ovirt-engine/engine.log | grep host_mixed_1
...
2018-08-06 15:50:14,309+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SetHaMaintenanceModeVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-61) [767c91f2] START, SetHaMaintenanceModeVDSCommand(HostName = host_mixed_1, SetHaMaintenanceModeVDSCommandParameters:{hostId='da9c8586-53d4-4b5f-b35d-0ab9da211dec'}), log id: 1ab08812
[root@compute-ge-he-2 ~]#
So at 18:37 the engine executed:
SetVdsStatusVDSCommand skipping SetHaMaintenanceModeVDSCommand
Now everything seams fine at DB level.
[root@compute-ge-he-2 ~]# sudo -u postgres scl enable rh-postgresql95 -- psql -d engine -c 'select vds_name, ha_configured, ha_active from vds'
could not change directory to "/root": Permission denied
vds_name | ha_configured | ha_active
--------------+---------------+-----------
host_mixed_2 | t | t
host_mixed_1 | t | t
(2 rows)
I think that the issue comes from this if: https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/MaintenanceVdsCommand.java#L64 that if successfully skips the HE specific command on line 70. Re-targeting to 4.2.7 not being identified as blocker for 4.2.6. I'm closing with insufficient data resolution. If someone can provide more info on how to reproduce this, please reopen. Created attachment 1489483 [details]
agent log
I reopen the bug - happens again while the tier1 automatic run.
build_info: rhv-4.2.7-3
ovirt-engine-4.2.7.1-0.1.el7ev.noarch
vdsm-4.20.40-1.el7ev.x86_64
libvirt-4.5.0-9.el7.x86_64
One test failed on an attempt to deactivate host. Get request API shows that the host is in maintenance state, but this info is not transferred to the HA agent. The local maintenance in agent.log in not detected.
from test log:
2018-09-21 22:12:10,169 - MainThread - VDS - INFO - [10.46.16.27] Executing command python -c from ovirt_hosted_engine_ha.client.client import HAClient;host_id = HAClient().get_local_host_id();print HAClient().get_all_host_stats()[host_id]
Host(10.46.16.27) :Timeout waiting for hosted-engine maintenance state
2018-09-21 22:12:12,062 - MainThread - RemoteExecutor - DEBUG - [root.16.27/qum5net] OUT: {'conf_on_shared_storage': True, 'live-data': True, 'extra': 'metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=16738 (Fri Sep 21 22:12:06 2018)\nhost-id=1\nscore=3400\nvm_conf_refresh_time=16738 (Fri Sep 21 22:12:06 2018)\nconf_on_shared_storage=True\nmaintenance=False\nstate=EngineDown\nstopped=False\n', 'hostname': 'lynx12.lab.eng.tlv2.redhat.com', 'host-id': 1, 'engine-status': '{"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}', 'score': 3400, 'stopped': False, 'maintenance': False, 'crc32': '4b134367', 'local_conf_timestamp': 16738, 'host-ts': 16738}
agent.log attached
(In reply to Polina from comment #19) > agent.log attached Can you please attach also engine.log? Do you see a SetHaMaintenanceModeVDSCommand log entry there? Created attachment 1490117 [details]
engine log
Please see the attached engine.log
There are SetHaMaintenanceModeVDSCommand entries in line 509
2018-09-21 22:12:15,709+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SetHaMaintenanceModeVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-91) [2995a007] START, SetHaMaintenanceModeVDSComma nd(HostName = host_mixed_1, SetHaMaintenanceModeVDSCommandParameters:{hostId='a6ebcb78-bc75-4342-927f-cdf57be26a25'}), log id: 13b72aeb
510 2018-09-21 22:12:15,714+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SetHaMaintenanceModeVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-91) [2995a007] FINISH, SetHaMaintenanceModeVDSComm and, log id: 13b72aeb
Moving to 4.3.0 not being identified as a blocker for 4.2.7 and not being urgent. Created attachment 1508441 [details] automation logs seeing again in rhv-4.2.7-9. added automation logs https://rhv-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/rhv-4.2-ge-runner-tier1/268 Postponing to 4.3.0 not being identified as blocker for 4.2.8 *** Bug 1595091 has been marked as a duplicate of this bug. *** Created attachment 1701406 [details]
logs_4.4.1-11
we see this bug in automation sometimes. When it happens it affects the automation run results a lot.
I just want to point that this fix is very required.
add logs from the last rhv-4.4.1-11.
Correlation-Id: hosts_syncAction_ccd443a6-9323-4ab0
(In reply to Simone Tiraboschi from comment #15) > I think that the issue comes from this if: > https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/ > bll/src/main/java/org/ovirt/engine/core/bll/MaintenanceVdsCommand.java#L64 > > that if successfully skips the HE specific command on line 70. It would, but that should have been sent during the original transition to maintenance. But it can happen if this is the host where He runs originally, and then it moves as part of migrating all VMs away. In that case the HE command is not called again This bug/RFE is more than 2 years old and it didn't get enough attention so far, and is now flagged as pending close. Please review if it is still relevant and provide additional details/justification/patches if you believe it should get more attention for the next oVirt release. This bug didn't get any attention in a long time, and it's not planned in foreseeable future. oVirt development team has no plans to work on it. Please feel free to reopen if you have a plan how to contribute this feature/bug fix. |