Bug 1580243
| Summary: | A host with Unreachable Data SD is not moving to non-operational when brought up (but remains in a loop between non-responsive and connecting) | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> |
| Component: | ovirt-engine | Assignee: | Fred Rolland <frolland> |
| Status: | CLOSED ERRATA | QA Contact: | Yosi Ben Shimon <ybenshim> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.2.3 | CC: | aefrat, amashah, frolland, lsurette, mtessun, nsoffer, Rhev-m-bugs, rhodain, shipatil, srevivo, tnisan, ylavi |
| Target Milestone: | ovirt-4.3.3 | Keywords: | Reopened |
| Target Release: | 4.3.0 | Flags: | lsvaty:
testing_plan_complete-
|
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-05-08 12:37:35 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Germano Veit Michel
2018-05-21 05:24:57 UTC
Fred, anything we can do about it? One solution could be to move the connectStorageServer verb to be asynch, as the NFS connect can take long before failing. This needs some design and input from Nir. Unlike ISO domains, data domains are critical to our operation, and we don't want a host to be up if it's not seeing all data domains. That being said, it should not be in a loop but in a non-operational state. Right, the expect outcome is 'Non Operational'. But it doesn't reach that stage since connectStorageServer times out, making the engine enter in a loop of connecting -> connectStorageServer -> timeout -> Not Responding -> connecting... (In reply to Fred Rolland from comment #2) > One solution could be to move the connectStorageServer verb to be asynch, as > the NFS connect can take long before failing. > > This needs some design and input from Nir. Nir, what do you think? I agree with Fred. If something can take lot of time it should be async. *** This bug has been marked as a duplicate of bug 1561522 *** Inverting the DUP bug as this has more info and some discussions under way. *** Bug 1561522 has been marked as a duplicate of this bug. *** We just saw this on a partial storage network outage, where 1+ paths fail to connect. It delays the command and causes the same issue. So this is important for resiliency on storage/network failures. This bug has not been marked as blocker for oVirt 4.3.0. Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1. Fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1684554 in patch https://gerrit.ovirt.org/#/c/98633/ The host will move to non-operational if the connection to the storage server fails when activating the host. It will not loop on the status. Tested on:
ovirt-engine-4.3.3.1-0.1.el7.noarch
I blocked the connection between a host to the iscsi SD using iptables.
Actual result:
The host status switched from up to non-operational without looping through connecting and non-responsive as expected.
But since that, every 5 minutes, the hosts status changes from non-operational -> unassigned and back to non-operational.
2019-04-04 11:55:00,048+03 INFO [org.ovirt.engine.core.bll.ActivateVdsCommand] (EE-ManagedThreadFactory
-engineScheduled-Thread-56) [659d89e6] Before acquiring lock in order to prevent monitoring for host 'ho
st_mixed_2' from data-center 'golden_env_mixed'
2019-04-04 11:55:00,049+03 INFO [org.ovirt.engine.core.bll.ActivateVdsCommand] (EE-ManagedThreadFactory
-engineScheduled-Thread-56) [659d89e6] Lock acquired, from now a monitoring of host will be skipped for
host 'host_mixed_2' from data-center 'golden_env_mixed'
2019-04-04 11:55:00,062+03 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThr
eadFactory-engineScheduled-Thread-56) [659d89e6] START, SetVdsStatusVDSCommand(HostName = host_mixed_2,
SetVdsStatusVDSCommandParameters:{hostId='c54a1613-9828-4e95-a0ec-e4372012bdc9', status='Unassigned', no
nOperationalReason='NONE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 3f0d5e8a
2019-04-04 11:55:00,070+03 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThr
eadFactory-engineScheduled-Thread-56) [659d89e6] FINISH, SetVdsStatusVDSCommand, return: , log id: 3f0d5
e8a
2019-04-04 11:55:00,080+03 INFO [org.ovirt.engine.core.bll.ActivateVdsCommand] (EE-ManagedThreadFactory
-engineScheduled-Thread-56) [659d89e6] Activate host finished. Lock released. Monitoring can run now for
host 'host_mixed_2' from data-center 'golden_env_mixed'
.....
2019-04-04 11:55:10,658+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-62068) [324c1f52] EVENT_ID: VDS_STORAGE_VDS_STATS_FAILED(189), Host host_mixed_2 reports about one of the Active Storage Domains as Problematic.
2019-04-04 11:55:10,703+03 INFO [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-60) [12078c3f] Running command: SetNonOperationalVdsCommand internal: true. Entities affected : ID: c54a1613-9828-4e95-a0ec-e4372012bdc9 Type: VDS
2019-04-04 11:55:10,711+03 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-60) [12078c3f] START, SetVdsStatusVDSCommand(HostName = host_mixed_2, SetVdsStatusVDSCommandParameters:{hostId='c54a1613-9828-4e95-a0ec-e4372012bdc9', status='NonOperational', nonOperationalReason='STORAGE_DOMAIN_UNREACHABLE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 2309f9ab
2019-04-04 11:55:10,719+03 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-60) [12078c3f] FINISH, SetVdsStatusVDSCommand, return: , log id: 2309f9ab
2019-04-04 11:55:10,860+03 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-60) [12078c3f] EVENT_ID: VDS_SET_NONOPERATIONAL_DOMAIN(522), Host host_mixed_2 cannot access the Storage Domain(s) iscsi_1, iscsi_0, iscsi_2 attached to the Data Center golden_env_mixed. Setting Host state to Non-Operational.
Is this the expected behavior in terms of status changes?
This is the scenario to test: Have a DC with 2 hosts and at least one SD. Put Host 2 in maintenance. Block access to SD in Host2 Move Host2 to UP. Without the fix the host never moves to NonOperational, it keeps moving between connecting and Up. With the fix the host moves to NonOperational. HostMonitoring on the next cycle tries to connect to storage domain again. (In reply to Fred Rolland from comment #16) > This is the scenario to test: > > Have a DC with 2 hosts and at least one SD. > Put Host 2 in maintenance. > Block access to SD in Host2 > Move Host2 to UP. > > Without the fix the host never moves to NonOperational, it keeps moving > between connecting and Up. > With the fix the host moves to NonOperational. HostMonitoring on the next > cycle tries to connect to storage domain again. Thanks for update Fred! Yosi will test by this scenario to verify. What about the issue seen in last comment(c#15) from yosi -> host state moves from 'Non-operational' -> 'unassigned' and back every 5 minutes. A few questions we need your help with to clarify and maybe open a new bug on: 1) It's important to understand that the host state moves from 'Non-operational' -> 'unassigned' and back means the host is treated as down and not in any means active, is this the case ? 2) What is the reason for this move ? Is there any impact on the customer? 3) Is this a regression? The host is not moving to unassigned in this flow Tested using: ovirt-engine-4.3.3.2-0.1.el7.noarch Actual result: The blocked host doesn't loop from connecting -> up and goes straight to "NonOperational" after the timeout as expected. Moving to VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:1085 sync2jira sync2jira |