Bug 1468999 - [downstream clone - 4.1.4] Command via API can cause host in Maintenance mode to be fenced
[downstream clone - 4.1.4] Command via API can cause host in Maintenance mode...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
4.1.1
x86_64 Linux
unspecified Severity medium
: ovirt-4.1.4
: ---
Assigned To: Maor
Kevin Alon Goldblatt
: ZStream
Depends On: 1464296
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-10 04:27 EDT by rhev-integ
Modified: 2017-07-27 14:02 EDT (History)
15 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, if a host is in maintenance mode and a user calls GET /ovirt-engine/api/hosts/64fc087f-821b-429a-b274-5e8597a88f3d/storage through the REST API, the Manager would try to call getDeviceList on the host, and as a result it would be fenced. In this release, a validation has been added which first checks to see if the host is in maintenance mode. The Manager will only call getDeviceList if the host is up.
Story Points: ---
Clone Of: 1464296
Environment:
Last Closed: 2017-07-27 14:02:44 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3091701 None None None 2017-07-10 04:29 EDT
oVirt gerrit 78901 master MERGED core: Avoid fencing on host API to call getDeviceList. 2017-07-10 04:29 EDT
oVirt gerrit 78942 ovirt-engine-4.1 MERGED core: Avoid fencing on host API to call getDeviceList. 2017-07-10 04:29 EDT

  None (edit)
Description rhev-integ 2017-07-10 04:27:55 EDT
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1464296 +++
======================================================================

Description of problem:

1. Host is switched to maintenance:

2017-06-12 15:22:33,480 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-3) [39697061] Correlation ID: 39697061, Job ID: 4c04cf41-d8f8-4ff5-ab1d-2ed2bad7fd57, Call Stack: null, Custom Event ID: -1, Message: Host <HOSTNAME> was switched to Maintenance mode by admin@internal-authz.

2. 3 minutes after, API request for storage on this host:

[12/Jun/2017:15:25:01 +0900] "GET /ovirt-engine/api/hosts/64fc087f-821b-429a-b274-5e8597a88f3d/storage HTTP/1.1" 400 113

3. To return that info, engine needs to run a command on the host, so it tries to connect to a host which is in maintenance mode:

2017-06-12 15:25:01,912 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (default task-186) [] START, GetDeviceListVDSCommand(HostName = <HOSTNAME>, GetDeviceListVDSCommandParameters:{runAsync='true', hostId='64fc087f-821b-429a-b274-5e8597a88f3d', storageType='UNKNOWN', checkStatus='true', lunIds='null'}), log id: 6f188af0

2017-06-12 15:25:01,913 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to /<HOST IP>

4. As the host is in maintenance mode, vdsm might not be running (rebooting, upgrading...). In this case, things go in an undesired route:

2017-06-12 15:25:04,920 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (default task-186) [] Command 'GetDeviceListVDSCommand(HostName = <HOSTNAME>, GetDeviceListVDSCommandParameters:{runAsync='true', hostId='64fc087f-821b-429a-b274-5e8597a88f3d', storageType='UNKNOWN', checkStatus='true', lunIds='null'})' execution failed: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Connection failed

2017-06-12 15:25:04,920 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-26) [] Host '<HOSTNAME>' is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued.

Version-Release number of selected component (if applicable):
rhevm-4.0.6.3-0.1.el7ev.noarch

How reproducible:
100% on 4.1.1

Steps to Reproduce:
1. Switch host to maintenance
2. systemctl stop vdsmd
3. Point your browser to https://<rhv-m>/ovirt-engine/api/hosts/<host id>/storage

Actual results:
Host switches to Non-Responsive status in Admin Portal, can be fenced.

Expected results:
Host continues in Maintenance mode.

Additional info:

The actual HTTP GET returns 400, which is probably expected as the engine can't talk to the host to run the required commands to return this info.

<fault><detail>Network error during communication with the Host.</detail><reason>Operation Failed</reason></fault>

But the host should not switch to Non-Reponding status because it was in Maintenance mode.

(Originally by Germano Veit Michel)
Comment 4 rhev-integ 2017-07-10 04:28:16 EDT
Just FYI, I did some random GETs via the API and the only one I found with this behavior is "storage". The others seem fine.

(Originally by Germano Veit Michel)
Comment 8 rhev-integ 2017-07-10 04:28:38 EDT
Won't it be better to add the validation only in the REST command?

GetDeviceList is being called in numerous flows, and it will be hard to anticipate how it will be reflected on every flow which using it.
Propagate the error upwards should be done also on every flow which uses GetDeviceList.
Is this type of solution is suitable is suitable for 4.1.4?

(Originally by Maor Lipchuk)
Comment 9 rhev-integ 2017-07-10 04:28:43 EDT
(In reply to Maor from comment #7)
> Won't it be better to add the validation only in the REST command?
> 
> GetDeviceList is being called in numerous flows, and it will be hard to
> anticipate how it will be reflected on every flow which using it.
> Propagate the error upwards should be done also on every flow which uses
> GetDeviceList.
> Is this type of solution is suitable is suitable for 4.1.4?

Sounds legit.
Note that the REST code shouldn't have this validation itself, but should probably pass an indicator on the parameters objects down to the qeury so it can decide whether or not it should perform this validation.

(Originally by Allon Mureinik)
Comment 10 rhev-integ 2017-07-10 04:28:50 EDT
Or else create a new query, used only by the API, that performs the validation and calls the existing GetDeviceList query. In any case the API is not the place to perform this kind of validation.

(Originally by juan.hernandez)
Comment 11 rhev-integ 2017-07-10 04:28:55 EDT
(In reply to Juan Hernández from comment #9)
> Or else create a new query, used only by the API, that performs the
> validation and calls the existing GetDeviceList query. In any case the API
> is not the place to perform this kind of validation.

sure, I implemented what Allon suggested and added the validation only in the vdc query

(Originally by Maor Lipchuk)
Comment 12 rhev-integ 2017-07-10 04:29:00 EDT
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.1.z': '?'}', ]

For more info please contact: rhv-devops@redhat.comINFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.1.z': '?'}', ]

For more info please contact: rhv-devops@redhat.com

(Originally by rhev-integ)
Comment 13 Kevin Alon Goldblatt 2017-07-10 16:15:50 EDT
Verified with the following code:
---------------------------------
ovirt-engine-4.1.4-0.2.el7.noarch
rhevm-4.1.4-0.2.el7.noarch
vdsm-4.19.21-1.el7ev.x86_64

Verified with the following scenario:
-------------------------------------
Steps to Reproduce:
1. Switch host to maintenance
2. systemctl stop vdsmd
3. Point your browser to https://<rhv-m>/ovirt-engine/api/hosts/<host id>/storage

Actual results:
Host continues in Maintenance mode.


Moving to VERIFIED!
Comment 14 Allon Mureinik 2017-07-13 07:34:43 EDT
Maor, this fix changes the behavior of the public API in a visible way.
Can you please add some doctext here?
Comment 16 errata-xmlrpc 2017-07-27 14:02:44 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1814

Note You need to log in before you can comment on or make changes to this bug.