Bug 1464296 - Command via API can cause host in Maintenance mode to be fenced
Command via API can cause host in Maintenance mode to be fenced
Status: VERIFIED
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
4.1.1
x86_64 Linux
unspecified Severity medium
: ovirt-4.2.0
: ---
Assigned To: Maor
Kevin Alon Goldblatt
: ZStream
Depends On:
Blocks: 1468999
  Show dependency treegraph
 
Reported: 2017-06-22 22:11 EDT by Germano Veit Michel
Modified: 2017-11-27 10:42 EST (History)
14 users (show)

See Also:
Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of:
: 1468999 (view as bug list)
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3091701 None None None 2017-06-23 00:35 EDT
oVirt gerrit 78901 master MERGED core: Avoid fencing on host API to call getDeviceList. 2017-07-03 09:02 EDT
oVirt gerrit 78942 ovirt-engine-4.1 MERGED core: Avoid fencing on host API to call getDeviceList. 2017-07-04 05:56 EDT

  None (edit)
Description Germano Veit Michel 2017-06-22 22:11:10 EDT
Description of problem:

1. Host is switched to maintenance:

2017-06-12 15:22:33,480 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-3) [39697061] Correlation ID: 39697061, Job ID: 4c04cf41-d8f8-4ff5-ab1d-2ed2bad7fd57, Call Stack: null, Custom Event ID: -1, Message: Host <HOSTNAME> was switched to Maintenance mode by admin@internal-authz.

2. 3 minutes after, API request for storage on this host:

[12/Jun/2017:15:25:01 +0900] "GET /ovirt-engine/api/hosts/64fc087f-821b-429a-b274-5e8597a88f3d/storage HTTP/1.1" 400 113

3. To return that info, engine needs to run a command on the host, so it tries to connect to a host which is in maintenance mode:

2017-06-12 15:25:01,912 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (default task-186) [] START, GetDeviceListVDSCommand(HostName = <HOSTNAME>, GetDeviceListVDSCommandParameters:{runAsync='true', hostId='64fc087f-821b-429a-b274-5e8597a88f3d', storageType='UNKNOWN', checkStatus='true', lunIds='null'}), log id: 6f188af0

2017-06-12 15:25:01,913 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to /<HOST IP>

4. As the host is in maintenance mode, vdsm might not be running (rebooting, upgrading...). In this case, things go in an undesired route:

2017-06-12 15:25:04,920 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (default task-186) [] Command 'GetDeviceListVDSCommand(HostName = <HOSTNAME>, GetDeviceListVDSCommandParameters:{runAsync='true', hostId='64fc087f-821b-429a-b274-5e8597a88f3d', storageType='UNKNOWN', checkStatus='true', lunIds='null'})' execution failed: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Connection failed

2017-06-12 15:25:04,920 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-26) [] Host '<HOSTNAME>' is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued.

Version-Release number of selected component (if applicable):
rhevm-4.0.6.3-0.1.el7ev.noarch

How reproducible:
100% on 4.1.1

Steps to Reproduce:
1. Switch host to maintenance
2. systemctl stop vdsmd
3. Point your browser to https://<rhv-m>/ovirt-engine/api/hosts/<host id>/storage

Actual results:
Host switches to Non-Responsive status in Admin Portal, can be fenced.

Expected results:
Host continues in Maintenance mode.

Additional info:

The actual HTTP GET returns 400, which is probably expected as the engine can't talk to the host to run the required commands to return this info.

<fault><detail>Network error during communication with the Host.</detail><reason>Operation Failed</reason></fault>

But the host should not switch to Non-Reponding status because it was in Maintenance mode.
Comment 3 Germano Veit Michel 2017-06-25 19:26:52 EDT
Just FYI, I did some random GETs via the API and the only one I found with this behavior is "storage". The others seem fine.
Comment 7 Maor 2017-07-02 06:32:15 EDT
Won't it be better to add the validation only in the REST command?

GetDeviceList is being called in numerous flows, and it will be hard to anticipate how it will be reflected on every flow which using it.
Propagate the error upwards should be done also on every flow which uses GetDeviceList.
Is this type of solution is suitable is suitable for 4.1.4?
Comment 8 Allon Mureinik 2017-07-02 06:49:01 EDT
(In reply to Maor from comment #7)
> Won't it be better to add the validation only in the REST command?
> 
> GetDeviceList is being called in numerous flows, and it will be hard to
> anticipate how it will be reflected on every flow which using it.
> Propagate the error upwards should be done also on every flow which uses
> GetDeviceList.
> Is this type of solution is suitable is suitable for 4.1.4?

Sounds legit.
Note that the REST code shouldn't have this validation itself, but should probably pass an indicator on the parameters objects down to the qeury so it can decide whether or not it should perform this validation.
Comment 9 Juan Hernández 2017-07-03 04:05:24 EDT
Or else create a new query, used only by the API, that performs the validation and calls the existing GetDeviceList query. In any case the API is not the place to perform this kind of validation.
Comment 10 Maor 2017-07-03 05:21:26 EDT
(In reply to Juan Hernández from comment #9)
> Or else create a new query, used only by the API, that performs the
> validation and calls the existing GetDeviceList query. In any case the API
> is not the place to perform this kind of validation.

sure, I implemented what Allon suggested and added the validation only in the vdc query
Comment 11 rhev-integ 2017-07-07 08:23:39 EDT
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.1.z': '?'}', ]

For more info please contact: rhv-devops@redhat.comINFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.1.z': '?'}', ]

For more info please contact: rhv-devops@redhat.com
Comment 13 Allon Mureinik 2017-07-13 07:35:14 EDT
Doctext will be provided in the zstream clone 1468999.
Comment 17 Kevin Alon Goldblatt 2017-11-27 10:42:30 EST
Verified with the following code:
--------------------------------------
ovirt-engine-4.2.0-0.5.master.el7.noarch
vdsm-4.20.7-55.git11440d6.el7.centos.x86_64

Verified with the following scenario:
--------------------------------------
Steps to Reproduce:
1. Switch host to maintenance
2. systemctl stop vdsmd
3. Point your browser to https://<rhv-m>/ovirt-engine/api/hosts/<host id>/storage
>>>>> The host is reported as not Up


<fault><detail>Cannot ${action} ${type}. The server ${VdsName} is not UP.</detail><reason>Operation Failed</reason></fault>



Moving to VERIFIED

Note You need to log in before you can comment on or make changes to this bug.