+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1464296 +++ ====================================================================== Description of problem: 1. Host is switched to maintenance: 2017-06-12 15:22:33,480 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-3) [39697061] Correlation ID: 39697061, Job ID: 4c04cf41-d8f8-4ff5-ab1d-2ed2bad7fd57, Call Stack: null, Custom Event ID: -1, Message: Host <HOSTNAME> was switched to Maintenance mode by admin@internal-authz. 2. 3 minutes after, API request for storage on this host: [12/Jun/2017:15:25:01 +0900] "GET /ovirt-engine/api/hosts/64fc087f-821b-429a-b274-5e8597a88f3d/storage HTTP/1.1" 400 113 3. To return that info, engine needs to run a command on the host, so it tries to connect to a host which is in maintenance mode: 2017-06-12 15:25:01,912 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (default task-186) [] START, GetDeviceListVDSCommand(HostName = <HOSTNAME>, GetDeviceListVDSCommandParameters:{runAsync='true', hostId='64fc087f-821b-429a-b274-5e8597a88f3d', storageType='UNKNOWN', checkStatus='true', lunIds='null'}), log id: 6f188af0 2017-06-12 15:25:01,913 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to /<HOST IP> 4. As the host is in maintenance mode, vdsm might not be running (rebooting, upgrading...). In this case, things go in an undesired route: 2017-06-12 15:25:04,920 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (default task-186) [] Command 'GetDeviceListVDSCommand(HostName = <HOSTNAME>, GetDeviceListVDSCommandParameters:{runAsync='true', hostId='64fc087f-821b-429a-b274-5e8597a88f3d', storageType='UNKNOWN', checkStatus='true', lunIds='null'})' execution failed: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Connection failed 2017-06-12 15:25:04,920 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-26) [] Host '<HOSTNAME>' is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued. Version-Release number of selected component (if applicable): rhevm-4.0.6.3-0.1.el7ev.noarch How reproducible: 100% on 4.1.1 Steps to Reproduce: 1. Switch host to maintenance 2. systemctl stop vdsmd 3. Point your browser to https://<rhv-m>/ovirt-engine/api/hosts/<host id>/storage Actual results: Host switches to Non-Responsive status in Admin Portal, can be fenced. Expected results: Host continues in Maintenance mode. Additional info: The actual HTTP GET returns 400, which is probably expected as the engine can't talk to the host to run the required commands to return this info. <fault><detail>Network error during communication with the Host.</detail><reason>Operation Failed</reason></fault> But the host should not switch to Non-Reponding status because it was in Maintenance mode. (Originally by Germano Veit Michel)
Just FYI, I did some random GETs via the API and the only one I found with this behavior is "storage". The others seem fine. (Originally by Germano Veit Michel)
Won't it be better to add the validation only in the REST command? GetDeviceList is being called in numerous flows, and it will be hard to anticipate how it will be reflected on every flow which using it. Propagate the error upwards should be done also on every flow which uses GetDeviceList. Is this type of solution is suitable is suitable for 4.1.4? (Originally by Maor Lipchuk)
(In reply to Maor from comment #7) > Won't it be better to add the validation only in the REST command? > > GetDeviceList is being called in numerous flows, and it will be hard to > anticipate how it will be reflected on every flow which using it. > Propagate the error upwards should be done also on every flow which uses > GetDeviceList. > Is this type of solution is suitable is suitable for 4.1.4? Sounds legit. Note that the REST code shouldn't have this validation itself, but should probably pass an indicator on the parameters objects down to the qeury so it can decide whether or not it should perform this validation. (Originally by Allon Mureinik)
Or else create a new query, used only by the API, that performs the validation and calls the existing GetDeviceList query. In any case the API is not the place to perform this kind of validation. (Originally by juan.hernandez)
(In reply to Juan Hernández from comment #9) > Or else create a new query, used only by the API, that performs the > validation and calls the existing GetDeviceList query. In any case the API > is not the place to perform this kind of validation. sure, I implemented what Allon suggested and added the validation only in the vdc query (Originally by Maor Lipchuk)
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.1.z': '?'}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.1.z': '?'}', ] For more info please contact: rhv-devops (Originally by rhev-integ)
Verified with the following code: --------------------------------- ovirt-engine-4.1.4-0.2.el7.noarch rhevm-4.1.4-0.2.el7.noarch vdsm-4.19.21-1.el7ev.x86_64 Verified with the following scenario: ------------------------------------- Steps to Reproduce: 1. Switch host to maintenance 2. systemctl stop vdsmd 3. Point your browser to https://<rhv-m>/ovirt-engine/api/hosts/<host id>/storage Actual results: Host continues in Maintenance mode. Moving to VERIFIED!
Maor, this fix changes the behavior of the public API in a visible way. Can you please add some doctext here?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1814