Description of problem: 1. Host is switched to maintenance: 2017-06-12 15:22:33,480 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-3) [39697061] Correlation ID: 39697061, Job ID: 4c04cf41-d8f8-4ff5-ab1d-2ed2bad7fd57, Call Stack: null, Custom Event ID: -1, Message: Host <HOSTNAME> was switched to Maintenance mode by admin@internal-authz. 2. 3 minutes after, API request for storage on this host: [12/Jun/2017:15:25:01 +0900] "GET /ovirt-engine/api/hosts/64fc087f-821b-429a-b274-5e8597a88f3d/storage HTTP/1.1" 400 113 3. To return that info, engine needs to run a command on the host, so it tries to connect to a host which is in maintenance mode: 2017-06-12 15:25:01,912 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (default task-186) [] START, GetDeviceListVDSCommand(HostName = <HOSTNAME>, GetDeviceListVDSCommandParameters:{runAsync='true', hostId='64fc087f-821b-429a-b274-5e8597a88f3d', storageType='UNKNOWN', checkStatus='true', lunIds='null'}), log id: 6f188af0 2017-06-12 15:25:01,913 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to /<HOST IP> 4. As the host is in maintenance mode, vdsm might not be running (rebooting, upgrading...). In this case, things go in an undesired route: 2017-06-12 15:25:04,920 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetDeviceListVDSCommand] (default task-186) [] Command 'GetDeviceListVDSCommand(HostName = <HOSTNAME>, GetDeviceListVDSCommandParameters:{runAsync='true', hostId='64fc087f-821b-429a-b274-5e8597a88f3d', storageType='UNKNOWN', checkStatus='true', lunIds='null'})' execution failed: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: Connection failed 2017-06-12 15:25:04,920 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-26) [] Host '<HOSTNAME>' is not responding. It will stay in Connecting state for a grace period of 60 seconds and after that an attempt to fence the host will be issued. Version-Release number of selected component (if applicable): rhevm-4.0.6.3-0.1.el7ev.noarch How reproducible: 100% on 4.1.1 Steps to Reproduce: 1. Switch host to maintenance 2. systemctl stop vdsmd 3. Point your browser to https://<rhv-m>/ovirt-engine/api/hosts/<host id>/storage Actual results: Host switches to Non-Responsive status in Admin Portal, can be fenced. Expected results: Host continues in Maintenance mode. Additional info: The actual HTTP GET returns 400, which is probably expected as the engine can't talk to the host to run the required commands to return this info. <fault><detail>Network error during communication with the Host.</detail><reason>Operation Failed</reason></fault> But the host should not switch to Non-Reponding status because it was in Maintenance mode.
Just FYI, I did some random GETs via the API and the only one I found with this behavior is "storage". The others seem fine.
Won't it be better to add the validation only in the REST command? GetDeviceList is being called in numerous flows, and it will be hard to anticipate how it will be reflected on every flow which using it. Propagate the error upwards should be done also on every flow which uses GetDeviceList. Is this type of solution is suitable is suitable for 4.1.4?
(In reply to Maor from comment #7) > Won't it be better to add the validation only in the REST command? > > GetDeviceList is being called in numerous flows, and it will be hard to > anticipate how it will be reflected on every flow which using it. > Propagate the error upwards should be done also on every flow which uses > GetDeviceList. > Is this type of solution is suitable is suitable for 4.1.4? Sounds legit. Note that the REST code shouldn't have this validation itself, but should probably pass an indicator on the parameters objects down to the qeury so it can decide whether or not it should perform this validation.
Or else create a new query, used only by the API, that performs the validation and calls the existing GetDeviceList query. In any case the API is not the place to perform this kind of validation.
(In reply to Juan Hernández from comment #9) > Or else create a new query, used only by the API, that performs the > validation and calls the existing GetDeviceList query. In any case the API > is not the place to perform this kind of validation. sure, I implemented what Allon suggested and added the validation only in the vdc query
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.1.z': '?'}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.1.z': '?'}', ] For more info please contact: rhv-devops
Doctext will be provided in the zstream clone 1468999.
Verified with the following code: -------------------------------------- ovirt-engine-4.2.0-0.5.master.el7.noarch vdsm-4.20.7-55.git11440d6.el7.centos.x86_64 Verified with the following scenario: -------------------------------------- Steps to Reproduce: 1. Switch host to maintenance 2. systemctl stop vdsmd 3. Point your browser to https://<rhv-m>/ovirt-engine/api/hosts/<host id>/storage >>>>> The host is reported as not Up <fault><detail>Cannot ${action} ${type}. The server ${VdsName} is not UP.</detail><reason>Operation Failed</reason></fault> Moving to VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1488
BZ<2>Jira Resync
qe_test_coverage is '-' as this includes UI verification (step 3 in verification flow)