Bug 1319324

Summary: VM refreshes are failing but the message status from each of the EmsRefresh.refresh commands shows 'ok' in error
Product: Red Hat CloudForms Management Engine Reporter: Thomas Hennessy <thenness>
Component: ProvidersAssignee: Marcel Hild <mhild>
Status: CLOSED CURRENTRELEASE QA Contact: Jiri Stefanisin <jistefan>
Severity: high Docs Contact:
Priority: high    
Version: 5.5.0CC: benglish, cpelland, dajohnso, jdeubel, jfrey, jhardy, jocarter, mfeifer, obarenbo, oourfali, simaishi, thenness
Target Milestone: GAKeywords: TestOnly, ZStream
Target Release: 5.7.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: ems_refresh:retest
Fixed In Version: 5.7.0.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1330139 1346990 (view as bug list) Environment:
Last Closed: 2017-01-11 20:07:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1330139, 1346990    
Attachments:
Description Flags
one page pdf detailing multiple ems refreshes from same 3002 PID with their exeuction times. Only the 3400+ second ems refresh actually processed all expected data.
none
distilled evm.log lines from process 3002 containing ems refreshes identified in pdf already provided none

Description Thomas Hennessy 2016-03-18 20:55:14 UTC
Description of problem:Message status returned from emsrefresh of RHEVM always shows a status of 'ok' even when refresh encounters errors.


Version-Release number of selected component (if applicable):5.5.2.4


How reproducible:


Steps to Reproduce:
1. Register a RHEVM provider into CFME 5.5.2.4 appliance
2. Manually initiate a full refresh and kill the session from the evmrefresh worker to the RHEVM appliance using targeted kill or alter the network latency/delay to be so long that the tcp session used to collect data timesout.
3. examine the evm.log file to see what the message status value is.

Actual results:
 the results will show 'ok'

Expected results: message status to be 'error' or 'timedout'


Additional info:
customer logs from multiple RHEVM process showing multiple refreshes are included as well as spreadsheet showing refresh start times and pid, with refresh durations ranging from 2 minutes to > one hour for the same RHEVM instance

Comment 2 Thomas Hennessy 2016-03-18 21:04:49 UTC
Created attachment 1137882 [details]
one page pdf detailing multiple ems refreshes from same 3002 PID with their exeuction times.  Only the 3400+ second ems refresh actually processed all expected data.

Comment 3 Thomas Hennessy 2016-03-18 21:10:04 UTC
Created attachment 1137883 [details]
distilled evm.log lines from process 3002 containing ems refreshes identified in pdf already provided

Comment 4 Greg Blomquist 2016-04-25 03:48:23 UTC
Dave,

can QE test this scenario with a couple different providers?

I would like to determine if this is a RHEVM specific problem.  Or, if it's across the board for all providers.

Marcel, assigning to you for now until we get more information from QE on whether this is an across the board problem with reporting an error on EMS Refresh, or if it's a RHEVM specific problem.

Comment 10 Marcel Hild 2016-06-22 17:08:42 UTC
@Thomas Hennessy could you explain, where you would expect an error to be shown?
In the UI there should be an error displayed for that specific EMS. Also the log shows and ERROR for that specific EMS.

From what I understand, you a referring to the log file, where the "State: [ok]" appears. But this refers to the Queue Message being delivered and executed.

Comment 11 Thomas Hennessy 2016-06-22 17:40:55 UTC
Marcel,
IMHO the existence of the log line message containing the completed for any message ID indicates that the message number was delivered and completed.

the status line should indicate if the requested command was completed successfully or not.

If the log lines between the initiation of the message (ie message dequeued) and the message completion log line contains any errors that indicates to me that the initiated command encountered error and therefore the status of the execution of the requested command cannot possibly be 'ok'.

Tom Hennessy

Comment 12 Marcel Hild 2016-06-22 17:47:51 UTC
ok, thanks for the clarification.
Just to be 100% sure: you want this

[----] I, [2016-03-16T10:18:06.067938 #3002:90798c]  INFO -- : MIQ(MiqQueue#delivered) Message id: [2000001124178], State: [ok], Delivered in [157.827267344] seconds

to be changed into

[----] I, [2016-03-16T10:18:06.067938 #3002:90798c]  INFO -- : MIQ(MiqQueue#delivered) Message id: [2000001124178], State: [error], Delivered in [157.827267344] seconds

Comment 13 Thomas Hennessy 2016-06-22 17:59:57 UTC
Marcel,
yes.

more generally if the command being initiated by the message id (Command: [EmsRefresh.refresh] in this case) encounters errors in the execution of the command (as is reported by the error messages in the evm.log) it seems to me any any reasonable customer would expect that the status returned would reflect that highest level of error encountered.  In this case the 'error' as reported by multipe log lines.

Tom Hennessy

Comment 14 Marcel Hild 2016-07-19 09:27:21 UTC
on dev: https://github.com/ManageIQ/manageiq/pull/9907