Bug 1319324 - VM refreshes are failing but the message status from each of the EmsRefresh.refresh commands shows 'ok' in error
Summary: VM refreshes are failing but the message status from each of the EmsRefresh.r...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Providers
Version: 5.5.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: GA
: 5.7.0
Assignee: Marcel Hild
QA Contact: Jiri Stefanisin
URL:
Whiteboard: ems_refresh:retest
Depends On:
Blocks: 1330139 1346990
TreeView+ depends on / blocked
 
Reported: 2016-03-18 20:55 UTC by Thomas Hennessy
Modified: 2019-10-10 11:36 UTC (History)
12 users (show)

Fixed In Version: 5.7.0.0
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1330139 1346990 (view as bug list)
Environment:
Last Closed: 2017-01-11 20:07:06 UTC
Category: ---
Cloudforms Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
one page pdf detailing multiple ems refreshes from same 3002 PID with their exeuction times. Only the 3400+ second ems refresh actually processed all expected data. (95.25 KB, application/pdf)
2016-03-18 21:04 UTC, Thomas Hennessy
no flags Details
distilled evm.log lines from process 3002 containing ems refreshes identified in pdf already provided (133.28 KB, application/zip)
2016-03-18 21:10 UTC, Thomas Hennessy
no flags Details

Description Thomas Hennessy 2016-03-18 20:55:14 UTC
Description of problem:Message status returned from emsrefresh of RHEVM always shows a status of 'ok' even when refresh encounters errors.


Version-Release number of selected component (if applicable):5.5.2.4


How reproducible:


Steps to Reproduce:
1. Register a RHEVM provider into CFME 5.5.2.4 appliance
2. Manually initiate a full refresh and kill the session from the evmrefresh worker to the RHEVM appliance using targeted kill or alter the network latency/delay to be so long that the tcp session used to collect data timesout.
3. examine the evm.log file to see what the message status value is.

Actual results:
 the results will show 'ok'

Expected results: message status to be 'error' or 'timedout'


Additional info:
customer logs from multiple RHEVM process showing multiple refreshes are included as well as spreadsheet showing refresh start times and pid, with refresh durations ranging from 2 minutes to > one hour for the same RHEVM instance

Comment 2 Thomas Hennessy 2016-03-18 21:04:49 UTC
Created attachment 1137882 [details]
one page pdf detailing multiple ems refreshes from same 3002 PID with their exeuction times.  Only the 3400+ second ems refresh actually processed all expected data.

Comment 3 Thomas Hennessy 2016-03-18 21:10:04 UTC
Created attachment 1137883 [details]
distilled evm.log lines from process 3002 containing ems refreshes identified in pdf already provided

Comment 4 Greg Blomquist 2016-04-25 03:48:23 UTC
Dave,

can QE test this scenario with a couple different providers?

I would like to determine if this is a RHEVM specific problem.  Or, if it's across the board for all providers.

Marcel, assigning to you for now until we get more information from QE on whether this is an across the board problem with reporting an error on EMS Refresh, or if it's a RHEVM specific problem.

Comment 10 Marcel Hild 2016-06-22 17:08:42 UTC
@Thomas Hennessy could you explain, where you would expect an error to be shown?
In the UI there should be an error displayed for that specific EMS. Also the log shows and ERROR for that specific EMS.

From what I understand, you a referring to the log file, where the "State: [ok]" appears. But this refers to the Queue Message being delivered and executed.

Comment 11 Thomas Hennessy 2016-06-22 17:40:55 UTC
Marcel,
IMHO the existence of the log line message containing the completed for any message ID indicates that the message number was delivered and completed.

the status line should indicate if the requested command was completed successfully or not.

If the log lines between the initiation of the message (ie message dequeued) and the message completion log line contains any errors that indicates to me that the initiated command encountered error and therefore the status of the execution of the requested command cannot possibly be 'ok'.

Tom Hennessy

Comment 12 Marcel Hild 2016-06-22 17:47:51 UTC
ok, thanks for the clarification.
Just to be 100% sure: you want this

[----] I, [2016-03-16T10:18:06.067938 #3002:90798c]  INFO -- : MIQ(MiqQueue#delivered) Message id: [2000001124178], State: [ok], Delivered in [157.827267344] seconds

to be changed into

[----] I, [2016-03-16T10:18:06.067938 #3002:90798c]  INFO -- : MIQ(MiqQueue#delivered) Message id: [2000001124178], State: [error], Delivered in [157.827267344] seconds

Comment 13 Thomas Hennessy 2016-06-22 17:59:57 UTC
Marcel,
yes.

more generally if the command being initiated by the message id (Command: [EmsRefresh.refresh] in this case) encounters errors in the execution of the command (as is reported by the error messages in the evm.log) it seems to me any any reasonable customer would expect that the status returned would reflect that highest level of error encountered.  In this case the 'error' as reported by multipe log lines.

Tom Hennessy

Comment 14 Marcel Hild 2016-07-19 09:27:21 UTC
on dev: https://github.com/ManageIQ/manageiq/pull/9907


Note You need to log in before you can comment on or make changes to this bug.