1513038 – No Event Error Generated When Reloading EAP That Is Not Running

Bug 1513038 - No Event Error Generated When Reloading EAP That Is Not Running

Summary: No Event Error Generated When Reloading EAP That Is Not Running

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Providers
Sub Component:
Version:	5.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	cfme-future
Assignee:	Tomas Coufal
QA Contact:	Matt Mahoney
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-11-14 16:05 UTC by Matt Mahoney
Modified:	2018-01-05 23:48 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-01-04 15:35:49 UTC
Category:	---
Cloudforms Team:	Middleware
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
middleware Manager Log (14.22 KB, application/zip) 2017-11-14 16:05 UTC, Matt Mahoney	no flags	Details
evm.log (668.45 KB, application/zip) 2017-11-14 16:05 UTC, Matt Mahoney	no flags	Details
Screen Shot (59.41 KB, image/png) 2017-11-14 16:06 UTC, Matt Mahoney	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1452986	0	medium	CLOSED	Failed power operation on middleware server appears as green OK	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1497922	0	unspecified	CLOSED	[RFE] Middleware - Stopped Domain - Power operations do not fail with notification	2021-02-22 00:41:40 UTC

Description Matt Mahoney 2017-11-14 16:05:09 UTC

Created attachment 1352057 [details]
middleware Manager Log

Description of problem:
Expected Two Events, one "start" event and one "Completed/Failed" event when reloading an EAP that has gone off line. But only the start event was generated, and no 2nd Error event occurred.

Version-Release number of selected component (if applicable):
Middleware Manager DR2
CFME: Version 5.9.0.7.20171107212356_ed87902

How reproducible:


Steps to Reproduce:
1. Start Middleware Manager instance, and Add as CFME Provider
2. Start EAP Standalone Server
3. CFME Middleware Provider Refresh Relationships
4. When EAP is displayed in Servers list, manually stop the EAP Server.
 (i.e. Kill the EAP Process on the EAP server)
5. Navigate to the EAP Detailed View, and then Power->Reload Server
6. Refresh Middleware Provider
7. Navigate to Middleware Provider Timelines view, and Display Power Activity with Show Detailed Events checked.
8. Note that there should be two events generated, but only the start event is generated (even after many hours, there was no 2nd Power Event)

Actual results:
One Power Event was generated when EAP Standalone Server Reload was issued, after Server was manually stopped

Expected results:
Two Power Events should be generated, one for the start of the event and a second for the completion/error.

Additional info:

Comment 2 Matt Mahoney 2017-11-14 16:05:49 UTC

Created attachment 1352058 [details]
evm.log

Comment 3 Matt Mahoney 2017-11-14 16:06:23 UTC

Created attachment 1352059 [details]
Screen Shot

Comment 4 Dave Johnson 2017-11-14 16:44:29 UTC

Please assess the impact of this issue and update the severity accordingly.  Please refer to https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity for a reminder on each severity's definition.

If it's something like a tracker bug where it doesn't matter, please set the severity to Low.

Comment 5 Tomas Coufal 2017-11-21 14:12:50 UTC

Result of my investigation on this bug:

The scenario of forcefully removed MW server seems to be failing to me at all. Not only the events are not generated properly (it seems Hawkular lets the server respond on its own and does not care if it's no longer present) but also such server is still present in MIQ inventory after a refresh (despite it should have been removed). I don't think we should be fixing that on MIQ side, it sounds to me like a more Hawkular side issue.

Matt, are you facing the same? (EAP not being removed from inventory when killed?) If so we're facing a broader issue than just events...

Cheers
Tom

Comment 6 Matt Mahoney 2017-11-21 14:51:39 UTC

The EAP Server State shows "Stopped", but is not removed from the inventory.

Comment 9 Edgar Hernández 2017-11-27 17:49:51 UTC

This is related to: https://bugzilla.redhat.com/show_bug.cgi?id=1497922

And the conversation about how to deal with that bugzilla happened in github: https://github.com/ManageIQ/manageiq-providers-hawkular/issues/46

Comment 10 Tomas Coufal 2017-11-27 18:03:37 UTC

Edgar, I'm probably missing something. Discussion on the Github issue ended as RFE so we should allow only such operations in MIQ which make sense based on the   current power state in MIQ. But the point here is how MIQ should behave in case the operation is fired...

Can you please elaborate in detail what's the desired workflow for now (without the RFE implemented)? What is the expected MIQ behavior on killed EAP? And what should happen when such "illegal" operation is fired? How many events do we expect to have, whether and what kind of notifications should appear?

Comment 11 Edgar Hernández 2017-11-27 19:01:44 UTC

I was writing a longer reply before you posted ;)

I agree with Jay in that the server should remain in inventory, but its status should be shown as "stopped".

Regarding the events/notifications... I'm almost sure the "availability change" events are happening and being processed. Even if you kill the EAP, the events should raise and MiQ should update the status of the server. But the avail change events are different from the one described in this BZ.

The "events" in this BZ are "timeline events" generated in MiQ side and are related to the command gateway flow. The first event (the "start" event) is logged/generated as soon the user clicks the "Reload" operation button. The second event of failure/completion is logged when the cmdgw replies with the status of the operation. If the cmdgw doesn't reply, no event will be logged in the timeline. Note that the cmdgw can provide more information about why an operation could have failed (the avail change events can't provide this). I think this is the reason of the existence of this BZ: the cmdgw does not reply.

In the related bugzilla, Josejulio found that if the server is unavailable, the command gateway will delay that operation until the server becomes available and it won't push any reply to the websocket/MiQ until that happens. MiQ will wait for the response of cmdgw, but that may last forever if the server never becomes available again and that also means that nothing will be logged in the timeline.

In the github conversation, the suggested way to solve this was to enable/disable the operation buttons based on the status stored in MiQ database, regardless whether it's outdated. So, well... you have no option but to implement the RFE to the power events. If for some reason, MiQ database has an outated server status, the absence of the event will be by design.

-- Additional comments:

To be honest, I don't like the proposed solution. It sounds more like a workaround to avoid changing the cmdgw. IMO, the cmdgw should be modified to never delay the operation and to reply immediately if the agent is unavailable. Or, maybe, a parameter or agent configs could be used to specify if the operation can be delayed (I like more the agent configs approach) or to use timeouts... Well, the idea is to have something where MiQ won't wait forever so that error/success events are logged in the timeline.

I guess the delay of the operation is there in case there are connectivity issues between the Agent and H-services. But in the case described in this bugzilla, the absence of the agent is legitimate. At least for power operations, I think it doesn't make sense to delay the operation. May be for adding datasources or deployments it may make some sense, but I think it's good to have timeouts.

Comment 12 Tomas Coufal 2017-11-28 09:50:03 UTC

Thanks Edgar, this is so much helpful to understand the situation properly!

So, let me summarize:

- There's an RFE which would workaround this issue in later version (I agree it's more a workaround than a solution)
- Current state of things is that we're missing the ok/fail operation result event because there's no such event at all (not present in Hawkular due to wait for avail change)
- It would be nice to have the behavior changed in Hawkular but that's a subject to more detail discussion with stakeholders and not issue solvable directly in MIQ

So, how should we approach this bug now? Considering the points above, do we still think it is a bug?

Comment 13 Alissa 2017-11-28 12:49:43 UTC

I agree with the above that:
1. A server in inventory can be up or down.  A down server should not be removed from the inventory, it stays in inventory but its state should be reflected.
2. Even if the RFE to adjust power operations will be implemented, it indeed still does not fully address it because the server's status in miq might not be up to date.
3. Feedback on power operations should not be delayed.

Side note - in the above scenario, restart of a stopped server won't work because restart is typically down on an entity that is currently up, the appropriate operation would be "start" which is not supported atm.

Comment 14 Tomas Coufal 2017-12-08 11:02:36 UTC

related to
https://bugzilla.redhat.com/show_bug.cgi?id=1497922
https://bugzilla.redhat.com/show_bug.cgi?id=1452986
https://bugzilla.redhat.com/show_bug.cgi?id=1445233

Note You need to log in before you can comment on or make changes to this bug.