Created attachment 1352057 [details] middleware Manager Log Description of problem: Expected Two Events, one "start" event and one "Completed/Failed" event when reloading an EAP that has gone off line. But only the start event was generated, and no 2nd Error event occurred. Version-Release number of selected component (if applicable): Middleware Manager DR2 CFME: Version 5.9.0.7.20171107212356_ed87902 How reproducible: Steps to Reproduce: 1. Start Middleware Manager instance, and Add as CFME Provider 2. Start EAP Standalone Server 3. CFME Middleware Provider Refresh Relationships 4. When EAP is displayed in Servers list, manually stop the EAP Server. (i.e. Kill the EAP Process on the EAP server) 5. Navigate to the EAP Detailed View, and then Power->Reload Server 6. Refresh Middleware Provider 7. Navigate to Middleware Provider Timelines view, and Display Power Activity with Show Detailed Events checked. 8. Note that there should be two events generated, but only the start event is generated (even after many hours, there was no 2nd Power Event) Actual results: One Power Event was generated when EAP Standalone Server Reload was issued, after Server was manually stopped Expected results: Two Power Events should be generated, one for the start of the event and a second for the completion/error. Additional info:
Created attachment 1352058 [details] evm.log
Created attachment 1352059 [details] Screen Shot
Please assess the impact of this issue and update the severity accordingly. Please refer to https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity for a reminder on each severity's definition. If it's something like a tracker bug where it doesn't matter, please set the severity to Low.
Result of my investigation on this bug: The scenario of forcefully removed MW server seems to be failing to me at all. Not only the events are not generated properly (it seems Hawkular lets the server respond on its own and does not care if it's no longer present) but also such server is still present in MIQ inventory after a refresh (despite it should have been removed). I don't think we should be fixing that on MIQ side, it sounds to me like a more Hawkular side issue. Matt, are you facing the same? (EAP not being removed from inventory when killed?) If so we're facing a broader issue than just events... Cheers Tom
The EAP Server State shows "Stopped", but is not removed from the inventory.
This is related to: https://bugzilla.redhat.com/show_bug.cgi?id=1497922 And the conversation about how to deal with that bugzilla happened in github: https://github.com/ManageIQ/manageiq-providers-hawkular/issues/46
Edgar, I'm probably missing something. Discussion on the Github issue ended as RFE so we should allow only such operations in MIQ which make sense based on the current power state in MIQ. But the point here is how MIQ should behave in case the operation is fired... Can you please elaborate in detail what's the desired workflow for now (without the RFE implemented)? What is the expected MIQ behavior on killed EAP? And what should happen when such "illegal" operation is fired? How many events do we expect to have, whether and what kind of notifications should appear?
I was writing a longer reply before you posted ;) I agree with Jay in that the server should remain in inventory, but its status should be shown as "stopped". Regarding the events/notifications... I'm almost sure the "availability change" events are happening and being processed. Even if you kill the EAP, the events should raise and MiQ should update the status of the server. But the avail change events are different from the one described in this BZ. The "events" in this BZ are "timeline events" generated in MiQ side and are related to the command gateway flow. The first event (the "start" event) is logged/generated as soon the user clicks the "Reload" operation button. The second event of failure/completion is logged when the cmdgw replies with the status of the operation. If the cmdgw doesn't reply, no event will be logged in the timeline. Note that the cmdgw can provide more information about why an operation could have failed (the avail change events can't provide this). I think this is the reason of the existence of this BZ: the cmdgw does not reply. In the related bugzilla, Josejulio found that if the server is unavailable, the command gateway will delay that operation until the server becomes available and it won't push any reply to the websocket/MiQ until that happens. MiQ will wait for the response of cmdgw, but that may last forever if the server never becomes available again and that also means that nothing will be logged in the timeline. In the github conversation, the suggested way to solve this was to enable/disable the operation buttons based on the status stored in MiQ database, regardless whether it's outdated. So, well... you have no option but to implement the RFE to the power events. If for some reason, MiQ database has an outated server status, the absence of the event will be by design. -- Additional comments: To be honest, I don't like the proposed solution. It sounds more like a workaround to avoid changing the cmdgw. IMO, the cmdgw should be modified to never delay the operation and to reply immediately if the agent is unavailable. Or, maybe, a parameter or agent configs could be used to specify if the operation can be delayed (I like more the agent configs approach) or to use timeouts... Well, the idea is to have something where MiQ won't wait forever so that error/success events are logged in the timeline. I guess the delay of the operation is there in case there are connectivity issues between the Agent and H-services. But in the case described in this bugzilla, the absence of the agent is legitimate. At least for power operations, I think it doesn't make sense to delay the operation. May be for adding datasources or deployments it may make some sense, but I think it's good to have timeouts.
Thanks Edgar, this is so much helpful to understand the situation properly! So, let me summarize: - There's an RFE which would workaround this issue in later version (I agree it's more a workaround than a solution) - Current state of things is that we're missing the ok/fail operation result event because there's no such event at all (not present in Hawkular due to wait for avail change) - It would be nice to have the behavior changed in Hawkular but that's a subject to more detail discussion with stakeholders and not issue solvable directly in MIQ So, how should we approach this bug now? Considering the points above, do we still think it is a bug?
I agree with the above that: 1. A server in inventory can be up or down. A down server should not be removed from the inventory, it stays in inventory but its state should be reflected. 2. Even if the RFE to adjust power operations will be implemented, it indeed still does not fully address it because the server's status in miq might not be up to date. 3. Feedback on power operations should not be delayed. Side note - in the above scenario, restart of a stopped server won't work because restart is typically down on an entity that is currently up, the appropriate operation would be "start" which is not supported atm.
related to https://bugzilla.redhat.com/show_bug.cgi?id=1497922 https://bugzilla.redhat.com/show_bug.cgi?id=1452986 https://bugzilla.redhat.com/show_bug.cgi?id=1445233