| Summary: | EAP7 image with Hawkular Agent: Sometimes server is not operable after restart | ||
|---|---|---|---|
| Product: | [JBoss] Middleware Manager | Reporter: | Hayk Hovsepyan <hhovsepy> |
| Component: | Agent, download agent | Assignee: | Heiko W. Rupp <hrupp> |
| Status: | CLOSED NOTABUG | QA Contact: | Hayk Hovsepyan <hhovsepy> |
| Severity: | low | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | unspecified | CC: | mfoley, mmahoney |
| Target Milestone: | --- | Keywords: | Triaged |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-12-15 09:48:12 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Hayk Hovsepyan
2016-12-08 14:58:06 UTC
Mazz can you please have a look? Looking at all the timestamps on all those stack traces coming from the agent, it is clear they are happening during the shutdown of the agent service (the reload is shutting down all the services). The server internals are finally all stopped at this message: 12:33:07,126 INFO [org.jboss.as] (MSC service thread 1-2) WFLYSRV0050: JBoss EAP 7.0.3.GA (WildFly Core 2.1.9.Final-redhat-1) stopped in 338ms Then at time 12:33:07,127 the server begins starting up again. I see no messages at all from the agent here. It is as if the EAP instance did not start up the agent service again. I do not know why this is. But at this point, if EAP does not start the agent service then, yes, all agent functionality will be unavailable. The fact this is not repeatable is odd. You would think if something is wrong with the agent it would happen all the time. I'm not sure why EAP would not be able to restart services during a reload. Oh, and the fact that "Server logs does not contain HA logs, seems like HA is not working." tells me this problem with EAP reload is more than just affecting the agent. Seems like it is affecting other non-agent related EAP subsystems (like the HA subsystems). That tells me that this is a problem internal to EAP and not directly attributable to the agent itself. Ok, I just ran some tests over here. I can tell you what goes on under the covers.
I ran EAP 7 in standalone mode, and had an agent manage it as a <remote-dmr> endpoint. So the agent is running elsewhere, outside of the EAP 7 instance. Which I assume is what the Hayk is doing.
When you do a "Reload" operation (that is, you send a "Reload" ExecuteOperationRequest into the agent), the agent will simply send a CLI command "/:reload" to the EAP7 server. Once that request is send, the agent performs a full discovery scan.
When the EAP7 performs a reload, it shuts all of its internals down, and then starts them back up again.
But there is a race condition here. The agent is doing a full discovery scan, so it is possible the agent is going to remove the server from inventory if it looks like it has been removed (and once its internals is shutdown, the server will look like it has been removed). So the agent removes it from its internal inventory in memory. At this point, you can't do anything to the EAP server via the agent (like send in other websocket commands) because the server is gone from the agent's inventory in memory. If the agent is slow in its automatic discovery scan, the EAP7 could have finished its reload, and came back up in time for the agent's scan to still see it - thus, nothing will look wrong.
Once the agent runs its discovery scan again, and assuming the EAP7 has come back up after the reload finished, the agent re-discovers it, puts it back in memory, and you can then do things via websocket commands to the EAP7 again.
You can manually kick off a discovery scan by sending the agent an ExecuteOperationRequest (and target the agent resource itself) and ask to run the operation "Inventory Discovery Scan" - this is defined in the default agent config as this:
<operation-dmr name="Inventory Discovery Scan" internal-name="fullDiscoveryScan" />
if the EAP7 successfully completed its reload, the agent should detect it again and pull it back into inventory.
NOTE!! I still am not convinced this is the full answer, because this BZ's description says that EAP7's HA features don't work after the reload. So something about the reload is failing - and this could be why the agent can no longer do anything - because perhaps the EAP7 isn't in a good state.
Was able to constantly reproduce the problem, and it happens not only in docker container, but on standalone running EAP7 server.
Steps to reproduce:
1. Start EAP7 server in standalone mode with configured Hawkular Agent to Hawkular Services.
2. Add Middleware provider in CFME.
3. List servers on CFME Middleware provider. You will find just started EAP7 server.
4. Deploy application archive on EAP7 server via CFME UI Menu Item "Add Deployment". Keep "Enable Deployment" "Yes".
5. Make sure that Deployments List for that EAP7 server contains newly deployed archive.
5. Deploy the newest version of that archive, but this time: "Enable Deployment" is "No" and "Overwrite (if exists) " is "Yes". So it overwrites and disables the archive.
6. Wait for some time, until Hawkular Agent on Server side will log this error in a loop:
"2016-12-13 09:05:51,001 ERROR [org.jboss.as.controller.management-operation] (Hawkular-WildFly-Agent-Scheduler-Metrics-2) WFLYCTL0013: Operation ("read-attribute") failed - address: ([
("deployment" => "xmwpvphp_cfme_test_war_middleware.war"),
("subsystem" => "undertow")
]) - failure description: "WFLYCTL0216: Management resource '[
(\"deployment\" => \"xmwpvphp_cfme_test_war_middleware.war\"),
(\"subsystem\" => \"undertow\")
]' not found"
2016-12-13 09:05:51,003 ERROR [org.hawkular.agent.monitor.protocol.EndpointService] (Hawkular-WildFly-Agent-Scheduler-Metrics-2) HAWKMONITOR010049: Could not access resources of endpoint [DMREndpointService[Endpoint[eap7standalone2]:[null]]]: org.hawkular.agent.monitor.protocol.ProtocolException: Unsuccessful fetching DMR attribute [max-active-sessions]
at org.hawkular.agent.monitor.protocol.dmr.DMRDriver.fetchAttribute(DMRDriver.java:158)
at org.hawkular.agent.monitor.protocol.EndpointService.measureMetrics(EndpointService.java:447)
at org.hawkular.agent.monitor.scheduler.MetricsCollector.run(MetricsCollector.java:60)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
at org.jboss.threads.JBossThread.run(JBossThread.java:320)
Caused by: org.hawkular.dmr.api.OperationFailureException: Could not perform operation [read-attribute]: WFLYCTL0216: Management resource '[
("deployment" => "xmwpvphp_cfme_test_war_middleware.war"),
("subsystem" => "undertow")
]' not found
at org.hawkular.dmr.api.OperationBuilder$OperationResult.assertSuccess(OperationBuilder.java:422)
at org.hawkular.agent.monitor.protocol.dmr.DMRDriver.fetchAttribute(DMRDriver.java:155)
... 8 more
"
7. Reload the EAP7 server via CFME UI Menu Item Power Operation "Reload Server".
Here on EAP7 server side, you can see that during server reload, services are not started, and agent is not started as well.
Any further operation via CFME UI on that server will not start, and no server logs.
No any error in evm.log of CFME during operations:
"[----] I, [2016-12-13T09:10:50.440375 #2669:6bf134] INFO -- : MIQ(MiqServer#populate_queue_messages) Fetched 2 miq_queue rows for queue_name=generic, wcount=4, priority=200
[----] I, [2016-12-13T09:10:50.979092 #2905:6bf134] INFO -- : MIQ(MiqPriorityWorker::Runner#get_message_via_drb) Message id: [92422], MiqWorker id: [48], Zone: [default], Role: [smartstate], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [job_dispatcher], Command: [JobProxyDispatcher.dispatch], Timeout: [600], Priority: [20], State: [dequeue], Deliver On: [], Data: [], Args: [], Dequeued in: [4.993216103] seconds
[----] I, [2016-12-13T09:10:50.979546 #2905:6bf134] INFO -- : Q-task_id([job_dispatcher]) MIQ(MiqQueue#deliver) Message id: [92422], Delivering...
[----] I, [2016-12-13T09:10:50.987704 #2905:6bf134] INFO -- : Q-task_id([job_dispatcher]) MIQ(JobProxyDispatcher#dispatch) Complete - Timings: {:pending_container_jobs=>0.003825664520263672, :container_jobs_to_dispatch_count=>0, :container_dispatching=>0.0038614273071289062, :pending_vm_jobs=>0.0007607936859130859, :vm_jobs_to_dispatch_count=>0, :total_time=>0.0067653656005859375}
[----] I, [2016-12-13T09:10:50.988264 #2905:6bf134] INFO -- : Q-task_id([job_dispatcher]) MIQ(MiqQueue#delivered) Message id: [92422], State: [ok], Delivered in [0.008759507] seconds
[----] I, [2016-12-13T09:10:51.883257 #2886:6bf134] INFO -- : MIQ(MiqGenericWorker::Runner#get_message_via_drb) Message id: [92421], MiqWorker id: [46], Zone: [default], Role: [], Server: [ee9fe06e-bc2d-11e6-b241-001c4c160162], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [Session.check_session_timeout], Timeout: [600], Priority: [90], State: [dequeue], Deliver On: [], Data: [], Args: [], Dequeued in: [5.903997207] seconds
[----] I, [2016-12-13T09:10:51.883611 #2886:6bf134] INFO -- : MIQ(MiqQueue#deliver) Message id: [92421], Delivering...
[----] I, [2016-12-13T09:10:51.896032 #2886:6bf134] INFO -- : MIQ(MiqQueue#delivered) Message id: [92421], State: [ok], Delivered in [0.012402141] seconds
"
Just quickly off the top of my head, the error in your step 6 is to be expected. You deployed a new deployment war and told the agent not to enable it. So I suspect a disabled war doesn't have an "undertow" sub-resource under it and so any metrics or resource config props we try to read from it will fail since it doesn't exist. So that is probably just a "to be expected" though the logs are ugly in this case.
It is step 7 which is the important one where you ask to reload the server.
> Here on EAP7 server side, you can see that during server reload,
> services are not started, and agent is not started as well.
So it sounds like EAP7's "reload" is not working.
Try this without any hawkular installed (deploy apps, disable them, then reload an EAP without any hawkular agent). I did a quick search in JIRA and I see several problems that were recently resolved that involved internal WildFly components not working after a reload. I suspect this is another one of those times when reload fails to put the EAP in a good state, such as: https://issues.jboss.org/browse/WFLY-2220 https://issues.jboss.org/browse/WFLY-3355 https://issues.jboss.org/browse/WFLY-5261 I don't know what versions of WildFly they were fixed in and if the EAP version being tested has those fixed. But even still, the point is that it looks like reload has in the past caused problems - this is probably one of them. Was able to reproduce the issue on EAP 7.0.3 GA manually without hawkular agent, but on EAP 7.1.0.DR9 issue is fixed. Have tried 7.1.0.DR9 with Hawkualar agent and it works fine. According to #c9 Hawkular-services is fine and I am closing this one. |