Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1402900

Summary:	EAP7 image with Hawkular Agent: Sometimes server is not operable after restart
Product:	[JBoss] Middleware Manager	Reporter:	Hayk Hovsepyan <hhovsepy>
Component:	Agent, download agent	Assignee:	Heiko W. Rupp <hrupp>
Status:	CLOSED NOTABUG	QA Contact:	Hayk Hovsepyan <hhovsepy>
Severity:	low	Docs Contact:
Priority:	unspecified
Version:	unspecified	CC:	mfoley, mmahoney
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-12-15 09:48:12 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Hayk Hovsepyan 2016-12-08 14:58:06 UTC

Description of problem:
After power operations (reload/restart) on EAP7 server from docker container with HA, it is impossible to do any power or other operations on server.
Server logs does not contain HA logs, seems like HA is not working.
And Power operations or deployment operations does not cause anything on server, no logs on server side.

It happened both on Standalone mode or Domain mode servers few times.

Version-Release number of selected component (if applicable):
0.23.0.Final-redhat-1 (commit 6e04108013d71ea5ec501fe14efbb7603466b294)

How reproducible:
not always,
can not reproduce it manually,
saw this intermittent problem few times while running whole automation set,
rerunning automation does not reproduce the problem

Full server logs are attached.
Here is the fragment of last logs on Standalone server side.
From log, server is not reloaded correctly, logs are strange.
"
12:33:06,782 INFO  [org.hawkular.agent.monitor.cmd.AbstractResourcePathCommand] (OkHttp http://docker-haw-services.bc.jonqe.lab.eng.bos.redhat.com:8080/hawkular/command-gateway/feed/a935f1ed-04bc-4909-903b-675715852060) HAWKMONITOR010065: Received request to perform [Reload] on a [DMR Node] given by inventory path [/t;hawkular/f;a935f1ed-04bc-4909-903b-675715852060/r;Local~~]
12:33:06,794 INFO  [org.hawkular.agent.monitor.service.MonitorService] (MSC service thread 1-3) HAWKMONITOR010001: Stopping Hawkular WildFly Agent service
12:33:06,816 WARN  [org.hawkular.agent.monitor.cmd.FeedCommProcessor] (OkHttp http://docker-haw-services.bc.jonqe.lab.eng.bos.redhat.com:8080/hawkular/command-gateway/feed/a935f1ed-04bc-4909-903b-675715852060) HAWKMONITOR010036: Feed communications channel encountered a failure. Response=[<null>]: java.net.SocketException: Socket closed
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:170)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at okio.Okio$2.read(Okio.java:139)
	at okio.AsyncTimeout$2.read(AsyncTimeout.java:211)
	at okio.RealBufferedSource.request(RealBufferedSource.java:71)
	at okio.RealBufferedSource.require(RealBufferedSource.java:64)
	at okio.RealBufferedSource.readByte(RealBufferedSource.java:77)
	at com.squareup.okhttp.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:110)
	at com.squareup.okhttp.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:99)
	at com.squareup.okhttp.internal.ws.RealWebSocket.readMessage(RealWebSocket.java:97)
	at com.squareup.okhttp.ws.WebSocketCall.createWebSocket(WebSocketCall.java:153)
	at com.squareup.okhttp.ws.WebSocketCall.access$000(WebSocketCall.java:40)
	at com.squareup.okhttp.ws.WebSocketCall$1.onResponse(WebSocketCall.java:98)
	at com.squareup.okhttp.Call$AsyncCall.execute(Call.java:177)
	at com.squareup.okhttp.internal.NamedRunnable.run(NamedRunnable.java:33)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

12:33:06,826 INFO  [org.hawkular.agent.monitor.scheduler.SchedulerService] (MSC service thread 1-3) HAWKMONITOR010013: Stopping scheduler
12:33:06,880 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 150) WFLYUT0022: Unregistered web context: /ovhiwigx_cfme_test_war_middleware
12:33:06,881 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 151) WFLYUT0022: Unregistered web context: /lwdtctzn_cfme_test_war_middleware
12:33:06,889 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 152) WFLYUT0022: Unregistered web context: /dztopbfd_cfme_test_war_middleware
12:33:06,894 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 153) WFLYUT0022: Unregistered web context: /ocfqrhya_cfme_test_war_middleware
12:33:06,897 INFO  [org.jboss.as.connector.subsystems.datasources] (MSC service thread 1-3) WFLYJCA0010: Unbound data source [java:jboss/mysql]
12:33:06,899 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 155) WFLYUT0022: Unregistered web context: /zxfcqlit_cfme_test_war_middleware
12:33:06,921 INFO  [org.jboss.as.connector.subsystems.datasources] (MSC service thread 1-4) WFLYJCA0010: Unbound data source [java:jboss/oracle]
12:33:06,924 INFO  [org.jboss.as.connector.subsystems.datasources] (MSC service thread 1-4) WFLYJCA0010: Unbound data source [java:/MySqlDS]
12:33:06,925 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 162) WFLYUT0022: Unregistered web context: /fqpizwfh_cfme_test_war_middleware
12:33:06,933 INFO  [org.jboss.as.connector.deployers.jdbc] (MSC service thread 1-2) WFLYJCA0019: Stopped Driver service with driver-name = mysql-connector-java-5.1.38-bin.jar_com.mysql.fabric.jdbc.FabricMySQLDriver_5_1
12:33:06,933 INFO  [org.jboss.as.connector.deployers.jdbc] (MSC service thread 1-2) WFLYJCA0019: Stopped Driver service with driver-name = mysql-connector-java-5.1.38-bin.jar_com.mysql.jdbc.Driver_5_1
12:33:06,933 INFO  [org.jboss.as.connector.deployers.jdbc] (MSC service thread 1-2) WFLYJCA0019: Stopped Driver service with driver-name = ojdbc7.jar
12:33:06,933 INFO  [org.jboss.as.connector.deployers.jdbc] (MSC service thread 1-3) WFLYJCA0019: Stopped Driver service with driver-name = postgresql-9.4.1209.jre6.jar
12:33:06,934 INFO  [org.wildfly.extension.undertow] (ServerService Thread Pool -- 157) WFLYUT0022: Unregistered web context: /tvmxzgvx_cfme_test_war_middleware
12:33:06,958 INFO  [org.jboss.as.connector.deployers.jdbc] (MSC service thread 1-3) WFLYJCA0019: Stopped Driver service with driver-name = oracle
12:33:06,972 INFO  [org.jboss.as.connector.deployers.jdbc] (MSC service thread 1-3) WFLYJCA0019: Stopped Driver service with driver-name = mysql
12:33:06,988 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-3) WFLYUT0019: Host default-host stopping
12:33:07,036 INFO  [org.jboss.as.connector.subsystems.datasources] (MSC service thread 1-2) WFLYJCA0010: Unbound data source [java:jboss/datasources/ExampleDS]
12:33:07,056 INFO  [org.jboss.as.connector.deployers.jdbc] (MSC service thread 1-3) WFLYJCA0019: Stopped Driver service with driver-name = h2
12:33:07,067 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-2) WFLYSRV0028: Stopped deployment postgresql-9.4.1209.jre6.jar (runtime-name: postgresql-9.4.1209.jre6.jar) in 279ms
12:33:07,069 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-3) WFLYSRV0028: Stopped deployment ojdbc7.jar (runtime-name: ojdbc7.jar) in 280ms
12:33:07,070 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-2) WFLYSRV0028: Stopped deployment mysql-connector-java-5.1.38-bin.jar (runtime-name: mysql-connector-java-5.1.38-bin.jar) in 281ms
12:33:07,074 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-2) WFLYSRV0028: Stopped deployment clpezlfp_cfme_test_jar_middleware.jar (runtime-name: clpezlfp_cfme_test_jar_middleware.jar) in 285ms
12:33:07,077 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-3) WFLYSRV0028: Stopped deployment tvmxzgvx_cfme_test_war_middleware.war (runtime-name: tvmxzgvx_cfme_test_war_middleware.war) in 288ms
12:33:07,078 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-1) WFLYSRV0028: Stopped deployment ocfqrhya_cfme_test_war_middleware.war (runtime-name: ocfqrhya_cfme_test_war_middleware.war) in 290ms
12:33:07,078 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-1) WFLYSRV0028: Stopped deployment zxfcqlit_cfme_test_war_middleware.war (runtime-name: zxfcqlit_cfme_test_war_middleware.war) in 290ms
12:33:07,078 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-4) WFLYSRV0028: Stopped deployment zsqdqblj_cfme_test_jar_middleware.jar (runtime-name: zsqdqblj_cfme_test_jar_middleware.jar) in 290ms
12:33:07,079 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-4) WFLYSRV0028: Stopped deployment dztopbfd_cfme_test_war_middleware.war (runtime-name: dztopbfd_cfme_test_war_middleware.war) in 290ms
12:33:07,080 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-2) WFLYSRV0028: Stopped deployment fqpizwfh_cfme_test_war_middleware.war (runtime-name: fqpizwfh_cfme_test_war_middleware.war) in 291ms
12:33:07,082 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-1) WFLYSRV0028: Stopped deployment wkwoqltm_cfme_test_jar_middleware.jar (runtime-name: wkwoqltm_cfme_test_jar_middleware.jar) in 294ms
12:33:07,083 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-4) WFLYSRV0028: Stopped deployment ovhiwigx_cfme_test_war_middleware.war (runtime-name: ovhiwigx_cfme_test_war_middleware.war) in 295ms
12:33:07,085 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-4) WFLYSRV0028: Stopped deployment lwdtctzn_cfme_test_war_middleware.war (runtime-name: lwdtctzn_cfme_test_war_middleware.war) in 297ms
12:33:07,087 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-1) WFLYUT0008: Undertow HTTP listener default suspending
12:33:07,088 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-1) WFLYUT0007: Undertow HTTP listener default stopped, was bound to 0.0.0.0:8080
12:33:07,088 ERROR [org.jboss.as.controller.management-operation] (Hawkular WildFly Agent Full Discovery Scan-Local-1) WFLYCTL0013: Operation ("read-attribute") failed - address: ([("deployment" => "wkwoqltm_cfme_test_jar_middleware.jar")]): java.lang.NullPointerException
	at org.jboss.as.server.deployment.AbstractDeploymentUnitService.getStatus(AbstractDeploymentUnitService.java:111)
	at org.jboss.as.server.deployment.DeploymentStatusHandler$1.execute(DeploymentStatusHandler.java:61)
	at org.jboss.as.controller.AbstractOperationContext.executeStep(AbstractOperationContext.java:890)
	at org.jboss.as.controller.AbstractOperationContext.processStages(AbstractOperationContext.java:659)
	at org.jboss.as.controller.AbstractOperationContext.executeOperation(AbstractOperationContext.java:370)
	at org.jboss.as.controller.OperationContextImpl.executeOperation(OperationContextImpl.java:1344)
	at org.jboss.as.controller.ModelControllerImpl.internalExecute(ModelControllerImpl.java:392)
	at org.jboss.as.controller.ModelControllerImpl.execute(ModelControllerImpl.java:204)
	at org.jboss.as.controller.ModelControllerImpl$3.execute(ModelControllerImpl.java:659)
	at org.jboss.as.controller.ModelControllerImpl$3.execute(ModelControllerImpl.java:649)
	at org.hawkular.dmr.api.OperationBuilder$AbstractOperationBuilder.execute(OperationBuilder.java:79)
	at org.hawkular.agent.monitor.protocol.dmr.DMRDriver.fetchNodes(DMRDriver.java:220)
	at org.hawkular.agent.monitor.protocol.dmr.DMRDriver.fetchNodes(DMRDriver.java:55)
	at org.hawkular.agent.monitor.protocol.Discovery.discoverChildren(Discovery.java:75)
	at org.hawkular.agent.monitor.protocol.Discovery.discoverChildren(Discovery.java:120)
	at org.hawkular.agent.monitor.protocol.EndpointService.discoverChildren(EndpointService.java:328)
	at org.hawkular.agent.monitor.protocol.EndpointService.access$400(EndpointService.java:78)
	at org.hawkular.agent.monitor.protocol.EndpointService$1.run(EndpointService.java:275)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
	at org.jboss.threads.JBossThread.run(JBossThread.java:320)

12:33:07,095 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-1) WFLYUT0004: Undertow 1.3.25.Final-redhat-1 stopping
12:33:07,111 INFO  [org.hawkular.agent.monitor.service.MonitorService] (MSC service thread 1-3) HAWKMONITOR010004: JNDI binding [java:global/hawkular/agent/api]: unbound
12:33:07,111 INFO  [org.jboss.as.mail.extension] (MSC service thread 1-3) WFLYMAIL0002: Unbound mail session [java:jboss/mail/Default]
12:33:07,113 INFO  [org.hawkular.agent.monitor.service.MonitorService] (MSC service thread 1-3) HAWKMONITOR010004: JNDI binding [java:global/hawkular/agent/api]: unbound
12:33:07,120 ERROR [org.jboss.as.controller.management-operation] (Hawkular WildFly Agent Full Discovery Scan-Local-1) WFLYCTL0013: Operation ("read-attribute") failed - address: ([
    ("socket-binding-group" => "standard-sockets"),
    ("socket-binding" => "txn-status-manager")
]): org.jboss.msc.service.ServiceNotFoundException: Service service org.wildfly.network.socket-binding.txn-status-manager not found
	at org.jboss.msc.service.ServiceContainerImpl.getRequiredService(ServiceContainerImpl.java:669)
	at org.jboss.as.controller.OperationContextImpl$OperationContextServiceRegistry.getRequiredService(OperationContextImpl.java:2211)
	at org.jboss.as.server.services.net.BindingMetricHandlers$AbstractBindingMetricsHandler$1.execute(BindingMetricHandlers.java:63)
	at org.jboss.as.controller.AbstractOperationContext.executeStep(AbstractOperationContext.java:890)
	at org.jboss.as.controller.AbstractOperationContext.processStages(AbstractOperationContext.java:659)
	at org.jboss.as.controller.AbstractOperationContext.executeOperation(AbstractOperationContext.java:370)
	at org.jboss.as.controller.OperationContextImpl.executeOperation(OperationContextImpl.java:1344)
	at org.jboss.as.controller.ModelControllerImpl.internalExecute(ModelControllerImpl.java:392)
	at org.jboss.as.controller.ModelControllerImpl.execute(ModelControllerImpl.java:204)
	at org.jboss.as.controller.ModelControllerImpl$3.execute(ModelControllerImpl.java:659)
	at org.jboss.as.controller.ModelControllerImpl$3.execute(ModelControllerImpl.java:649)
	at org.hawkular.dmr.api.OperationBuilder$AbstractOperationBuilder.execute(OperationBuilder.java:79)
	at org.hawkular.agent.monitor.protocol.dmr.DMRDriver.fetchNodes(DMRDriver.java:220)
	at org.hawkular.agent.monitor.protocol.dmr.DMRDriver.fetchNodes(DMRDriver.java:55)
	at org.hawkular.agent.monitor.protocol.Discovery.discoverChildren(Discovery.java:75)
	at org.hawkular.agent.monitor.protocol.Discovery.discoverChildren(Discovery.java:120)
	at org.hawkular.agent.monitor.protocol.Discovery.discoverChildren(Discovery.java:120)
	at org.hawkular.agent.monitor.protocol.EndpointService.discoverChildren(EndpointService.java:328)
	at org.hawkular.agent.monitor.protocol.EndpointService.access$400(EndpointService.java:78)
	at org.hawkular.agent.monitor.protocol.EndpointService$1.run(EndpointService.java:275)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
	at org.jboss.threads.JBossThread.run(JBossThread.java:320)

12:33:07,126 INFO  [org.jboss.as] (MSC service thread 1-2) WFLYSRV0050: JBoss EAP 7.0.3.GA (WildFly Core 2.1.9.Final-redhat-1) stopped in 338ms
12:33:07,127 INFO  [org.jboss.as] (MSC service thread 1-2) WFLYSRV0049: JBoss EAP 7.0.3.GA (WildFly Core 2.1.9.Final-redhat-1) starting
12:33:07,305 INFO  [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0039: Creating http management service using socket-binding (management-http)
12:33:07,329 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-1) WFLYSRV0027: Starting deployment of "dztopbfd_cfme_test_war_middleware.war" (runtime-name: "dztopbfd_cfme_test_war_middleware.war")
12:33:07,329 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-1) WFLYSRV0027: Starting deployment of "wkwoqltm_cfme_test_jar_middleware.jar" (runtime-name: "wkwoqltm_cfme_test_jar_middleware.jar")
12:33:07,330 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-1) WFLYSRV0027: Starting deployment of "zsqdqblj_cfme_test_jar_middleware.jar" (runtime-name: "zsqdqblj_cfme_test_jar_middleware.jar")
12:33:07,330 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-1) WFLYSRV0027: Starting deployment of "ocfqrhya_cfme_test_war_middleware.war" (runtime-name: "ocfqrhya_cfme_test_war_middleware.war")
12:33:07,330 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-1) WFLYSRV0027: Starting deployment of "fqpizwfh_cfme_test_war_middleware.war" (runtime-name: "fqpizwfh_cfme_test_war_middleware.war")
12:33:07,331 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-1) WFLYSRV0027: Starting deployment of "zxfcqlit_cfme_test_war_middleware.war" (runtime-name: "zxfcqlit_cfme_test_war_middleware.war")
12:33:07,331 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-1) WFLYSRV0027: Starting deployment of "clpezlfp_cfme_test_jar_middleware.jar" (runtime-name: "clpezlfp_cfme_test_jar_middleware.jar")
12:33:07,331 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-1) WFLYSRV0027: Starting deployment of "lwdtctzn_cfme_test_war_middleware.war" (runtime-name: "lwdtctzn_cfme_test_war_middleware.war")
12:33:07,331 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-1) WFLYSRV0027: Starting deployment of "ovhiwigx_cfme_test_war_middleware.war" (runtime-name: "ovhiwigx_cfme_test_war_middleware.war")
12:33:07,332 INFO  [org.jboss.as.server.deployment] (MSC service thread 1-1) WFLYSRV0027: Starting deployment of "tvmxzgvx_cfme_test_war_middleware.war" (runtime-name: "tvmxzgvx_cfme_test_war_middleware.war")
12:33:07,378 INFO  [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0010: Deployed "lwdtctzn_cfme_test_war_middleware.war" (runtime-name : "lwdtctzn_cfme_test_war_middleware.war")
12:33:07,378 INFO  [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0010: Deployed "wkwoqltm_cfme_test_jar_middleware.jar" (runtime-name : "wkwoqltm_cfme_test_jar_middleware.jar")
12:33:07,378 INFO  [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0010: Deployed "fqpizwfh_cfme_test_war_middleware.war" (runtime-name : "fqpizwfh_cfme_test_war_middleware.war")
12:33:07,378 INFO  [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0010: Deployed "tvmxzgvx_cfme_test_war_middleware.war" (runtime-name : "tvmxzgvx_cfme_test_war_middleware.war")
12:33:07,378 INFO  [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0010: Deployed "zxfcqlit_cfme_test_war_middleware.war" (runtime-name : "zxfcqlit_cfme_test_war_middleware.war")
12:33:07,379 INFO  [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0010: Deployed "ovhiwigx_cfme_test_war_middleware.war" (runtime-name : "ovhiwigx_cfme_test_war_middleware.war")
12:33:07,379 INFO  [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0010: Deployed "zsqdqblj_cfme_test_jar_middleware.jar" (runtime-name : "zsqdqblj_cfme_test_jar_middleware.jar")
12:33:07,379 INFO  [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0010: Deployed "ocfqrhya_cfme_test_war_middleware.war" (runtime-name : "ocfqrhya_cfme_test_war_middleware.war")
12:33:07,379 INFO  [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0010: Deployed "clpezlfp_cfme_test_jar_middleware.jar" (runtime-name : "clpezlfp_cfme_test_jar_middleware.jar")
12:33:07,379 INFO  [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0010: Deployed "dztopbfd_cfme_test_war_middleware.war" (runtime-name : "dztopbfd_cfme_test_war_middleware.war")
12:33:07,403 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0060: Http management interface listening on http://0.0.0.0:9990/management
12:33:07,403 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0051: Admin console listening on http://0.0.0.0:9990
12:33:07,403 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0025: JBoss EAP 7.0.3.GA (WildFly Core 2.1.9.Final-redhat-1) started in 275ms - Started 213 of 223 services (44 services are lazy, passive or on-demand)
12:43:07,176 INFO  [org.jboss.as.repository] (ServerService Thread Pool -- 31) WFLYDR0009: Content /opt/eap/standalone/data/content/7c/9b5984b2c1e32e7c8cf3331df77f31e89e24c2 is obsolete and will be removed
12:43:07,181 INFO  [org.jboss.as.repository] (ServerService Thread Pool -- 31) WFLYDR0002: Content removed from location /opt/eap/standalone/data/content/7c/9b5984b2c1e32e7c8cf3331df77f31e89e24c2/content
12:43:07,181 INFO  [org.jboss.as.repository] (ServerService Thread Pool -- 31) WFLYDR0009: Content /opt/eap/standalone/data/content/47/1bcc236d5d85690046eb69914c271411029ffe is obsolete and will be removed
12:43:07,181 INFO  [org.jboss.as.repository] (ServerService Thread Pool -- 31) WFLYDR0002: Content removed from location /opt/eap/standalone/data/content/47/1bcc236d5d85690046eb69914c271411029ffe/content
12:43:07,182 INFO  [org.jboss.as.repository] (ServerService Thread Pool -- 31) WFLYDR0009: Content /opt/eap/standalone/data/content/2e/d7a5a8c952d9ea18af92efea7e56ef854abfea is obsolete and will be removed
12:43:07,182 INFO  [org.jboss.as.repository] (ServerService Thread Pool -- 31) WFLYDR0002: Content removed from location /opt/eap/standalone/data/content/2e/d7a5a8c952d9ea18af92efea7e56ef854abfea/content
"

Comment 1 Heiko W. Rupp 2016-12-08 18:55:38 UTC

Mazz can you please have a look?

Comment 2 John Mazzitelli 2016-12-08 19:07:46 UTC

Looking at all the timestamps on all those stack traces coming from the agent, it is clear they are happening during the shutdown of the agent service (the reload is shutting down all the services). The server internals are finally all stopped at this message:

12:33:07,126 INFO  [org.jboss.as] (MSC service thread 1-2) WFLYSRV0050: JBoss EAP 7.0.3.GA (WildFly Core 2.1.9.Final-redhat-1) stopped in 338ms

Then at time 12:33:07,127 the server begins starting up again. I see no messages at all from the agent here. It is as if the EAP instance did not start up the agent service again.

I do not know why this is. But at this point, if EAP does not start the agent service then, yes, all agent functionality will be unavailable.

The fact this is not repeatable is odd. You would think if something is wrong with the agent it would happen all the time. I'm not sure why EAP would not be able to restart services during a reload.

Comment 3 John Mazzitelli 2016-12-08 19:09:02 UTC

Oh, and the fact that "Server logs does not contain HA logs, seems like HA is not working." tells me this problem with EAP reload is more than just affecting the agent. Seems like it is affecting other non-agent related EAP subsystems (like the HA subsystems). That tells me that this is a problem internal to EAP and not directly attributable to the agent itself.

Comment 4 John Mazzitelli 2016-12-09 21:34:19 UTC

Ok, I just ran some tests over here. I can tell you what goes on under the covers.

I ran EAP 7 in standalone mode, and had an agent manage it as a <remote-dmr> endpoint. So the agent is running elsewhere, outside of the EAP 7 instance. Which I assume is what the Hayk is doing.

When you do a "Reload" operation (that is, you send a "Reload" ExecuteOperationRequest into the agent), the agent will simply send a CLI command "/:reload" to the EAP7 server. Once that request is send, the agent performs a full discovery scan.

When the EAP7 performs a reload, it shuts all of its internals down, and then starts them back up again.

But there is a race condition here. The agent is doing a full discovery scan, so it is possible the agent is going to remove the server from inventory if it looks like it has been removed (and once its internals is shutdown, the server will look like it has been removed). So the agent removes it from its internal inventory in memory. At this point, you can't do anything to the EAP server via the agent (like send in other websocket commands) because the server is gone from the agent's inventory in memory. If the agent is slow in its automatic discovery scan, the EAP7 could have finished its reload, and came back up in time for the agent's scan to still see it - thus, nothing will look wrong.

Once the agent runs its discovery scan again, and assuming the EAP7 has come back up after the reload finished, the agent re-discovers it, puts it back in memory, and you can then do things via websocket commands to the EAP7 again.

You can manually kick off a discovery scan by sending the agent an ExecuteOperationRequest (and target the agent resource itself) and ask to run the operation "Inventory Discovery Scan" - this is defined in the default agent config as this:

<operation-dmr name="Inventory Discovery Scan" internal-name="fullDiscoveryScan" />

if the EAP7 successfully completed its reload, the agent should detect it again and pull it back into inventory.

NOTE!! I still am not convinced this is the full answer, because this BZ's description says that EAP7's HA features don't work after the reload. So something about the reload is failing - and this could be why the agent can no longer do anything - because perhaps the EAP7 isn't in a good state.

Comment 6 Hayk Hovsepyan 2016-12-13 14:25:25 UTC

Was able to constantly reproduce the problem, and it happens not only in docker container, but on standalone running EAP7 server.

Steps to reproduce:
1. Start EAP7 server in standalone mode with configured Hawkular Agent to Hawkular Services.
2. Add Middleware provider in CFME.
3. List servers on CFME Middleware provider. You will find just started EAP7 server.
4.  Deploy application archive on EAP7 server via CFME UI Menu Item "Add Deployment". Keep "Enable Deployment" "Yes".
5. Make sure that Deployments List for that EAP7 server contains newly deployed archive.
5. Deploy the newest version of that archive, but this time: "Enable Deployment" is "No" and "Overwrite (if exists) " is "Yes". So it overwrites and disables the archive.
6. Wait for some time, until Hawkular Agent on Server side will log this error in a loop:
"2016-12-13 09:05:51,001 ERROR [org.jboss.as.controller.management-operation] (Hawkular-WildFly-Agent-Scheduler-Metrics-2) WFLYCTL0013: Operation ("read-attribute") failed - address: ([
    ("deployment" => "xmwpvphp_cfme_test_war_middleware.war"),
    ("subsystem" => "undertow")
]) - failure description: "WFLYCTL0216: Management resource '[
    (\"deployment\" => \"xmwpvphp_cfme_test_war_middleware.war\"),
    (\"subsystem\" => \"undertow\")
]' not found"
2016-12-13 09:05:51,003 ERROR [org.hawkular.agent.monitor.protocol.EndpointService] (Hawkular-WildFly-Agent-Scheduler-Metrics-2) HAWKMONITOR010049: Could not access resources of endpoint [DMREndpointService[Endpoint[eap7standalone2]:[null]]]: org.hawkular.agent.monitor.protocol.ProtocolException: Unsuccessful fetching DMR attribute [max-active-sessions]
	at org.hawkular.agent.monitor.protocol.dmr.DMRDriver.fetchAttribute(DMRDriver.java:158)
	at org.hawkular.agent.monitor.protocol.EndpointService.measureMetrics(EndpointService.java:447)
	at org.hawkular.agent.monitor.scheduler.MetricsCollector.run(MetricsCollector.java:60)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
	at org.jboss.threads.JBossThread.run(JBossThread.java:320)
Caused by: org.hawkular.dmr.api.OperationFailureException: Could not perform operation [read-attribute]: WFLYCTL0216: Management resource '[
    ("deployment" => "xmwpvphp_cfme_test_war_middleware.war"),
    ("subsystem" => "undertow")
]' not found
	at org.hawkular.dmr.api.OperationBuilder$OperationResult.assertSuccess(OperationBuilder.java:422)
	at org.hawkular.agent.monitor.protocol.dmr.DMRDriver.fetchAttribute(DMRDriver.java:155)
	... 8 more
"
7. Reload the EAP7 server via CFME UI Menu Item Power Operation "Reload Server".

Here on EAP7 server side, you can see that during server reload, services are not started, and agent is not started as well.
Any further operation via CFME UI on that server will not start, and no server logs.

No any error in evm.log of CFME during operations:
"[----] I, [2016-12-13T09:10:50.440375 #2669:6bf134]  INFO -- : MIQ(MiqServer#populate_queue_messages) Fetched 2 miq_queue rows for queue_name=generic, wcount=4, priority=200
[----] I, [2016-12-13T09:10:50.979092 #2905:6bf134]  INFO -- : MIQ(MiqPriorityWorker::Runner#get_message_via_drb) Message id: [92422], MiqWorker id: [48], Zone: [default], Role: [smartstate], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [job_dispatcher], Command: [JobProxyDispatcher.dispatch], Timeout: [600], Priority: [20], State: [dequeue], Deliver On: [], Data: [], Args: [], Dequeued in: [4.993216103] seconds
[----] I, [2016-12-13T09:10:50.979546 #2905:6bf134]  INFO -- : Q-task_id([job_dispatcher]) MIQ(MiqQueue#deliver) Message id: [92422], Delivering...
[----] I, [2016-12-13T09:10:50.987704 #2905:6bf134]  INFO -- : Q-task_id([job_dispatcher]) MIQ(JobProxyDispatcher#dispatch) Complete - Timings: {:pending_container_jobs=>0.003825664520263672, :container_jobs_to_dispatch_count=>0, :container_dispatching=>0.0038614273071289062, :pending_vm_jobs=>0.0007607936859130859, :vm_jobs_to_dispatch_count=>0, :total_time=>0.0067653656005859375}
[----] I, [2016-12-13T09:10:50.988264 #2905:6bf134]  INFO -- : Q-task_id([job_dispatcher]) MIQ(MiqQueue#delivered) Message id: [92422], State: [ok], Delivered in [0.008759507] seconds
[----] I, [2016-12-13T09:10:51.883257 #2886:6bf134]  INFO -- : MIQ(MiqGenericWorker::Runner#get_message_via_drb) Message id: [92421], MiqWorker id: [46], Zone: [default], Role: [], Server: [ee9fe06e-bc2d-11e6-b241-001c4c160162], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [Session.check_session_timeout], Timeout: [600], Priority: [90], State: [dequeue], Deliver On: [], Data: [], Args: [], Dequeued in: [5.903997207] seconds
[----] I, [2016-12-13T09:10:51.883611 #2886:6bf134]  INFO -- : MIQ(MiqQueue#deliver) Message id: [92421], Delivering...
[----] I, [2016-12-13T09:10:51.896032 #2886:6bf134]  INFO -- : MIQ(MiqQueue#delivered) Message id: [92421], State: [ok], Delivered in [0.012402141] seconds
"

Comment 7 John Mazzitelli 2016-12-13 14:36:50 UTC

Just quickly off the top of my head, the error in your step 6 is to be expected. You deployed a new deployment war and told the agent not to enable it. So I suspect a disabled war doesn't have an "undertow" sub-resource under it and so any metrics or resource config props we try to read from it will fail since it doesn't exist. So that is probably just a "to be expected" though the logs are ugly in this case.

It is step 7 which is the important one where you ask to reload the server.

> Here on EAP7 server side, you can see that during server reload,
> services are not started, and agent is not started as well.

So it sounds like EAP7's "reload" is not working.

Comment 8 John Mazzitelli 2016-12-13 14:45:01 UTC

Try this without any hawkular installed (deploy apps, disable them, then reload an EAP without any hawkular agent).

I did a quick search in JIRA and I see several problems that were recently resolved that involved internal WildFly components not working after a reload. I suspect this is another one of those times when reload fails to put the EAP in a good state, such as:

https://issues.jboss.org/browse/WFLY-2220
https://issues.jboss.org/browse/WFLY-3355
https://issues.jboss.org/browse/WFLY-5261

I don't know what versions of WildFly they were fixed in and if the EAP version being tested has those fixed. But even still, the point is that it looks like reload has in the past caused problems - this is probably one of them.

Comment 9 Hayk Hovsepyan 2016-12-13 15:52:06 UTC

Was able to reproduce the issue on EAP 7.0.3 GA manually without hawkular agent, but on EAP 7.1.0.DR9 issue is fixed.
Have tried 7.1.0.DR9 with Hawkualar agent and it works fine.

Comment 10 Heiko W. Rupp 2016-12-15 09:48:12 UTC

According to #c9 Hawkular-services is fine and I am closing this one.