Bug 1330180

Summary: EAP 7 with secured management become unavailable after ~8 days
Product: [JBoss] JBoss Operations Network Reporter: Filip Brychta <fbrychta>
Component: Plugin -- JBoss EAP 7Assignee: Simeon Pinder <spinder>
Status: CLOSED NOTABUG QA Contact: Mike Foley <mfoley>
Severity: high Docs Contact:
Priority: high    
Version: JON 3.3.5CC: hrupp, jmartine, loleary, rhatlapa
Target Milestone: CR02Keywords: Triaged
Target Release: One-off release   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-09 14:47:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
thread dumps
none
tcpdump none

Description Filip Brychta 2016-04-25 14:30:46 UTC
Created attachment 1150524 [details]
thread dumps

Description of problem:
I found EAP7 domain controller marked as down after running ~8 days without problems. The resource has been marked as down since following msg in agent.log:

2016-04-23 11:15:12,001 WARN  [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Failure to collect measurement data for Resource[id=13648, uuid=d8969534-6e2a-4cff-b0bd-bb395f48bfcc, type={EAP7}EAP7 Host Controller, key=hostConfig: /home/hudson/jboss-eap7-domain/domain/configuration/host-secured-changed.xml, name=EAP 7 Domain Controller (master 0.0.0.0:8443), parent=fbr-eap7-sec.bc.jonqe.lab.eng.bos.redhat.com, version=EAP 7.0.0.GA] - cause: java.lang.Exception:Failed to read attribute [name] of address [] - response: Result{outcome='failure', failureDescription=The server closed the connection before sending the response, rolled-back=false, rolledBack=false}

The EAP domain controller process was still running on the host but the agent was throwing following errors periodically and the resource was marked as down:

    2016-04-25 04:21:04,729 DEBUG [ResourceContainer.invoker.availCheck.daemon-69] (rhq.modules.plugins.wildfly10.BaseServerComponent)- ResourceType[id=0, name=EAP7 Host Controller, plugin=EAP7, category=Server] [hostConfig: /home/hudson/jboss-eap7-domain/domain/configuration/host-secured-changed.xml]: exception while checking availability
    java.lang.Exception: Failed to read attribute [name] of address [] - response: Result{outcome='failure', failureDescription=The server closed the connection before sending the response, rolled-back=false, rolledBack=false}
            at org.rhq.modules.plugins.wildfly10.BaseComponent.readAttribute(BaseComponent.java:835)
            at org.rhq.modules.plugins.wildfly10.BaseComponent.readAttribute(BaseComponent.java:805)
            at org.rhq.modules.plugins.wildfly10.BaseServerComponent.getAvailability(BaseServerComponent.java:145)
            at org.rhq.core.pc.inventory.AvailabilityProxy.call(AvailabilityProxy.java:143)
            at org.rhq.core.pc.inventory.AvailabilityProxy.call(AvailabilityProxy.java:52)
            at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)



Restart of agent didn't resolve the issue. The issues disappeared when the EAP was restarted.

Version-Release number of selected component (if applicable):
JON3.3.5 with EAP plugin pack 04-ER01

How reproducible:
Not sure

Steps to Reproduce:
1. install/start EAP7 in domain mode with secured management (2-way ssl)
2. import it to JON and set correct Connection properties to have it UP
3. wait ~8days

Actual results:
The resource is marked as down and agent is throwing previous err on debug level.

Expected results:
It should continue to work.

Additional info:
Discussed the issue with EAP QE and they are not aware of any similar issue on EAP 7 side. It was suggested that EAP got into this "broken" state because of interaction with JON agent which is causing exhaustion of connection pool.

Thread dumps of agent, domain controller and host controller processes are attached.

Suggested steps to get more info next time the issue is hit:
- try to connect with jboss-cli directly to see if it's refused as well
- use javax.net.debug=all to log handshake details
- capture traffic via tcpdump -w (-w option is important to be able to import results to wireshark which is able to interpret SSL handshake)

Comment 1 Filip Brychta 2016-04-25 14:35:37 UTC
Forgot to mention that the issue is not visible on EAP6

Comment 2 Filip Brychta 2016-04-25 14:40:45 UTC
Created attachment 1150526 [details]
tcpdump

Adding tcpdump, unfortunately without -w option

Comment 3 Heiko W. Rupp 2016-04-26 12:53:24 UTC
Do we know if there were many (parallel) tcp connections open between agent and the DC?

Comment 4 Heiko W. Rupp 2016-04-26 12:55:43 UTC
(In reply to Heiko W. Rupp from comment #3)
> Do we know if there were many (parallel) tcp connections open between agent
> and the DC?

Something like regular

netstat -an | grep 999 to see how many connections end up at the EAP. See if that number grows

Comment 5 Filip Brychta 2016-04-26 13:21:23 UTC
I didn't save output of netstat but I didn't notice suspiciously large amount of connections. But I don't remember if It was before or after restart of agent. I have thread dump of both agent and DC. Is it possible to guess number of connections from there?

Comment 6 Josejulio Martínez 2016-04-26 17:22:48 UTC
Could you attach the logs?

Comment 7 Filip Brychta 2016-04-27 08:26:29 UTC
All I have is attached. Complete agent log was unfortunately rolled out because it's keeping only two 5000KB backup files by default and debug mode is so noisy.

Only relevant msgs which I found before it was rolled out are pasted in description. Nothing in eap logs (host-controller.log, process-controller.log)

Comment 8 Filip Brychta 2016-04-27 08:33:55 UTC
I tried to move system date 10 days forward and it was not reproduced this way.

Comment 10 Filip Brychta 2016-04-29 09:58:44 UTC
I will run following curl periodically to simulate the operation which is failing (exception above):
curl --insecure --cert-type pem --key key.pem --cert ./cert.pem  "https://127.0.0.1:844/management" --header "Content-Type: application/json" -d '{"operation":"read-attribute","name":"name","json.pretty":1}'

Comment 11 Radim Hatlapatka 2016-05-02 13:07:52 UTC
We are running EAP with two way ssl setup for management interface and sending request once in a second. Request is similar to the one Filip describes in Comment #10. It is now running for more than 3 days not seeing any issue so far.

Comment 12 Filip Brychta 2016-05-05 15:43:19 UTC
I now have 3 instances in "broken" state. It happened again after 8.1 days.
- it's not necessary to use 2-way SSL, the issue is visible when using 1-way SSL as well
- it's irrelevant if there is another EAP 7 running on the host
- EAP 7 standalone mode is not yet affected by this issue, but it might be later??
- jboss-cli is working correctly

Following err in host-controller.log:

2016-05-05 11:42:40,271 ERROR [io.undertow.request] (Host Controller Service Threads - 54116) UT005071: Undertow request failed HttpServerExchange{ POST /management request {Accept=[application/json], Connection=[Keep-Alive], Authorization=[Digest username="rhqadmin", realm="ManagementRealm", nonce="5LMqqFybGWoNMTQ2MjQ2Mjk2MDA0Mna3aWvkOayVgASy9bsRABI=", uri="/management", response="e194f9abf448bf1f7a1fb411cff6abc8", qop=auth, nc=00000001, cnonce="25e2dd1725f8752c", algorithm="MD5", opaque="00000000000000000000000000000000"], Content-Length=[74], Content-Type=[application/json; charset=UTF-8], User-Agent=[Apache-HttpClient/4.2.5 (java 1.5)], Host=[127.0.0.1:8443]} response {Connection=[keep-alive], X-Frame-Options=[SAMEORIGIN], Authentication-Info=[nextnonce="5LMqqFybGWoNMTQ2MjQ2Mjk2MDA0Mna3aWvkOayVgASy9bsRABI=",qop="auth",rspauth="be69e0a8fe741fe0df203fde6e0b2389",cnonce="25e2dd1725f8752c",nc=00000001], Content-Type=[application/json; charset=utf-8], Content-Length=[44], Date=[Thu, 05 May 2016 15:42:40 GMT]}}: java.lang.OutOfMemoryError: Direct buffer memory
        at java.nio.Bits.reserveMemory(Bits.java:658)
        at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
        at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
        at io.undertow.server.DefaultByteBufferPool.allocate(DefaultByteBufferPool.java:122)
        at io.undertow.protocols.ssl.SslConduit.doWrap(SslConduit.java:849)
        at io.undertow.protocols.ssl.SslConduit.write(SslConduit.java:377)
        at io.undertow.server.protocol.http.HttpResponseConduit.processWrite(HttpResponseConduit.java:247)
        at io.undertow.server.protocol.http.HttpResponseConduit.write(HttpResponseConduit.java:588)
        at io.undertow.conduits.AbstractFixedLengthStreamSinkConduit.write(AbstractFixedLengthStreamSinkConduit.java:106)
        at io.undertow.conduits.AbstractFixedLengthStreamSinkConduit.write(AbstractFixedLengthStreamSinkConduit.java:120)
        at org.xnio.conduits.ConduitStreamSinkChannel.write(ConduitStreamSinkChannel.java:158)
        at io.undertow.channels.DetachableStreamSinkChannel.write(DetachableStreamSinkChannel.java:179)
        at io.undertow.server.HttpServerExchange$WriteDispatchChannel.write(HttpServerExchange.java:1969)
        at org.xnio.channels.Channels.writeBlocking(Channels.java:152)
        at io.undertow.io.UndertowOutputStream.write(UndertowOutputStream.java:228)
        at io.undertow.io.BlockingSenderImpl.writeBuffer(BlockingSenderImpl.java:194)
        at io.undertow.io.BlockingSenderImpl.writeBuffer(BlockingSenderImpl.java:187)
        at io.undertow.io.BlockingSenderImpl.send(BlockingSenderImpl.java:63)
        at io.undertow.io.BlockingSenderImpl.send(BlockingSenderImpl.java:83)
        at org.jboss.as.domain.http.server.DomainUtil.writeResponse(DomainUtil.java:70)
        at org.jboss.as.domain.http.server.DomainApiHandler$1.doSendResponse(DomainApiHandler.java:176)
        at org.jboss.as.domain.http.server.DomainApiHandler$ResponseCallback.sendResponse(DomainApiHandler.java:391)
        at org.jboss.as.domain.http.server.DomainApiHandler.handleRequest(DomainApiHandler.java:232)
        at io.undertow.server.handlers.encoding.EncodingHandler.handleRequest(EncodingHandler.java:66)
        at org.jboss.as.domain.http.server.security.SubjectDoAsHandler$1.run(SubjectDoAsHandler.java:72)
        at org.jboss.as.domain.http.server.security.SubjectDoAsHandler$1.run(SubjectDoAsHandler.java:68)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.jboss.as.controller.AccessAuditContext.doAs(AccessAuditContext.java:92)
        at org.jboss.as.domain.http.server.security.SubjectDoAsHandler.handleRequest(SubjectDoAsHandler.java:68)
        at org.jboss.as.domain.http.server.security.SubjectDoAsHandler.handleRequest(SubjectDoAsHandler.java:63)
        at io.undertow.server.handlers.BlockingHandler.handleRequest(BlockingHandler.java:56)
        at org.jboss.as.domain.http.server.DomainApiCheckHandler.handleRequest(DomainApiCheckHandler.java:95)
        at io.undertow.server.Connectors.executeRootHandler(Connectors.java:202)
        at io.undertow.server.HttpServerExchange$1.run(HttpServerExchange.java:792)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
        at org.jboss.threads.JBossThread.run(JBossThread.java:320)

Comment 14 Filip Brychta 2016-05-09 14:47:38 UTC
This is not a bug in the plugin. The issue is caused by memory leak in EAP7. There is a following jira to track that - https://issues.jboss.org/browse/JBEAP-4410

Closing as not a bug.

Comment 15 JBoss JIRA Server 2016-05-13 04:28:52 UTC
Jimmy Wilson <jawilson> updated the status of jira JBEAP-4410 to Coding In Progress

Comment 16 JBoss JIRA Server 2016-05-13 04:29:05 UTC
Jimmy Wilson <jawilson> updated the status of jira JBEAP-4410 to Open

Comment 17 JBoss JIRA Server 2016-05-13 04:29:22 UTC
Jimmy Wilson <jawilson> updated the status of jira JBEAP-4410 to Coding In Progress

Comment 18 JBoss JIRA Server 2016-05-25 13:03:17 UTC
Errata Tool <etool> updated the status of jira JBEAP-4410 to Closed