Bug 1330180 - EAP 7 with secured management become unavailable after ~8 days
Summary: EAP 7 with secured management become unavailable after ~8 days
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: JBoss Operations Network
Classification: JBoss
Component: Plugin -- JBoss EAP 7
Version: JON 3.3.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: CR02
: One-off release
Assignee: Simeon Pinder
QA Contact: Mike Foley
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-25 14:30 UTC by Filip Brychta
Modified: 2016-05-25 13:03 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-09 14:47:38 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
thread dumps (9.33 KB, application/x-gzip)
2016-04-25 14:30 UTC, Filip Brychta
no flags Details
tcpdump (11.50 KB, text/plain)
2016-04-25 14:40 UTC, Filip Brychta
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker JBEAP-4410 0 Critical Closed EAP 7 with management leaks small portions of memory when delay between requests is large resulting in management becomi... 2018-11-09 01:56:10 UTC

Description Filip Brychta 2016-04-25 14:30:46 UTC
Created attachment 1150524 [details]
thread dumps

Description of problem:
I found EAP7 domain controller marked as down after running ~8 days without problems. The resource has been marked as down since following msg in agent.log:

2016-04-23 11:15:12,001 WARN  [MeasurementManager.collector-1] (rhq.core.pc.measurement.MeasurementCollectorRunner)- Failure to collect measurement data for Resource[id=13648, uuid=d8969534-6e2a-4cff-b0bd-bb395f48bfcc, type={EAP7}EAP7 Host Controller, key=hostConfig: /home/hudson/jboss-eap7-domain/domain/configuration/host-secured-changed.xml, name=EAP 7 Domain Controller (master 0.0.0.0:8443), parent=fbr-eap7-sec.bc.jonqe.lab.eng.bos.redhat.com, version=EAP 7.0.0.GA] - cause: java.lang.Exception:Failed to read attribute [name] of address [] - response: Result{outcome='failure', failureDescription=The server closed the connection before sending the response, rolled-back=false, rolledBack=false}

The EAP domain controller process was still running on the host but the agent was throwing following errors periodically and the resource was marked as down:

    2016-04-25 04:21:04,729 DEBUG [ResourceContainer.invoker.availCheck.daemon-69] (rhq.modules.plugins.wildfly10.BaseServerComponent)- ResourceType[id=0, name=EAP7 Host Controller, plugin=EAP7, category=Server] [hostConfig: /home/hudson/jboss-eap7-domain/domain/configuration/host-secured-changed.xml]: exception while checking availability
    java.lang.Exception: Failed to read attribute [name] of address [] - response: Result{outcome='failure', failureDescription=The server closed the connection before sending the response, rolled-back=false, rolledBack=false}
            at org.rhq.modules.plugins.wildfly10.BaseComponent.readAttribute(BaseComponent.java:835)
            at org.rhq.modules.plugins.wildfly10.BaseComponent.readAttribute(BaseComponent.java:805)
            at org.rhq.modules.plugins.wildfly10.BaseServerComponent.getAvailability(BaseServerComponent.java:145)
            at org.rhq.core.pc.inventory.AvailabilityProxy.call(AvailabilityProxy.java:143)
            at org.rhq.core.pc.inventory.AvailabilityProxy.call(AvailabilityProxy.java:52)
            at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)



Restart of agent didn't resolve the issue. The issues disappeared when the EAP was restarted.

Version-Release number of selected component (if applicable):
JON3.3.5 with EAP plugin pack 04-ER01

How reproducible:
Not sure

Steps to Reproduce:
1. install/start EAP7 in domain mode with secured management (2-way ssl)
2. import it to JON and set correct Connection properties to have it UP
3. wait ~8days

Actual results:
The resource is marked as down and agent is throwing previous err on debug level.

Expected results:
It should continue to work.

Additional info:
Discussed the issue with EAP QE and they are not aware of any similar issue on EAP 7 side. It was suggested that EAP got into this "broken" state because of interaction with JON agent which is causing exhaustion of connection pool.

Thread dumps of agent, domain controller and host controller processes are attached.

Suggested steps to get more info next time the issue is hit:
- try to connect with jboss-cli directly to see if it's refused as well
- use javax.net.debug=all to log handshake details
- capture traffic via tcpdump -w (-w option is important to be able to import results to wireshark which is able to interpret SSL handshake)

Comment 1 Filip Brychta 2016-04-25 14:35:37 UTC
Forgot to mention that the issue is not visible on EAP6

Comment 2 Filip Brychta 2016-04-25 14:40:45 UTC
Created attachment 1150526 [details]
tcpdump

Adding tcpdump, unfortunately without -w option

Comment 3 Heiko W. Rupp 2016-04-26 12:53:24 UTC
Do we know if there were many (parallel) tcp connections open between agent and the DC?

Comment 4 Heiko W. Rupp 2016-04-26 12:55:43 UTC
(In reply to Heiko W. Rupp from comment #3)
> Do we know if there were many (parallel) tcp connections open between agent
> and the DC?

Something like regular

netstat -an | grep 999 to see how many connections end up at the EAP. See if that number grows

Comment 5 Filip Brychta 2016-04-26 13:21:23 UTC
I didn't save output of netstat but I didn't notice suspiciously large amount of connections. But I don't remember if It was before or after restart of agent. I have thread dump of both agent and DC. Is it possible to guess number of connections from there?

Comment 6 Josejulio Martínez 2016-04-26 17:22:48 UTC
Could you attach the logs?

Comment 7 Filip Brychta 2016-04-27 08:26:29 UTC
All I have is attached. Complete agent log was unfortunately rolled out because it's keeping only two 5000KB backup files by default and debug mode is so noisy.

Only relevant msgs which I found before it was rolled out are pasted in description. Nothing in eap logs (host-controller.log, process-controller.log)

Comment 8 Filip Brychta 2016-04-27 08:33:55 UTC
I tried to move system date 10 days forward and it was not reproduced this way.

Comment 10 Filip Brychta 2016-04-29 09:58:44 UTC
I will run following curl periodically to simulate the operation which is failing (exception above):
curl --insecure --cert-type pem --key key.pem --cert ./cert.pem  "https://127.0.0.1:844/management" --header "Content-Type: application/json" -d '{"operation":"read-attribute","name":"name","json.pretty":1}'

Comment 11 Radim Hatlapatka 2016-05-02 13:07:52 UTC
We are running EAP with two way ssl setup for management interface and sending request once in a second. Request is similar to the one Filip describes in Comment #10. It is now running for more than 3 days not seeing any issue so far.

Comment 12 Filip Brychta 2016-05-05 15:43:19 UTC
I now have 3 instances in "broken" state. It happened again after 8.1 days.
- it's not necessary to use 2-way SSL, the issue is visible when using 1-way SSL as well
- it's irrelevant if there is another EAP 7 running on the host
- EAP 7 standalone mode is not yet affected by this issue, but it might be later??
- jboss-cli is working correctly

Following err in host-controller.log:

2016-05-05 11:42:40,271 ERROR [io.undertow.request] (Host Controller Service Threads - 54116) UT005071: Undertow request failed HttpServerExchange{ POST /management request {Accept=[application/json], Connection=[Keep-Alive], Authorization=[Digest username="rhqadmin", realm="ManagementRealm", nonce="5LMqqFybGWoNMTQ2MjQ2Mjk2MDA0Mna3aWvkOayVgASy9bsRABI=", uri="/management", response="e194f9abf448bf1f7a1fb411cff6abc8", qop=auth, nc=00000001, cnonce="25e2dd1725f8752c", algorithm="MD5", opaque="00000000000000000000000000000000"], Content-Length=[74], Content-Type=[application/json; charset=UTF-8], User-Agent=[Apache-HttpClient/4.2.5 (java 1.5)], Host=[127.0.0.1:8443]} response {Connection=[keep-alive], X-Frame-Options=[SAMEORIGIN], Authentication-Info=[nextnonce="5LMqqFybGWoNMTQ2MjQ2Mjk2MDA0Mna3aWvkOayVgASy9bsRABI=",qop="auth",rspauth="be69e0a8fe741fe0df203fde6e0b2389",cnonce="25e2dd1725f8752c",nc=00000001], Content-Type=[application/json; charset=utf-8], Content-Length=[44], Date=[Thu, 05 May 2016 15:42:40 GMT]}}: java.lang.OutOfMemoryError: Direct buffer memory
        at java.nio.Bits.reserveMemory(Bits.java:658)
        at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
        at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
        at io.undertow.server.DefaultByteBufferPool.allocate(DefaultByteBufferPool.java:122)
        at io.undertow.protocols.ssl.SslConduit.doWrap(SslConduit.java:849)
        at io.undertow.protocols.ssl.SslConduit.write(SslConduit.java:377)
        at io.undertow.server.protocol.http.HttpResponseConduit.processWrite(HttpResponseConduit.java:247)
        at io.undertow.server.protocol.http.HttpResponseConduit.write(HttpResponseConduit.java:588)
        at io.undertow.conduits.AbstractFixedLengthStreamSinkConduit.write(AbstractFixedLengthStreamSinkConduit.java:106)
        at io.undertow.conduits.AbstractFixedLengthStreamSinkConduit.write(AbstractFixedLengthStreamSinkConduit.java:120)
        at org.xnio.conduits.ConduitStreamSinkChannel.write(ConduitStreamSinkChannel.java:158)
        at io.undertow.channels.DetachableStreamSinkChannel.write(DetachableStreamSinkChannel.java:179)
        at io.undertow.server.HttpServerExchange$WriteDispatchChannel.write(HttpServerExchange.java:1969)
        at org.xnio.channels.Channels.writeBlocking(Channels.java:152)
        at io.undertow.io.UndertowOutputStream.write(UndertowOutputStream.java:228)
        at io.undertow.io.BlockingSenderImpl.writeBuffer(BlockingSenderImpl.java:194)
        at io.undertow.io.BlockingSenderImpl.writeBuffer(BlockingSenderImpl.java:187)
        at io.undertow.io.BlockingSenderImpl.send(BlockingSenderImpl.java:63)
        at io.undertow.io.BlockingSenderImpl.send(BlockingSenderImpl.java:83)
        at org.jboss.as.domain.http.server.DomainUtil.writeResponse(DomainUtil.java:70)
        at org.jboss.as.domain.http.server.DomainApiHandler$1.doSendResponse(DomainApiHandler.java:176)
        at org.jboss.as.domain.http.server.DomainApiHandler$ResponseCallback.sendResponse(DomainApiHandler.java:391)
        at org.jboss.as.domain.http.server.DomainApiHandler.handleRequest(DomainApiHandler.java:232)
        at io.undertow.server.handlers.encoding.EncodingHandler.handleRequest(EncodingHandler.java:66)
        at org.jboss.as.domain.http.server.security.SubjectDoAsHandler$1.run(SubjectDoAsHandler.java:72)
        at org.jboss.as.domain.http.server.security.SubjectDoAsHandler$1.run(SubjectDoAsHandler.java:68)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.jboss.as.controller.AccessAuditContext.doAs(AccessAuditContext.java:92)
        at org.jboss.as.domain.http.server.security.SubjectDoAsHandler.handleRequest(SubjectDoAsHandler.java:68)
        at org.jboss.as.domain.http.server.security.SubjectDoAsHandler.handleRequest(SubjectDoAsHandler.java:63)
        at io.undertow.server.handlers.BlockingHandler.handleRequest(BlockingHandler.java:56)
        at org.jboss.as.domain.http.server.DomainApiCheckHandler.handleRequest(DomainApiCheckHandler.java:95)
        at io.undertow.server.Connectors.executeRootHandler(Connectors.java:202)
        at io.undertow.server.HttpServerExchange$1.run(HttpServerExchange.java:792)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
        at org.jboss.threads.JBossThread.run(JBossThread.java:320)

Comment 14 Filip Brychta 2016-05-09 14:47:38 UTC
This is not a bug in the plugin. The issue is caused by memory leak in EAP7. There is a following jira to track that - https://issues.jboss.org/browse/JBEAP-4410

Closing as not a bug.

Comment 15 JBoss JIRA Server 2016-05-13 04:28:52 UTC
Jimmy Wilson <jawilson> updated the status of jira JBEAP-4410 to Coding In Progress

Comment 16 JBoss JIRA Server 2016-05-13 04:29:05 UTC
Jimmy Wilson <jawilson> updated the status of jira JBEAP-4410 to Open

Comment 17 JBoss JIRA Server 2016-05-13 04:29:22 UTC
Jimmy Wilson <jawilson> updated the status of jira JBEAP-4410 to Coding In Progress

Comment 18 JBoss JIRA Server 2016-05-25 13:03:17 UTC
Errata Tool <etool> updated the status of jira JBEAP-4410 to Closed


Note You need to log in before you can comment on or make changes to this bug.