Bug 1193085
| Summary: | [GSS](6.4.z) HQ222010: Critical IO Error when client is disconnected via JMX or model-node | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [JBoss] JBoss Enterprise Application Platform 6 | Reporter: | Ondřej Kalman <okalman> | ||||||
| Component: | HornetQ | Assignee: | Justin Bertram <jbertram> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Miroslav Novak <mnovak> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 6.4.0 | CC: | bbaranow, bmaxwell, istudens, jbertram, msvehla, rsvoboda, tom.ross, toross | ||||||
| Target Milestone: | CR1 | ||||||||
| Target Release: | EAP 6.4.3 | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2017-01-17 10:36:47 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1231259, 1231916 | ||||||||
| Attachments: |
|
||||||||
Can you outline what I need to do to reproduce this? You can try to pause HQ with debugger somewhere around 106 (before fileSize = channel.size();) line in NIOSequentialFile class and disconnect client via JMX. But I'm really not sure if it will be enough. We are hitting this quite randomly and i was not able to make reproducer with byteman. I can't see in the code where interrupt() is being invoked. Can you use a Byteman rule like this to try to identify who is calling interrupt():
RULE check who is calling Thread.interrupt()
CLASS java.lang.Thread
METHOD interrupt()
AT ENTRY
IF TRUE
DO traceStack("\*\*\* called interrupt on thread " + $0 + " from thread " + Thread.currentThread(), 50)
ENDRULE
I worked up a Byteman test which I think should simulate the issue. It's sending messages to the server until paging starts and then on the next send Byteman will kill the user's connection right when it's checking the channel size in org.hornetq.core.journal.impl.NIOSequentialFile (i.e. line 107). The client throws the proper exception at this point. However, it doesn't trigger a ClosedByInterruptException (i.e. an IOException). It's critical we determine where the IOException in your test is coming from. Please let me know if you can use the Byteman rule from my previous comment to get any additional information on this. BTW, what kind of filesystem are you using? Any chance it is NAS or NFS? I'm working on it. I have problem with triggering your rule, but I'll figure something out. About that FS, we can hit it on ext4 and also on NFS. Created attachment 992565 [details]
stack trace with interrupt caller
So finally I was able to hit it on IMB JDK. I hope it will help you.
I see that Netty is interrupting the working thread (see https://github.com/netty/netty/blob/netty-3.6.10.Final/src/main/java/org/jboss/netty/channel/socket/oio/AbstractOioWorker.java#L224) when it's performing an IO operation which HornetQ then interprets as a critical error. I'm investigating to see what can be done to mitigate this. Here's my understanding of this issue... When a Netty OIO thread servicing a client performs work on the journal (e.g. creating a new page file, etc.) there is a small window where if the connection is closed (either via administrative intervention or because of ping failure, etc.) the Netty thread can be interrupted and HornetQ will interpret this as a critical journal failure and shut itself down. To address this problem we now "gracefully" stop the Netty OIO thread. In other words, we wait for the current packet to finish and then we shut down the connection. If the packet doesn't finish within the timeout (i.e. 30 seconds) then we proceed to terminate the connection anyway. This is fixed now on the 2.3.x branch via 54165e6c36fc96fef57422eb0245ce2c90f06a1b. To be clear, this problem doesn't impact HornetQ 2.4.x and beyond because we upgraded Netty and the new Netty architecture works differently (i.e. it doesn't interrupt the worker thread in this case). VERIFIED with 6.4.3.CP.CR1 Retroactively bulk-closing issues from released EAP 6.4 cummulative patches. Retroactively bulk-closing issues from released EAP 6.4 cummulative patches. |
Created attachment 992227 [details] stack trace Scenario: 1) EAP serverver is started 2) Producer sends messages to queue and server starts paging 3) Receiver consumes messages from queue 4) clients are disconnected via JMX method "closeConnectionsForUser" Description of problem: Sometimes Hornetq throws this warning in to log. WARN [org.hornetq.core.server] (Old I/O server worker (parentId: 2104647255, [id: 0x7d725e57, /127.0.0.1:5445])) HQ222010: Critical IO Error, shutting down the server. file=NIOSequentialFile /qa/hudson_workspace/workspace/eap-6-hornetq-qe-internal-ts-functional-tests-matrix/jdk/openjdk1.8_local/label/messaging-lab/server1/jboss-eap/standalone/data/messagingpaging/b711d820-b213-11e4-8e1e-3bab47195540/000001408.page, message=null: HornetQException[errorType=IO_ERROR message=null] We also check via notification listener if both clients was successfully disconnected. One notification is missing. It looks like some race condition, because hitting this issue is quite difficult and as far as i can tell random. In attachment is whole stack trace of exceptions from log.