Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1193085

Summary: [GSS](6.4.z) HQ222010: Critical IO Error when client is disconnected via JMX or model-node
Product: [JBoss] JBoss Enterprise Application Platform 6 Reporter: Ondřej Kalman <okalman>
Component: HornetQAssignee: Justin Bertram <jbertram>
Status: CLOSED CURRENTRELEASE QA Contact: Miroslav Novak <mnovak>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.4.0CC: bbaranow, bmaxwell, istudens, jbertram, msvehla, rsvoboda, tom.ross, toross
Target Milestone: CR1   
Target Release: EAP 6.4.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-17 10:36:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1231259, 1231916    
Attachments:
Description Flags
stack trace
none
stack trace with interrupt caller none

Description Ondřej Kalman 2015-02-16 14:37:58 UTC
Created attachment 992227 [details]
stack trace

Scenario:
1) EAP serverver is started
2) Producer sends messages to queue and server starts paging
3) Receiver consumes messages from queue
4) clients are disconnected via JMX method "closeConnectionsForUser"

Description of problem: 

Sometimes Hornetq throws this warning in to log.
 WARN  [org.hornetq.core.server] (Old I/O server worker (parentId: 2104647255, [id: 0x7d725e57, /127.0.0.1:5445])) HQ222010: Critical IO Error, shutting down the server. file=NIOSequentialFile /qa/hudson_workspace/workspace/eap-6-hornetq-qe-internal-ts-functional-tests-matrix/jdk/openjdk1.8_local/label/messaging-lab/server1/jboss-eap/standalone/data/messagingpaging/b711d820-b213-11e4-8e1e-3bab47195540/000001408.page, message=null: HornetQException[errorType=IO_ERROR message=null]

We also check via notification listener if both clients was successfully disconnected. One notification is missing.

It looks like some race condition, because hitting this issue is quite difficult and as far as i can tell random.

In attachment is whole stack trace of exceptions from log.

Comment 1 Justin Bertram 2015-02-16 14:57:44 UTC
Can you outline what I need to do to reproduce this?

Comment 2 Ondřej Kalman 2015-02-16 15:04:18 UTC
You can try to pause HQ with debugger somewhere around 106 (before fileSize = channel.size();) line in NIOSequentialFile class and disconnect client via JMX. But I'm really not sure if it will be enough. We are hitting this quite randomly and i was not able to make reproducer with byteman.

Comment 3 Justin Bertram 2015-02-16 16:16:45 UTC
I can't see in the code where interrupt() is being invoked.  Can you use a Byteman rule like this to try to identify who is calling interrupt():

  RULE check who is calling Thread.interrupt()
  CLASS java.lang.Thread  
  METHOD interrupt()  
  AT ENTRY  
  IF TRUE  
  DO traceStack("\*\*\* called interrupt on thread " + $0 + " from thread " + Thread.currentThread(), 50)  
  ENDRULE

Comment 4 Justin Bertram 2015-02-16 22:30:26 UTC
I worked up a Byteman test which I think should simulate the issue.  It's sending messages to the server until paging starts and then on the next send Byteman will kill the user's connection right when it's checking the channel size in org.hornetq.core.journal.impl.NIOSequentialFile (i.e. line 107).  The client throws the proper exception at this point.  However, it doesn't trigger a ClosedByInterruptException (i.e. an IOException).

It's critical we determine where the IOException in your test is coming from.  Please let me know if you can use the Byteman rule from my previous comment to get any additional information on this.

BTW, what kind of filesystem are you using?  Any chance it is NAS or NFS?

Comment 5 Ondřej Kalman 2015-02-17 07:28:39 UTC
I'm working on it. I have problem with triggering your rule, but I'll figure something  out.
About that FS, we can hit it on ext4 and also on NFS.

Comment 6 Ondřej Kalman 2015-02-17 08:46:25 UTC
Created attachment 992565 [details]
stack trace with interrupt caller

So finally I was able to hit it on IMB JDK. I hope it will help you.

Comment 7 Justin Bertram 2015-02-19 14:52:47 UTC
I see that Netty is interrupting the working thread (see https://github.com/netty/netty/blob/netty-3.6.10.Final/src/main/java/org/jboss/netty/channel/socket/oio/AbstractOioWorker.java#L224) when it's performing an IO operation which HornetQ then interprets as a critical error.

I'm investigating to see what can be done to mitigate this.

Comment 8 Justin Bertram 2015-02-19 19:43:22 UTC
Here's my understanding of this issue...

When a Netty OIO thread servicing a client performs work on the journal (e.g. creating a new page file, etc.) there is a small window where if the connection is closed (either via administrative intervention or because of ping failure, etc.) the Netty thread can be interrupted and HornetQ will interpret this as a critical journal failure and shut itself down.

To address this problem we now "gracefully" stop the Netty OIO thread. In other words, we wait for the current packet to finish and then we shut down the connection. If the packet doesn't finish within the timeout (i.e. 30 seconds) then we proceed to terminate the connection anyway.

Comment 9 Justin Bertram 2015-02-20 21:28:16 UTC
This is fixed now on the 2.3.x branch via 54165e6c36fc96fef57422eb0245ce2c90f06a1b.

Comment 10 Justin Bertram 2015-02-20 21:38:14 UTC
To be clear, this problem doesn't impact HornetQ 2.4.x and beyond because we upgraded Netty and the new Netty architecture works differently (i.e. it doesn't interrupt the worker thread in this case).

Comment 14 Ondřej Kalman 2015-07-31 08:26:50 UTC
VERIFIED with 6.4.3.CP.CR1

Comment 15 Petr Penicka 2017-01-17 10:36:47 UTC
Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.

Comment 16 Petr Penicka 2017-01-17 10:37:29 UTC
Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.