Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1193085

Summary:

[GSS](6.4.z) HQ222010: Critical IO Error when client is disconnected via JMX or model-node

Product:

[JBoss] JBoss Enterprise Application Platform 6

Reporter:

Ondřej Kalman <okalman>

Component:

HornetQ

Assignee:

Justin Bertram <jbertram>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Miroslav Novak <mnovak>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

6.4.0

CC:

bbaranow, bmaxwell, istudens, jbertram, msvehla, rsvoboda, tom.ross, toross

Target Milestone:

CR1

Target Release:

EAP 6.4.3

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-01-17 10:36:47 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1231259, 1231916

Attachments:

Description	Flags
stack trace	none
stack trace with interrupt caller	none

Description Ondřej Kalman 2015-02-16 14:37:58 UTC

Created attachment 992227 [details]
stack trace

Scenario:
1) EAP serverver is started
2) Producer sends messages to queue and server starts paging
3) Receiver consumes messages from queue
4) clients are disconnected via JMX method "closeConnectionsForUser"

Description of problem: 

Sometimes Hornetq throws this warning in to log.
 WARN  [org.hornetq.core.server] (Old I/O server worker (parentId: 2104647255, [id: 0x7d725e57, /127.0.0.1:5445])) HQ222010: Critical IO Error, shutting down the server. file=NIOSequentialFile /qa/hudson_workspace/workspace/eap-6-hornetq-qe-internal-ts-functional-tests-matrix/jdk/openjdk1.8_local/label/messaging-lab/server1/jboss-eap/standalone/data/messagingpaging/b711d820-b213-11e4-8e1e-3bab47195540/000001408.page, message=null: HornetQException[errorType=IO_ERROR message=null]

We also check via notification listener if both clients was successfully disconnected. One notification is missing.

It looks like some race condition, because hitting this issue is quite difficult and as far as i can tell random.

In attachment is whole stack trace of exceptions from log.

Comment 1 Justin Bertram 2015-02-16 14:57:44 UTC

Can you outline what I need to do to reproduce this?

Comment 2 Ondřej Kalman 2015-02-16 15:04:18 UTC

You can try to pause HQ with debugger somewhere around 106 (before fileSize = channel.size();) line in NIOSequentialFile class and disconnect client via JMX. But I'm really not sure if it will be enough. We are hitting this quite randomly and i was not able to make reproducer with byteman.

Comment 3 Justin Bertram 2015-02-16 16:16:45 UTC

I can't see in the code where interrupt() is being invoked.  Can you use a Byteman rule like this to try to identify who is calling interrupt():

  RULE check who is calling Thread.interrupt()
  CLASS java.lang.Thread  
  METHOD interrupt()  
  AT ENTRY  
  IF TRUE  
  DO traceStack("\*\*\* called interrupt on thread " + $0 + " from thread " + Thread.currentThread(), 50)  
  ENDRULE

Comment 4 Justin Bertram 2015-02-16 22:30:26 UTC

I worked up a Byteman test which I think should simulate the issue.  It's sending messages to the server until paging starts and then on the next send Byteman will kill the user's connection right when it's checking the channel size in org.hornetq.core.journal.impl.NIOSequentialFile (i.e. line 107).  The client throws the proper exception at this point.  However, it doesn't trigger a ClosedByInterruptException (i.e. an IOException).

It's critical we determine where the IOException in your test is coming from.  Please let me know if you can use the Byteman rule from my previous comment to get any additional information on this.

BTW, what kind of filesystem are you using?  Any chance it is NAS or NFS?

Comment 5 Ondřej Kalman 2015-02-17 07:28:39 UTC

I'm working on it. I have problem with triggering your rule, but I'll figure something  out.
About that FS, we can hit it on ext4 and also on NFS.

Comment 6 Ondřej Kalman 2015-02-17 08:46:25 UTC

Created attachment 992565 [details]
stack trace with interrupt caller

So finally I was able to hit it on IMB JDK. I hope it will help you.

Comment 7 Justin Bertram 2015-02-19 14:52:47 UTC

I see that Netty is interrupting the working thread (see https://github.com/netty/netty/blob/netty-3.6.10.Final/src/main/java/org/jboss/netty/channel/socket/oio/AbstractOioWorker.java#L224) when it's performing an IO operation which HornetQ then interprets as a critical error.

I'm investigating to see what can be done to mitigate this.

Comment 8 Justin Bertram 2015-02-19 19:43:22 UTC

Here's my understanding of this issue...

When a Netty OIO thread servicing a client performs work on the journal (e.g. creating a new page file, etc.) there is a small window where if the connection is closed (either via administrative intervention or because of ping failure, etc.) the Netty thread can be interrupted and HornetQ will interpret this as a critical journal failure and shut itself down.

To address this problem we now "gracefully" stop the Netty OIO thread. In other words, we wait for the current packet to finish and then we shut down the connection. If the packet doesn't finish within the timeout (i.e. 30 seconds) then we proceed to terminate the connection anyway.

Comment 9 Justin Bertram 2015-02-20 21:28:16 UTC

This is fixed now on the 2.3.x branch via 54165e6c36fc96fef57422eb0245ce2c90f06a1b.

Comment 10 Justin Bertram 2015-02-20 21:38:14 UTC

To be clear, this problem doesn't impact HornetQ 2.4.x and beyond because we upgraded Netty and the new Netty architecture works differently (i.e. it doesn't interrupt the worker thread in this case).

Comment 14 Ondřej Kalman 2015-07-31 08:26:50 UTC

VERIFIED with 6.4.3.CP.CR1

Comment 15 Petr Penicka 2017-01-17 10:36:47 UTC

Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.

Comment 16 Petr Penicka 2017-01-17 10:37:29 UTC

Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.