911610 – HornetQ test suite hangs on Windows

Bug 911610 - HornetQ test suite hangs on Windows

Summary: HornetQ test suite hangs on Windows

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Enterprise Application Platform 6
Classification:	JBoss
Component:	HornetQ
Sub Component:
Version:	6.1.0
Hardware:	Unspecified
OS:	Windows
Priority:	unspecified
Severity:	high
Target Milestone:	ER6
Target Release:	---
Assignee:	Yong Hao Gao
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	JBPAPP6-1254 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-02-15 13:05 UTC by Miroslav Novak
Modified:	2013-07-23 18:41 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Description Miroslav Novak 2013-02-15 13:05:23 UTC

Description of problem:
Sometimes happens that one following tests hang the whole test suite on Windows Server 2008:
- org.hornetq.tests.integration.cluster.failover.NettyFailoverTest
- org.hornetq.tests.integration.client.PagingTest
- org.hornetq.tests.integration.jms.ManualReconnectionToSingleServerTest
- org.hornetq.tests.integration.cluster.failover.NettyReplicatedFailoverTest

List of unstable tests is incomplete and will be updated.  

Link to Jenkins job:
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-HornetQ/job/eap-61-hornetq-project-testsuite-windows/

Comment 1 Nikoleta Hlavickova 2013-02-22 13:22:52 UTC

On Windows Server machines we get following messages in huge number of tests and various temporary files:

07:51:18,003 ERROR [org.hornetq.journal] HQ144001: Failed to delete file NIOSequentialFile c:\tmp\hornetq-unit-test\page\788997bf-7cee-11e2-ae1b-cf4b605d40bc\000000001.page

These files exist, user 'hudson' has privileges set to 'full control' for the directory and it can be deleted manually. It seems that the file is locked at the time when it should be deleted.

These are some of the tests where the problem occurs:
org.hornetq.tests.integration.client.LargeMessageCompressTest
org.hornetq.tests.integration.cluster.bridge.BridgeStartTest
org.hornetq.tests.integration.cluster.bridge.BridgeTest
org.hornetq.tests.integration.cluster.bridge.NettyBridgeTest
org.hornetq.tests.integration.cluster.failover.BackupSyncLargeMessageTest
org.hornetq.tests.integration.cluster.failover.BackupSyncPagingTest

Because of this issue our jobs do not finish and we do not have results of the testsuite so we cannot certify HornetQ for Windows Server.

Comment 4 Clebert Suconic 2013-04-01 21:40:48 UTC

*** Bug 900899 has been marked as a duplicate of this bug. ***

Comment 5 Clebert Suconic 2013-04-01 21:42:00 UTC

Howard: Please look at that failure through Bug 900899 as well. we should concentrate windows failures through this issue here.

Comment 6 Nikoleta Hlavickova 2013-04-02 08:02:31 UTC

Hi Howard,
are you able to connect to and run HornetQ testsuite on Jenkins Windows machines (e.g. dev98)? If no and you need it, tell and I can write you some instructions.

Comment 7 Yong Hao Gao 2013-04-02 14:49:46 UTC

Hi Nikoleta,

I don't know how to get access to the Jenkins Windows machines, please help me.

Currently I have set up my laptop (Windows 7) and run some tests locally.

Thanks
Howard

Comment 8 Yong Hao Gao 2013-04-02 14:52:38 UTC

I have run some of the mentioned tests on my local laptop and I did see some random failures esp in PagingTest (which is the one I'm currently focusing on). But I didn't see any 'hanging' happening.

Howard

Comment 9 Yong Hao Gao 2013-04-02 15:16:05 UTC

Having one issue with ClientConsumer.receiveImmediate()

In its javadoc it says:

'... This call will force a network trip to HornetQ server to
ensure that there are no messages in the queue which can be 
delivered to this consumer.'

It seems not true as shown in some of the paging tests (PagingTest).

It assumes that if one calls 

Message m = clisntConsumer.receiveImmediate();

and got null return value, there should be no messages in the target
queue. This is not always so. Let see some of the implementation details:

When the above method is called, it causes the server to arrange a 
delivery and then send a special message back to the client consumer.
See ServerConsumerImpl.forceDelivery(Long). In it the delivery task
will sure be executed before the task of sending back the special
message. However the delivery task itself may kick off another task 
for delivery purpose which is not guaranteed to be executed before
the sending back of the special message.

For example when the queue has no messages in the memory but has 
some messages in paging store, the queue will schedule a depaging
task and just return from the current deliver routine.

So if the special message reached to client consumer before any 
messages depaged arrived at the client, at the moment only this
special message is in the buffer, the client gets it and 
decides that there is no messages in queue, and returns null.

In some tests we rely on this call to check all messages are 
received like the following :

      for (int msgCount = 0; msgCount < numberOfMessages; msgCount++)
      {
         ClientMessage msg = consumer.receiveImmediate();
         if (msg == null)
         {
            sessionConsumer.commit();
            fail("Didn't receive a message");
         }
		 ...
      }

Due to the above-said reason, this is not a reliable test. 
On Linux I haven't seen it fail but on Windows platform it fails occassionally.

May be we can use receive(timeout) instead of receiveImmediate().

Comment 10 Clebert Suconic 2013-04-02 15:25:55 UTC

But on this case there should be a message there...  you are free to change it to receive (big timeout) if you want.

Comment 12 Yong Hao Gao 2013-04-03 03:06:52 UTC

OK, I think I'll change to use receive(big timeout) whereever suitable. Thanks.

Comment 13 Clebert Suconic 2013-04-05 17:13:27 UTC

I don't think this issue is a blocker. Those are test issues that won't affect a running system.

it may be a blocker for Final (GA)... but definitely not for a Beta.

We are working on it anyways.

Comment 14 Yong Hao Gao 2013-04-09 02:51:52 UTC

Hi Mirek and Nikoleta,

I have committed several fixes for Windows. Can you give me some instructions on how to kick off a jenkins test using hornetq's master branch?

Thanks
Howard

Comment 15 Miroslav Novak 2013-04-09 08:26:09 UTC

Just adding link to created Jenkins job:
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-HornetQ/job/eap-60-hornetq-project-testsuite-windows-2008-r2-x86_64-OracleJDK1.6-NIO/

Comment 16 Dimitris Andreadis 2013-04-26 16:45:29 UTC

Are you guys working on it?

Comment 17 Yong Hao Gao 2013-04-27 12:45:06 UTC

I believe most of the said tests are passing now. There are some new test issues in recent test report but they are passing on my local machine. I'll see how to fix this.

Comment 19 Miroslav Novak 2013-04-30 15:08:39 UTC

Pavel will be on PTO until the end of the week. Also if I have correct information about Howard, he is also on PTO.

Anyway based on last run with HornetQ master branch:
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-HornetQ/job/eap-60-hornetq-project-testsuite-windows-2008-r2-x86_64-OracleJDK1.6-NIO/15/

it seems that HornetQ test suite is not hanging anymore on Windows. I'd suggest to move this to ON_QA and we'll verify it with EAP 6.1.0.ER6.

Comment 21 Miroslav Novak 2013-05-05 08:20:54 UTC

HornetQ test suite does not hang anymore for EAP 6.1.0.ER6. For failed tests will be created new bugzilla Setting as verified. Great work, Howard!

Note You need to log in before you can comment on or make changes to this bug.