Description of problem: Sometimes happens that one following tests hang the whole test suite on Windows Server 2008: - org.hornetq.tests.integration.cluster.failover.NettyFailoverTest - org.hornetq.tests.integration.client.PagingTest - org.hornetq.tests.integration.jms.ManualReconnectionToSingleServerTest - org.hornetq.tests.integration.cluster.failover.NettyReplicatedFailoverTest List of unstable tests is incomplete and will be updated. Link to Jenkins job: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-HornetQ/job/eap-61-hornetq-project-testsuite-windows/
On Windows Server machines we get following messages in huge number of tests and various temporary files: 07:51:18,003 ERROR [org.hornetq.journal] HQ144001: Failed to delete file NIOSequentialFile c:\tmp\hornetq-unit-test\page\788997bf-7cee-11e2-ae1b-cf4b605d40bc\000000001.page These files exist, user 'hudson' has privileges set to 'full control' for the directory and it can be deleted manually. It seems that the file is locked at the time when it should be deleted. These are some of the tests where the problem occurs: org.hornetq.tests.integration.client.LargeMessageCompressTest org.hornetq.tests.integration.cluster.bridge.BridgeStartTest org.hornetq.tests.integration.cluster.bridge.BridgeTest org.hornetq.tests.integration.cluster.bridge.NettyBridgeTest org.hornetq.tests.integration.cluster.failover.BackupSyncLargeMessageTest org.hornetq.tests.integration.cluster.failover.BackupSyncPagingTest Because of this issue our jobs do not finish and we do not have results of the testsuite so we cannot certify HornetQ for Windows Server.
*** Bug 900899 has been marked as a duplicate of this bug. ***
Howard: Please look at that failure through Bug 900899 as well. we should concentrate windows failures through this issue here.
Hi Howard, are you able to connect to and run HornetQ testsuite on Jenkins Windows machines (e.g. dev98)? If no and you need it, tell and I can write you some instructions.
Hi Nikoleta, I don't know how to get access to the Jenkins Windows machines, please help me. Currently I have set up my laptop (Windows 7) and run some tests locally. Thanks Howard
I have run some of the mentioned tests on my local laptop and I did see some random failures esp in PagingTest (which is the one I'm currently focusing on). But I didn't see any 'hanging' happening. Howard
Having one issue with ClientConsumer.receiveImmediate() In its javadoc it says: '... This call will force a network trip to HornetQ server to ensure that there are no messages in the queue which can be delivered to this consumer.' It seems not true as shown in some of the paging tests (PagingTest). It assumes that if one calls Message m = clisntConsumer.receiveImmediate(); and got null return value, there should be no messages in the target queue. This is not always so. Let see some of the implementation details: When the above method is called, it causes the server to arrange a delivery and then send a special message back to the client consumer. See ServerConsumerImpl.forceDelivery(Long). In it the delivery task will sure be executed before the task of sending back the special message. However the delivery task itself may kick off another task for delivery purpose which is not guaranteed to be executed before the sending back of the special message. For example when the queue has no messages in the memory but has some messages in paging store, the queue will schedule a depaging task and just return from the current deliver routine. So if the special message reached to client consumer before any messages depaged arrived at the client, at the moment only this special message is in the buffer, the client gets it and decides that there is no messages in queue, and returns null. In some tests we rely on this call to check all messages are received like the following : for (int msgCount = 0; msgCount < numberOfMessages; msgCount++) { ClientMessage msg = consumer.receiveImmediate(); if (msg == null) { sessionConsumer.commit(); fail("Didn't receive a message"); } ... } Due to the above-said reason, this is not a reliable test. On Linux I haven't seen it fail but on Windows platform it fails occassionally. May be we can use receive(timeout) instead of receiveImmediate().
But on this case there should be a message there... you are free to change it to receive (big timeout) if you want.
OK, I think I'll change to use receive(big timeout) whereever suitable. Thanks.
I don't think this issue is a blocker. Those are test issues that won't affect a running system. it may be a blocker for Final (GA)... but definitely not for a Beta. We are working on it anyways.
Hi Mirek and Nikoleta, I have committed several fixes for Windows. Can you give me some instructions on how to kick off a jenkins test using hornetq's master branch? Thanks Howard
Just adding link to created Jenkins job: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-HornetQ/job/eap-60-hornetq-project-testsuite-windows-2008-r2-x86_64-OracleJDK1.6-NIO/
Are you guys working on it?
I believe most of the said tests are passing now. There are some new test issues in recent test report but they are passing on my local machine. I'll see how to fix this.
Pavel will be on PTO until the end of the week. Also if I have correct information about Howard, he is also on PTO. Anyway based on last run with HornetQ master branch: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-HornetQ/job/eap-60-hornetq-project-testsuite-windows-2008-r2-x86_64-OracleJDK1.6-NIO/15/ it seems that HornetQ test suite is not hanging anymore on Windows. I'd suggest to move this to ON_QA and we'll verify it with EAP 6.1.0.ER6.
HornetQ test suite does not hang anymore for EAP 6.1.0.ER6. For failed tests will be created new bugzilla Setting as verified. Great work, Howard!