Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 916625

Summary: Clients won't failover when journal on NFSv4 is disconnected
Product: [JBoss] JBoss Enterprise Application Platform 6 Reporter: Miroslav Novak <mnovak>
Component: HornetQAssignee: Miroslav Novak <mnovak>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 6.1.0CC: cdewolf, csuconic, dandread, francisco.borges, jdoyle, myarboro
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
threaddump.txt
none
server logs
none
reproducer.zip
none
threaddump.txt (2012-04-09)
none
threaddump_mnovak branch (2013-04-19)
none
test-fcntl.c
none
test-flock.c
none
Working test-flock
none
threadump_before_shutdown_server1.txt
none
threaddump_hanging_shutdown.txt
none
server_logs_2012_04_25.zip none

Description Miroslav Novak 2013-02-28 14:08:40 UTC
Test scenario:
1. Start 2 EAP 6.1.0.ER1 servers in colocated topology. Servers contain backup for each other (shared journals for live/backup pairs are on NFSv4 server)
2. Start producer/consumer connected to server1 to queue jms/queue/testQueue0
3. Disconnect server1 from NFSv4 server (by adding rule to iptables which throws away all incoming/outgoing NFSv4 packets)

Result:
After step 3. it takes long time (cca 4 min) for first server to detect that connection to NFSv4 server was lost and NFSv4 server releases file lock (it took cca 90s) for journal before first server shutdown itself. This leads to situation where live and backup server are up at the same time.
When first server finally recognize that NFSv4 server was disconnected and is trying to shutdown (HQ server) itself then it gets to deadlock. (threadump.txt attached) Another problem is that clients never failover to second server.
There is also problem to shut down the first server - ctrl-c or kill -9 ... have no effect and server1 becomes a zombie process. I noticed that when NFSv4 server is reconnected then server1 shutdown itself.

Reproducer was prepared for messaging lab - two machines are needed to run it:
1. Download and unzip reproducer.zip - following commands execute in unzipped directory
2. Prepare 2 server in colocated topology - "sh prepare.sh"
	- downloads EAP 6.1.0.ER1 + natives and unzip
	- copies jboss-eap-6.1 in server1 and server2 directory
	- copies configuration standalone-full-ha-1.xml and standalone-full-ha-2.xml to server1 and server2 (to $JBOSS_HOME/standalone/configuration)
3. Start server1 - "sh start-server1.sh <public_ip_of_first_server>" 
4. Start server2 - "sh start-server2.sh <public_ip_of_second_server>
5. Start producer - "sh start-producer.sh <public_ip_of_first_server> 100000" 
6. Start consumer - "sh start-consumer.sh <public_ip_of_first_server>" 
7. Disconnect first machine from NFSv4 server:
	- switch under hudson user by  "sudo su - hudson"
	- disconnect NFSv4 server by "sudo test-nfsv4-messaginglab.sh disable"


Journal for first/second live/backup pairs is located in /mnt/hornetq/client/nfs-shared-journal/colocated-power-off-1 and /mnt/hornetq/client/nfs-shared-journal/colocated-power-off-2

IMPORTANT: Do NOT keep NFSv4 server disconnected after test. Reconnect it by "sudo test-nfsv4-messaginglab.sh enable" under hudson user.

Comment 1 Miroslav Novak 2013-02-28 14:11:55 UTC
Created attachment 703904 [details]
threaddump.txt

Comment 2 Miroslav Novak 2013-02-28 14:12:23 UTC
Created attachment 703905 [details]
server logs

Comment 3 Miroslav Novak 2013-02-28 14:14:07 UTC
Created attachment 703906 [details]
reproducer.zip

Comment 4 Miroslav Novak 2013-02-28 14:20:12 UTC
Issues seems to be that:
1. It takes too long to shutdown hq live in server1 when journal is disconnected.
2. Clients do not failover
3. Server 1 can't be cleanly shutdowned or killed - becomes zombie.

Comment 5 Clebert Suconic 2013-02-28 15:20:47 UTC
There isn't much we can do about it.. if NFS is not informing us the error that makes it a NFS bug.

It's probably a configuration issue... if we get any exception back from the fileIO operations we will shutdown the server based on the critical error.

Comment 6 Miroslav Novak 2013-02-28 16:15:04 UTC
Configuration should be ok. When I kill/shutdown server1 then clients failover to server2. 

From the threaddump.txt which was taken after IO critical error it seems that HQ server did not fully shutdown:
"pool-7-thread-1" prio=10 tid=0x00007fa02c010000 nid=0x70bf in Object.wait() [0x00007f9ff3ca8000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x00000000fde44e58> (a org.hornetq.core.remoting.server.impl.RemotingServiceImpl$FailureCheckAndFlushThread)
	at java.lang.Thread.join(Thread.java:1203)
	- locked <0x00000000fde44e58> (a org.hornetq.core.remoting.server.impl.RemotingServiceImpl$FailureCheckAndFlushThread)
	at java.lang.Thread.join(Thread.java:1256)
	at org.hornetq.core.remoting.server.impl.RemotingServiceImpl$FailureCheckAndFlushThread.close(RemotingServiceImpl.java:598)
	at org.hornetq.core.remoting.server.impl.RemotingServiceImpl.stop(RemotingServiceImpl.java:326)
	at org.hornetq.core.server.impl.HornetQServerImpl.stop(HornetQServerImpl.java:583)
	at org.hornetq.core.server.impl.HornetQServerImpl.stop(HornetQServerImpl.java:532)
	at org.hornetq.core.server.impl.HornetQServerImpl.stop(HornetQServerImpl.java:499)

Could it cause that clients did not failover?

Comment 7 Miroslav Novak 2013-03-05 13:22:03 UTC
Problem that clients do not failover happens only with colocated HA topology.

Comment 9 Miroslav Novak 2013-03-22 11:43:02 UTC
Any update on this?

Comment 10 Clebert Suconic 2013-03-28 15:54:49 UTC
We will take a look on this, but I don't think there's much we can do about it...

this can be left it as a bug for now, but it doesn't need to be on the 6.1.Beta.

Comment 12 Clebert Suconic 2013-04-02 14:48:48 UTC
Can you give me some instructions on how to mount nfs things? I will need my dev env

Comment 13 Miroslav Novak 2013-04-02 15:13:25 UTC
To configure NFSv4 server:
1. export directory on machine with NFSv4 server by editing file /etc/exports:
/home/jbossqa/shared_folder_nfs 10.34.3.189(rw,sync,fsid=0) 10.34.3.190(rw,sync,fsid=0) 192.168.40.1(rw,sync,fsid=0)

In this line are specified IPs of server which can mount directory /home/jbossqa/shared_folder_nfs which is located on this machine.

2. on machine which mounts exported file system above edit /etc/fstab:
<ip_address_of_nfsv4_server>:/    /home/jbossqa/shared_folder_nfs  nfs4  rw,nosuid,nodev,intr,tcp,noac,soft,lookupcache=none,sloppy

3. create directory "/home/jbossqa/shared_folder_nfs" on machine which mounts the file system

4. mount the file system on client machine: "mount /home/jbossqa/shared_folder_nfs"

If something is not working there is a doc:
https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/s1-nfs-server-export.html

Comment 14 Clebert Suconic 2013-04-02 15:45:44 UTC
mnovak: clebert, still the mount options are:
[09:47am] mnovak: clebert, messaging-01.jbm.lab.bos.redhat.com:/hornetq on /mnt/hornetq/client type nfs4 (rw,nosuid,nodev,intr,tcp,noac,soft,lookupcache=none,sloppy,addr=10.16.100.6,clientaddr=10.16.100.20)
[09:48am]


But if you could still give me some information on how to make mounts)

Comment 15 Clebert Suconic 2013-04-02 21:43:02 UTC
ah.. just saw the comment 13. It wasn't visible before?.. anyway.. I se enow.. thanks

Comment 16 Miroslav Novak 2013-04-02 21:46:35 UTC
Not sure I've set the comment as private when I've created it. I think that you should be in "private_comment" group to see it.

Comment 17 Clebert Suconic 2013-04-04 03:56:08 UTC
Miroslav: after spending a big deal of time on this.. I can assure you that there is absolutely nothing I can do about it...


The code will be hanging on channel.write(), and NFS will be reconnecting forever...    everything will just hang...  (try for instance do a ls on the nfs folder.. nothing happens.. it just hangs).


You could maybe ask engineering to see if they could find a way to setup some timeout on the fstab.. I found some options on google but I couldn't make any timeout to work.


Or you could take this to the kernel guys and ask them to provide an option to fail operation in case of disconnections.


If you can't find an option from either raising an issue with RHEL/IO guys or better configuring this...   you can always document expectations...


Nothing I can do...  I will assign it back to you so you can try finding a better config with eng-ops guys.. or assign it to doc guys later

Comment 18 Miroslav Novak 2013-04-04 08:54:07 UTC
Please consult this with Andy, engineering should provide information about supported NFS configuration for customers. We need to verify if it's working correctly. Right now we don't have spare time for experiments we did before because of ERx releases without any space in between. Current documentation [1] is quite abstract.

Documenting this behavior is our last option, we can fall-back to it with PM ack.
In that case documentation should address following issues:
 1. It takes several minutes to shutdown HQ live in server1 when journal is disconnected.
 2. Clients do not failover
 3. Server 1 can't be cleanly shutdowned or killed when NFS server was disconnected - becomes zombie.

[1] http://documentation-devel.engineering.redhat.com/docs/en-US/JBoss_Enterprise_Application_Platform/6-Alpha/html/Migration_Guide/sect-HornetQ_Changes.html

Comment 19 Clebert Suconic 2013-04-04 14:40:15 UTC
As I said, The issue is not on HornetQ, the issue is either the NFS is not properly configured, or the NFS is not configured at all.

You have to first check if there isn't a better config in which the kernel would throw errors to the IO operation.

Comment 23 Clebert Suconic 2013-04-04 16:01:31 UTC
I had a typo on my last comment....   I don't think there's a way to post-edit it in BZ... so, this is just to adjust what I said:

-------

As I said, The issue is not on HornetQ, the issue is either the NFS is not properly configured, or the NFS *does not support it such scenario*.

You have to first check if there isn't a better config in which the kernel would throw errors to the IO operation.

Comment 24 Carlo de Wolf 2013-04-05 10:49:48 UTC
With the following setting I get a proper input/output error after 12 seconds on RHEL 6.4.

192.168.30.128:/tmp     /mnt    nfs     rw,soft,udp,timeo=2,retrans=5,rsize=1024,wsize=1024     2 2

Note that the real differentiator is 'udp'. Normally it'll use tcp with all of its associated timeouts, only after tcp times out the NFS timeout options are considered.

The others are there to show what we are getting into. Using udp as a transport protocol is not efficient by default and needs to be tuned. I've opted for a low timeout threshold with a couple of more retransmission to counter the occasional packet loss. The rsize and wsize need to be set to a multiple of 1024 of the MTU on the network (interface) used. In a real production scenario the MTU needs to be tuned first.

Comment 25 Carlo de Wolf 2013-04-05 10:51:03 UTC
Oops, that should be:

192.168.30.128:/tmp     /mnt    nfs     rw,soft,udp,timeo=2,retrans=5,rsize=1024,wsize=1024     0 0

Comment 26 Miroslav Novak 2013-04-09 13:15:50 UTC
Created attachment 733212 [details]
threaddump.txt (2012-04-09)

We did some testing with last mount options. Execution path seems to be different when clients are connected to server and "IO critical error" cannot and is not thrown. Issue appears to be in PagingStoreImpl.page(PagingStoreImpl.java:846) where server hangs on:
// We need to ensure a read lock, as depage could change the paging state
lock.readLock().lock(); // line 846

when NFS server is disconnected. There are several threads waiting forever for the lock. From threaddump.txt (2012-04-09) is not possible to see who's holding the lock. Could dev take a look at it, please?

Comment 27 Clebert Suconic 2013-04-11 16:03:04 UTC
I see the same issue there:

This thread here is forcing the FileChannel and not getting an exception back from NFS. I can't do anything if the kernel just hangs on IO calls:


"Thread-2 (HornetQ-scheduled-threads-779325750)" prio=10 tid=0x00007f086c179000 nid=0xd05 runnable [0x00007f08357cc000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.FileChannelImpl.force0(Native Method)
	at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:348)
	at org.hornetq.core.journal.impl.NIOSequentialFile.sync(NIOSequentialFile.java:226)
	at org.hornetq.core.paging.impl.Page.sync(Page.java:220)
	at org.hornetq.core.paging.impl.PagingStoreImpl.ioSync(PagingStoreImpl.java:337)
	at org.hornetq.core.paging.impl.PageSyncTimer.tick(PageSyncTimer.java:94)
	at org.hornetq.core.paging.impl.PageSyncTimer.access$000(PageSyncTimer.java:29)
	at org.hornetq.core.paging.impl.PageSyncTimer$1.run(PageSyncTimer.java:48)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:662)

Comment 28 Clebert Suconic 2013-04-11 16:05:54 UTC
I'm assigning back to Mnovak as a way to decide what to do next:

I think QA should do these two tasks:

- raise an issue with eng-ops
- raise an issue with storage guys and possibly with kernel guys on the RHEL side.


Depending on the outcome, we document our expectations.


it should be easy to write a 10 lines code to show the behaviour with a NFS timeout. write, sync and see it hanging.

Comment 29 Carlo de Wolf 2013-04-11 17:40:51 UTC
[root@node02 ~]# more Main.java
import java.io.*;
import java.nio.*;
import java.nio.channels.*;

public class Main {
        public static void main(String[] args) throws Exception {
                RandomAccessFile file = new RandomAccessFile("/mnt/dummy", "rw")
;
                FileChannel channel = file.getChannel();
                ByteBuffer buf = ByteBuffer.allocate(1024);
                int i = 0;
                while(true) {
                        System.out.println(i++);
                        for (int p = 0; p < 1024; p++) {
                                buf.put((byte) p);
                        }
                        buf.flip();
                        channel.position(0);
                        channel.write(buf);
                        channel.force(false);
                        buf.flip();
                        Thread.sleep(1000);
                }
        }
}
[root@node02 ~]# java Main
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Exception in thread "main" java.io.IOException: Input/output error
        at sun.nio.ch.FileDispatcherImpl.force0(Native Method)
        at sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:75)
        at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:363)
        at Main.main(Main.java:19)

I pulled the plug after 16 seconds and waited ~15 seconds.

Comment 30 Clebert Suconic 2013-04-11 20:21:38 UTC
Hmmm.. you're right.. I'm missing a call to the IO Critical Listener through paging sync.

Comment 31 Clebert Suconic 2013-04-12 04:07:11 UTC
@mnovak: Can you try with a patched version please? I've sent you instructions to how to get the build.

Comment 32 Miroslav Novak 2013-04-17 14:56:00 UTC
We did testing with same results.

Comment 33 Clebert Suconic 2013-04-17 16:12:25 UTC
I have given you a branch (mnovak on my github) that will give you some debug on when / if an exception is being captured. if you could try that and analyze the logs.


Anyway, we should defer this issue.

Comment 34 Miroslav Novak 2013-04-19 10:22:26 UTC
Created attachment 737562 [details]
threaddump_mnovak branch (2013-04-19)

Comment 35 Francisco Borges 2013-04-19 10:32:14 UTC
Some comments on this case:

1. from Carlo de Wolf's comment (https://bugzilla.redhat.com/show_bug.cgi?id=916625#c29) I take that we are missing error handling for IOErrors. Clebert added a missing one in https://github.com/hornetq/hornetq/commit/d98cce79c922f566d5bf3e1d163acf1d9b5ed53a but there are other points where we are propaging IOException's and not calling "factory.onIOError(...)"

2. Miroslav just tried to reproduce this and got a server hanging at server.stop(). Turns out that we are not setting criticalIOError:=true when shutting down the server on an IOError. While this is a different bug, it affects this scenario (NFS disconnect should trigger a shutdown on IOError).

Comment 36 Francisco Borges 2013-04-19 15:21:34 UTC
I set a PR to address the lack of criticalIOError:=true when an IOError is detected. https://github.com/hornetq/hornetq/pull/1008

I also added a lot more calls to `factory.onIOError` on IOException's. These can be seen on this branch of mine https://github.com/FranciscoBorges/hornetq/tree/nfsHang

Right now we have servers running with the NFS taken out. Both:

1. The eventPoller thread 
"Thread-2 (HornetQ-AIO-poller-pool1084824280-1960129093)" daemon prio=10 tid=0x00007f46e402d800 nid=0xcfa runn\
able [0x00007f4714e5a000]                                                                                      
   java.lang.Thread.State: RUNNABLE                                                                            
 at org.hornetq.core.asyncio.impl.AsynchronousFileImpl.internalPollEvents(Native Method)                
        at org.hornetq.core.asyncio.impl.AsynchronousFileImpl.pollEvents(AsynchronousFileImpl.java:607)        
        at org.hornetq.core.asyncio.impl.AsynchronousFileImpl.access$400(AsynchronousFileImpl.java:45)         
        at org.hornetq.core.asyncio.impl.AsynchronousFileImpl$PollerRunnable.run(AsynchronousFileImpl.java:830\
)                                                                                                              
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)                     
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)                     
        at java.lang.Thread.run(Thread.java:722)

As well as the backup file lock thread, seem to be running without noticing this:

"HQ119000: Activation for server HornetQServerImpl::serverUUID=c3ee89bd-a8f5-11e2-a581-d96af0806ad1" prio=10 t\
id=0x00007f46e8075000 nid=0xc87 runnable [0x00007f47335ec000]                                                  
   java.lang.Thread.State: RUNNABLE                                                                            
        at org.hornetq.core.asyncio.impl.AsynchronousFileImpl.flock(Native Method)                             
        at org.hornetq.core.asyncio.impl.AsynchronousFileImpl.lock(AsynchronousFileImpl.java:659)              
        at org.hornetq.core.server.impl.AIOFileLockNodeManager.lock(AIOFileLockNodeManager.java:82)            
        at org.hornetq.core.server.impl.FileLockNodeManager.awaitLiveNode(FileLockNodeManager.java:130)        
        at org.hornetq.core.server.impl.HornetQServerImpl$SharedStoreBackupActivation.run(HornetQServerImpl.ja\
va:2111)                                                                                                       
        at java.lang.Thread.run(Thread.java:722)

Comment 37 Francisco Borges 2013-04-19 15:47:54 UTC
Last 2 times we tried to verify this, those loads of ReentrantReadWriteLock were all gone. The weird thing is that the logs at /home/jbossqa/test-nfs/server1/jboss-eap-6.1/standalone/log and /home/jbossqa/test-nfs/server2/jboss-eap-6.1/standalone/log (at the respective servers) don't show anything interesting.

Comment 38 Clebert Suconic 2013-04-24 04:32:25 UTC
This is a puzzled which I honestly don't know how to solve:

- GFS2 was broken at the time we wrote libaio, hence we used flock which was the only method that worked with GFS2. We even implemented it natively as there wouldn't be a way to do it in java without JNI. and for that reason (and only that reason) we are requiring AIO on GFS2 as it will require JNI to access it


Now, look at this from the man pages from flock(2):

NOTES
       flock() does not lock files over NFS.  Use fcntl(2) instead: that does work over NFS, given a sufficiently recent version of Linux and a server which supports locking.



So, we will need fcntl with NFS!


I have gotten lots of hangs with NFS if using flock. I'm trying to debug directly into a c program now but I'm having a bit of a challenge on setting gcc on a second laptop.. so it will take me another day to setup that environment.


On my tests with Journal NIO this was supposed to work alright. But I would need QE to perform tests with the latest changes.

Comment 39 Clebert Suconic 2013-04-24 05:29:01 UTC
I have written two simple c program that will use either flock or fcntl

Neither of those are releasing the locks until we reconnect the system, as I have been telling since the beginning...


If you guys find a way to fence NFS like you did on GFS2, than I'm fine.. otherwise this will be an unsuported scenario... where NFS' connection has to be reliable.


flock is documented to not work with NFS, but we would need to gather more information with kernel guys.

Comment 40 Clebert Suconic 2013-04-24 05:32:19 UTC
Created attachment 739266 [details]
test-fcntl.c

Comment 41 Clebert Suconic 2013-04-24 05:36:44 UTC
Created attachment 739267 [details]
test-flock.c

Comment 42 Clebert Suconic 2013-04-24 05:39:12 UTC
on the attachments I added, test-flock.c and test-fcntl.c, you can edit the file name, and try a shared storage disconnection to see how they handle releasing upon disconnection.


I'm not having much success on this front.




I'm improving what Francisco saw on his thread dumps, as I can see some space for improvements... but I'm not sure how to find a more reliable fix without involving someone from storage or kernel on RHEL. (at least someone with better options on the NFS export and nfstabs, like you guys did with fencing on GFS2)

Comment 43 Carlo de Wolf 2013-04-24 12:27:18 UTC
Created attachment 739443 [details]
Working test-flock

I'm not sure what functionality you would expect in such a while loop. You won't get an interrupt or something when the NFS server connection dies.

I've modified the test-flock code to do, what I think is, the right thing.

Note that flock does seem to work correctly on NFS contrary the manual page.
(At least from a single client. I really need to hook up multiple clients.)

Comment 44 Carlo de Wolf 2013-04-24 12:45:55 UTC
Hooking up multiple clients works out.

Comment 45 Clebert Suconic 2013-04-24 14:42:11 UTC
I know it works opposed to what's documented. I"m just concerned about supporting that without clearing with someone from the storage and IO team at RHEL.

Comment 46 Clebert Suconic 2013-04-24 15:40:47 UTC
It doesn't make any sense... my flock example is working today after a full reboot.

@Mnovak: can you try with the latest changes please?

Comment 47 Clebert Suconic 2013-04-24 15:47:44 UTC
This thing seems unstable over disconnection.. sometimes it works.. sometimes it doesn't...   that's the same observation made my mnovak.

Comment 48 Clebert Suconic 2013-04-24 16:05:42 UTC
@mnovak: I just tried with tcp instead of udp here, and it seemed more stable. udp is being a bit unstable... sometimes working, sometimes not.

You guys should chnage it to tcp and test it again:


192.168.1.120:/    /work/share  nfs  rw,nosuid,soft,tcp,timeo=2,retrans=5,rsize=1024,wsize=1024  0 0



I will be waiting QA's input on this.


I will set this as ON_QA as in a way to say I will waiting QA's input on this. we may come up with 2.3.1 if there's anything still around here.

Comment 49 Clebert Suconic 2013-04-24 18:35:46 UTC
It all worked as expected with tcp here. udp was a bit unstable.. working or not working.


We need to clear out with storage guys at RedHat that flock on NFS is a reliable option. Anyone can take the lead?

Comment 50 Miroslav Novak 2013-04-25 09:27:12 UTC
I'm confused by all the comments and not sure which options for NFS and HQ to use. There are a few things to clarify:
- officially NFS is supported with AIO journal-type but now it seems it's NIO, right?
- tcp or udp? (yes I saw that udp is unreliable in some tries too, at one moment producer could not send new messages to server with PAGE policy)
- which bits should be tested?

I need to finish some high priority tasks this week. Promised test development + one-offs. So my support to this case is limited.

Anyway I'll give it a try with NFS options:
192.168.1.120:/    /work/share  nfs  rw,nosuid,soft,tcp,timeo=2,retrans=5,rsize=1024,wsize=1024  0 0

and journal-type=NIO, address-full-policy=PAGE and test with HQ master branch. I'll test again with EAP 6.1.0.ER6 bits.

Comment 51 Miroslav Novak 2013-04-25 10:24:22 UTC
Here is what I did:
- replaced hq jars EAP 6.1.0.ER5 from HQ master (git://github.com/hornetq/hornetq.git)
- used nfs options: 192.168.1.120:/    /work/share  nfs  rw,nosuid,soft,tcp,timeo=2,retrans=5,rsize=1024,wsize=1024  0 0
- set journal-type=NIO, address-full-policy=PAGE

Test scenario was the same. Two EAP 6.1.0 servers in colocated HA topology with clients connected to server1 which was then disconnected from NFS server (by turning off the switch).

Result:
After disconnecting of NFS server JMS clients failovered to backup on server2. So file lock on journal was released, hq live in server1 closed its port 5445 and backup on server2 started.
Still from the thread dump I'm not sure whether hq live on server1 fully shutdowned. (check threadump_before_shutdown_server1.txt)
There is still problem with clean shutdown (ctrl-c) of server1. It hangs - thread dump - threaddump_hanging_shutdown.txt. Server logs attached.(server_logs_2012_04_25.zip)

I could be just lucky, I'll test again with EAP 6.1.0.ER6 bits.

Comment 52 Miroslav Novak 2013-04-25 10:24:59 UTC
Created attachment 739808 [details]
threadump_before_shutdown_server1.txt

Comment 53 Miroslav Novak 2013-04-25 10:25:33 UTC
Created attachment 739809 [details]
threaddump_hanging_shutdown.txt

Comment 54 Miroslav Novak 2013-04-25 10:26:17 UTC
Created attachment 739810 [details]
server_logs_2012_04_25.zip

Comment 55 Clebert Suconic 2013-04-26 16:24:14 UTC
We can improve the shutdown and make sure things will work smoothly for next version.


BTW: I have had confirmation from storage that flock will work fine:



"That information is outdated. 
The NFS client module will simply convert the flock request to a fcntl
request over the entire file and make the locking call. It should work
fine." - Sachin Prabhu from our filesystem dept.

Comment 56 Dimitris Andreadis 2013-04-29 08:52:38 UTC
Can someone explain the ON_DEV status, please? I don't think we use this?

It should either be ASSIGNED (engineering working on it) or ON_QA (stuff been committed and we are testing the ER)

Comment 57 John Doyle 2013-04-29 17:05:07 UTC
Changing to ON_QA based on comment 51.  Will have to revisit blocker status based upon test results with ER6 or ER7 if limited time on ER6 does not allow testing.

Comment 58 Miroslav Novak 2013-05-07 12:33:11 UTC
Tried again EAP 6.1.0.ER6 (HQ 2.3.0.Final) and everything worked like a charm. Clients failovered, HQ live on first server shutdowned and first server could be shutdowned after it was disconnected from NFSv4 server.

I'll give it a try with ASYNCIO for journal-type.

Comment 59 Miroslav Novak 2013-05-07 13:06:45 UTC
Also with ASYNCIO for journal-type is everything ok. Setting as verified. 

Just for info the nfs mount was:
192.168.10.4:/ on /home/jbossqa/shared_folder_nfs type nfs (rw,soft,tcp,timeo=2,retrans=5,rsize=1024,wsize=1024,vers=4,addr=192.168.10.4,clientaddr=192.168.10.1)

Comment 63 mark yarborough 2013-09-05 19:19:16 UTC
The info in the private email is simply stating that this bug was closed based on the fact that it was VERIFIED in a previous release (6.1.0, based on the dates in comments above).