Bug 916625
| Summary: | Clients won't failover when journal on NFSv4 is disconnected | ||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [JBoss] JBoss Enterprise Application Platform 6 | Reporter: | Miroslav Novak <mnovak> | ||||||||||||||||||||||||
| Component: | HornetQ | Assignee: | Miroslav Novak <mnovak> | ||||||||||||||||||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | |||||||||||||||||||||||||
| Severity: | urgent | Docs Contact: | |||||||||||||||||||||||||
| Priority: | unspecified | ||||||||||||||||||||||||||
| Version: | 6.1.0 | CC: | cdewolf, csuconic, dandread, francisco.borges, jdoyle, myarboro | ||||||||||||||||||||||||
| Target Milestone: | --- | ||||||||||||||||||||||||||
| Target Release: | --- | ||||||||||||||||||||||||||
| Hardware: | Unspecified | ||||||||||||||||||||||||||
| OS: | Unspecified | ||||||||||||||||||||||||||
| Whiteboard: | |||||||||||||||||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||||||||||
| Last Closed: | Type: | Bug | |||||||||||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||||
| Embargoed: | |||||||||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||||||||
|
Description
Miroslav Novak
2013-02-28 14:08:40 UTC
Created attachment 703904 [details]
threaddump.txt
Created attachment 703905 [details]
server logs
Created attachment 703906 [details]
reproducer.zip
Issues seems to be that: 1. It takes too long to shutdown hq live in server1 when journal is disconnected. 2. Clients do not failover 3. Server 1 can't be cleanly shutdowned or killed - becomes zombie. There isn't much we can do about it.. if NFS is not informing us the error that makes it a NFS bug. It's probably a configuration issue... if we get any exception back from the fileIO operations we will shutdown the server based on the critical error. Configuration should be ok. When I kill/shutdown server1 then clients failover to server2. From the threaddump.txt which was taken after IO critical error it seems that HQ server did not fully shutdown: "pool-7-thread-1" prio=10 tid=0x00007fa02c010000 nid=0x70bf in Object.wait() [0x00007f9ff3ca8000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000000fde44e58> (a org.hornetq.core.remoting.server.impl.RemotingServiceImpl$FailureCheckAndFlushThread) at java.lang.Thread.join(Thread.java:1203) - locked <0x00000000fde44e58> (a org.hornetq.core.remoting.server.impl.RemotingServiceImpl$FailureCheckAndFlushThread) at java.lang.Thread.join(Thread.java:1256) at org.hornetq.core.remoting.server.impl.RemotingServiceImpl$FailureCheckAndFlushThread.close(RemotingServiceImpl.java:598) at org.hornetq.core.remoting.server.impl.RemotingServiceImpl.stop(RemotingServiceImpl.java:326) at org.hornetq.core.server.impl.HornetQServerImpl.stop(HornetQServerImpl.java:583) at org.hornetq.core.server.impl.HornetQServerImpl.stop(HornetQServerImpl.java:532) at org.hornetq.core.server.impl.HornetQServerImpl.stop(HornetQServerImpl.java:499) Could it cause that clients did not failover? Problem that clients do not failover happens only with colocated HA topology. Any update on this? We will take a look on this, but I don't think there's much we can do about it... this can be left it as a bug for now, but it doesn't need to be on the 6.1.Beta. Can you give me some instructions on how to mount nfs things? I will need my dev env To configure NFSv4 server: 1. export directory on machine with NFSv4 server by editing file /etc/exports: /home/jbossqa/shared_folder_nfs 10.34.3.189(rw,sync,fsid=0) 10.34.3.190(rw,sync,fsid=0) 192.168.40.1(rw,sync,fsid=0) In this line are specified IPs of server which can mount directory /home/jbossqa/shared_folder_nfs which is located on this machine. 2. on machine which mounts exported file system above edit /etc/fstab: <ip_address_of_nfsv4_server>:/ /home/jbossqa/shared_folder_nfs nfs4 rw,nosuid,nodev,intr,tcp,noac,soft,lookupcache=none,sloppy 3. create directory "/home/jbossqa/shared_folder_nfs" on machine which mounts the file system 4. mount the file system on client machine: "mount /home/jbossqa/shared_folder_nfs" If something is not working there is a doc: https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/s1-nfs-server-export.html mnovak: clebert, still the mount options are: [09:47am] mnovak: clebert, messaging-01.jbm.lab.bos.redhat.com:/hornetq on /mnt/hornetq/client type nfs4 (rw,nosuid,nodev,intr,tcp,noac,soft,lookupcache=none,sloppy,addr=10.16.100.6,clientaddr=10.16.100.20) [09:48am] But if you could still give me some information on how to make mounts) ah.. just saw the comment 13. It wasn't visible before?.. anyway.. I se enow.. thanks Not sure I've set the comment as private when I've created it. I think that you should be in "private_comment" group to see it. Miroslav: after spending a big deal of time on this.. I can assure you that there is absolutely nothing I can do about it... The code will be hanging on channel.write(), and NFS will be reconnecting forever... everything will just hang... (try for instance do a ls on the nfs folder.. nothing happens.. it just hangs). You could maybe ask engineering to see if they could find a way to setup some timeout on the fstab.. I found some options on google but I couldn't make any timeout to work. Or you could take this to the kernel guys and ask them to provide an option to fail operation in case of disconnections. If you can't find an option from either raising an issue with RHEL/IO guys or better configuring this... you can always document expectations... Nothing I can do... I will assign it back to you so you can try finding a better config with eng-ops guys.. or assign it to doc guys later Please consult this with Andy, engineering should provide information about supported NFS configuration for customers. We need to verify if it's working correctly. Right now we don't have spare time for experiments we did before because of ERx releases without any space in between. Current documentation [1] is quite abstract. Documenting this behavior is our last option, we can fall-back to it with PM ack. In that case documentation should address following issues: 1. It takes several minutes to shutdown HQ live in server1 when journal is disconnected. 2. Clients do not failover 3. Server 1 can't be cleanly shutdowned or killed when NFS server was disconnected - becomes zombie. [1] http://documentation-devel.engineering.redhat.com/docs/en-US/JBoss_Enterprise_Application_Platform/6-Alpha/html/Migration_Guide/sect-HornetQ_Changes.html As I said, The issue is not on HornetQ, the issue is either the NFS is not properly configured, or the NFS is not configured at all. You have to first check if there isn't a better config in which the kernel would throw errors to the IO operation. I had a typo on my last comment.... I don't think there's a way to post-edit it in BZ... so, this is just to adjust what I said: ------- As I said, The issue is not on HornetQ, the issue is either the NFS is not properly configured, or the NFS *does not support it such scenario*. You have to first check if there isn't a better config in which the kernel would throw errors to the IO operation. With the following setting I get a proper input/output error after 12 seconds on RHEL 6.4. 192.168.30.128:/tmp /mnt nfs rw,soft,udp,timeo=2,retrans=5,rsize=1024,wsize=1024 2 2 Note that the real differentiator is 'udp'. Normally it'll use tcp with all of its associated timeouts, only after tcp times out the NFS timeout options are considered. The others are there to show what we are getting into. Using udp as a transport protocol is not efficient by default and needs to be tuned. I've opted for a low timeout threshold with a couple of more retransmission to counter the occasional packet loss. The rsize and wsize need to be set to a multiple of 1024 of the MTU on the network (interface) used. In a real production scenario the MTU needs to be tuned first. Oops, that should be: 192.168.30.128:/tmp /mnt nfs rw,soft,udp,timeo=2,retrans=5,rsize=1024,wsize=1024 0 0 Created attachment 733212 [details]
threaddump.txt (2012-04-09)
We did some testing with last mount options. Execution path seems to be different when clients are connected to server and "IO critical error" cannot and is not thrown. Issue appears to be in PagingStoreImpl.page(PagingStoreImpl.java:846) where server hangs on:
// We need to ensure a read lock, as depage could change the paging state
lock.readLock().lock(); // line 846
when NFS server is disconnected. There are several threads waiting forever for the lock. From threaddump.txt (2012-04-09) is not possible to see who's holding the lock. Could dev take a look at it, please?
I see the same issue there: This thread here is forcing the FileChannel and not getting an exception back from NFS. I can't do anything if the kernel just hangs on IO calls: "Thread-2 (HornetQ-scheduled-threads-779325750)" prio=10 tid=0x00007f086c179000 nid=0xd05 runnable [0x00007f08357cc000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.FileChannelImpl.force0(Native Method) at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:348) at org.hornetq.core.journal.impl.NIOSequentialFile.sync(NIOSequentialFile.java:226) at org.hornetq.core.paging.impl.Page.sync(Page.java:220) at org.hornetq.core.paging.impl.PagingStoreImpl.ioSync(PagingStoreImpl.java:337) at org.hornetq.core.paging.impl.PageSyncTimer.tick(PageSyncTimer.java:94) at org.hornetq.core.paging.impl.PageSyncTimer.access$000(PageSyncTimer.java:29) at org.hornetq.core.paging.impl.PageSyncTimer$1.run(PageSyncTimer.java:48) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) I'm assigning back to Mnovak as a way to decide what to do next: I think QA should do these two tasks: - raise an issue with eng-ops - raise an issue with storage guys and possibly with kernel guys on the RHEL side. Depending on the outcome, we document our expectations. it should be easy to write a 10 lines code to show the behaviour with a NFS timeout. write, sync and see it hanging. [root@node02 ~]# more Main.java
import java.io.*;
import java.nio.*;
import java.nio.channels.*;
public class Main {
public static void main(String[] args) throws Exception {
RandomAccessFile file = new RandomAccessFile("/mnt/dummy", "rw")
;
FileChannel channel = file.getChannel();
ByteBuffer buf = ByteBuffer.allocate(1024);
int i = 0;
while(true) {
System.out.println(i++);
for (int p = 0; p < 1024; p++) {
buf.put((byte) p);
}
buf.flip();
channel.position(0);
channel.write(buf);
channel.force(false);
buf.flip();
Thread.sleep(1000);
}
}
}
[root@node02 ~]# java Main
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Exception in thread "main" java.io.IOException: Input/output error
at sun.nio.ch.FileDispatcherImpl.force0(Native Method)
at sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:75)
at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:363)
at Main.main(Main.java:19)
I pulled the plug after 16 seconds and waited ~15 seconds.
Hmmm.. you're right.. I'm missing a call to the IO Critical Listener through paging sync. @mnovak: Can you try with a patched version please? I've sent you instructions to how to get the build. We did testing with same results. I have given you a branch (mnovak on my github) that will give you some debug on when / if an exception is being captured. if you could try that and analyze the logs. Anyway, we should defer this issue. Created attachment 737562 [details]
threaddump_mnovak branch (2013-04-19)
Some comments on this case: 1. from Carlo de Wolf's comment (https://bugzilla.redhat.com/show_bug.cgi?id=916625#c29) I take that we are missing error handling for IOErrors. Clebert added a missing one in https://github.com/hornetq/hornetq/commit/d98cce79c922f566d5bf3e1d163acf1d9b5ed53a but there are other points where we are propaging IOException's and not calling "factory.onIOError(...)" 2. Miroslav just tried to reproduce this and got a server hanging at server.stop(). Turns out that we are not setting criticalIOError:=true when shutting down the server on an IOError. While this is a different bug, it affects this scenario (NFS disconnect should trigger a shutdown on IOError). I set a PR to address the lack of criticalIOError:=true when an IOError is detected. https://github.com/hornetq/hornetq/pull/1008 I also added a lot more calls to `factory.onIOError` on IOException's. These can be seen on this branch of mine https://github.com/FranciscoBorges/hornetq/tree/nfsHang Right now we have servers running with the NFS taken out. Both: 1. The eventPoller thread "Thread-2 (HornetQ-AIO-poller-pool1084824280-1960129093)" daemon prio=10 tid=0x00007f46e402d800 nid=0xcfa runn\ able [0x00007f4714e5a000] java.lang.Thread.State: RUNNABLE at org.hornetq.core.asyncio.impl.AsynchronousFileImpl.internalPollEvents(Native Method) at org.hornetq.core.asyncio.impl.AsynchronousFileImpl.pollEvents(AsynchronousFileImpl.java:607) at org.hornetq.core.asyncio.impl.AsynchronousFileImpl.access$400(AsynchronousFileImpl.java:45) at org.hornetq.core.asyncio.impl.AsynchronousFileImpl$PollerRunnable.run(AsynchronousFileImpl.java:830\ ) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) As well as the backup file lock thread, seem to be running without noticing this: "HQ119000: Activation for server HornetQServerImpl::serverUUID=c3ee89bd-a8f5-11e2-a581-d96af0806ad1" prio=10 t\ id=0x00007f46e8075000 nid=0xc87 runnable [0x00007f47335ec000] java.lang.Thread.State: RUNNABLE at org.hornetq.core.asyncio.impl.AsynchronousFileImpl.flock(Native Method) at org.hornetq.core.asyncio.impl.AsynchronousFileImpl.lock(AsynchronousFileImpl.java:659) at org.hornetq.core.server.impl.AIOFileLockNodeManager.lock(AIOFileLockNodeManager.java:82) at org.hornetq.core.server.impl.FileLockNodeManager.awaitLiveNode(FileLockNodeManager.java:130) at org.hornetq.core.server.impl.HornetQServerImpl$SharedStoreBackupActivation.run(HornetQServerImpl.ja\ va:2111) at java.lang.Thread.run(Thread.java:722) Last 2 times we tried to verify this, those loads of ReentrantReadWriteLock were all gone. The weird thing is that the logs at /home/jbossqa/test-nfs/server1/jboss-eap-6.1/standalone/log and /home/jbossqa/test-nfs/server2/jboss-eap-6.1/standalone/log (at the respective servers) don't show anything interesting. This is a puzzled which I honestly don't know how to solve:
- GFS2 was broken at the time we wrote libaio, hence we used flock which was the only method that worked with GFS2. We even implemented it natively as there wouldn't be a way to do it in java without JNI. and for that reason (and only that reason) we are requiring AIO on GFS2 as it will require JNI to access it
Now, look at this from the man pages from flock(2):
NOTES
flock() does not lock files over NFS. Use fcntl(2) instead: that does work over NFS, given a sufficiently recent version of Linux and a server which supports locking.
So, we will need fcntl with NFS!
I have gotten lots of hangs with NFS if using flock. I'm trying to debug directly into a c program now but I'm having a bit of a challenge on setting gcc on a second laptop.. so it will take me another day to setup that environment.
On my tests with Journal NIO this was supposed to work alright. But I would need QE to perform tests with the latest changes.
I have written two simple c program that will use either flock or fcntl Neither of those are releasing the locks until we reconnect the system, as I have been telling since the beginning... If you guys find a way to fence NFS like you did on GFS2, than I'm fine.. otherwise this will be an unsuported scenario... where NFS' connection has to be reliable. flock is documented to not work with NFS, but we would need to gather more information with kernel guys. Created attachment 739266 [details]
test-fcntl.c
Created attachment 739267 [details]
test-flock.c
on the attachments I added, test-flock.c and test-fcntl.c, you can edit the file name, and try a shared storage disconnection to see how they handle releasing upon disconnection. I'm not having much success on this front. I'm improving what Francisco saw on his thread dumps, as I can see some space for improvements... but I'm not sure how to find a more reliable fix without involving someone from storage or kernel on RHEL. (at least someone with better options on the NFS export and nfstabs, like you guys did with fencing on GFS2) Created attachment 739443 [details]
Working test-flock
I'm not sure what functionality you would expect in such a while loop. You won't get an interrupt or something when the NFS server connection dies.
I've modified the test-flock code to do, what I think is, the right thing.
Note that flock does seem to work correctly on NFS contrary the manual page.
(At least from a single client. I really need to hook up multiple clients.)
Hooking up multiple clients works out. I know it works opposed to what's documented. I"m just concerned about supporting that without clearing with someone from the storage and IO team at RHEL. It doesn't make any sense... my flock example is working today after a full reboot. @Mnovak: can you try with the latest changes please? This thing seems unstable over disconnection.. sometimes it works.. sometimes it doesn't... that's the same observation made my mnovak. @mnovak: I just tried with tcp instead of udp here, and it seemed more stable. udp is being a bit unstable... sometimes working, sometimes not. You guys should chnage it to tcp and test it again: 192.168.1.120:/ /work/share nfs rw,nosuid,soft,tcp,timeo=2,retrans=5,rsize=1024,wsize=1024 0 0 I will be waiting QA's input on this. I will set this as ON_QA as in a way to say I will waiting QA's input on this. we may come up with 2.3.1 if there's anything still around here. It all worked as expected with tcp here. udp was a bit unstable.. working or not working. We need to clear out with storage guys at RedHat that flock on NFS is a reliable option. Anyone can take the lead? I'm confused by all the comments and not sure which options for NFS and HQ to use. There are a few things to clarify: - officially NFS is supported with AIO journal-type but now it seems it's NIO, right? - tcp or udp? (yes I saw that udp is unreliable in some tries too, at one moment producer could not send new messages to server with PAGE policy) - which bits should be tested? I need to finish some high priority tasks this week. Promised test development + one-offs. So my support to this case is limited. Anyway I'll give it a try with NFS options: 192.168.1.120:/ /work/share nfs rw,nosuid,soft,tcp,timeo=2,retrans=5,rsize=1024,wsize=1024 0 0 and journal-type=NIO, address-full-policy=PAGE and test with HQ master branch. I'll test again with EAP 6.1.0.ER6 bits. Here is what I did: - replaced hq jars EAP 6.1.0.ER5 from HQ master (git://github.com/hornetq/hornetq.git) - used nfs options: 192.168.1.120:/ /work/share nfs rw,nosuid,soft,tcp,timeo=2,retrans=5,rsize=1024,wsize=1024 0 0 - set journal-type=NIO, address-full-policy=PAGE Test scenario was the same. Two EAP 6.1.0 servers in colocated HA topology with clients connected to server1 which was then disconnected from NFS server (by turning off the switch). Result: After disconnecting of NFS server JMS clients failovered to backup on server2. So file lock on journal was released, hq live in server1 closed its port 5445 and backup on server2 started. Still from the thread dump I'm not sure whether hq live on server1 fully shutdowned. (check threadump_before_shutdown_server1.txt) There is still problem with clean shutdown (ctrl-c) of server1. It hangs - thread dump - threaddump_hanging_shutdown.txt. Server logs attached.(server_logs_2012_04_25.zip) I could be just lucky, I'll test again with EAP 6.1.0.ER6 bits. Created attachment 739808 [details]
threadump_before_shutdown_server1.txt
Created attachment 739809 [details]
threaddump_hanging_shutdown.txt
Created attachment 739810 [details]
server_logs_2012_04_25.zip
We can improve the shutdown and make sure things will work smoothly for next version. BTW: I have had confirmation from storage that flock will work fine: "That information is outdated. The NFS client module will simply convert the flock request to a fcntl request over the entire file and make the locking call. It should work fine." - Sachin Prabhu from our filesystem dept. Can someone explain the ON_DEV status, please? I don't think we use this? It should either be ASSIGNED (engineering working on it) or ON_QA (stuff been committed and we are testing the ER) Changing to ON_QA based on comment 51. Will have to revisit blocker status based upon test results with ER6 or ER7 if limited time on ER6 does not allow testing. Tried again EAP 6.1.0.ER6 (HQ 2.3.0.Final) and everything worked like a charm. Clients failovered, HQ live on first server shutdowned and first server could be shutdowned after it was disconnected from NFSv4 server. I'll give it a try with ASYNCIO for journal-type. Also with ASYNCIO for journal-type is everything ok. Setting as verified. Just for info the nfs mount was: 192.168.10.4:/ on /home/jbossqa/shared_folder_nfs type nfs (rw,soft,tcp,timeo=2,retrans=5,rsize=1024,wsize=1024,vers=4,addr=192.168.10.4,clientaddr=192.168.10.1) The info in the private email is simply stating that this bug was closed based on the fact that it was VERIFIED in a previous release (6.1.0, based on the dates in comments above). |