Hide Forgot
please find attached the core-dump of node02...
ftp://heimkinder.dyndns.org/core_node02.tar.bz2
Hi all, first of all thanks to everybody pushing glusterfs forward. There are many new features and improvements over the last years. To my problem: I owning a small scientific cluster (3 nodes) where every node ships 1.5TB local disk space interlinked with Infiniband and/or TCP-ethernet. I've tried to aggregate this disk-space to a huge and fast striped gluster-filesystem over IB and TCP. The use case of this system is heavy I/O load, mostly with mpi_write and mpi_read due to memory consuming particle-in-cell simulations. I have observed strange behavior with the striped-translator already in the earlier version 3.0.5 of glusterfs. I'm using a special openMPI based PIC code, which writes his output data in parallel using native mpi_write code. In the 3.0.5 version I realized data corruption. The parallel output files missed a lot of data when using the striped translator. E.g. every node writes some data to disk with mpi_write. Node01 does this right, but node02 and node03 "forgot" to write the data. Most of the time 2/3 of the output-data was missing in the file, although the file has the disk-size I expected. It looks like, there were holes in the file where the output of the two other nodes should be. The strange thing, sometimes the output file was right and works and in one of 2 cases the data was corrupted. So, I thought I could give a try to the new version 3.2.3 of glusterfs. I've tested the same scenario mentioned above, on the one with distributed and on the other hand with striped translator. With distributed everthing works nice, all output files are right, no data corruption. As soon as I switch to the striped translator the glusterfs-client crashes after the second output-file written with mpi_write (approx after 300MB). But it crashes only on node02 and node03. After this, the socket on node02 and 03 isn't reachable anymore. The log from node02 looks like this: Given volfile: +------------------------------------------------------------------------------+ 1: volume data-global-client-0 2: type protocol/client 3: option remote-host ppclus01 4: option remote-subvolume /export/sda2/striped 5: option transport-type tcp 6: end-volume 7: 8: volume data-global-client-1 9: type protocol/client 10: option remote-host ppclus02 11: option remote-subvolume /export/sda2/striped 12: option transport-type tcp 13: end-volume 14: 15: volume data-global-client-2 16: type protocol/client 17: option remote-host ppclus03 18: option remote-subvolume /export/sda2/striped 19: option transport-type tcp 20: end-volume 21: 22: volume data-global-stripe-0 23: type cluster/stripe 24: subvolumes data-global-client-0 data-global-client-1 data-global-client-2 25: end-volume 26: 27: volume data-global-write-behind 28: type performance/write-behind 29: subvolumes data-global-stripe-0 30: end-volume 31: 32: volume data-global-read-ahead 33: type performance/read-ahead 34: subvolumes data-global-write-behind 35: end-volume 36: 37: volume data-global-io-cache 38: type performance/io-cache 39: subvolumes data-global-read-ahead 40: end-volume 41: 42: volume data-global-quick-read 43: type performance/quick-read 44: subvolumes data-global-io-cache 45: end-volume 46: 47: volume data-global-stat-prefetch 48: type performance/stat-prefetch 49: subvolumes data-global-quick-read 50: end-volume 51: 52: volume data-global 53: type debug/io-stats 54: option latency-measurement off 55: option count-fop-hits off 56: subvolumes data-global-stat-prefetch 57: end-volume +------------------------------------------------------------------------------+ pending frames: frame : type(1) op(STATFS) frame : type(1) op(STATFS) frame : type(1) op(STATFS) frame : type(1) op(STATFS) frame : type(1) op(STATFS) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2011-09-02 23:39:14 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.2.3 /lib/libc.so.6[0x2b9dbf4a4f60] /usr/lib/glusterfs/3.2.3/xlator/protocol/client.so(client3_1_mknod_cbk+0x92)[0x2aaaab1c6232] /usr/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0xa4)[0x2b9dbec17894] /usr/lib/libgfrpc.so.0(rpc_clnt_notify+0xcd)[0x2b9dbec17acd] /usr/lib/libgfrpc.so.0(rpc_transport_notify+0x27)[0x2b9dbec12937] /usr/lib/glusterfs/3.2.3/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x2aaaaad70c4f] /usr/lib/glusterfs/3.2.3/rpc-transport/socket.so(socket_event_handler+0x148)[0x2aaaaad70db8] /usr/lib/libglusterfs.so.0[0x2b9dbe9e5faf] /usr/sbin/glusterfs(main+0x25a)[0x40621a] /lib/libc.so.6(__libc_start_main+0xe6)[0x2b9dbf4911a6] /usr/sbin/glusterfs[0x403a99] --------- Any idea how to solve this problem? It's really important for me to use the striped translator combined with Infiniband to reach maximum disk performance. I've tested both, TCP-socket and Infiniband. But gluster-client crashes in both cases. Thanks and kind regards, Oliver Deppert
Hi Oliver, Can you please provide the server/client logs? (If they are large in size, please provide logs in proximity of the crash).
Created attachment 657
Created attachment 658 [details] tar file containing files to reproduce compiler bug
(In reply to comment #4) > Hi Oliver, > > Can you please provide the server/client logs? (If they are large in size, > please provide logs in proximity of the crash). Hi, I've mounted glusterFS with log-level=ERROR; attached you'll find the logs of the client2 and server2. As I mentioned, there isn't any unusal at the server-side, just one of the clients crashes. Sometimes client2 and sometimes client3. But never the client where I've started the mpi job. In my configuration every node is both server and client. As soon as I remount the client2 side, everything is reachable again.
Hi Oliver, We are unable to reproduce the issue inhouse. I had few questions though. 1. Is the problem reproduced everytime you use a stripe xlator? (if yes, can you provide a test case if available) 2. Are the backends clean (No data) when the stripe volume is created and mounted? If possible, can we have access to the machine where the crash happened to investigate it further?
(In reply to comment #8) > Hi Oliver, > > We are unable to reproduce the issue inhouse. I had few questions though. > > 1. Is the problem reproduced everytime you use a stripe xlator? (if yes, can > you provide a test case if available) > 2. Are the backends clean (No data) when the stripe volume is created and > mounted? > > If possible, can we have access to the machine where the crash happened to > investigate it further? Hi, to 1: yes, it is reproducible everytime I use the stripe xlator. The PIC code writes it's output for the first time step (one file written with mpi_write by 24 cores, approx 310MB without any problems), serveral minutes later the second timestep (77MB) is written succesfully....but everytime at the fourth time-step client2 or client3 crashed. I tried to isolate the problem in kind of a test-case, but at the moment without any success. My normal test-case to test mpi_write on distributed/striped file-systems seems to work. But I'll see what I can do... So, I tested also with the distributed translator....there everthing is fine, no problems at all. to 2: yes, the backends are clean, it's a fresh and clean set-up with the stripe-xlator with standard settings created by the "new" glusterd deamon, like "gluster volgen translator stripe transport tcp ..." The cluster is behind a firewall of a scientific research centre, so direct access via ssh isn't possible, you'll need an account for our linux farm. But, what i can manage is a vnc outgoing connection to our cluster. Therefore you'll need an turbo vnc client on listen mode, I'll need the ip of your machine which listens. Then I can give you vnc access as an observer. I'll have to control the vnc-connection, I can show you both the working distributed system and the failing stripe system...you can observe and control further actions by saying me what to do... regards, Oliver
CHANGE: http://review.gluster.com/526 (This will be destroyed in cbk. Porting commit) merged in release-3.1 by Vijay Bellur (vijay)
*** Bug 3613 has been marked as a duplicate of this bug. ***
*** Bug 3632 has been marked as a duplicate of this bug. ***