Created attachment 339 Attached is the server log.
Is a core file available? It should be normally in /. If yes, would like to see a backtrace from the core. Thanks, Vijay
Created attachment 340
Vijay, I found the core file. It is 500Mb and when I tried to upload it gave an error after taking close to an hour to upload. I have installed gdb and run a bt on the file. (gdb) bt #0 0x0000003f5da2e21d in raise () from /lib64/tls/libc.so.6 #1 0x0000003f5da2fa1e in abort () from /lib64/tls/libc.so.6 #2 0x0000003f5da27ae1 in __assert_fail () from /lib64/tls/libc.so.6 #3 0x00002b2f1d4ee537 in mem_put (pool=Variable "pool" is not available. ) at mem-pool.c:463 #4 0x00002aaaabb1379d in rdma_handshake_pollerr (this=0x72dd948) at rdma.c:314 #5 0x00002aaaabb13b3e in rdma_event_handler (fd=Variable "fd" is not available. ) at rdma.c:4358 #6 0x00002b2f1d4ed7cb in event_dispatch_epoll (event_pool=0x7248048) at event.c:812 #7 0x0000000000405416 in main () Let me know if there is any more information I can pull from this file or if there is a better place I can put the file. Mike (In reply to comment #2) > Is a core file available? It should be normally in /. If yes, would like to see > a backtrace from the core. > > Thanks, > Vijay
Thanks, Mike. This should do for now. Will revert back if we need more information. If you could preserve the core till we root cause this, that would be great. Thanks, Vijay
The servers are Dell PE1950 w/ 16G of RAM running RHEL 4 Update 5. The backend storage is DDN S2A9550 controller pair attached via Fibre Channel. 1 brick per server. The Infiniband card is: 0c:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20) [root@iosrv-7-1 ~]# gluster volume info Volume Name: gluster-test Type: Distribute Status: Started Number of Bricks: 4 Transport-type: rdma Bricks: Brick1: iosrv-7-3:/mnt/brick3 Brick2: iosrv-7-4:/mnt/brick4 Brick3: 172.16.2.4:/mnt/brick1 Brick4: iosrv-7-2:/mnt/brick2 The clients are all Dell, mostly PE1950 with either 16 or 32 GB of RAM and they are running RHEL 5.4. I have attached the current log file from /var/log/glusterfs/ on the head node. My most recent action was to stop the volume, unmount everywhere, kill any remaining glusterfs processes on the servers, then restart the server and mount the client on the head node of the cluster. ls of the mount is hung. The server process just crashed with the following log: [2010-10-07 09:37:00.63148] E [rdma.c:4350:rdma_event_handler] rpc-transport/rdma: gluster-test-client -0: pollin received on tcp socket (peer: 172.16.2.5:6971) after handshake is complete pending frames: frame : type(1) op(LOOKUP) frame : type(1) op(LOOKUP) patchset: v3.1.0qa7-389-gc5a5fea signal received: 6 time of crash: 2010-10-07 09:37:00 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.1.0beta /lib64/tls/libc.so.6[0x3f5da2e2b0] /lib64/tls/libc.so.6(gsignal+0x3d)[0x3f5da2e21d] /lib64/tls/libc.so.6(abort+0xfe)[0x3f5da2fa1e] /lib64/tls/libc.so.6(__assert_fail+0xf1)[0x3f5da27ae1] /usr/lib64/libglusterfs.so.0[0x2b2f1d4ee537] /usr/lib64/glusterfs/3.1.0beta/rpc-transport/rdma.so[0x2aaaabb1379d] /usr/lib64/glusterfs/3.1.0beta/rpc-transport/rdma.so[0x2aaaabb13b3e] /usr/lib64/libglusterfs.so.0[0x2b2f1d4ed7cb] /usr/sbin/glusterfs(main+0x326)[0x405416] /lib64/tls/libc.so.6(__libc_start_main+0xdb)[0x3f5da1c3fb] /usr/sbin/glusterfs[0x40313a] ---------
Hi Mike, Can you attach gdb and execute command 'thr apply all bt full'? This command will give us more information, which will help us in debugging. regards, Raghavendra (In reply to comment #4) > Vijay, > I found the core file. It is 500Mb and when I tried to upload it gave an error > after taking close to an hour to upload. I have installed gdb and run a bt on > the file. > > (gdb) bt > #0 0x0000003f5da2e21d in raise () from /lib64/tls/libc.so.6 > #1 0x0000003f5da2fa1e in abort () from /lib64/tls/libc.so.6 > #2 0x0000003f5da27ae1 in __assert_fail () from /lib64/tls/libc.so.6 > #3 0x00002b2f1d4ee537 in mem_put (pool=Variable "pool" is not available. > ) at mem-pool.c:463 > #4 0x00002aaaabb1379d in rdma_handshake_pollerr (this=0x72dd948) at rdma.c:314 > #5 0x00002aaaabb13b3e in rdma_event_handler (fd=Variable "fd" is not > available. > ) at rdma.c:4358 > #6 0x00002b2f1d4ed7cb in event_dispatch_epoll (event_pool=0x7248048) at > event.c:812 > #7 0x0000000000405416 in main () > > Let me know if there is any more information I can pull from this file or if > there is a better place I can put the file. > > Mike > > (In reply to comment #2) > > Is a core file available? It should be normally in /. If yes, would like to see > > a backtrace from the core. > > > > Thanks, > > Vijay
Hi Mike, Can you also give the commandlines you used to start glusterfs (glusterfs, glusterfsd and glusterd) on each of the client and server? regards, Raghavendra
(In reply to comment #7) > Hi Mike, > > Can you also give the commandlines you used to start glusterfs (glusterfs, > glusterfsd and glusterd) on each of the client and server? > > regards, > Raghavendra Here is the full backtrace: (gdb) thr apply all bt full Thread 12 (process 17931): #0 0x0000003f5e10c28b in do_sigwait () from /lib64/tls/libpthread.so.0 No symbol table info available. #1 0x0000003f5e10c33d in sigwait () from /lib64/tls/libpthread.so.0 No symbol table info available. #2 0x0000000000404f9d in ?? () No symbol table info available. #3 0x0000003f5e1060a0 in __free_tcb () from /lib64/tls/libpthread.so.0 No symbol table info available. #4 0x000000005e10d280 in ?? () No symbol table info available. #5 0x0000000000004a03 in ?? () No symbol table info available. #6 0x0000003f5e10d280 in _thread_db_sizeof_pthread () from /lib64/tls/libpthread.so.0 No symbol table info available. #7 0x0000000000000000 in ?? () No symbol table info available. Thread 11 (process 17933): #0 0x0000003f5da8fc45 in __nanosleep_nocancel () from /lib64/tls/libc.so.6 No symbol table info available. #1 0x0000003f5dac08f8 in usleep () from /lib64/tls/libc.so.6 No symbol table info available. #2 0x00002b2f1d4dca8d in gf_timer_proc (ctx=0x7246010) at timer.c:182 now = 1286465819254181 now_tv = {tv_sec = 1286465819, tv_usec = 254181} event = (gf_timer_t *) 0x2aaab9fc4c58 reg = (gf_timer_registry_t *) 0x724c8b8 __FUNCTION__ = "gf_timer_proc" #3 0x0000003f5e106137 in start_thread () from /lib64/tls/libpthread.so.0 No symbol table info available. #4 0x0000003f5dac7113 in clone () from /lib64/tls/libc.so.6 No symbol table info available. Thread 10 (process 17934): #0 0x0000003f5e10b19f in __read_nocancel () from /lib64/tls/libpthread.so.0 No symbol table info available. #1 0x00000030a930934c in ibv_get_cq_event () from /usr/lib64/libibverbs.so.1 No symbol table info available. #2 0x00002aaaabb13375 in rdma_send_completion_proc (data=0x724e8a0) at rdma.c:3430 chan = Variable "chan" is not available. Thread 9 (process 17935): #0 0x0000003f5e10b19f in __read_nocancel () from /lib64/tls/libpthread.so.0 No symbol table info available. #1 0x00000030a930934c in ibv_get_cq_event () from /usr/lib64/libibverbs.so.1 No symbol table info available. #2 0x00002aaaabb12ee2 in rdma_recv_completion_proc (data=0x7266e10) at rdma.c:3261 chan = Variable "chan" is not available. Thread 8 (process 17936): #0 0x0000003f5e10b19f in __read_nocancel () from /lib64/tls/libpthread.so.0 No symbol table info available. ---Type <return> to continue, or q <return> to quit--- #1 0x00000030a930934c in ibv_get_cq_event () from /usr/lib64/libibverbs.so.1 No symbol table info available. #2 0x00002aaaabb13375 in rdma_send_completion_proc (data=0x725e100) at rdma.c:3430 chan = Variable "chan" is not available. Thread 7 (process 17937): #0 0x0000003f5e10b19f in __read_nocancel () from /lib64/tls/libpthread.so.0 No symbol table info available. #1 0x00000030a930934c in ibv_get_cq_event () from /usr/lib64/libibverbs.so.1 No symbol table info available. #2 0x00002aaaabb12ee2 in rdma_recv_completion_proc (data=0x726dfd0) at rdma.c:3261 chan = Variable "chan" is not available. Thread 6 (process 17938): #0 0x0000003f5e10b19f in __read_nocancel () from /lib64/tls/libpthread.so.0 No symbol table info available. #1 0x00000030a930934c in ibv_get_cq_event () from /usr/lib64/libibverbs.so.1 No symbol table info available. #2 0x00002aaaabb13375 in rdma_send_completion_proc (data=0x72ba050) at rdma.c:3430 chan = Variable "chan" is not available. Thread 5 (process 17939): #0 0x0000003f5e10b19f in __read_nocancel () from /lib64/tls/libpthread.so.0 No symbol table info available. #1 0x00000030a930934c in ibv_get_cq_event () from /usr/lib64/libibverbs.so.1 No symbol table info available. #2 0x00002aaaabb12ee2 in rdma_recv_completion_proc (data=0x72ba070) at rdma.c:3261 chan = Variable "chan" is not available. Thread 4 (process 17940): #0 0x0000003f5e10b19f in __read_nocancel () from /lib64/tls/libpthread.so.0 No symbol table info available. #1 0x00000030a930934c in ibv_get_cq_event () from /usr/lib64/libibverbs.so.1 No symbol table info available. #2 0x00002aaaabb13375 in rdma_send_completion_proc (data=0x72befd0) at rdma.c:3430 chan = Variable "chan" is not available. Thread 3 (process 17941): #0 0x0000003f5e10b19f in __read_nocancel () from /lib64/tls/libpthread.so.0 No symbol table info available. #1 0x00000030a930934c in ibv_get_cq_event () from /usr/lib64/libibverbs.so.1 No symbol table info available. #2 0x00002aaaabb12ee2 in rdma_recv_completion_proc (data=0x72e1350) at rdma.c:3261 chan = Variable "chan" is not available. Thread 2 (process 17944): #0 0x0000003f5e10aecb in __lll_mutex_lock_wait () from /lib64/tls/libpthread.so.0 No symbol table info available. #1 0x00000000072dddd0 in ?? () No symbol table info available. #2 0x00002aaabfb010b4 in ?? () No symbol table info available. #3 0x0000003f5e107c94 in pthread_mutex_lock () from /lib64/tls/libpthread.so.0 No symbol table info available. #4 0x0000000000000004 in ?? () No symbol table info available. #5 0x00002aaaba3b2528 in ?? () No symbol table info available. ---Type <return> to continue, or q <return> to quit--- #6 0x00000000072ddd10 in ?? () No symbol table info available. #7 0x00002aaabfb01034 in ?? () No symbol table info available. #8 0x00002b2f1d4ef36b in iobref_unref (iobref=0x72dd948) at iobuf.c:531 No locals. #9 0x00000000ffffffff in ?? () No symbol table info available. #10 0x00000000072e05a8 in ?? () No symbol table info available. #11 0x00002aaaabb1033b in __rdma_ioq_churn_entry (peer=0x2b2f1d4ee483, entry=0x2aaab307b280) at rdma.c:1683 ret = 120446224 priv = (rdma_private_t *) 0xfffffffffffffffc device = (rdma_device_t *) 0x72dddd0 options = (rdma_options_t *) 0x2aaaba367db8 post = (rdma_post_t *) 0xfffffffffffffffc #12 0x00002aaaabb136f4 in rdma_disconnect (this=0x72ddcf8) at rdma.c:3901 ret = Variable "ret" is not available. Thread 1 (process 17930): #0 0x0000003f5da2e21d in raise () from /lib64/tls/libc.so.6 No symbol table info available. #1 0x0000003f5da2fa1e in abort () from /lib64/tls/libc.so.6 No symbol table info available. #2 0x0000003f5da27ae1 in __assert_fail () from /lib64/tls/libc.so.6 No symbol table info available. #3 0x00002b2f1d4ee537 in mem_put (pool=Variable "pool" is not available. ) at mem-pool.c:463 head = Variable "head" is not available. (gdb) gdb claims this corefile was generated by this command line: Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster-test --volume-name gluste'. All I ever did on the servers was run '/etc/init.d/glusterd start' On the client I used mount with the following fstab entry iosrv-7-1:gluster-test /mnt/gluster-test glusterfs defaults,_netdev 0 0 Also, here is some IP information that might be useful The gluster servers are iosrv-7-1 eth0 172.16.2.4 ib0 172.16.34.4 iosrv-7-2 eth0 172.16.2.5 ib0 172.16.34.5 iosrv-7-3 eth0 172.16.2.6 ib0 172.16.34.6 iosrv-7-4 eth0 172.16.2.7 ib0 172.16.34.7 The frontend of the cluster is ra5 eth0 172.16.0.1 ib0 172.16.32.2 other clients will be in the range eth0 172.16.8.1 - 160 and 172.16.12.1 - 84 ib0 172.16.40.1 - 160 and 172.16.44.1 - 84
PATCH: http://patches.gluster.com/patch/5384 in master (rpc-transport/rdma: fix the bug which was causing an ioq-entry being mem_put twice.)
What is the proper procedure for me to utilize this fix? I do not see the patch in git and in fact I can't even get the git version to startup properly because the startup scripts are pointing to different binaries and configuration files that don't exist. Thanks, Mike
(In reply to comment #10) > What is the proper procedure for me to utilize this fix? I do not see the patch > in git The patch went in as commit d1f938c16e309431382ab. > and in fact I can't even get the git version to startup properly because > the startup scripts are pointing to different binaries and configuration files > that don't exist. Can you please give the names of the scripts you are facing problems with? > > Thanks, > Mike regards, Raghavendra
> > and in fact I can't even get the git version to startup properly because > > the startup scripts are pointing to different binaries and configuration files > > that don't exist. > > Can you please give the names of the scripts you are facing problems with? > Mike, also can you try with our 3.1.0 release, and see if its fixed ? If fixed, we will mark it as RESOLVED.
(In reply to comment #11) > (In reply to comment #10) > > What is the proper procedure for me to utilize this fix? I do not see the patch > > in git > > The patch went in as commit d1f938c16e309431382ab. The documentation at http://www.gluster.org/download/gluster-source-code/, which I had to dig to find from the new 3.1 documentation, first pointed me to sh$ git clone git://git.sv.gnu.org/gluster.git glusterfs That repo does not contain the commit that you mention, it doesn't even contain the rdma code. The last commit there is from Aug. 12 > > > and in fact I can't even get the git version to startup properly because > > the startup scripts are pointing to different binaries and configuration files > > that don't exist. > > Can you please give the names of the scripts you are facing problems with? Again using the code listed above the only init.d script that I could find and the one that installed is glusterfsd. That init script doesn't run by default. Anyways, I later logged into the IRC channel and was pointed to: http://www.gluster.com/community/documentation/index.php/Development_Work_Flow which contains a link to: git clone git://git.gluster.com/glusterfs.git ./glusterfs.git That code contains the patch and correct init scripts so I was able to get it compiled and started. Unfortunately I can't test further because I am now blocked by bug 763587 which I will work on today. > > > > > Thanks, > > Mike > > regards, > Raghavendra
Hi Michael, Did you get a chance to test whether this bug is fixed? If so, we can close the bug. regards, Raghavendra (In reply to comment #13) > (In reply to comment #11) > > (In reply to comment #10) > > > What is the proper procedure for me to utilize this fix? I do not see the patch > > > in git > > > > The patch went in as commit d1f938c16e309431382ab. > > The documentation at http://www.gluster.org/download/gluster-source-code/, > which I had to dig to find from the new 3.1 documentation, first pointed me to > sh$ git clone git://git.sv.gnu.org/gluster.git glusterfs > That repo does not contain the commit that you mention, it doesn't even contain > the rdma code. The last commit there is from Aug. 12 > > > > > > and in fact I can't even get the git version to startup properly because > > > the startup scripts are pointing to different binaries and configuration files > > > that don't exist. > > > > Can you please give the names of the scripts you are facing problems with? > > Again using the code listed above the only init.d script that I could find and > the one that installed is glusterfsd. That init script doesn't run by default. > > Anyways, I later logged into the IRC channel and was pointed to: > http://www.gluster.com/community/documentation/index.php/Development_Work_Flow > which contains a link to: > git clone git://git.gluster.com/glusterfs.git ./glusterfs.git > That code contains the patch and correct init scripts so I was able to get it > compiled and started. Unfortunately I can't test further because I am now > blocked by bug 763587 which I will work on today. > > > > > > > > Thanks, > > > Mike > > > > regards, > > Raghavendra
Raghavendra, Please see comment #13 as it says that Michael did not find the fix. Was this fix included in the released version of GlsuterFS 3.1.0?
Jacob, Yes this fix is include in release-3.1.0. regards, Raghavendra
According to Michael, I asked him if it works with the 3.1.0 release: Nope. I'm at a complete stand still because I can not create volume in my current state. Is there additional information that Michael can gather regarding this issue?
Jacob, Even with release-3.1.0, is the customer running into same crash, mentioned at the begining of this bug report? If so, is it possible to run gdb on core and get output of command 'thr apply all bt full'? Also, is it possible to attach the file rpc/rpc-transport/rdma/src/rdma.c from glusterfs sources? If the customer is not running into crash, is it possible to explain in detail what is the issue? regards, Raghavendra
(In reply to comment #18) > Jacob, > > Even with release-3.1.0, is the customer running into same crash, mentioned at > the begining of this bug report? If so, is it possible to run gdb on core and > get output of command 'thr apply all bt full'? Also, is it possible to attach > the file rpc/rpc-transport/rdma/src/rdma.c from glusterfs sources? > > If the customer is not running into crash, is it possible to explain in detail > what is the issue? > > regards, > Raghavendra This issue is resolved in 3.1 works with rdma in native glusterfs. Issue was regarding few drivers not loaded automatically eg: ib_ipoib, ib_uverbs namely and fuse which rendered mount-point hung. We can close this ticket its not valid anymore.