Bug 852426 - [glusterfs-3.1.2geosync4]: crash in locks
[glusterfs-3.1.2geosync4]: crash in locks
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterd (Show other bugs)
2.0
All Linux
low Severity high
: ---
: ---
Assigned To: krishnan parthasarathi
Sachidananda Urs
:
Depends On: 764096
Blocks: 858480
  Show dependency treegraph
 
Reported: 2012-08-28 09:10 EDT by Vidya Sakar
Modified: 2015-11-03 18:04 EST (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 764096
: 858480 (view as bug list)
Environment:
Last Closed: 2013-09-23 18:38:58 EDT
Type: ---
Regression: RTP
Mount Type: ---
Documentation: DP
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Vidya Sakar 2012-08-28 09:10:40 EDT
+++ This bug was initially created as a clone of Bug #764096 +++

glusterfs server process crashed in pl_getlk. This crash was observed at the end of the sanity script execution (gsyncd was running too with marker enabled....)

This is the backtrace of the core generated.



Loaded symbols for /lib64/libgcc_s.so.1
Core was generated by `/opt/glusterfs/3.1.2gsyncqa4/sbin/glusterfsd --xlator-option mirror-server.list'.
Program terminated with signal 11, Segmentation fault.
[New process 19528]
[New process 20019]
[New process 20018]
[New process 20017]
[New process 20016]
[New process 20015]
[New process 20014]
[New process 20013]
[New process 17644]
[New process 17639]
[New process 17637]
[New process 17636]
#0  0x00002aaaab5be9c6 in first_overlap (pl_inode=0xcc9ff40, lock=0x2aaab8000f30) at common.c:719
719                     if (l->blocked)
(gdb) bt
#0  0x00002aaaab5be9c6 in first_overlap (pl_inode=0xcc9ff40, lock=0x2aaab8000f30) at common.c:719
#1  0x00002aaaab5bfa17 in pl_getlk (pl_inode=0xcc9ff40, lock=0x2aaab8000f30) at common.c:1050
#2  0x00002aaaab5c45ed in pl_lk (frame=0x2b3df6db0f10, this=0xcc97ff0, fd=0x2aaaac41f6b4, cmd=5, flock=0x2b3df71f3810) at posix.c:1143
#3  0x00002aaaab7db7f6 in iot_lk_wrapper (frame=0x2b3df6dc1818, this=0xcc98e30, fd=0x2aaaac41f6b4, cmd=5, flock=0x2b3df71f3810)
    at io-threads.c:1014
#4  0x00002b3df5ecd565 in call_resume_wind (stub=0x2b3df71f37c8) at call-stub.c:2363
#5  0x00002b3df5ed360a in call_resume (stub=0x2b3df71f37c8) at call-stub.c:3870
#6  0x00002aaaab7d5eae in iot_worker (data=0xcc9f590) at io-threads.c:118
#7  0x0000003ecec06617 in start_thread () from /lib64/libpthread.so.0
#8  0x0000003ece4d3c2d in clone () from /lib64/libc.so.6
(gdb) (gdb) p l
$1 = (posix_lock_t *) 0x0
(gdb) p *pl_inode
$2 = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, 
        __next = 0x0}}, __size = '\0' <repeats 39 times>, __align = 0}, dom_list = {next = 0xccb4f60, prev = 0xccb4f60}, ext_list = {
    next = 0xcc9ff78, prev = 0xcc9ff78}, rw_list = {next = 0xcc9ff88, prev = 0xcc9ff88}, reservelk_list = {next = 0xcc9ff98, 
    prev = 0xcc9ff98}, blocked_reservelks = {next = 0xcc9ffa8, prev = 0xcc9ffa8}, blocked_calls = {next = 0xcc9ffb8, prev = 0xcc9ffb8}, 
  mandatory = 0, refkeeper = 0x0}
(gdb) p *pl_inode->ext_list 
Structure has no component named operator*.
(gdb) p pl_inode->ext_list 
$3 = {next = 0xcc9ff78, prev = 0xcc9ff78}
(gdb)

--- Additional comment from amarts@redhat.com on 2011-03-02 23:20:13 EST ---

Could not reproduce, taking it out of 3.1.3 target and will keep it open for more time to see if we can see this during QA of 3.1.3

--- Additional comment from pkarampu@redhat.com on 2011-03-10 01:58:59 EST ---

Seems like a race. The core suggests that the list is empty but the process crashed in-side the list_for_each. I see that all the list related operations are done with out taking any mutex locks.

Pranith.
(In reply to comment #0)
> glusterfs server process crashed in pl_getlk. This crash was observed at the
> end of the sanity script execution (gsyncd was running too with marker
> enabled....)
> 
> This is the backtrace of the core generated.
> 
> 
> 
> Loaded symbols for /lib64/libgcc_s.so.1
> Core was generated by `/opt/glusterfs/3.1.2gsyncqa4/sbin/glusterfsd
> --xlator-option mirror-server.list'.
> Program terminated with signal 11, Segmentation fault.
> [New process 19528]
> [New process 20019]
> [New process 20018]
> [New process 20017]
> [New process 20016]
> [New process 20015]
> [New process 20014]
> [New process 20013]
> [New process 17644]
> [New process 17639]
> [New process 17637]
> [New process 17636]
> #0  0x00002aaaab5be9c6 in first_overlap (pl_inode=0xcc9ff40,
> lock=0x2aaab8000f30) at common.c:719
> 719                     if (l->blocked)
> (gdb) bt
> #0  0x00002aaaab5be9c6 in first_overlap (pl_inode=0xcc9ff40,
> lock=0x2aaab8000f30) at common.c:719
> #1  0x00002aaaab5bfa17 in pl_getlk (pl_inode=0xcc9ff40, lock=0x2aaab8000f30) at
> common.c:1050
> #2  0x00002aaaab5c45ed in pl_lk (frame=0x2b3df6db0f10, this=0xcc97ff0,
> fd=0x2aaaac41f6b4, cmd=5, flock=0x2b3df71f3810) at posix.c:1143
> #3  0x00002aaaab7db7f6 in iot_lk_wrapper (frame=0x2b3df6dc1818, this=0xcc98e30,
> fd=0x2aaaac41f6b4, cmd=5, flock=0x2b3df71f3810)
>     at io-threads.c:1014
> #4  0x00002b3df5ecd565 in call_resume_wind (stub=0x2b3df71f37c8) at
> call-stub.c:2363
> #5  0x00002b3df5ed360a in call_resume (stub=0x2b3df71f37c8) at call-stub.c:3870
> #6  0x00002aaaab7d5eae in iot_worker (data=0xcc9f590) at io-threads.c:118
> #7  0x0000003ecec06617 in start_thread () from /lib64/libpthread.so.0
> #8  0x0000003ece4d3c2d in clone () from /lib64/libc.so.6
> (gdb) (gdb) p l
> $1 = (posix_lock_t *) 0x0
> (gdb) p *pl_inode
> $2 = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0,
> __kind = 0, __spins = 0, __list = {__prev = 0x0, 
>         __next = 0x0}}, __size = '\0' <repeats 39 times>, __align = 0},
> dom_list = {next = 0xccb4f60, prev = 0xccb4f60}, ext_list = {
>     next = 0xcc9ff78, prev = 0xcc9ff78}, rw_list = {next = 0xcc9ff88, prev =
> 0xcc9ff88}, reservelk_list = {next = 0xcc9ff98, 
>     prev = 0xcc9ff98}, blocked_reservelks = {next = 0xcc9ffa8, prev =
> 0xcc9ffa8}, blocked_calls = {next = 0xcc9ffb8, prev = 0xcc9ffb8}, 
>   mandatory = 0, refkeeper = 0x0}
> (gdb) p *pl_inode->ext_list 
> Structure has no component named operator*.
> (gdb) p pl_inode->ext_list 
> $3 = {next = 0xcc9ff78, prev = 0xcc9ff78}
> (gdb)
Comment 2 Amar Tumballi 2012-10-05 13:22:53 EDT
moving it to on-qa... never saw it happening. not working on this right now...
Comment 3 Sachidananda Urs 2013-01-10 00:55:32 EST
Unable to reproduce this issue:

Tested with geosync setup with different volume types and ran load tests on them for a couple of days. Will reopen if I happen to hit the issue.
Comment 6 Scott Haines 2013-09-23 18:38:58 EDT
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html
Comment 7 Scott Haines 2013-09-23 18:43:40 EDT
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Note You need to log in before you can comment on or make changes to this bug.