Bug 1387204 - [md-cache]: All bricks crashed while performing symlink and rename from client at the same time
Summary: [md-cache]: All bricks crashed while performing symlink and rename from clien...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: marker
Version: rhgs-3.2
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: RHGS 3.2.0
Assignee: Poornima G
QA Contact: Anil Shah
URL:
Whiteboard:
Depends On:
Blocks: 1351528 1394131 1396414 1396418 1396419
TreeView+ depends on / blocked
 
Reported: 2016-10-20 10:53 UTC by Rahul Hinduja
Modified: 2017-03-23 06:13 UTC (History)
11 users (show)

Fixed In Version: glusterfs-3.8.4-6
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1394131 (view as bug list)
Environment:
Last Closed: 2017-03-23 06:13:07 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:0486 0 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 09:18:45 UTC

Description Rahul Hinduja 2016-10-20 10:53:38 UTC
Description of problem:
======================

All the 6 bricks of a volume (3x2) crashed with the upcall bt: 

[root@dhcp37-58 ~]# file core.5895.1476956627.dump.1
core.5895.1476956627.dump.1: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfsd -s 10.70.37.58 --volfile-id master.10.70.37.58.rhs-brick1-', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/sbin/glusterfsd', platform: 'x86_64'
[root@dhcp37-58 ~]# 

(gdb) bt
#0  0x00007f9530adc210 in pthread_spin_lock () from /lib64/libpthread.so.0
#1  0x00007f951de3b129 in upcall_inode_ctx_get () from /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so
#2  0x00007f951de3055f in upcall_local_init () from /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so
#3  0x00007f951de3431a in up_setxattr () from /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so
#4  0x00007f9531d072a4 in default_setxattr_resume () from /lib64/libglusterfs.so.0
#5  0x00007f9531c9947d in call_resume () from /lib64/libglusterfs.so.0
#6  0x00007f951dc20743 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#7  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f953041c73d in clone () from /lib64/libc.so.6
(gdb)



Version-Release number of selected component (if applicable):
=============================================================

glusterfs-server-3.8.4-2.26.git0a405a4.el7rhgs.x86_64 
(private build with downstream 3.2.0 and md-cache patches)


Steps Carried:
==============

This has happened in geo-rep setup, but all the master bricks are crashed and looks more generic issue. However, I will write all the steps 

1. Create Master and Slave volume (3x2) each from 3 node clusters
2. Enable md-cache on master and slave
3. Create geo-rep between master and slave
4. Mount the Master volume (Fuse) thrice on same client at different location
5. Create Data on Master volume from one client and keep stat from other client path:
crefi -T 10 -n 10 --multi -d 10 -b 10 --random --max=5K --min=1K --fop=create /mnt/master/
find . | xargs stat
6. Let the data be synced to slave. Confirm via arequal checksum
7. Chmod on master volume from one client and keep stat from other client path:
crefi -T 10 -n 10 --multi -d 10 -b 10 --random --max=5K --min=1K --fop=chmod /mnt/master/
find . | xargs stat
8. Let the data be synced to slave. Confirm via arequal checksum
9. Chown on master volume from one client and keep stat from other client path:
crefi -T 10 -n 10 --multi -d 10 -b 10 --random --max=5K --min=1K --fop=chown
/mnt/master/
find . | xargs stat
10. Let the data be synced to slave. Confirm via arequal checksum

11. Chgrp on master volume from one client and keep stat from other client path:
crefi -T 10 -n 10 --multi -d 10 -b 10 --random --max=5K --min=1K --fop=chgrp /mnt/master/
find . | xargs stat
12. Let the data be synced to slave. Confirm via arequal checksum

13. symlink  on master volume from one client and rename from another client client path:
crefi -T 10 -n 10 --multi -d 10 -b 10 --random --max=5K --min=1K --fop=symlink /mnt/master/
crefi -T 10 -n 10 --multi -d 10 -b 10 --random --max=5K --min=1K --fop=rename /mnt/new_1

Actual results:
===============

All brick process crashed

Comment 3 Michael Adam 2016-10-20 12:14:44 UTC
the bt does not lock like a crash to me. is that the wrong thread?

Comment 4 Rahul Hinduja 2016-10-20 13:04:52 UTC
(In reply to Michael Adam from comment #3)
> the bt does not lock like a crash to me. is that the wrong thread?

Core is available in the sosreports. Following is the bt from all threads. 

(gdb) thread apply all bt

Thread 35 (Thread 0x7f9524a7d700 (LWP 5900)):
#0  0x00007f953041cd13 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f9531cd01c0 in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 34 (Thread 0x7f9504310700 (LWP 7688)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 33 (Thread 0x7f950410e700 (LWP 7720)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

---Type <return> to continue, or q <return> to quit---
Thread 32 (Thread 0x7f951cd31700 (LWP 5901)):
#0  0x00007f953041cd13 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f9531cd01c0 in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 31 (Thread 0x7f94c3afa700 (LWP 7972)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 30 (Thread 0x7f9504411700 (LWP 7687)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 29 (Thread 0x7f9504714700 (LWP 7682)):
---Type <return> to continue, or q <return> to quit---
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 28 (Thread 0x7f9505016700 (LWP 6530)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 27 (Thread 0x7f950420f700 (LWP 7689)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 26 (Thread 0x7f9515794700 (LWP 5908)):
#0  0x00007f9530413ba3 in select () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---
#1  0x00007f951eee705a in changelog_ev_dispatch () from /usr/lib64/glusterfs/3.8.4/xlator/features/changelog.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 25 (Thread 0x7f9515f95700 (LWP 5907)):
#0  0x00007f9530413ba3 in select () from /lib64/libc.so.6
#1  0x00007f951eee705a in changelog_ev_dispatch () from /usr/lib64/glusterfs/3.8.4/xlator/features/changelog.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 24 (Thread 0x7f9504613700 (LWP 7683)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 23 (Thread 0x7f9504512700 (LWP 7684)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
---Type <return> to continue, or q <return> to quit---
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 22 (Thread 0x7f94c3efe700 (LWP 7725)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 21 (Thread 0x7f94c3bfb700 (LWP 7962)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 20 (Thread 0x7f94c3cfc700 (LWP 7727)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 19 (Thread 0x7f9504f15700 (LWP 6883)):
#0  0x00007f9530413ba3 in select () from /lib64/libc.so.6
#1  0x00007f951eee3282 in changelog_fsync_thread () from /usr/lib64/glusterfs/3.8.4/xlator/features/changelog.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 18 (Thread 0x7f951759d700 (LWP 5903)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 17 (Thread 0x7f9505817700 (LWP 5989)):
#0  0x00007f95303e366d in nanosleep () from /lib64/libc.so.6
#1  0x00007f95303e3504 in sleep () from /lib64/libc.so.6
#2  0x00007f951de3b45c in upcall_reaper_thread () from /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so
#3  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#4  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 16 (Thread 0x7f9506ffd700 (LWP 5912)):
#0  0x00007f9530adb6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f952406cb1b in posix_fsyncer_pick () from /usr/lib64/glusterfs/3.8.4/xlator/storage/posix.so
#2  0x00007f952406cda5 in posix_fsyncer () from /usr/lib64/glusterfs/3.8.4/xlator/storage/posix.so
#3  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 15 (Thread 0x7f95077fe700 (LWP 5911)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f95240694f5 in posix_janitor_thread_proc () from /usr/lib64/glusterfs/3.8.4/xlator/storage/posix.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x7f9516796700 (LWP 5906)):
#0  0x00007f9530adb6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951eee6e13 in changelog_ev_connector () from /usr/lib64/glusterfs/3.8.4/xlator/features/changelog.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x7f9516c9b700 (LWP 5905)):
#0  0x00007f9530adb6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951eaa4c4b in br_stub_worker () from /usr/lib64/glusterfs/3.8.4/xlator/features/bitrot-stub.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7f951749c700 (LWP 5904)):
#0  0x00007f9530adb6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951eaa61b3 in br_stub_signth () from /usr/lib64/glusterfs/3.8.4/xlator/features/bitrot-stub.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7f9527f42700 (LWP 5897)):
#0  0x00007f9530adf101 in sigwait () from /lib64/libpthread.so.0
#1  0x00007f953216bbfb in glusterfs_sigwaiter ()
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---

Thread 10 (Thread 0x7f9528743700 (LWP 5896)):
#0  0x00007f9530adebdd in nanosleep () from /lib64/libpthread.so.0
#1  0x00007f9531c83bb6 in gf_timer_proc () from /lib64/libglusterfs.so.0
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7f9514f93700 (LWP 5909)):
#0  0x00007f9530413ba3 in select () from /lib64/libc.so.6
#1  0x00007f951eee705a in changelog_ev_dispatch () from /usr/lib64/glusterfs/3.8.4/xlator/features/changelog.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7f951c12a700 (LWP 5902)):
#0  0x00007f9530adb6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951d5e2e5b in index_worker () from /usr/lib64/glusterfs/3.8.4/xlator/features/index.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

---Type <return> to continue, or q <return> to quit---
Thread 7 (Thread 0x7f9506018700 (LWP 6873)):
#0  0x00007f95303e366d in nanosleep () from /lib64/libc.so.6
#1  0x00007f95303e3504 in sleep () from /lib64/libc.so.6
#2  0x00007f952406c7ac in posix_health_check_thread_proc () from /usr/lib64/glusterfs/3.8.4/xlator/storage/posix.so
#3  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f9527741700 (LWP 5898)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f9531caed98 in syncenv_task () from /lib64/libglusterfs.so.0
#2  0x00007f9531cafbe0 in syncenv_processor () from /lib64/libglusterfs.so.0
#3  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f9507fff700 (LWP 6882)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951eee2e6e in changelog_rollover () from /usr/lib64/glusterfs/3.8.4/xlator/features/changelog.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---

Thread 4 (Thread 0x7f9526f40700 (LWP 5899)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f9531caed98 in syncenv_task () from /lib64/libglusterfs.so.0
#2  0x00007f9531cafbe0 in syncenv_processor () from /lib64/libglusterfs.so.0
#3  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f953214d780 (LWP 5895)):
#0  0x00007f9530ad8ef7 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f9531cd0768 in event_dispatch_epoll () from /lib64/libglusterfs.so.0
#2  0x00007f9532168ae2 in main ()

Thread 2 (Thread 0x7f94c3dfd700 (LWP 7726)):
#0  0x00007f9531c779f7 in _gf_msg () from /lib64/libglusterfs.so.0
#1  0x00007f9531cf22d0 in default_setxattr_cbk () from /lib64/libglusterfs.so.0
#2  0x00007f951de2c205 in up_setxattr_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so
#3  0x00007f951e89c22d in posix_acl_setxattr_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/features/access-control.so
#4  0x00007f951eed3fb9 in changelog_setxattr_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/features/changelog.so
---Type <return> to continue, or q <return> to quit---
#5  0x00007f951f5c3d32 in ctr_setxattr_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/features/changetimerecorder.so
#6  0x00007f952405589e in posix_setxattr () from /usr/lib64/glusterfs/3.8.4/xlator/storage/posix.so
#7  0x00007f9531ceeb41 in default_setxattr () from /lib64/libglusterfs.so.0
#8  0x00007f951f5bbbcd in ctr_setxattr () from /usr/lib64/glusterfs/3.8.4/xlator/features/changetimerecorder.so
#9  0x00007f951eed8e35 in changelog_setxattr () from /usr/lib64/glusterfs/3.8.4/xlator/features/changelog.so
#10 0x00007f951eaa752a in br_stub_setxattr () from /usr/lib64/glusterfs/3.8.4/xlator/features/bitrot-stub.so
#11 0x00007f951e89bf1d in posix_acl_setxattr () from /usr/lib64/glusterfs/3.8.4/xlator/features/access-control.so
#12 0x00007f951e680359 in pl_setxattr () from /usr/lib64/glusterfs/3.8.4/xlator/features/locks.so
#13 0x00007f9531ceeb41 in default_setxattr () from /lib64/libglusterfs.so.0
#14 0x00007f951e25b2d6 in ro_setxattr () from /usr/lib64/glusterfs/3.8.4/xlator/features/read-only.so
#15 0x00007f9531ceeb41 in default_setxattr () from /lib64/libglusterfs.so.0
#16 0x00007f951de344e3 in up_setxattr () from /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so
#17 0x00007f9531d072a4 in default_setxattr_resume () from /lib64/libglusterfs.so.0
#18 0x00007f9531c9947d in call_resume () from /lib64/libglusterfs.so.0
#19 0x00007f951dc20743 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#20 0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#21 0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f94c3fff700 (LWP 7721)):
---Type <return> to continue, or q <return> to quit---
#0  0x00007f9530adc210 in pthread_spin_lock () from /lib64/libpthread.so.0
#1  0x00007f951de3b129 in upcall_inode_ctx_get () from /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so
#2  0x00007f951de3055f in upcall_local_init () from /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so
#3  0x00007f951de3431a in up_setxattr () from /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so
#4  0x00007f9531d072a4 in default_setxattr_resume () from /lib64/libglusterfs.so.0
#5  0x00007f9531c9947d in call_resume () from /lib64/libglusterfs.so.0
#6  0x00007f951dc20743 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#7  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f953041c73d in clone () from /lib64/libc.so.6
(gdb)

Comment 5 surabhi 2016-10-20 15:08:35 UTC
I executed the similar tests on a 2*2 volume mounted on Linux cifs client (Non Geo-rep setup) and have not seen this issue in any of the following combinations:

1.With stat-prefetch on and client-io-threads on
2. With stat-prefetch off and client-io-threads off
3. With stat-prefetch on and client-io-threads off
4. With stat-prefetch off and client-io-threads on

Not sure if this has anything specific to Geo-rep.

Comment 6 Rahul Hinduja 2016-10-21 09:55:57 UTC
Tried on non geo-rep setup on fuse mount, couldn't reproduce the issue. Also, tried the case by enabling changelog on the volume, it also did not result in any crash. 

Not sure if it is a race or specific geo-rep setup issue. However, it did not look like initially a geo-rep issue as the master bricks crashed. Putting needinfo on aravinda to provide his thoughts on it.

Comment 7 Kotresh HR 2016-10-21 10:26:30 UTC
(In reply to Michael Adam from comment #3)
> the bt does not lock like a crash to me. is that the wrong thread?

The bt is a crash with segmentation fault. It's a null deference where inode passed to upcall_inode_ctx_get itself is NULL. May be upcall team should have a look at this. It's not related to geo-replication

Comment 9 Atin Mukherjee 2016-10-27 09:20:56 UTC
Niels - could you provide your inputs on the crash observed which is related to upcall?

Comment 10 Poornima G 2016-11-07 05:43:13 UTC
The crash is because, the loc->inode is NULL,

$8 = {path = 0x7f94b0b3e9f0 "/thread4/level00/5808443b%%4FZV09CJ04", 
  name = 0x7f94b0b3ea01 "5808443b%%4FZV09CJ04", inode = 0x0, 
  parent = 0x7f950626c094, 
  gfid = "\220&ĵiuF\363\217\003\373ռX\273", <incomplete sequence \357>, pargfid = "+ؚ\242\224DK\n\211'\222 VÐ\004"}

The crash can be prevented by just checking for inode NULL case, but still the reason when loca->inode can be NULL is unknown, we need to root cause when the loca->inode can be NULL.

Is the issue consistently reproducible?

Comment 11 Poornima G 2016-11-09 10:54:44 UTC
RCA:

The simple reproducer for this issue:
Create a plain distribute volume, enable cache-invalidation and marker feature on the server side:
gluster vol set <VOLNAME> features.cache-invalidation on
gluster vol ser <VOLNAME> indexing on

And from the fuse mount point, create a file and rename the file. After this all the bricks will crash.

The reason for the crash is, on recieving a rename fop, marker_rename() stores the, oldloc and newloc in its 'local' struct, once the rename is done, the xtime marker(last updated time) is set on the file, but sending a setxattr fop. When upcall receives the setxattr fop, the loc->inode is NULL and it crashes. The loc->inode can be NULL only in one valid case, i.e. in rename case where the inode of new loc will be NULL. Hence, marker should have got the inode of the new_loc and filled it before issuing a setxattr.

Hence moving the component to marker.

Comment 12 Poornima G 2016-11-11 09:09:16 UTC
Fix posted upstream : http://review.gluster.org/#/c/15826/1

Comment 17 Anil Shah 2016-12-05 06:15:20 UTC
Created distribute volume , enabled  cache-invalidation  and indexing , created 1000 files and renamed files on mount.
Not seeing any bricks crash.
Hence marking this as verified on build glusterfs-3.8.4-6.el7rhgs.x86_64

Comment 19 errata-xmlrpc 2017-03-23 06:13:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html


Note You need to log in before you can comment on or make changes to this bug.