1396414 – [md-cache]: All bricks crashed while performing symlink and rename from client at the same time

Bug 1396414 - [md-cache]: All bricks crashed while performing symlink and rename from client at the same time

Summary: [md-cache]: All bricks crashed while performing symlink and rename from clien...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	marker
Sub Component:
Version:	3.9
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1387204 1394131
Blocks:	1396418 1396419
TreeView+	depends on / blocked

Reported:	2016-11-18 09:30 UTC by Poornima G
Modified:	2017-03-08 10:19 UTC (History)
CC List:	13 users (show)
Fixed In Version:	glusterfs-3.9.1
Clone Of:	1394131
Clones:	1396418 (view as bug list)
Environment:
Last Closed:	2017-03-08 10:19:16 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Poornima G 2016-11-18 09:30:37 UTC

+++ This bug was initially created as a clone of Bug #1394131 +++

+++ This bug was initially created as a clone of Bug #1387204 +++

Description of problem:
======================

All the 6 bricks of a volume (3x2) crashed with the upcall bt: 

[root@dhcp37-58 ~]# file core.5895.1476956627.dump.1
core.5895.1476956627.dump.1: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfsd -s 10.70.37.58 --volfile-id master.10.70.37.58.rhs-brick1-', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/sbin/glusterfsd', platform: 'x86_64'
[root@dhcp37-58 ~]# 

(gdb) bt
#0  0x00007f9530adc210 in pthread_spin_lock () from /lib64/libpthread.so.0
#1  0x00007f951de3b129 in upcall_inode_ctx_get () from /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so
#2  0x00007f951de3055f in upcall_local_init () from /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so
#3  0x00007f951de3431a in up_setxattr () from /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so
#4  0x00007f9531d072a4 in default_setxattr_resume () from /lib64/libglusterfs.so.0
#5  0x00007f9531c9947d in call_resume () from /lib64/libglusterfs.so.0
#6  0x00007f951dc20743 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#7  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f953041c73d in clone () from /lib64/libc.so.6
(gdb)



Version-Release number of selected component (if applicable):
=============================================================

glusterfs-server-3.8.4-2.26.git0a405a4.el7rhgs.x86_64 
(private build with downstream 3.2.0 and md-cache patches)


Steps Carried:
==============

This has happened in geo-rep setup, but all the master bricks are crashed and looks more generic issue. However, I will write all the steps 

1. Create Master and Slave volume (3x2) each from 3 node clusters
2. Enable md-cache on master and slave
3. Create geo-rep between master and slave
4. Mount the Master volume (Fuse) thrice on same client at different location
5. Create Data on Master volume from one client and keep stat from other client path:
crefi -T 10 -n 10 --multi -d 10 -b 10 --random --max=5K --min=1K --fop=create /mnt/master/
find . | xargs stat
6. Let the data be synced to slave. Confirm via arequal checksum
7. Chmod on master volume from one client and keep stat from other client path:
crefi -T 10 -n 10 --multi -d 10 -b 10 --random --max=5K --min=1K --fop=chmod /mnt/master/
find . | xargs stat
8. Let the data be synced to slave. Confirm via arequal checksum
9. Chown on master volume from one client and keep stat from other client path:
crefi -T 10 -n 10 --multi -d 10 -b 10 --random --max=5K --min=1K --fop=chown
/mnt/master/
find . | xargs stat
10. Let the data be synced to slave. Confirm via arequal checksum

11. Chgrp on master volume from one client and keep stat from other client path:
crefi -T 10 -n 10 --multi -d 10 -b 10 --random --max=5K --min=1K --fop=chgrp /mnt/master/
find . | xargs stat
12. Let the data be synced to slave. Confirm via arequal checksum

13. symlink  on master volume from one client and rename from another client client path:
crefi -T 10 -n 10 --multi -d 10 -b 10 --random --max=5K --min=1K --fop=symlink /mnt/master/
crefi -T 10 -n 10 --multi -d 10 -b 10 --random --max=5K --min=1K --fop=rename /mnt/new_1

Actual results:
===============

All brick process crashed

--- Additional comment from Red Hat Bugzilla Rules Engine on 2016-10-20 06:53:43 EDT ---

This bug is automatically being proposed for the current release of Red Hat Gluster Storage 3 under active development, by setting the release flag 'rhgs‑3.2.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Rahul Hinduja on 2016-10-20 07:04:46 EDT ---

sosreports @: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1387204/

[root@dhcp37-58 ~]# gluster volume info master
 
Volume Name: master
Type: Distributed-Replicate
Volume ID: a60df9d2-8ebc-40db-b3fc-44775ee00173
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.37.58:/rhs/brick1/b1
Brick2: 10.70.37.177:/rhs/brick1/b2
Brick3: 10.70.37.62:/rhs/brick1/b3
Brick4: 10.70.37.58:/rhs/brick2/b4
Brick5: 10.70.37.177:/rhs/brick2/b5
Brick6: 10.70.37.62:/rhs/brick2/b6
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
cluster.enable-shared-storage: enable
[root@dhcp37-58 ~]# gluster v status master
Status of volume: master
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.58:/rhs/brick1/b1            N/A       N/A        N       N/A  
Brick 10.70.37.177:/rhs/brick1/b2           N/A       N/A        N       N/A  
Brick 10.70.37.62:/rhs/brick1/b3            N/A       N/A        N       N/A  
Brick 10.70.37.58:/rhs/brick2/b4            N/A       N/A        N       N/A  
Brick 10.70.37.177:/rhs/brick2/b5           N/A       N/A        N       N/A  
Brick 10.70.37.62:/rhs/brick2/b6            N/A       N/A        N       N/A  
Self-heal Daemon on localhost               N/A       N/A        Y       6872 
Self-heal Daemon on 10.70.37.177            N/A       N/A        Y       31470
Self-heal Daemon on 10.70.37.62             N/A       N/A        Y       23602
 
Task Status of Volume master
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp37-58 ~]#

--- Additional comment from Michael Adam on 2016-10-20 08:14:44 EDT ---

the bt does not lock like a crash to me. is that the wrong thread?

--- Additional comment from Rahul Hinduja on 2016-10-20 09:04:52 EDT ---

(In reply to Michael Adam from comment #3)
> the bt does not lock like a crash to me. is that the wrong thread?

Core is available in the sosreports. Following is the bt from all threads. 

(gdb) thread apply all bt

Thread 35 (Thread 0x7f9524a7d700 (LWP 5900)):
#0  0x00007f953041cd13 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f9531cd01c0 in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 34 (Thread 0x7f9504310700 (LWP 7688)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 33 (Thread 0x7f950410e700 (LWP 7720)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

---Type <return> to continue, or q <return> to quit---
Thread 32 (Thread 0x7f951cd31700 (LWP 5901)):
#0  0x00007f953041cd13 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f9531cd01c0 in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 31 (Thread 0x7f94c3afa700 (LWP 7972)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 30 (Thread 0x7f9504411700 (LWP 7687)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 29 (Thread 0x7f9504714700 (LWP 7682)):
---Type <return> to continue, or q <return> to quit---
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 28 (Thread 0x7f9505016700 (LWP 6530)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 27 (Thread 0x7f950420f700 (LWP 7689)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 26 (Thread 0x7f9515794700 (LWP 5908)):
#0  0x00007f9530413ba3 in select () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---
#1  0x00007f951eee705a in changelog_ev_dispatch () from /usr/lib64/glusterfs/3.8.4/xlator/features/changelog.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 25 (Thread 0x7f9515f95700 (LWP 5907)):
#0  0x00007f9530413ba3 in select () from /lib64/libc.so.6
#1  0x00007f951eee705a in changelog_ev_dispatch () from /usr/lib64/glusterfs/3.8.4/xlator/features/changelog.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 24 (Thread 0x7f9504613700 (LWP 7683)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 23 (Thread 0x7f9504512700 (LWP 7684)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
---Type <return> to continue, or q <return> to quit---
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 22 (Thread 0x7f94c3efe700 (LWP 7725)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 21 (Thread 0x7f94c3bfb700 (LWP 7962)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 20 (Thread 0x7f94c3cfc700 (LWP 7727)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 19 (Thread 0x7f9504f15700 (LWP 6883)):
#0  0x00007f9530413ba3 in select () from /lib64/libc.so.6
#1  0x00007f951eee3282 in changelog_fsync_thread () from /usr/lib64/glusterfs/3.8.4/xlator/features/changelog.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 18 (Thread 0x7f951759d700 (LWP 5903)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951dc206f3 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 17 (Thread 0x7f9505817700 (LWP 5989)):
#0  0x00007f95303e366d in nanosleep () from /lib64/libc.so.6
#1  0x00007f95303e3504 in sleep () from /lib64/libc.so.6
#2  0x00007f951de3b45c in upcall_reaper_thread () from /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so
#3  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#4  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 16 (Thread 0x7f9506ffd700 (LWP 5912)):
#0  0x00007f9530adb6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f952406cb1b in posix_fsyncer_pick () from /usr/lib64/glusterfs/3.8.4/xlator/storage/posix.so
#2  0x00007f952406cda5 in posix_fsyncer () from /usr/lib64/glusterfs/3.8.4/xlator/storage/posix.so
#3  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 15 (Thread 0x7f95077fe700 (LWP 5911)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f95240694f5 in posix_janitor_thread_proc () from /usr/lib64/glusterfs/3.8.4/xlator/storage/posix.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 14 (Thread 0x7f9516796700 (LWP 5906)):
#0  0x00007f9530adb6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951eee6e13 in changelog_ev_connector () from /usr/lib64/glusterfs/3.8.4/xlator/features/changelog.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 13 (Thread 0x7f9516c9b700 (LWP 5905)):
#0  0x00007f9530adb6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951eaa4c4b in br_stub_worker () from /usr/lib64/glusterfs/3.8.4/xlator/features/bitrot-stub.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7f951749c700 (LWP 5904)):
#0  0x00007f9530adb6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951eaa61b3 in br_stub_signth () from /usr/lib64/glusterfs/3.8.4/xlator/features/bitrot-stub.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7f9527f42700 (LWP 5897)):
#0  0x00007f9530adf101 in sigwait () from /lib64/libpthread.so.0
#1  0x00007f953216bbfb in glusterfs_sigwaiter ()
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---

Thread 10 (Thread 0x7f9528743700 (LWP 5896)):
#0  0x00007f9530adebdd in nanosleep () from /lib64/libpthread.so.0
#1  0x00007f9531c83bb6 in gf_timer_proc () from /lib64/libglusterfs.so.0
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7f9514f93700 (LWP 5909)):
#0  0x00007f9530413ba3 in select () from /lib64/libc.so.6
#1  0x00007f951eee705a in changelog_ev_dispatch () from /usr/lib64/glusterfs/3.8.4/xlator/features/changelog.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7f951c12a700 (LWP 5902)):
#0  0x00007f9530adb6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951d5e2e5b in index_worker () from /usr/lib64/glusterfs/3.8.4/xlator/features/index.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6

---Type <return> to continue, or q <return> to quit---
Thread 7 (Thread 0x7f9506018700 (LWP 6873)):
#0  0x00007f95303e366d in nanosleep () from /lib64/libc.so.6
#1  0x00007f95303e3504 in sleep () from /lib64/libc.so.6
#2  0x00007f952406c7ac in posix_health_check_thread_proc () from /usr/lib64/glusterfs/3.8.4/xlator/storage/posix.so
#3  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f9527741700 (LWP 5898)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f9531caed98 in syncenv_task () from /lib64/libglusterfs.so.0
#2  0x00007f9531cafbe0 in syncenv_processor () from /lib64/libglusterfs.so.0
#3  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f9507fff700 (LWP 6882)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f951eee2e6e in changelog_rollover () from /usr/lib64/glusterfs/3.8.4/xlator/features/changelog.so
#2  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f953041c73d in clone () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---

Thread 4 (Thread 0x7f9526f40700 (LWP 5899)):
#0  0x00007f9530adba82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f9531caed98 in syncenv_task () from /lib64/libglusterfs.so.0
#2  0x00007f9531cafbe0 in syncenv_processor () from /lib64/libglusterfs.so.0
#3  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f953214d780 (LWP 5895)):
#0  0x00007f9530ad8ef7 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f9531cd0768 in event_dispatch_epoll () from /lib64/libglusterfs.so.0
#2  0x00007f9532168ae2 in main ()

Thread 2 (Thread 0x7f94c3dfd700 (LWP 7726)):
#0  0x00007f9531c779f7 in _gf_msg () from /lib64/libglusterfs.so.0
#1  0x00007f9531cf22d0 in default_setxattr_cbk () from /lib64/libglusterfs.so.0
#2  0x00007f951de2c205 in up_setxattr_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so
#3  0x00007f951e89c22d in posix_acl_setxattr_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/features/access-control.so
#4  0x00007f951eed3fb9 in changelog_setxattr_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/features/changelog.so
---Type <return> to continue, or q <return> to quit---
#5  0x00007f951f5c3d32 in ctr_setxattr_cbk () from /usr/lib64/glusterfs/3.8.4/xlator/features/changetimerecorder.so
#6  0x00007f952405589e in posix_setxattr () from /usr/lib64/glusterfs/3.8.4/xlator/storage/posix.so
#7  0x00007f9531ceeb41 in default_setxattr () from /lib64/libglusterfs.so.0
#8  0x00007f951f5bbbcd in ctr_setxattr () from /usr/lib64/glusterfs/3.8.4/xlator/features/changetimerecorder.so
#9  0x00007f951eed8e35 in changelog_setxattr () from /usr/lib64/glusterfs/3.8.4/xlator/features/changelog.so
#10 0x00007f951eaa752a in br_stub_setxattr () from /usr/lib64/glusterfs/3.8.4/xlator/features/bitrot-stub.so
#11 0x00007f951e89bf1d in posix_acl_setxattr () from /usr/lib64/glusterfs/3.8.4/xlator/features/access-control.so
#12 0x00007f951e680359 in pl_setxattr () from /usr/lib64/glusterfs/3.8.4/xlator/features/locks.so
#13 0x00007f9531ceeb41 in default_setxattr () from /lib64/libglusterfs.so.0
#14 0x00007f951e25b2d6 in ro_setxattr () from /usr/lib64/glusterfs/3.8.4/xlator/features/read-only.so
#15 0x00007f9531ceeb41 in default_setxattr () from /lib64/libglusterfs.so.0
#16 0x00007f951de344e3 in up_setxattr () from /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so
#17 0x00007f9531d072a4 in default_setxattr_resume () from /lib64/libglusterfs.so.0
#18 0x00007f9531c9947d in call_resume () from /lib64/libglusterfs.so.0
#19 0x00007f951dc20743 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#20 0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#21 0x00007f953041c73d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f94c3fff700 (LWP 7721)):
---Type <return> to continue, or q <return> to quit---
#0  0x00007f9530adc210 in pthread_spin_lock () from /lib64/libpthread.so.0
#1  0x00007f951de3b129 in upcall_inode_ctx_get () from /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so
#2  0x00007f951de3055f in upcall_local_init () from /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so
#3  0x00007f951de3431a in up_setxattr () from /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so
#4  0x00007f9531d072a4 in default_setxattr_resume () from /lib64/libglusterfs.so.0
#5  0x00007f9531c9947d in call_resume () from /lib64/libglusterfs.so.0
#6  0x00007f951dc20743 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#7  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f953041c73d in clone () from /lib64/libc.so.6
(gdb)

--- Additional comment from surabhi on 2016-10-20 11:08:35 EDT ---

I executed the similar tests on a 2*2 volume mounted on Linux cifs client (Non Geo-rep setup) and have not seen this issue in any of the following combinations:

1.With stat-prefetch on and client-io-threads on
2. With stat-prefetch off and client-io-threads off
3. With stat-prefetch on and client-io-threads off
4. With stat-prefetch off and client-io-threads on

Not sure if this has anything specific to Geo-rep.

--- Additional comment from Rahul Hinduja on 2016-10-21 05:55:57 EDT ---

Tried on non geo-rep setup on fuse mount, couldn't reproduce the issue. Also, tried the case by enabling changelog on the volume, it also did not result in any crash. 

Not sure if it is a race or specific geo-rep setup issue. However, it did not look like initially a geo-rep issue as the master bricks crashed. Putting needinfo on aravinda to provide his thoughts on it.

--- Additional comment from Kotresh HR on 2016-10-21 06:26:30 EDT ---

(In reply to Michael Adam from comment #3)
> the bt does not lock like a crash to me. is that the wrong thread?

The bt is a crash with segmentation fault. It's a null deference where inode passed to upcall_inode_ctx_get itself is NULL. May be upcall team should have a look at this. It's not related to geo-replication

--- Additional comment from Rahul Hinduja on 2016-10-26 09:35:53 EDT ---

Removing needinfo based on comment 7

--- Additional comment from Atin Mukherjee on 2016-10-27 05:20:56 EDT ---

Niels - could you provide your inputs on the crash observed which is related to upcall?

--- Additional comment from Poornima G on 2016-11-07 00:43:13 EST ---

The crash is because, the loc->inode is NULL,

$8 = {path = 0x7f94b0b3e9f0 "/thread4/level00/5808443b%%4FZV09CJ04", 
  name = 0x7f94b0b3ea01 "5808443b%%4FZV09CJ04", inode = 0x0, 
  parent = 0x7f950626c094, 
  gfid = "\220&ĵiuF\363\217\003\373ռX\273", <incomplete sequence \357>, pargfid = "+ؚ\242\224DK\n\211'\222 VÐ\004"}

The crash can be prevented by just checking for inode NULL case, but still the reason when loca->inode can be NULL is unknown, we need to root cause when the loca->inode can be NULL.

Is the issue consistently reproducible?

--- Additional comment from Poornima G on 2016-11-09 05:54:44 EST ---

RCA:

The simple reproducer for this issue:
Create a plain distribute volume, enable cache-invalidation and marker feature on the server side:
gluster vol set <VOLNAME> features.cache-invalidation on
gluster vol ser <VOLNAME> indexing on

And from the fuse mount point, create a file and rename the file. After this all the bricks will crash.

The reason for the crash is, on recieving a rename fop, marker_rename() stores the, oldloc and newloc in its 'local' struct, once the rename is done, the xtime marker(last updated time) is set on the file, but sending a setxattr fop. When upcall receives the setxattr fop, the loc->inode is NULL and it crashes. The loc->inode can be NULL only in one valid case, i.e. in rename case where the inode of new loc will be NULL. Hence, marker should have got the inode of the new_loc and filled it before issuing a setxattr.

Hence moving the component to marker.

--- Additional comment from Worker Ant on 2016-11-11 04:06:58 EST ---

REVIEW: http://review.gluster.org/15826 (marker: Fix inode value in loc, in setxattr fop) posted (#1) for review on master by Poornima G (pgurusid)

--- Additional comment from Poornima G on 2016-11-14 05:13:52 EST ---


All the 6 bricks of a volume (3x2) crashed with the upcall bt: 

[root@dhcp37-58 ~]# file core.5895.1476956627.dump.1
core.5895.1476956627.dump.1: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfsd -s 10.70.37.58 --volfile-id master.10.70.37.58.rhs-brick1-', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/sbin/glusterfsd', platform: 'x86_64'
[root@dhcp37-58 ~]# 

(gdb) bt
#0  0x00007f9530adc210 in pthread_spin_lock () from /lib64/libpthread.so.0
#1  0x00007f951de3b129 in upcall_inode_ctx_get () from /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so
#2  0x00007f951de3055f in upcall_local_init () from /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so
#3  0x00007f951de3431a in up_setxattr () from /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so
#4  0x00007f9531d072a4 in default_setxattr_resume () from /lib64/libglusterfs.so.0
#5  0x00007f9531c9947d in call_resume () from /lib64/libglusterfs.so.0
#6  0x00007f951dc20743 in iot_worker () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so
#7  0x00007f9530ad7dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f953041c73d in clone () from /lib64/libc.so.6
(gdb)




Steps Carried:
==============

This has happened in geo-rep setup, but all the master bricks are crashed and looks more generic issue. However, I will write all the steps 

1. Create Master and Slave volume (3x2) each from 3 node clusters
2. Enable md-cache on master and slave
3. Create geo-rep between master and slave
4. Mount the Master volume (Fuse) thrice on same client at different location
5. Create Data on Master volume from one client and keep stat from other client path:
crefi -T 10 -n 10 --multi -d 10 -b 10 --random --max=5K --min=1K --fop=create /mnt/master/
find . | xargs stat
6. Let the data be synced to slave. Confirm via arequal checksum
7. Chmod on master volume from one client and keep stat from other client path:
crefi -T 10 -n 10 --multi -d 10 -b 10 --random --max=5K --min=1K --fop=chmod /mnt/master/
find . | xargs stat
8. Let the data be synced to slave. Confirm via arequal checksum
9. Chown on master volume from one client and keep stat from other client path:
crefi -T 10 -n 10 --multi -d 10 -b 10 --random --max=5K --min=1K --fop=chown
/mnt/master/
find . | xargs stat
10. Let the data be synced to slave. Confirm via arequal checksum

11. Chgrp on master volume from one client and keep stat from other client path:
crefi -T 10 -n 10 --multi -d 10 -b 10 --random --max=5K --min=1K --fop=chgrp /mnt/master/
find . | xargs stat
12. Let the data be synced to slave. Confirm via arequal checksum

13. symlink  on master volume from one client and rename from another client client path:
crefi -T 10 -n 10 --multi -d 10 -b 10 --random --max=5K --min=1K --fop=symlink /mnt/master/
crefi -T 10 -n 10 --multi -d 10 -b 10 --random --max=5K --min=1K --fop=rename /mnt/new_1

--- Additional comment from Worker Ant on 2016-11-17 05:53:20 EST ---

COMMIT: http://review.gluster.org/15826 committed in master by Rajesh Joseph (rjoseph) 
------
commit 46e5466850311ee69e6ae9a11c2bba2aabadd5de
Author: Poornima G <pgurusid>
Date:   Fri Nov 11 12:08:57 2016 +0530

    marker: Fix inode value in loc, in setxattr fop
    
    On recieving a rename fop, marker_rename() stores the,
    oldloc and newloc in its 'local' struct, once the rename
    is done, the xtime marker(last updated time) is set on
    the file, but sending a setxattr fop. When upcall
    receives the setxattr fop, the loc->inode is NULL and
    it crashes. The loc->inode can be NULL only in one valid
    case, i.e. in rename case where the inode of new loc
    can be NULL. Hence, marker should have filled the inode
    of the new_loc before issuing a setxattr.
    
    Change-Id: Id638f678c3daaf4a5c29b970b58929d377ae8977
    BUG: 1394131
    Signed-off-by: Poornima G <pgurusid>
    Reviewed-on: http://review.gluster.org/15826
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Kotresh HR <khiremat>
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Rajesh Joseph <rjoseph>

Comment 1 Worker Ant 2016-11-18 09:31:37 UTC

REVIEW: http://review.gluster.org/15877 (marker: Fix inode value in loc, in setxattr fop) posted (#2) for review on release-3.9 by Poornima G (pgurusid)

Comment 2 Worker Ant 2016-11-18 09:35:30 UTC

REVIEW: http://review.gluster.org/15877 (marker: Fix inode value in loc, in setxattr fop) posted (#3) for review on release-3.9 by Poornima G (pgurusid)

Comment 3 Worker Ant 2016-11-18 10:18:17 UTC

REVIEW: http://review.gluster.org/15877 (marker: Fix inode value in loc, in setxattr fop) posted (#4) for review on release-3.9 by Poornima G (pgurusid)

Comment 4 Worker Ant 2016-11-22 05:43:30 UTC

COMMIT: http://review.gluster.org/15877 committed in release-3.9 by Rajesh Joseph (rjoseph) 
------
commit f5192cffe716d4db6de39a3e132d16c918d3846e
Author: Poornima G <pgurusid>
Date:   Fri Nov 11 12:08:57 2016 +0530

    marker: Fix inode value in loc, in setxattr fop
    
    Backport of http://review.gluster.org/15826
    
    On recieving a rename fop, marker_rename() stores the,
    oldloc and newloc in its 'local' struct, once the rename
    is done, the xtime marker(last updated time) is set on
    the file, but sending a setxattr fop. When upcall
    receives the setxattr fop, the loc->inode is NULL and
    it crashes. The loc->inode can be NULL only in one valid
    case, i.e. in rename case where the inode of new loc
    can be NULL. Hence, marker should have filled the inode
    of the new_loc before issuing a setxattr.
    
    > Reviewed-on: http://review.gluster.org/15826
    > NetBSD-regression: NetBSD Build System <jenkins.org>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > Reviewed-by: Kotresh HR <khiremat>
    > Smoke: Gluster Build System <jenkins.org>
    > Reviewed-by: Rajesh Joseph <rjoseph>
    (cherry picked from commit 46e5466850311ee69e6ae9a11c2bba2aabadd5de)
    
    Change-Id: Id638f678c3daaf4a5c29b970b58929d377ae8977
    BUG: 1396414
    Signed-off-by: Poornima G <pgurusid>
    Reviewed-on: http://review.gluster.org/15877
    Reviewed-by: Rajesh Joseph <rjoseph>
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 5 Kaushal 2017-03-08 10:19:16 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.9.1, please open a new bug report.

glusterfs-3.9.1 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-January/029725.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.