I created a EC volume and ran IOs as below. I attached a tier after about 1 day of IOs done. mounted the volume on 3 clients. From each client did different operations like below: 1)On clients: Client1:rhsauto026: In screen : for i in {1..10000};do dd if=/dev/urandom of=a1a.$i bs=1024 count=40000;done on terminal: ls -laRt started before attach tier and continuing Client2 :rhs-client1: open file handle script on file amb.mp script as below: #!/usr/bin/python import fcntl import time def main(): file_path = "/mnt/file.dd" FH = open(file_path, "rw+") print FH try: fcntl.lockf(FH, fcntl.LOCK_SH, 10, 10, 0) except IOError: print "lock could not be held." return False print "lock held" time.sleep(30) return True if __name__ == "__main__": main() Client3:rhsauto070: in screen: Started before attach tier and attached tier after about 15-20 dirs were created for i in {1..100};do mkdir dir.$i;cp linux-4.3.3.tar.xz dir.$i/;tar -xvf dir.$i/linux-4.3.3.tar.xz -C dir.$i/;done Then I went ahead and attached tier and changed some tier options. Vol info as below: [root@dhcp35-190 ~]# gluster v info testvol Volume Name: testvol Type: Tier Volume ID: 6d8f38ae-1cc8-4c2f-b0e7-0d854e79f7c2 Status: Started Number of Bricks: 18 Transport-type: tcp Hot Tier : Hot Tier Type : Distributed-Replicate Number of Bricks: 3 x 2 = 6 Brick1: 10.70.35.68:/rhs/brick9/testvol_hot Brick2: 10.70.35.190:/rhs/brick9/testvol_hot Brick3: 10.70.35.157:/rhs/brick9/testvol_hot Brick4: 10.70.35.68:/rhs/brick8/testvol_hot Brick5: 10.70.35.190:/rhs/brick8/testvol_hot Brick6: 10.70.35.157:/rhs/brick8/testvol_hot Cold Tier: Cold Tier Type : Distributed-Disperse Number of Bricks: 2 x (4 + 2) = 12 Brick7: 10.70.35.190:/rhs/brick1/testvol Brick8: 10.70.35.21:/rhs/brick1/testvol Brick9: 10.70.35.68:/rhs/brick1/testvol Brick10: 10.70.35.190:/rhs/brick2/testvol Brick11: 10.70.35.21:/rhs/brick2/testvol Brick12: 10.70.35.68:/rhs/brick2/testvol Brick13: 10.70.35.190:/rhs/brick3/testvol Brick14: 10.70.35.21:/rhs/brick3/testvol Brick15: 10.70.35.68:/rhs/brick3/testvol Brick16: 10.70.35.190:/rhs/brick4/testvol Brick17: 10.70.35.21:/rhs/brick4/testvol Brick18: 10.70.35.68:/rhs/brick4/testvol Options Reconfigured: cluster.tier-demote-frequency: 1800 cluster.tier-promote-frequency: 3600 cluster.write-freq-threshold: 5 cluster.read-freq-threshold: 4 features.record-counters: on cluster.watermark-hi: 40 cluster.watermark-low: 15 cluster.tier-mode: cache features.ctr-enabled: on features.uss: on features.quota-deem-statfs: on features.inode-quota: on features.quota: on performance.readdir-ahead: on [root@dhcp35-190 ~]# gluster v tier testvol status Node Promoted files Demoted files Status --------- --------- --------- --------- localhost 1515 150000 in progress 10.70.35.21 0 0 in progress 10.70.35.68 0 247809 in progress 10.70.35.157 0 250000 in progress Tiering Migration Functionality: testvol: success [root@dhcp35-190 ~]# rhsauto004: Running following script on screen: Rename of Files in a loop: #for i in {1..5};for j in {2..5}; for k in {1..5};do mv a$ia.$k a$ja.$k;done; $i=1; $k=1; while($i<40000) { `mkdir /mnt/vm_testvol/hot_dir/dir.$i`; while($k<40000) { $j=$i+1; #`mkdir /mnt/vm_testvol/hot_dir/dir.$i`; `cp -f /mnt/vm_testvol/hot_dir/a"$i"a* /mnt/vm_testvol/hot_dir/dir.$i`; `mv /mnt/vm_testvol/hot_dir/a"$i"a.$k /mnt/vm_testvol/hot_dir/a"$j"a.$k`; #`cp -f /mnt/vm_testvol/hot_dir/a"$i"* $k++; } $j=1; $k=1; $i++; } =================> Problem Seen<========================================= After about a week , when checked the script running on rhsauto070 I found the mount not accessible anymore with the screen session having below issue: (Which was running date;for i in {1..100};do mkdir dir.$i;cp linux-4.3.3.tar.xz dir.$i/;tar -xvf dir.$i/linux-4.3.3.tar.xz -C dir.$i/;done;date) mkdir: cannot create directory ‘dir.92’: Transport endpoint is not connected cp: failed to access ‘dir.92/’: Transport endpoint is not connected tar: dir.92/linux-4.3.3.tar.xz: Cannot open: Transport endpoint is not connected tar: Error is not recoverable: exiting now mkdir: cannot create directory ‘dir.93’: Transport endpoint is not connected cp: failed to access ‘dir.93/’: Transport endpoint is not connected tar: dir.93/linux-4.3.3.tar.xz: Cannot open: Transport endpoint is not connected tar: Error is not recoverable: exiting now mkdir: cannot create directory ‘dir.94’: Transport endpoint is not connected cp: failed to access ‘dir.94/’: Transport endpoint is not connected tar: dir.94/linux-4.3.3.tar.xz: Cannot open: Transport endpoint is not connected tar: Error is not recoverable: exiting now mkdir: cannot create directory ‘dir.95’: Transport endpoint is not connected cp: failed to access ‘dir.95/’: Transport endpoint is not connected tar: dir.95/linux-4.3.3.tar.xz: Cannot open: Transport endpoint is not connected tar: Error is not recoverable: exiting now mkdir: cannot create directory ‘dir.96’: Transport endpoint is not connected cp: failed to access ‘dir.96/’: Transport endpoint is not connected tar: dir.96/linux-4.3.3.tar.xz: Cannot open: Transport endpoint is not connected tar: Error is not recoverable: exiting now mkdir: cannot create directory ‘dir.97’: Transport endpoint is not connected cp: failed to access ‘dir.97/’: Transport endpoint is not connected tar: dir.97/linux-4.3.3.tar.xz: Cannot open: Transport endpoint is not connected tar: Error is not recoverable: exiting now mkdir: cannot create directory ‘dir.98’: Transport endpoint is not connected cp: failed to access ‘dir.98/’: Transport endpoint is not connected tar: dir.98/linux-4.3.3.tar.xz: Cannot open: Transport endpoint is not connected tar: Error is not recoverable: exiting now mkdir: cannot create directory ‘dir.99’: Transport endpoint is not connected cp: failed to access ‘dir.99/’: Transport endpoint is not connected tar: dir.99/linux-4.3.3.tar.xz: Cannot open: Transport endpoint is not connected tar: Error is not recoverable: exiting now mkdir: cannot create directory ‘dir.100’: Transport endpoint is not connected cp: failed to access ‘dir.100/’: Transport endpoint is not connected tar: dir.100/linux-4.3.3.tar.xz: Cannot open: Transport endpoint is not connected tar: Error is not recoverable: exiting now =====> Found the core and below bt: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd.debug...done. done. warning: core file may not match specified executable file. [New LWP 28937] [New LWP 28940] [New LWP 28952] [New LWP 28936] [New LWP 28951] [New LWP 28946] [New LWP 28938] [New LWP 28941] [New LWP 28947] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/glusterfs --volfile-server=10.70.35.190 --volfile-id=testvol /mnt/tes'. Program terminated with signal 11, Segmentation fault. #0 0x00007f6a25b20210 in pthread_spin_lock () from /lib64/libpthread.so.0 Missing separate debuginfos, use: debuginfo-install glibc-2.17-105.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.13.2-10.el7.x86_64 libcom_err-1.42.9-7.el7.x86_64 libgcc-4.8.5-4.el7.x86_64 libselinux-2.2.2-6.el7.x86_64 libuuid-2.23.2-26.el7.x86_64 openssl-libs-1.0.1e-42.el7_1.9.x86_64 pcre-8.32-15.el7.x86_64 xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64 (gdb) bt #0 0x00007f6a25b20210 in pthread_spin_lock () from /lib64/libpthread.so.0 #1 0x00007f6a18faf60d in ec_resume (fop=0x0, error=0) at ec-common.c:265 #2 0x00007f6a26cd27ab in gf_timer_proc (ctx=0x7f6a28465010) at timer.c:194 #3 0x00007f6a25b1bdc5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f6a254621cd in clone () from /lib64/libc.so.6 (gdb) fr 1 #1 0x00007f6a18faf60d in ec_resume (fop=0x0, error=0) at ec-common.c:265 265 LOCK(&fop->lock); (gdb) l 260 261 void ec_resume(ec_fop_data_t * fop, int32_t error) 262 { 263 ec_resume_f resume = NULL; 264 265 LOCK(&fop->lock); 266 267 __ec_fop_set_error(fop, error); 268 269 if (--fop->jobs == 0) (gdb) bt
sosreports [nchilaka@rhsqe-repo bug.1298162]$ pwd /home/repo/sosreports/nchilaka/bug.1298162 [nchilaka@rhsqe-repo bug.1298162]$ hostname rhsqe-repo.lab.eng.blr.redhat.com
Found another crash on another client. But core was not collected. Saw the following in fuse mount log [root@rhsauto070 glusterfs]# tail mnt-testvol.log spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.7.5 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb2)[0x7f6a26cb3002] /lib64/libglusterfs.so.0(gf_print_trace+0x31d)[0x7f6a26ccf47d] /lib64/libc.so.6(+0x35670)[0x7f6a253a1670] /lib64/libpthread.so.0(pthread_spin_lock+0x0)[0x7f6a25b20210] --------- [root@rhsauto070 glusterfs]# tail -n 50 mnt-testvol.log [2016-01-13 02:01:57.008411] I [dict.c:473:dict_get] (-->/usr/lib64/glusterfs/3.7.5/xlator/cluster/replicate.so(+0x11c41) [0x7f6a24e19c41] -->/usr/lib64/glusterfs/3.7.5/xlator/cluster/replicate.so(afr_replies_interpret+0x15e) [0x7f6a24e47e9e] -->/lib64/libglusterfs.so.0(dict_get+0xac) [0x7f6a26cab0cc] ) 2-dict: !this || key=glusterfs.bad-inode [Invalid argument] [2016-01-13 02:01:57.008451] I [dict.c:473:dict_get] (-->/usr/lib64/glusterfs/3.7.5/xlator/cluster/replicate.so(afr_replies_interpret+0x1ad) [0x7f6a24e47eed] -->/usr/lib64/glusterfs/3.7.5/xlator/cluster/replicate.so(afr_accuse_smallfiles+0x66) [0x7f6a24e47cb6] -->/lib64/libglusterfs.so.0(dict_get+0xac) [0x7f6a26cab0cc] ) 2-dict: !this || key=glusterfs.bad-inode [Invalid argument] [2016-01-13 02:01:57.008512] I [dict.c:473:dict_get] (-->/usr/lib64/glusterfs/3.7.5/xlator/cluster/replicate.so(afr_replies_interpret+0x1ad) [0x7f6a24e47eed] -->/usr/lib64/glusterfs/3.7.5/xlator/cluster/replicate.so(afr_accuse_smallfiles+0x66) [0x7f6a24e47cb6] -->/lib64/libglusterfs.so.0(dict_get+0xac) [0x7f6a26cab0cc] ) 2-dict: !this || key=glusterfs.bad-inode [Invalid argument] [2016-01-13 02:01:57.008663] I [dict.c:473:dict_get] (-->/usr/lib64/glusterfs/3.7.5/xlator/cluster/replicate.so(afr_replies_interpret+0x1ad) [0x7f6a24e47eed] -->/usr/lib64/glusterfs/3.7.5/xlator/cluster/replicate.so(afr_accuse_smallfiles+0x66) [0x7f6a24e47cb6] -->/lib64/libglusterfs.so.0(dict_get+0xac) [0x7f6a26cab0cc] ) 2-dict: !this || key=glusterfs.bad-inode [Invalid argument] [2016-01-13 02:01:57.020032] I [MSGID: 109036] [dht-common.c:7950:dht_log_new_layout_for_dir_selfheal] 2-testvol-hot-dht: Setting layout of /kernel.leg/dir.45/linux-4.3.3/drivers/phy with [Subvol_name: testvol-replicate-0, Err: -1 , Start: 0 , Stop: 1431655764 , Hash: 1 ], [Subvol_name: testvol-replicate-1, Err: -1 , Start: 1431655765 , Stop: 2863311529 , Hash: 1 ], [Subvol_name: testvol-replicate-2, Err: -1 , Start: 2863311530 , Stop: 4294967295 , Hash: 1 ], [2016-01-13 02:01:57.034867] I [MSGID: 109036] [dht-common.c:7950:dht_log_new_layout_for_dir_selfheal] 2-testvol-tier-dht: Setting layout of /kernel.leg/dir.45/linux-4.3.3/drivers/phy with [Subvol_name: testvol-cold-dht, Err: 28 , Start: 0 , Stop: 0 , Hash: 0 ], [Subvol_name: testvol-hot-dht, Err: -1 , Start: 0 , Stop: 4294967295 , Hash: 1 ], [2016-01-13 02:02:05.037947] E [MSGID: 122027] [ec-data.c:29:ec_cbk_data_allocate] 2-testvol-client-6: Mismatching xlators between request and answer (req=testvol-disperse-1, ans=testvol-client-6). [Invalid argument] The message "E [MSGID: 122027] [ec-data.c:29:ec_cbk_data_allocate] 2-testvol-client-6: Mismatching xlators between request and answer (req=testvol-disperse-1, ans=testvol-client-6). [Invalid argument]" repeated 4 times between [2016-01-13 02:02:05.037947] and [2016-01-13 02:02:05.039945] [2016-01-13 02:02:05.040251] E [MSGID: 122027] [ec-data.c:29:ec_cbk_data_allocate] 2-testvol-client-6: Mismatching xlators between request and answer (req=testvol-disperse-1, ans=testvol-client-6). [Invalid argument] pending frames: frame : type(1) op(STAT) frame : type(1) op(STAT) frame : type(1) op(STAT) frame : type(1) op(STAT) frame : type(1) op(CREATE) frame : type(1) op(CREATE) frame : type(1) op(CREATE) frame : type(1) op(CREATE) pending frames: frame : type(1) op(STAT) frame : type(1) op(STAT) frame : type(1) op(STAT) frame : type(1) op(STAT) frame : type(1) op(CREATE) frame : type(1) op(CREATE) frame : type(1) op(CREATE) frame : type(1) op(CREATE) frame : type(1) op(OPEN) frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2016-01-13 02:02:05 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.7.5 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb2)[0x7f6a26cb3002] /lib64/libglusterfs.so.0(gf_print_trace+0x31d)[0x7f6a26ccf47d] /lib64/libc.so.6(+0x35670)[0x7f6a253a1670] /lib64/libpthread.so.0(pthread_spin_lock+0x0)[0x7f6a25b20210] --------- 10.70.35.190:testvol on /mnt/testvol type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072) fusectl on /sys/fs/fuse/connections type fusectl (rw,relatime) 10.70.35.190:testdb on /mnt/tesdb type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072) binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime) [root@rhsauto070 glusterfs]# cd /mnt/testvol -bash: cd: /mnt/testvol: Transport endpoint is not connected [root@rhsauto070 glusterfs]#
https://review.gluster.org/11796
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions
I ran the steps mentioned above for about 4 days, and didnt see any issue. hence closing the bug as fixed test version:3.7.9-3
Looks good to me
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240