Description of problem: ======================= Had a 6 node cluster with 3.8.4-23 build. Created a 1 * (4+2) EC volume and mounted it via fuse. Created two files 'test1' and 'test2' and corrupted both. The scrubber detected both the files as corrupted. Updated the build to 3.8.4-25 and restarted glusterd. Followed the steps of recovering the file as mentioned in the admin guide. 'test2' recovered successfully, but 'test1' failed with 'Input/output error' on the mountpoint. Volume status showed 2 brick processes down. Core file and sosreports will be copied to http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>/ Version-Release number of selected component (if applicable): =========================================================== 3.8.4-25 How reproducible: ================= 1:1 Additional info: ================ [root@dhcp47-121 ~]# rpm -qa | grep gluster glusterfs-libs-3.8.4-25.el7rhgs.x86_64 glusterfs-events-3.8.4-25.el7rhgs.x86_64 glusterfs-cli-3.8.4-25.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-25.el7rhgs.x86_64 glusterfs-server-3.8.4-25.el7rhgs.x86_64 glusterfs-rdma-3.8.4-25.el7rhgs.x86_64 vdsm-gluster-4.17.33-1.1.el7rhgs.noarch gluster-nagios-common-0.2.4-1.el7rhgs.noarch gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64 glusterfs-api-3.8.4-25.el7rhgs.x86_64 python-gluster-3.8.4-25.el7rhgs.noarch glusterfs-debuginfo-3.8.4-24.el7rhgs.x86_64 glusterfs-fuse-3.8.4-25.el7rhgs.x86_64 glusterfs-3.8.4-25.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-25.el7rhgs.x86_64 [root@dhcp47-121 ~]# gluster peer status Number of Peers: 5 Hostname: dhcp47-113.lab.eng.blr.redhat.com Uuid: a0557927-4e5e-4ff7-8dce-94873f867707 State: Peer in Cluster (Connected) Hostname: dhcp47-114.lab.eng.blr.redhat.com Uuid: c0dac197-5a4d-4db7-b709-dbf8b8eb0896 State: Peer in Cluster (Connected) Hostname: dhcp47-115.lab.eng.blr.redhat.com Uuid: f828fdfa-e08f-4d12-85d8-2121cafcf9d0 State: Peer in Cluster (Connected) Hostname: dhcp47-116.lab.eng.blr.redhat.com Uuid: a96e0244-b5ce-4518-895c-8eb453c71ded State: Peer in Cluster (Connected) Hostname: dhcp47-117.lab.eng.blr.redhat.com Uuid: 17eb3cef-17e7-4249-954b-fc19ec608304 State: Peer in Cluster (Connected) [root@dhcp47-121 ~]# [root@dhcp47-121 ~]# [root@dhcp47-121 ~]# gluster v status disp2 Status of volume: disp2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.47.121:/bricks/brick8/disp2_0 49154 0 Y 5552 Brick 10.70.47.113:/bricks/brick8/disp2_1 N/A N/A N N/A Brick 10.70.47.114:/bricks/brick8/disp2_2 49154 0 Y 30916 Brick 10.70.47.115:/bricks/brick8/disp2_3 49154 0 Y 23469 Brick 10.70.47.116:/bricks/brick8/disp2_4 49153 0 Y 27754 Brick 10.70.47.117:/bricks/brick8/disp2_5 N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 5497 Bitrot Daemon on localhost N/A N/A Y 5515 Scrubber Daemon on localhost N/A N/A Y 5525 Self-heal Daemon on dhcp47-113.lab.eng.blr. redhat.com N/A N/A Y 5893 Bitrot Daemon on dhcp47-113.lab.eng.blr.red hat.com N/A N/A Y 5911 Scrubber Daemon on dhcp47-113.lab.eng.blr.r edhat.com N/A N/A Y 5921 Self-heal Daemon on dhcp47-114.lab.eng.blr. redhat.com N/A N/A Y 30858 Bitrot Daemon on dhcp47-114.lab.eng.blr.red hat.com N/A N/A Y 30876 Scrubber Daemon on dhcp47-114.lab.eng.blr.r edhat.com N/A N/A Y 30886 Self-heal Daemon on dhcp47-116.lab.eng.blr. redhat.com N/A N/A Y 27708 Bitrot Daemon on dhcp47-116.lab.eng.blr.red hat.com N/A N/A Y 27726 Scrubber Daemon on dhcp47-116.lab.eng.blr.r edhat.com N/A N/A Y 27736 Self-heal Daemon on dhcp47-117.lab.eng.blr. redhat.com N/A N/A Y 9684 Bitrot Daemon on dhcp47-117.lab.eng.blr.red hat.com N/A N/A Y 9702 Scrubber Daemon on dhcp47-117.lab.eng.blr.r edhat.com N/A N/A Y 9712 Self-heal Daemon on dhcp47-115.lab.eng.blr. redhat.com N/A N/A Y 23411 Bitrot Daemon on dhcp47-115.lab.eng.blr.red hat.com N/A N/A Y 23429 Scrubber Daemon on dhcp47-115.lab.eng.blr.r edhat.com N/A N/A Y 23439 Task Status of Volume disp2 ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp47-121 ~]# [root@dhcp47-121 ~]# gluster v info disp2 Volume Name: disp2 Type: Disperse Volume ID: d7b0d170-f0e0-4e26-9369-f0a52dc92d38 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (4 + 2) = 6 Transport-type: tcp Bricks: Brick1: 10.70.47.121:/bricks/brick8/disp2_0 Brick2: 10.70.47.113:/bricks/brick8/disp2_1 Brick3: 10.70.47.114:/bricks/brick8/disp2_2 Brick4: 10.70.47.115:/bricks/brick8/disp2_3 Brick5: 10.70.47.116:/bricks/brick8/disp2_4 Brick6: 10.70.47.117:/bricks/brick8/disp2_5 Options Reconfigured: performance.stat-prefetch: off nfs.disable: on transport.address-family: inet features.bitrot: on features.scrub: Active features.scrub-freq: hourly cluster.brick-multiplex: disable [root@dhcp47-121 ~]# [root@dhcp47-121 ~]# gluster v bitrot disp2 scrub status Volume name : disp2 State of scrub: Active (In Progress) Scrub impact: lazy Scrub frequency: hourly Bitrot error log location: /var/log/glusterfs/bitd.log Scrubber error log location: /var/log/glusterfs/scrub.log ========================================================= Node: localhost Number of Scrubbed files: 2 Number of Skipped files: 0 Last completed scrub time: 2017-05-16 09:35:12 Duration of last scrub (D:M:H:M:S): 0:0:0:7 Error count: 0 ========================================================= Node: dhcp47-114.lab.eng.blr.redhat.com Number of Scrubbed files: 1 Number of Skipped files: 0 Last completed scrub time: 2017-05-16 09:35:12 Duration of last scrub (D:M:H:M:S): 0:0:0:7 Error count: 0 ========================================================= Node: dhcp47-116.lab.eng.blr.redhat.com Number of Scrubbed files: 2 Number of Skipped files: 0 Last completed scrub time: 2017-05-16 09:35:14 Duration of last scrub (D:M:H:M:S): 0:0:0:7 Error count: 0 ========================================================= Node: dhcp47-113.lab.eng.blr.redhat.com Number of Scrubbed files: 0 Number of Skipped files: 0 Last completed scrub time: 2017-05-16 08:35:24 Duration of last scrub (D:M:H:M:S): 0:0:0:7 Error count: 0 ========================================================= Node: dhcp47-115.lab.eng.blr.redhat.com Number of Scrubbed files: 2 Number of Skipped files: 0 Last completed scrub time: 2017-05-16 09:35:11 Duration of last scrub (D:M:H:M:S): 0:0:0:7 Error count: 0 ========================================================= Node: dhcp47-117.lab.eng.blr.redhat.com Number of Scrubbed files: 0 Number of Skipped files: 0 Last completed scrub time: 2017-05-16 08:35:23 Duration of last scrub (D:M:H:M:S): 0:0:0:7 Error count: 0 ========================================================= [root@dhcp47-121 ~]# gluster v heal disp2 info Brick 10.70.47.121:/bricks/brick8/disp2_0 /d1/d2/d3/d4/test2 Status: Connected Number of entries: 1 Brick 10.70.47.113:/bricks/brick8/disp2_1 Status: Transport endpoint is not connected Number of entries: - Brick 10.70.47.114:/bricks/brick8/disp2_2 /d1/d2/d3/d4/test2 Status: Connected Number of entries: 1 Brick 10.70.47.115:/bricks/brick8/disp2_3 /d1/d2/d3/d4/test2 Status: Connected Number of entries: 1 Brick 10.70.47.116:/bricks/brick8/disp2_4 /d1/d2/d3/d4/test2 Status: Connected Number of entries: 1 Brick 10.70.47.117:/bricks/brick8/disp2_5 Status: Transport endpoint is not connected Number of entries: - [root@dhcp47-121 ~]# [2017-05-16 08:54:10.160132] E [MSGID: 115070] [server-rpc-fops.c:1474:server_open_cbk] 0-disp2-server: 4619: OPEN /d1/d2/d3/d4/test2 (3673eecb-e5b5-4014-9bc6-a2fc007f08cb) ==> (Input/output error) [Input/output error] pending frames: frame : type(0) op(29) frame : type(0) op(11) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2017-05-16 08:55:01 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.8.4 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xc2)[0x7f0e805201b2] /lib64/libglusterfs.so.0(gf_print_trace+0x324)[0x7f0e80529bd4] /lib64/libc.so.6(+0x35250)[0x7f0e7ec02250] /usr/lib64/glusterfs/3.8.4/xlator/features/bitrot-stub.so(+0xadf4)[0x7f0e7174cdf4] /usr/lib64/glusterfs/3.8.4/xlator/features/bitrot-stub.so(+0xde56)[0x7f0e7174fe56] /usr/lib64/glusterfs/3.8.4/xlator/features/access-control.so(+0x5815)[0x7f0e71535815] /usr/lib64/glusterfs/3.8.4/xlator/features/locks.so(+0x6dc8)[0x7f0e71312dc8] /usr/lib64/glusterfs/3.8.4/xlator/features/worm.so(+0x7e59)[0x7f0e71106e59] /usr/lib64/glusterfs/3.8.4/xlator/features/read-only.so(+0x4478)[0x7f0e70efb478] /usr/lib64/glusterfs/3.8.4/xlator/features/leases.so(+0x50b4)[0x7f0e70ce70b4] /usr/lib64/glusterfs/3.8.4/xlator/features/upcall.so(+0xf143)[0x7f0e70ad7143] /lib64/libglusterfs.so.0(default_open_resume+0x1c9)[0x7f0e805b1269] /lib64/libglusterfs.so.0(call_resume+0x75)[0x7f0e80542b25] /usr/lib64/glusterfs/3.8.4/xlator/performance/io-threads.so(+0x4957)[0x7f0e708c1957] /lib64/libpthread.so.0(+0x7dc5)[0x7f0e7f37fdc5] /lib64/libc.so.6(clone+0x6d)[0x7f0e7ecc473d]
[qe@rhsqe-repo 1451280]$ [qe@rhsqe-repo 1451280]$ hostname rhsqe-repo.lab.eng.blr.redhat.com [qe@rhsqe-repo 1451280]$ [qe@rhsqe-repo 1451280]$ pwd /home/repo/sosreports/1451280 [qe@rhsqe-repo 1451280]$ [qe@rhsqe-repo 1451280]$ ll total 708976 -rwxr-xr-x. 1 qe qe 157433856 May 16 15:13 core.5950 -rwxr-xr-x. 1 qe qe 157433856 May 16 15:13 core.9730 -rwxr-xr-x. 1 qe qe 73012628 May 16 15:12 sosreport-sysreg-prod-20170516050748.tar.xz_dhcp47_121 -rwxr-xr-x. 1 qe qe 69134612 May 16 15:12 sosreport-sysreg-prod-20170516050917.tar.xz_dhcp47_113 -rwxr-xr-x. 1 qe qe 69795020 May 16 15:12 sosreport-sysreg-prod-20170516051025.tar.xz_dhcp47_114 -rwxr-xr-x. 1 qe qe 69256712 May 16 15:12 sosreport-sysreg-prod-20170516051259.tar.xz_dhcp47_115 -rwxr-xr-x. 1 qe qe 65140528 May 16 15:12 sosreport-sysreg-prod-20170516051545.tar.xz_dhcp47_116 -rwxr-xr-x. 1 qe qe 64772920 May 16 15:12 sosreport-sysreg-prod-20170516051639.tar.xz_dhcp47_117 [qe@rhsqe-repo 1451280]$
Following the bt: Program terminated with signal 11, Segmentation fault. #0 list_add_tail (head=0x7f0e28001908, new=0x18) at ../../../../../libglusterfs/src/list.h:40 40 new->next = head; Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 libacl-2.2.51-12.el7.x86_64 libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-12.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libgcc-4.8.5-11.el7.x86_64 libselinux-2.5-6.el7.x86_64 libuuid-2.23.2-33.el7_3.2.x86_64 openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 sqlite-3.7.17-8.el7.x86_64 sssd-client-1.14.0-43.el7_3.14.x86_64 zlib-1.2.7-17.el7.x86_64 (gdb) bt #0 list_add_tail (head=0x7f0e28001908, new=0x18) at ../../../../../libglusterfs/src/list.h:40 #1 br_stub_add_fd_to_inode (this=this@entry=0x7f0e6c012440, fd=fd@entry=0x7f0e6c0a5050, ctx=ctx@entry=0x0) at bit-rot-stub.c:2398 #2 0x00007f0e7174fe56 in br_stub_open (frame=0x7f0e28000ca0, this=0x7f0e6c012440, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at bit-rot-stub.c:2352 #3 0x00007f0e71535815 in posix_acl_open (frame=0x7f0e280014b0, this=0x7f0e6c013d70, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at posix-acl.c:1129 #4 0x00007f0e71312dc8 in pl_open (frame=frame@entry=0x7f0e28000ac0, this=this@entry=0x7f0e6c015320, loc=loc@entry=0x7f0e6c0ccf90, flags=flags@entry=2, fd=fd@entry=0x7f0e6c0a5050, xdata=xdata@entry=0x0) at posix.c:1698 #5 0x00007f0e71106e59 in worm_open (frame=0x7f0e28000ac0, this=<optimized out>, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at worm.c:43 #6 0x00007f0e70efb478 in ro_open (frame=0x7f0e28001740, this=0x7f0e6c018130, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at read-only-common.c:341 #7 0x00007f0e70ce70b4 in leases_open (frame=0x7f0e28001b50, this=0x7f0e6c019880, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at leases.c:75 #8 0x00007f0e70ad7143 in up_open (frame=0x7f0e28002250, this=0x7f0e6c01af20, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at upcall.c:75 #9 0x00007f0e805b1269 in default_open_resume (frame=0x7f0e6c002020, this=0x7f0e6c01c690, loc=0x7f0e6c0ccf90, flags=2, fd=0x7f0e6c0a5050, xdata=0x0) at defaults.c:1726 #10 0x00007f0e80542b25 in call_resume (stub=0x7f0e6c0ccf40) at call-stub.c:2508 #11 0x00007f0e708c1957 in iot_worker (data=0x7f0e6c0550e0) at io-threads.c:220 #12 0x00007f0e7f37fdc5 in start_thread () from /lib64/libpthread.so.0 #13 0x00007f0e7ecc473d in clone () from /lib64/libc.so.6
Upstream Patch: https://review.gluster.org/17357
Upstream Patches: https://review.gluster.org/17357 (master) https://review.gluster.org/#/c/17406/ (release-3.11) Downstream Patches: https://code.engineering.redhat.com/gerrit/#/c/107534/
Tested and verified this on the build glusterfs-3.8.4-35. A round of testing has taken place on bitrot on the said build, and I have not seen this crash again anytime in my logs. Moving this to verified in 3.3.0.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774