Description of problem: ====================== Had a 4*2 volume on a 4node cluster with parallel-readirp and negative-lookup-cache enabled. Had it mounted over 3 clients via cifs and created a mix of large files, small files, hardlinks, softlinks and directories. Went about a bitrot life-cycle of corrupting files/links at different nodes and levels, and the entire process of recovering/restoring a bad file. Over a period of time, volume status showed the brick process down. 4 cores are generated in 3 nodes, all of which seem to be of similar nature. Sosreports and core files will be copied to http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>/ Version-Release number of selected component (if applicable): =========================================================== 3.8.4-25 How reproducible: ================== 1:1 Additional info: ================= [root@dhcp47-127 ~]# rpm -qa | grep -i glusterfs glusterfs-libs-3.8.4-25.el7rhgs.x86_64 glusterfs-cli-3.8.4-25.el7rhgs.x86_64 samba-vfs-glusterfs-4.6.3-0.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-25.el7rhgs.x86_64 glusterfs-server-3.8.4-25.el7rhgs.x86_64 glusterfs-events-3.8.4-25.el7rhgs.x86_64 glusterfs-debuginfo-3.8.4-24.el7rhgs.x86_64 glusterfs-api-3.8.4-25.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-25.el7rhgs.x86_64 glusterfs-fuse-3.8.4-25.el7rhgs.x86_64 glusterfs-3.8.4-25.el7rhgs.x86_64 glusterfs-rdma-3.8.4-25.el7rhgs.x86_64 [root@dhcp47-127 ~]# gluster v status Status of volume: ctdb Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.47.127:/bricks/brick8/ctdb 49152 0 Y 22460 Brick 10.70.46.181:/bricks/brick8/ctdb 49152 0 Y 3734 Brick 10.70.46.47:/bricks/brick8/ctdb 49152 0 Y 12996 Brick 10.70.47.140:/bricks/brick8/ctdb 49152 0 Y 18260 Self-heal Daemon on localhost N/A N/A Y 22449 Self-heal Daemon on dhcp46-181.lab.eng.blr. redhat.com N/A N/A Y 3723 Self-heal Daemon on dhcp47-140.lab.eng.blr. redhat.com N/A N/A Y 18249 Self-heal Daemon on dhcp46-47.lab.eng.blr.r edhat.com N/A N/A Y 12985 Task Status of Volume ctdb ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: saturday-saturday Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp47-127.lab.eng.blr.redhat.com:/br icks/brick0/saturday-saturday_brick0 49153 0 Y 22466 Brick dhcp46-181.lab.eng.blr.redhat.com:/br icks/brick0/saturday-saturday_brick1 49153 0 Y 3740 Brick dhcp46-47.lab.eng.blr.redhat.com:/bri cks/brick0/saturday-saturday_brick2 49153 0 Y 13002 Brick dhcp47-140.lab.eng.blr.redhat.com:/br icks/brick0/saturday-saturday_brick3 49153 0 Y 18266 Brick dhcp47-127.lab.eng.blr.redhat.com:/br icks/brick1/saturday-saturday_brick4 N/A N/A N N/A Brick dhcp46-181.lab.eng.blr.redhat.com:/br icks/brick1/saturday-saturday_brick5 49154 0 Y 3750 Brick dhcp46-47.lab.eng.blr.redhat.com:/bri cks/brick1/saturday-saturday_brick6 N/A N/A N N/A Brick dhcp47-140.lab.eng.blr.redhat.com:/br icks/brick1/saturday-saturday_brick7 49154 0 Y 18276 Brick dhcp47-127.lab.eng.blr.redhat.com:/br icks/brick2/saturday-saturday_brick8 N/A N/A N N/A Brick dhcp46-181.lab.eng.blr.redhat.com:/br icks/brick2/saturday-saturday_brick9 N/A N/A N N/A Brick dhcp46-47.lab.eng.blr.redhat.com:/bri cks/brick2/saturday-saturday_brick10 49155 0 Y 13020 Brick dhcp47-140.lab.eng.blr.redhat.com:/br icks/brick2/saturday-saturday_brick11 49155 0 Y 18284 Snapshot Daemon on localhost 49162 0 Y 22546 Self-heal Daemon on localhost N/A N/A Y 22449 Bitrot Daemon on localhost N/A N/A Y 9001 Scrubber Daemon on localhost N/A N/A Y 9018 Snapshot Daemon on dhcp46-181.lab.eng.blr.r edhat.com 49162 0 Y 3821 Self-heal Daemon on dhcp46-181.lab.eng.blr. redhat.com N/A N/A Y 3723 Bitrot Daemon on dhcp46-181.lab.eng.blr.red hat.com N/A N/A Y 3023 Scrubber Daemon on dhcp46-181.lab.eng.blr.r edhat.com N/A N/A Y 3034 Snapshot Daemon on dhcp47-140.lab.eng.blr.r edhat.com 49162 0 Y 18348 Self-heal Daemon on dhcp47-140.lab.eng.blr. redhat.com N/A N/A Y 18249 Bitrot Daemon on dhcp47-140.lab.eng.blr.red hat.com N/A N/A Y 14952 Scrubber Daemon on dhcp47-140.lab.eng.blr.r edhat.com N/A N/A Y 14963 Snapshot Daemon on dhcp46-47.lab.eng.blr.re dhat.com 49162 0 Y 13083 Self-heal Daemon on dhcp46-47.lab.eng.blr.r edhat.com N/A N/A Y 12985 Bitrot Daemon on dhcp46-47.lab.eng.blr.redh at.com N/A N/A Y 10810 Scrubber Daemon on dhcp46-47.lab.eng.blr.re dhat.com N/A N/A Y 10822 Task Status of Volume saturday-saturday ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp47-127 ~]# gluster v info Volume Name: ctdb Type: Replicate Volume ID: 64980fe9-85ea-487e-8d0d-39b70c8626b0 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 4 = 4 Transport-type: tcp Bricks: Brick1: 10.70.47.127:/bricks/brick8/ctdb Brick2: 10.70.46.181:/bricks/brick8/ctdb Brick3: 10.70.46.47:/bricks/brick8/ctdb Brick4: 10.70.47.140:/bricks/brick8/ctdb Options Reconfigured: nfs.disable: on transport.address-family: inet Volume Name: saturday-saturday Type: Distributed-Replicate Volume ID: 4a24c34c-1144-4f07-9763-6e232c037a67 Status: Started Snapshot Count: 2 Number of Bricks: 6 x 2 = 12 Transport-type: tcp Bricks: Brick1: dhcp47-127.lab.eng.blr.redhat.com:/bricks/brick0/saturday-saturday_brick0 Brick2: dhcp46-181.lab.eng.blr.redhat.com:/bricks/brick0/saturday-saturday_brick1 Brick3: dhcp46-47.lab.eng.blr.redhat.com:/bricks/brick0/saturday-saturday_brick2 Brick4: dhcp47-140.lab.eng.blr.redhat.com:/bricks/brick0/saturday-saturday_brick3 Brick5: dhcp47-127.lab.eng.blr.redhat.com:/bricks/brick1/saturday-saturday_brick4 Brick6: dhcp46-181.lab.eng.blr.redhat.com:/bricks/brick1/saturday-saturday_brick5 Brick7: dhcp46-47.lab.eng.blr.redhat.com:/bricks/brick1/saturday-saturday_brick6 Brick8: dhcp47-140.lab.eng.blr.redhat.com:/bricks/brick1/saturday-saturday_brick7 Brick9: dhcp47-127.lab.eng.blr.redhat.com:/bricks/brick2/saturday-saturday_brick8 Brick10: dhcp46-181.lab.eng.blr.redhat.com:/bricks/brick2/saturday-saturday_brick9 Brick11: dhcp46-47.lab.eng.blr.redhat.com:/bricks/brick2/saturday-saturday_brick10 Brick12: dhcp47-140.lab.eng.blr.redhat.com:/bricks/brick2/saturday-saturday_brick11 Options Reconfigured: features.scrub: Active features.bitrot: on diagnostics.count-fop-hits: on diagnostics.latency-measurement: on performance.nl-cache: on features.barrier: disable features.show-snapshot-directory: enable features.uss: enable transport.address-family: inet nfs.disable: on server.allow-insecure: on performance.stat-prefetch: on storage.batch-fsync-delay-usec: 0 features.cache-invalidation: on features.cache-invalidation-timeout: 600 performance.cache-invalidation: on performance.md-cache-timeout: 600 network.inode-lru-limit: 50000 performance.cache-samba-metadata: on performance.parallel-readdir: on [root@dhcp47-127 ~]# gluster v list ctdb saturday-saturday [root@dhcp47-127 ~]# gluster v list ctdb saturday-saturday [root@dhcp47-127 ~]#
[qe@rhsqe-repo 1453160]$ [qe@rhsqe-repo 1453160]$ hostname rhsqe-repo.lab.eng.blr.redhat.com [qe@rhsqe-repo 1453160]$ [qe@rhsqe-repo 1453160]$ pwd /home/repo/sosreports/1453160 [qe@rhsqe-repo 1453160]$ [qe@rhsqe-repo 1453160]$ ll total 980952 -rwxr-xr-x. 1 qe qe 3473 May 22 15:03 client_logs -rwxr-xr-x. 1 qe qe 239718400 May 22 15:00 core.13011 -rwxr-xr-x. 1 qe qe 233590784 May 22 15:00 core.22469 -rwxr-xr-x. 1 qe qe 222420992 May 22 15:00 core.22484 -rwxr-xr-x. 1 qe qe 199450624 May 22 15:00 core.3758 -rwxr-xr-x. 1 qe qe 30375496 May 22 14:54 sosreport-sysreg-prod-20170522090552_dhcp_47_127.tar.xz -rwxr-xr-x. 1 qe qe 26819980 May 22 14:54 sosreport-sysreg-prod-20170522090554_dhcp_46_181.tar.xz -rwxr-xr-x. 1 qe qe 26213468 May 22 14:54 sosreport-sysreg-prod-20170522090556_dhcp_46_47.tar.xz -rwxr-xr-x. 1 qe qe 25884564 May 22 14:54 sosreport-sysreg-prod-20170522090558_dhcp_47_140.tar.xz [qe@rhsqe-repo 1453160]$
upstream patch : https://review.gluster.org/17357
Backtrace: #0 br_stub_need_versioning (this=this@entry=0x7fcda8012aa0, fd=fd@entry=0x7fcd94008480, versioning=versioning@entry=0x7fcda19a4058, modified=modified@entry=0x7fcda19a405c, ctx=ctx@entry=0x7fcda19a4060) at bit-rot-stub.c:466 466 br_stub_inode_ctx_t *c = NULL; Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 libacl-2.2.51-12.el7.x86_64 libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-12.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libgcc-4.8.5-11.el7.x86_64 libselinux-2.5-6.el7.x86_64 libuuid-2.23.2-33.el7_3.2.x86_64 openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 sqlite-3.7.17-8.el7.x86_64 sssd-client-1.14.0-43.el7_3.14.x86_64 zlib-1.2.7-17.el7.x86_64 (gdb) bt #0 br_stub_need_versioning (this=this@entry=0x7fcda8012aa0, fd=fd@entry=0x7fcd94008480, versioning=versioning@entry=0x7fcda19a4058, modified=modified@entry=0x7fcda19a405c, ctx=ctx@entry=0x7fcda19a4060) at bit-rot-stub.c:466 #1 0x00007fcdae8465d8 in br_stub_writev (frame=frame@entry=0x7fcd6400b5f0, this=this@entry=0x7fcda8012aa0, fd=fd@entry=0x7fcd94008480, vector=vector@entry=0x7fcd94030a30, count=count@entry=1, offset=offset@entry=1577648128, flags=flags@entry=0, iobref=iobref@entry=0x7fcd940086b0, xdata=xdata@entry=0x7fcd9402e710) at bit-rot-stub.c:1959 #2 0x00007fcdbd8955f9 in default_writev (frame=0x7fcd6400b5f0, this=<optimized out>, fd=0x7fcd94008480, vector=0x7fcd94030a30, count=1, off=1577648128, flags=0, iobref=0x7fcd940086b0, xdata=0x7fcd9402e710) at defaults.c:2543 #3 0x00007fcdae411d24 in pl_writev (frame=frame@entry=0x7fcd6400b920, this=this@entry=0x7fcda8015980, fd=fd@entry=0x7fcd94008480, vector=vector@entry=0x7fcd94030a30, count=count@entry=1, offset=offset@entry=1577648128, flags=flags@entry=0, iobref=iobref@entry=0x7fcd940086b0, xdata=xdata@entry=0x7fcd9402e710) at posix.c:2030 #4 0x00007fcdae1fc4c4 in worm_writev (frame=frame@entry=0x7fcd6400b920, this=this@entry=0x7fcda8016f60, fd=fd@entry=0x7fcd94008480, vector=vector@entry=0x7fcd94030a30, count=count@entry=1, offset=offset@entry=1577648128, flags=flags@entry=0, iobref=iobref@entry=0x7fcd940086b0, xdata=xdata@entry=0x7fcd9402e710) at worm.c:386 #5 0x00007fcdadff1ea7 in ro_writev (frame=0x7fcd6400b920, this=<optimized out>, fd=0x7fcd94008480, vector=0x7fcd94030a30, count=1, off=1577648128, flags=0, iobref=0x7fcd940086b0, xdata=0x7fcd9402e710) at read-only-common.c:383 #6 0x00007fcdadddd83d in leases_writev (frame=0x7fcd64006ce0, this=0x7fcda8019ef0, fd=0x7fcd94008480, vector=0x7fcd94030a30, count=1, off=1577648128, flags=0, iobref=0x7fcd940086b0, xdata=0x7fcd9402e710) at leases.c:127 #7 0x00007fcdadbcc8ad in up_writev (frame=0x7fcd64006f00, this=0x7fcda801b630, fd=0x7fcd94008480, vector=0x7fcd94030a30, count=1, off=1577648128, flags=0, iobref=0x7fcd940086b0, xdata=0x7fcd9402e710) at upcall.c:131 #8 0x00007fcdbd8ae736 in default_writev_resume (frame=0x7fcd94036c30, this=0x7fcda801cf60, fd=0x7fcd94008480, vector=0x7fcd94030a30, count=1, off=1577648128, flags=0, iobref=0x7fcd940086b0, xdata=0x7fcd9402e710) at defaults.c:1849 #9 0x00007fcdbd83e698 in call_resume_wind (stub=0x7fcd9404de20) at call-stub.c:2045 #10 0x00007fcdbd83eb25 in call_resume (stub=0x7fcd9404de20) at call-stub.c:2508 #11 0x00007fcdad9b7957 in iot_worker (data=0x7fcda8056490) at io-threads.c:220 #12 0x00007fcdbc67bdc5 in start_thread () from /lib64/libpthread.so.0 #13 0x00007fcdbbfc073d in clone () from /lib64/libc.so.6
The fix sent for BZ 1451280 solves this crash as well. Keeping this open as the bt is different. Upstream Patch: https://review.gluster.org/17357
Upstream Patches: https://review.gluster.org/17357 (master) https://review.gluster.org/#/c/17406/ (release-3.11) Downstream Patches: https://code.engineering.redhat.com/gerrit/#/c/107534/
Tested and verified this on the build glusterfs-3.8.4-35. A round of testing has taken place on bitrot and have not seen this crash again anytime in my logs. Moving this bug to verified in 3.3.0.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774