Description of problem: Self-heal daemon is crashing multiple times on one node, hence unable to trigger heal. cores are generated on all nodes Version-Release number of selected component (if applicable): # rpm -qa | grep gluster glusterfs-6.0-6.el7rhgs.x86_64 python2-gluster-6.0-6.el7rhgs.x86_64 glusterfs-rdma-6.0-6.el7rhgs.x86_64 glusterfs-server-6.0-6.el7rhgs.x86_64 glusterfs-events-6.0-6.el7rhgs.x86_64 How reproducible: 1/1 Steps to Reproduce: IO patterns: 1. small file workload with extensive softlink and hardlink creation 2. tar untar of huge dirs 3. 10 clients were used for the test Actual results: 1. While doing exploratory testing, with node-reboots and brick-down/ brick-up scenarios, hit a issue where shd crashed on two nodes 2. Stopped all vols and started them back, shd crashed again on one node 3. Also did volume restart force, shd daemon did not come back. 4. System is in same state since past 24hours, shd daemon is crashing repeatedly Expected results: Shd daemon should not crash, Additional info: # gluster v status repl3 Status of volume: repl3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.50:/bricks/brick1/vol1 49152 0 Y 18603 Brick 10.70.46.132:/bricks/brick1/vol1 49152 0 Y 6285 Brick 10.70.46.216:/bricks/brick1/vol1 49152 0 Y 6675 Brick 10.70.46.216:/bricks/brick2/vol1 49153 0 Y 6682 Brick 10.70.46.132:/bricks/brick2/vol1 49153 0 Y 6294 Brick 10.70.35.50:/bricks/brick2/vol1 49153 0 Y 18610 Brick 10.70.46.132:/bricks/brick3/vol1 49154 0 Y 6303 Brick 10.70.35.50:/bricks/brick3/vol1 49154 0 Y 18621 Brick 10.70.46.216:/bricks/brick3/vol1 49154 0 Y 6692 Brick 10.70.35.50:/bricks/brick4/vol1 49155 0 Y 18632 Brick 10.70.46.132:/bricks/brick4/vol1 49155 0 Y 6310 Brick 10.70.46.216:/bricks/brick4/vol1 49155 0 Y 6699 Self-heal Daemon on localhost N/A N/A Y 18716 Self-heal Daemon on 10.70.46.132 N/A N/A N N/A Self-heal Daemon on 10.70.46.216 N/A N/A N N/A Task Status of Volume repl3 ------------------------------------------------------------------------------ There are no active volume tasks ===================================================================== ]# gluster v info repl3 Volume Name: repl3 Type: Distributed-Replicate Volume ID: 118cc5b8-87ce-4936-a8ea-280baf8716c9 Status: Started Snapshot Count: 0 Number of Bricks: 4 x 3 = 12 Transport-type: tcp Bricks: Brick1: 10.70.35.50:/bricks/brick1/vol1 Brick2: 10.70.46.132:/bricks/brick1/vol1 Brick3: 10.70.46.216:/bricks/brick1/vol1 Brick4: 10.70.46.216:/bricks/brick2/vol1 Brick5: 10.70.46.132:/bricks/brick2/vol1 Brick6: 10.70.35.50:/bricks/brick2/vol1 Brick7: 10.70.46.132:/bricks/brick3/vol1 Brick8: 10.70.35.50:/bricks/brick3/vol1 Brick9: 10.70.46.216:/bricks/brick3/vol1 Brick10: 10.70.35.50:/bricks/brick4/vol1 Brick11: 10.70.46.132:/bricks/brick4/vol1 Brick12: 10.70.46.216:/bricks/brick4/vol1 Options Reconfigured: diagnostics.client-log-level: TRACE cluster.shd-max-threads: 40 changelog.changelog: on geo-replication.ignore-pid-check: on geo-replication.indexing: on transport.address-family: inet nfs.disable: on performance.client-io-threads: off cluster.enable-shared-storage: enable ======================================================================= # gluster v list gluster_shared_storage non-root repl3 ======================================================================= Preliminary investigation suggests that softlink is trying to undergo data heal, which should not happen as softlink contains no data. And shd daemon is crashing when it is trying to attempt this data heal on softlink. Rafi is looking into the system for RCA. System details will be provided in the next comment
As mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1716626#c8, this bug blocks the verification of BZ#1716626
I'm not sure why is this in POST state? Where's the patch?
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days