1721513 – shd daemon is crashing multiple times on one node

Bug 1721513 - shd daemon is crashing multiple times on one node

Summary: shd daemon is crashing multiple times on one node

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	replicate
Sub Component:
Version:	rhgs-3.5
Hardware:	Unspecified
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Mohammed Rafi KC
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:	shd-multiplexing
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-06-18 13:03 UTC by Anees Patel
Modified:	2023-09-14 05:30 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-01-20 09:32:28 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Anees Patel 2019-06-18 13:03:22 UTC

Description of problem:

Self-heal daemon is crashing multiple times on one node, hence unable to trigger heal. cores are generated on all nodes


Version-Release number of selected component (if applicable):
# rpm -qa | grep gluster
glusterfs-6.0-6.el7rhgs.x86_64
python2-gluster-6.0-6.el7rhgs.x86_64
glusterfs-rdma-6.0-6.el7rhgs.x86_64
glusterfs-server-6.0-6.el7rhgs.x86_64
glusterfs-events-6.0-6.el7rhgs.x86_64

How reproducible:
1/1

Steps to Reproduce:
IO patterns:
1. small file workload with extensive softlink and hardlink creation
2. tar untar of huge dirs
3. 10 clients were used for the test
Actual results:
1. While doing exploratory testing, with node-reboots and brick-down/ brick-up scenarios, hit a issue where shd crashed on two nodes  
2. Stopped all vols and started them back, shd crashed again on one node
3. Also did volume restart force, shd daemon did not come back.
4. System is in same state since past 24hours,

shd daemon is crashing repeatedly

Expected results:
Shd daemon should not crash, 

Additional info:

# gluster v status repl3
Status of volume: repl3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.50:/bricks/brick1/vol1       49152     0          Y       18603
Brick 10.70.46.132:/bricks/brick1/vol1      49152     0          Y       6285 
Brick 10.70.46.216:/bricks/brick1/vol1      49152     0          Y       6675 
Brick 10.70.46.216:/bricks/brick2/vol1      49153     0          Y       6682 
Brick 10.70.46.132:/bricks/brick2/vol1      49153     0          Y       6294 
Brick 10.70.35.50:/bricks/brick2/vol1       49153     0          Y       18610
Brick 10.70.46.132:/bricks/brick3/vol1      49154     0          Y       6303 
Brick 10.70.35.50:/bricks/brick3/vol1       49154     0          Y       18621
Brick 10.70.46.216:/bricks/brick3/vol1      49154     0          Y       6692 
Brick 10.70.35.50:/bricks/brick4/vol1       49155     0          Y       18632
Brick 10.70.46.132:/bricks/brick4/vol1      49155     0          Y       6310 
Brick 10.70.46.216:/bricks/brick4/vol1      49155     0          Y       6699 
Self-heal Daemon on localhost               N/A       N/A        Y       18716
Self-heal Daemon on 10.70.46.132            N/A       N/A        N       N/A  
Self-heal Daemon on 10.70.46.216            N/A       N/A        N       N/A  
 
Task Status of Volume repl3
------------------------------------------------------------------------------
There are no active volume tasks

=====================================================================
]# gluster v info repl3
 
Volume Name: repl3
Type: Distributed-Replicate
Volume ID: 118cc5b8-87ce-4936-a8ea-280baf8716c9
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.35.50:/bricks/brick1/vol1
Brick2: 10.70.46.132:/bricks/brick1/vol1
Brick3: 10.70.46.216:/bricks/brick1/vol1
Brick4: 10.70.46.216:/bricks/brick2/vol1
Brick5: 10.70.46.132:/bricks/brick2/vol1
Brick6: 10.70.35.50:/bricks/brick2/vol1
Brick7: 10.70.46.132:/bricks/brick3/vol1
Brick8: 10.70.35.50:/bricks/brick3/vol1
Brick9: 10.70.46.216:/bricks/brick3/vol1
Brick10: 10.70.35.50:/bricks/brick4/vol1
Brick11: 10.70.46.132:/bricks/brick4/vol1
Brick12: 10.70.46.216:/bricks/brick4/vol1
Options Reconfigured:
diagnostics.client-log-level: TRACE
cluster.shd-max-threads: 40
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
cluster.enable-shared-storage: enable
=======================================================================
# gluster v list
gluster_shared_storage
non-root
repl3
=======================================================================

Preliminary investigation suggests that softlink is trying to undergo data heal, which should not happen as softlink contains no data.
And shd daemon is crashing when it is trying to attempt this data heal on softlink.

Rafi is looking into the system for RCA.
System details will be provided in the next comment

Comment 11 Anees Patel 2019-06-26 07:14:55 UTC

As mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1716626#c8, this bug blocks the verification of BZ#1716626

Comment 18 Yaniv Kaul 2019-12-12 18:13:21 UTC

I'm not sure why is this in POST state?
Where's the patch?

Comment 21 Red Hat Bugzilla 2023-09-14 05:30:29 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.