Bug 1811373

Summary:

glusterfsd crashes healing disperse volumes on arm

Product:

[Community] GlusterFS

Reporter:

Fox <foxxz.net>

Component:

core

Assignee:

bugs <bugs>

Status:

CLOSED UPSTREAM

QA Contact:

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

CC:

bugs, jahernan, pasik, srakonde

Target Milestone:

---

Target Release:

---

Hardware:

armv7l

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-03-12 12:22:37 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Excerpts from several gluster logs	none

Description Fox 2020-03-08 04:09:47 UTC

Created attachment 1668387 [details]
Excerpts from several gluster logs

Description of problem:
the gluster brick process on an arm node that needs healing will crash (almost always) seconds after it starts and connects to other cluster members. Have tested under ubuntu 18, gluster v7 and v4 running on odroid HC2 and raspbian gluster v5 running on raspberry pi 3

Version-Release number of selected component (if applicable):
gluster 7.2 but have also reproduced the problem on 4 and 5

How reproducible:
Reliably reproducible

Steps to Reproduce:
1. Create disperse volume on a cluster with 3 or more members/bricks and enable healing
2. Have a client mount volume and begin writing files to volume
3. Reboot a cluster member during client operations
4. Cluster member rejoins cluster and attempts to heal
5. glusterd on that member typically crashes seconds to minutes after startup. In rare cases longer.

Actual results:
gluster volume status
shows the affected brick online briefly and then offline after it crashes. The self heal daemon shows as online. The brick is never able to heal and rejoin the cluster.

Expected results:
The brick should come online and sync up.

Additional info:
Have run the same test on x86 hardware and it does not exhibit the same crash.

I am willing to make this testbed available to developers to help debug this issue. It is a 12 node system comprised of odroid HC2 units with a 4tb drive attached to each unit.


Volume Name: bigdisp
Type: Disperse   
Volume ID: 56fa5de3-36d5-45ec-9789-88d8aae02275
Status: Started  
Snapshot Count: 0
Number of Bricks: 1 x (8 + 4) = 12
Transport-type: tcp
Bricks:
Brick1: gluster1:/exports/sda/brick1/bigdisp
Brick2: gluster2:/exports/sda/brick1/bigdisp
Brick3: gluster3:/exports/sda/brick1/bigdisp
Brick4: gluster4:/exports/sda/brick1/bigdisp
Brick5: gluster5:/exports/sda/brick1/bigdisp
Brick6: gluster6:/exports/sda/brick1/bigdisp
Brick7: gluster7:/exports/sda/brick1/bigdisp
Brick8: gluster8:/exports/sda/brick1/bigdisp
Brick9: gluster9:/exports/sda/brick1/bigdisp
Brick10: gluster10:/exports/sda/brick1/bigdisp
Brick11: gluster11:/exports/sda/brick1/bigdisp
Brick12: gluster12:/exports/sda/brick1/bigdisp
Options Reconfigured:
disperse.shd-max-threads: 4
client.event-threads: 8
cluster.disperse-self-heal-daemon: enable
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on



Status of volume: bigdisp
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gluster1:/exports/sda/brick1/bigdisp  49152     0          Y       4632
Brick gluster2:/exports/sda/brick1/bigdisp  49152     0          Y       3115
Brick gluster3:/exports/sda/brick1/bigdisp  N/A       N/A        N       N/A
Brick gluster4:/exports/sda/brick1/bigdisp  49152     0          Y       2728
Brick gluster5:/exports/sda/brick1/bigdisp  49152     0          Y       3072
Brick gluster6:/exports/sda/brick1/bigdisp  49152     0          Y       2549
Brick gluster7:/exports/sda/brick1/bigdisp  49152     0          Y       16848
Brick gluster8:/exports/sda/brick1/bigdisp  49152     0          Y       16740
Brick gluster9:/exports/sda/brick1/bigdisp  49152     0          Y       2619
Brick gluster10:/exports/sda/brick1/bigdisp 49152     0          Y       2677
Brick gluster11:/exports/sda/brick1/bigdisp 49152     0          Y       3023
Brick gluster12:/exports/sda/brick1/bigdisp 49153     0          Y       2440
Self-heal Daemon on localhost               N/A       N/A        Y       4653
Self-heal Daemon on gluster3                N/A       N/A        Y       7620
Self-heal Daemon on gluster10               N/A       N/A        Y       2698
Self-heal Daemon on gluster7                N/A       N/A        Y       16869
Self-heal Daemon on gluster8                N/A       N/A        Y       16761
Self-heal Daemon on gluster12               N/A       N/A        Y       2461
Self-heal Daemon on gluster9                N/A       N/A        Y       2640
Self-heal Daemon on gluster2                N/A       N/A        Y       3136
Self-heal Daemon on gluster5                N/A       N/A        Y       3093
Self-heal Daemon on gluster4                N/A       N/A        Y       2749
Self-heal Daemon on gluster6                N/A       N/A        Y       2570
Self-heal Daemon on gluster11               N/A       N/A        Y       3044

Task Status of Volume bigdisp
------------------------------------------------------------------------------
There are no active volume tasks

Comment 1 Xavi Hernandez 2020-03-09 07:08:19 UTC

Can you check if this patch [1] fixes the issue ?

[1] https://review.gluster.org/c/glusterfs/+/23912

Comment 2 Worker Ant 2020-03-12 12:22:37 UTC

This bug is moved to https://github.com/gluster/glusterfs/issues/886, and will be tracked there from now on. Visit GitHub issues URL for further details

Comment 3 Fox 2020-03-15 20:38:23 UTC

The patch provided did correct the issue. Thank you.

Comment 4 Fox 2020-03-17 03:09:07 UTC

Closed. Info added.