Bug 1811373 - glusterfsd crashes healing disperse volumes on arm
Summary: glusterfsd crashes healing disperse volumes on arm
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: GlusterFS
Classification: Community
Component: core
Version: 7
Hardware: armv7l
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-08 04:09 UTC by Fox
Modified: 2020-03-17 03:09 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-12 12:22:37 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
Excerpts from several gluster logs (26.43 KB, text/plain)
2020-03-08 04:09 UTC, Fox
no flags Details

Description Fox 2020-03-08 04:09:47 UTC
Created attachment 1668387 [details]
Excerpts from several gluster logs

Description of problem:
the gluster brick process on an arm node that needs healing will crash (almost always) seconds after it starts and connects to other cluster members. Have tested under ubuntu 18, gluster v7 and v4 running on odroid HC2 and raspbian gluster v5 running on raspberry pi 3

Version-Release number of selected component (if applicable):
gluster 7.2 but have also reproduced the problem on 4 and 5

How reproducible:
Reliably reproducible

Steps to Reproduce:
1. Create disperse volume on a cluster with 3 or more members/bricks and enable healing
2. Have a client mount volume and begin writing files to volume
3. Reboot a cluster member during client operations
4. Cluster member rejoins cluster and attempts to heal
5. glusterd on that member typically crashes seconds to minutes after startup. In rare cases longer.

Actual results:
gluster volume status
shows the affected brick online briefly and then offline after it crashes. The self heal daemon shows as online. The brick is never able to heal and rejoin the cluster.

Expected results:
The brick should come online and sync up.

Additional info:
Have run the same test on x86 hardware and it does not exhibit the same crash.

I am willing to make this testbed available to developers to help debug this issue. It is a 12 node system comprised of odroid HC2 units with a 4tb drive attached to each unit.


Volume Name: bigdisp
Type: Disperse   
Volume ID: 56fa5de3-36d5-45ec-9789-88d8aae02275
Status: Started  
Snapshot Count: 0
Number of Bricks: 1 x (8 + 4) = 12
Transport-type: tcp
Bricks:
Brick1: gluster1:/exports/sda/brick1/bigdisp
Brick2: gluster2:/exports/sda/brick1/bigdisp
Brick3: gluster3:/exports/sda/brick1/bigdisp
Brick4: gluster4:/exports/sda/brick1/bigdisp
Brick5: gluster5:/exports/sda/brick1/bigdisp
Brick6: gluster6:/exports/sda/brick1/bigdisp
Brick7: gluster7:/exports/sda/brick1/bigdisp
Brick8: gluster8:/exports/sda/brick1/bigdisp
Brick9: gluster9:/exports/sda/brick1/bigdisp
Brick10: gluster10:/exports/sda/brick1/bigdisp
Brick11: gluster11:/exports/sda/brick1/bigdisp
Brick12: gluster12:/exports/sda/brick1/bigdisp
Options Reconfigured:
disperse.shd-max-threads: 4
client.event-threads: 8
cluster.disperse-self-heal-daemon: enable
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on



Status of volume: bigdisp
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gluster1:/exports/sda/brick1/bigdisp  49152     0          Y       4632
Brick gluster2:/exports/sda/brick1/bigdisp  49152     0          Y       3115
Brick gluster3:/exports/sda/brick1/bigdisp  N/A       N/A        N       N/A
Brick gluster4:/exports/sda/brick1/bigdisp  49152     0          Y       2728
Brick gluster5:/exports/sda/brick1/bigdisp  49152     0          Y       3072
Brick gluster6:/exports/sda/brick1/bigdisp  49152     0          Y       2549
Brick gluster7:/exports/sda/brick1/bigdisp  49152     0          Y       16848
Brick gluster8:/exports/sda/brick1/bigdisp  49152     0          Y       16740
Brick gluster9:/exports/sda/brick1/bigdisp  49152     0          Y       2619
Brick gluster10:/exports/sda/brick1/bigdisp 49152     0          Y       2677
Brick gluster11:/exports/sda/brick1/bigdisp 49152     0          Y       3023
Brick gluster12:/exports/sda/brick1/bigdisp 49153     0          Y       2440
Self-heal Daemon on localhost               N/A       N/A        Y       4653
Self-heal Daemon on gluster3                N/A       N/A        Y       7620
Self-heal Daemon on gluster10               N/A       N/A        Y       2698
Self-heal Daemon on gluster7                N/A       N/A        Y       16869
Self-heal Daemon on gluster8                N/A       N/A        Y       16761
Self-heal Daemon on gluster12               N/A       N/A        Y       2461
Self-heal Daemon on gluster9                N/A       N/A        Y       2640
Self-heal Daemon on gluster2                N/A       N/A        Y       3136
Self-heal Daemon on gluster5                N/A       N/A        Y       3093
Self-heal Daemon on gluster4                N/A       N/A        Y       2749
Self-heal Daemon on gluster6                N/A       N/A        Y       2570
Self-heal Daemon on gluster11               N/A       N/A        Y       3044

Task Status of Volume bigdisp
------------------------------------------------------------------------------
There are no active volume tasks

Comment 1 Xavi Hernandez 2020-03-09 07:08:19 UTC
Can you check if this patch [1] fixes the issue ?

[1] https://review.gluster.org/c/glusterfs/+/23912

Comment 2 Worker Ant 2020-03-12 12:22:37 UTC
This bug is moved to https://github.com/gluster/glusterfs/issues/886, and will be tracked there from now on. Visit GitHub issues URL for further details

Comment 3 Fox 2020-03-15 20:38:23 UTC
The patch provided did correct the issue. Thank you.

Comment 4 Fox 2020-03-17 03:09:07 UTC
Closed. Info added.


Note You need to log in before you can comment on or make changes to this bug.