1575864 – glusterfsd crashing because of RHGS WA?

Bug 1575864 - glusterfsd crashing because of RHGS WA?

Summary: glusterfsd crashing because of RHGS WA?

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	core
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	hari gowtham
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-05-08 06:28 UTC by hari gowtham
Modified:	2018-10-23 15:07 UTC (History)
CC List:	10 users (show)
Fixed In Version:	glusterfs-5.0
Clone Of:	1572075
Environment:
Last Closed:	2018-10-23 15:07:59 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Comment 1 hari gowtham 2018-05-08 06:30:51 UTC

Description of problem:
  When I import Gluster Storage Cluster with one or more volumes into
  Tendrl, some of the bricks are marked as offline after
  a while and glusterfsd seems to be crashed on that particular node.

  I wasn't able to reproduce it without importing the cluster into
  Tendrl, so there seems to be some connection.

How reproducible:
  100%, it seems to be more quickly reproducible with two or 3 volumes

Steps to Reproduce:
1. Install and configure Gluster Cluster
  In my case 6 storage nodes and at least 7 spare disks for bricks peer node.
2. Create one or more volumes.
  In my case: volume_alpha_distrep_6x2[1] and prospectively
  volume_beta_arbiter_2_plus_1x2[2] and volume_gama_disperse_4_plus_2x2[3].
3. Install and configure RHGS WA (aka Tendrl).
4. Import Gluster cluster into RHGS WA.
5. Watch the status of volumes and bricks for a while.

Actual results:
  After a while (at maximum few hours, with two or three volumes it seems
  to be quicker), the volume is switched to degraded/partial state and one or
  more bricks are in offline state.

Expected results:
  glusterfsd shouldn't crash and all the bricks should be online

Additional info:
  
# systemctl status glusterd -l
  ● glusterd.service - GlusterFS, a clustered file-system server
     Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
     Active: active (running) since Wed 2018-04-25 06:10:39 EDT; 20h ago
   Main PID: 3967 (glusterd)
      Tasks: 16
     CGroup: /system.slice/glusterd.service
             ├─3967 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
             └─7138 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/3491b382ef0974aa824814231126d638.socket --xlator-option *replicate*.node-uuid=5c1daca2-bea9-45d8-909f-4970992e6cf8

  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: dlfcn 1
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: libpthread 1
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: llistxattr 1
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: setfsid 1
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: spinlock 1
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: epoll.h 1
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: xattr.h 1
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: st_atim.tv_nsec 1
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: package-string: glusterfs 3.12.2
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: ---------

# journalctl -u 'gluster*' -l
  -- Logs begin at Wed 2018-04-25 06:04:26 EDT, end at Thu 2018-04-26 02:51:44 EDT. --
  Apr 25 06:10:37 gl1.example.com systemd[1]: Starting GlusterFS, a clustered file-system
  Apr 25 06:10:39 gl1.example.com systemd[1]: Started GlusterFS, a clustered file-system 
  Apr 25 06:26:08 gl1.example.com systemd[1]: Started Gluster Events Notifier.
  Apr 25 06:26:08 gl1.example.com systemd[1]: Starting Gluster Events Notifier...
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: pending frames:
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: frame : type(0) op(0
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: patchset: git://git.
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: signal received: 11
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: time of crash:
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: 2018-04-25 15:08:50
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: configuration detail
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: argp 1
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: backtrace 1
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: dlfcn 1
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: libpthread 1
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: llistxattr 1
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: setfsid 1
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: spinlock 1
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: epoll.h 1
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: xattr.h 1
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: st_atim.tv_nsec 1
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: package-string: glus
  Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: ---------

# gluster volume status 
  Status of volume: volume_alpha_distrep_6x2
  Gluster process                             TCP Port  RDMA Port  Online  Pid
  ------------------------------------------------------------------------------
  Brick gl1.example.com
  dhat.com:/mnt/brick_alpha_distrep_1/1       N/A       N/A        N       N/A  
  Brick gl2.example.com
  dhat.com:/mnt/brick_alpha_distrep_1/1       N/A       N/A        N       N/A  
  Brick gl3.example.com
  dhat.com:/mnt/brick_alpha_distrep_1/1       49152     0          Y       6813 
  Brick gl4.example.com
  dhat.com:/mnt/brick_alpha_distrep_1/1       49152     0          Y       6810 
  Brick gl5.example.com
  dhat.com:/mnt/brick_alpha_distrep_1/1       N/A       N/A        N       N/A  
  Brick gl6.example.com
  dhat.com:/mnt/brick_alpha_distrep_1/1       49152     0          Y       6803 
  Brick gl1.example.com
  dhat.com:/mnt/brick_alpha_distrep_2/2       N/A       N/A        N       N/A  
  Brick gl2.example.com
  dhat.com:/mnt/brick_alpha_distrep_2/2       N/A       N/A        N       N/A  
  Brick gl3.example.com
  dhat.com:/mnt/brick_alpha_distrep_2/2       49152     0          Y       6813 
  Brick gl4.example.com
  dhat.com:/mnt/brick_alpha_distrep_2/2       49152     0          Y       6810 
  Brick gl5.example.com
  dhat.com:/mnt/brick_alpha_distrep_2/2       N/A       N/A        N       N/A  
  Brick gl6.example.com
  dhat.com:/mnt/brick_alpha_distrep_2/2       49152     0          Y       6803 
  Self-heal Daemon on localhost               N/A       N/A        Y       7138 
  Self-heal Daemon on gl2.example.com         N/A       N/A        Y       15133
  Self-heal Daemon on gl5.example.com         N/A       N/A        Y       15126
  Self-heal Daemon on gl3.example.com         N/A       N/A        Y       6845 
  Self-heal Daemon on gl4.example.com         N/A       N/A        Y       6840 
  Self-heal Daemon on gl6.example.com         N/A       N/A        Y       6839 
   
  Task Status of Volume volume_alpha_distrep_6x2
  ------------------------------------------------------------------------------
  There are no active volume tasks
   
# gluster pool list
  UUID                                  Hostname        State
  eda0bb49-4e25-4a61-bfcb-35b6d06fdbce	gl2.example.com	Connected 
  4e3b5b0b-78e6-464f-8cb6-f035ed25ab99	gl3.example.com	Connected 
  137f7aae-54ac-47e0-a804-63df82428ca5	gl4.example.com	Connected 
  11398aa2-8e93-4a5a-b3fe-fcf5afffa9e1	gl5.example.com	Connected 
  240c053e-4344-4e7e-af45-05a16309daea	gl6.example.com	Connected 
  5c1daca2-bea9-45d8-909f-4970992e6cf8	localhost      	Connected 

# ls -l /core.*
  -rw-------. 1 root root 328843264 Apr 25 11:08 /core.7102

Comment 2 Worker Ant 2018-05-09 11:36:33 UTC

REVIEW: https://review.gluster.org/19977 (Glusterfsd: brick crash during get-state) posted (#1) for review on master by hari gowtham

Comment 3 Worker Ant 2018-05-11 10:55:24 UTC

COMMIT: https://review.gluster.org/19977 committed in master by "Raghavendra G" <rgowdapp> with a commit message- Glusterfsd: brick crash during get-state

The xprt's dereferencing wasn't checked before using it for the
strcmp, which caused the segfault and crashed the brick process.

fix: Check every deferenced variable before using it.

Change-Id: I7f705d1c88a124e8219bb877156fadb17ecf11c3
fixes: bz#1575864
Signed-off-by: hari gowtham <hgowtham>

Comment 4 Shyamsundar 2018-10-23 15:07:59 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report.

glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.