Description of problem: When I import Gluster Storage Cluster with one or more volumes into Tendrl, some of the bricks are marked as offline after a while and glusterfsd seems to be crashed on that particular node. I wasn't able to reproduce it without importing the cluster into Tendrl, so there seems to be some connection. How reproducible: 100%, it seems to be more quickly reproducible with two or 3 volumes Steps to Reproduce: 1. Install and configure Gluster Cluster In my case 6 storage nodes and at least 7 spare disks for bricks peer node. 2. Create one or more volumes. In my case: volume_alpha_distrep_6x2[1] and prospectively volume_beta_arbiter_2_plus_1x2[2] and volume_gama_disperse_4_plus_2x2[3]. 3. Install and configure RHGS WA (aka Tendrl). 4. Import Gluster cluster into RHGS WA. 5. Watch the status of volumes and bricks for a while. Actual results: After a while (at maximum few hours, with two or three volumes it seems to be quicker), the volume is switched to degraded/partial state and one or more bricks are in offline state. Expected results: glusterfsd shouldn't crash and all the bricks should be online Additional info: # systemctl status glusterd -l ● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2018-04-25 06:10:39 EDT; 20h ago Main PID: 3967 (glusterd) Tasks: 16 CGroup: /system.slice/glusterd.service ├─3967 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO └─7138 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/3491b382ef0974aa824814231126d638.socket --xlator-option *replicate*.node-uuid=5c1daca2-bea9-45d8-909f-4970992e6cf8 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: dlfcn 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: libpthread 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: llistxattr 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: setfsid 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: spinlock 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: epoll.h 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: xattr.h 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: st_atim.tv_nsec 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: package-string: glusterfs 3.12.2 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: --------- # journalctl -u 'gluster*' -l -- Logs begin at Wed 2018-04-25 06:04:26 EDT, end at Thu 2018-04-26 02:51:44 EDT. -- Apr 25 06:10:37 gl1.example.com systemd[1]: Starting GlusterFS, a clustered file-system Apr 25 06:10:39 gl1.example.com systemd[1]: Started GlusterFS, a clustered file-system Apr 25 06:26:08 gl1.example.com systemd[1]: Started Gluster Events Notifier. Apr 25 06:26:08 gl1.example.com systemd[1]: Starting Gluster Events Notifier... Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: pending frames: Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: frame : type(0) op(0 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: patchset: git://git. Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: signal received: 11 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: time of crash: Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: 2018-04-25 15:08:50 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: configuration detail Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: argp 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: backtrace 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: dlfcn 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: libpthread 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: llistxattr 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: setfsid 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: spinlock 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: epoll.h 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: xattr.h 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: st_atim.tv_nsec 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: package-string: glus Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: --------- # gluster volume status Status of volume: volume_alpha_distrep_6x2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gl1.example.com dhat.com:/mnt/brick_alpha_distrep_1/1 N/A N/A N N/A Brick gl2.example.com dhat.com:/mnt/brick_alpha_distrep_1/1 N/A N/A N N/A Brick gl3.example.com dhat.com:/mnt/brick_alpha_distrep_1/1 49152 0 Y 6813 Brick gl4.example.com dhat.com:/mnt/brick_alpha_distrep_1/1 49152 0 Y 6810 Brick gl5.example.com dhat.com:/mnt/brick_alpha_distrep_1/1 N/A N/A N N/A Brick gl6.example.com dhat.com:/mnt/brick_alpha_distrep_1/1 49152 0 Y 6803 Brick gl1.example.com dhat.com:/mnt/brick_alpha_distrep_2/2 N/A N/A N N/A Brick gl2.example.com dhat.com:/mnt/brick_alpha_distrep_2/2 N/A N/A N N/A Brick gl3.example.com dhat.com:/mnt/brick_alpha_distrep_2/2 49152 0 Y 6813 Brick gl4.example.com dhat.com:/mnt/brick_alpha_distrep_2/2 49152 0 Y 6810 Brick gl5.example.com dhat.com:/mnt/brick_alpha_distrep_2/2 N/A N/A N N/A Brick gl6.example.com dhat.com:/mnt/brick_alpha_distrep_2/2 49152 0 Y 6803 Self-heal Daemon on localhost N/A N/A Y 7138 Self-heal Daemon on gl2.example.com N/A N/A Y 15133 Self-heal Daemon on gl5.example.com N/A N/A Y 15126 Self-heal Daemon on gl3.example.com N/A N/A Y 6845 Self-heal Daemon on gl4.example.com N/A N/A Y 6840 Self-heal Daemon on gl6.example.com N/A N/A Y 6839 Task Status of Volume volume_alpha_distrep_6x2 ------------------------------------------------------------------------------ There are no active volume tasks # gluster pool list UUID Hostname State eda0bb49-4e25-4a61-bfcb-35b6d06fdbce gl2.example.com Connected 4e3b5b0b-78e6-464f-8cb6-f035ed25ab99 gl3.example.com Connected 137f7aae-54ac-47e0-a804-63df82428ca5 gl4.example.com Connected 11398aa2-8e93-4a5a-b3fe-fcf5afffa9e1 gl5.example.com Connected 240c053e-4344-4e7e-af45-05a16309daea gl6.example.com Connected 5c1daca2-bea9-45d8-909f-4970992e6cf8 localhost Connected # ls -l /core.* -rw-------. 1 root root 328843264 Apr 25 11:08 /core.7102
REVIEW: https://review.gluster.org/19977 (Glusterfsd: brick crash during get-state) posted (#1) for review on master by hari gowtham
COMMIT: https://review.gluster.org/19977 committed in master by "Raghavendra G" <rgowdapp> with a commit message- Glusterfsd: brick crash during get-state The xprt's dereferencing wasn't checked before using it for the strcmp, which caused the segfault and crashed the brick process. fix: Check every deferenced variable before using it. Change-Id: I7f705d1c88a124e8219bb877156fadb17ecf11c3 fixes: bz#1575864 Signed-off-by: hari gowtham <hgowtham>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report. glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html [2] https://www.gluster.org/pipermail/gluster-users/