Description of problem:
When I import Gluster Storage Cluster with one or more volumes into
Tendrl, some of the bricks are marked as offline after
a while and glusterfsd seems to be crashed on that particular node.
I wasn't able to reproduce it without importing the cluster into
Tendrl, so there seems to be some connection.
How reproducible:
100%, it seems to be more quickly reproducible with two or 3 volumes
Steps to Reproduce:
1. Install and configure Gluster Cluster
In my case 6 storage nodes and at least 7 spare disks for bricks peer node.
2. Create one or more volumes.
In my case: volume_alpha_distrep_6x2[1] and prospectively
volume_beta_arbiter_2_plus_1x2[2] and volume_gama_disperse_4_plus_2x2[3].
3. Install and configure RHGS WA (aka Tendrl).
4. Import Gluster cluster into RHGS WA.
5. Watch the status of volumes and bricks for a while.
Actual results:
After a while (at maximum few hours, with two or three volumes it seems
to be quicker), the volume is switched to degraded/partial state and one or
more bricks are in offline state.
Expected results:
glusterfsd shouldn't crash and all the bricks should be online
Additional info:
# systemctl status glusterd -l
● glusterd.service - GlusterFS, a clustered file-system server
Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2018-04-25 06:10:39 EDT; 20h ago
Main PID: 3967 (glusterd)
Tasks: 16
CGroup: /system.slice/glusterd.service
├─3967 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
└─7138 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/3491b382ef0974aa824814231126d638.socket --xlator-option *replicate*.node-uuid=5c1daca2-bea9-45d8-909f-4970992e6cf8
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: dlfcn 1
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: libpthread 1
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: llistxattr 1
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: setfsid 1
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: spinlock 1
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: epoll.h 1
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: xattr.h 1
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: st_atim.tv_nsec 1
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: package-string: glusterfs 3.12.2
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: ---------
# journalctl -u 'gluster*' -l
-- Logs begin at Wed 2018-04-25 06:04:26 EDT, end at Thu 2018-04-26 02:51:44 EDT. --
Apr 25 06:10:37 gl1.example.com systemd[1]: Starting GlusterFS, a clustered file-system
Apr 25 06:10:39 gl1.example.com systemd[1]: Started GlusterFS, a clustered file-system
Apr 25 06:26:08 gl1.example.com systemd[1]: Started Gluster Events Notifier.
Apr 25 06:26:08 gl1.example.com systemd[1]: Starting Gluster Events Notifier...
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: pending frames:
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: frame : type(0) op(0
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: patchset: git://git.
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: signal received: 11
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: time of crash:
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: 2018-04-25 15:08:50
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: configuration detail
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: argp 1
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: backtrace 1
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: dlfcn 1
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: libpthread 1
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: llistxattr 1
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: setfsid 1
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: spinlock 1
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: epoll.h 1
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: xattr.h 1
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: st_atim.tv_nsec 1
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: package-string: glus
Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: ---------
# gluster volume status
Status of volume: volume_alpha_distrep_6x2
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick gl1.example.com
dhat.com:/mnt/brick_alpha_distrep_1/1 N/A N/A N N/A
Brick gl2.example.com
dhat.com:/mnt/brick_alpha_distrep_1/1 N/A N/A N N/A
Brick gl3.example.com
dhat.com:/mnt/brick_alpha_distrep_1/1 49152 0 Y 6813
Brick gl4.example.com
dhat.com:/mnt/brick_alpha_distrep_1/1 49152 0 Y 6810
Brick gl5.example.com
dhat.com:/mnt/brick_alpha_distrep_1/1 N/A N/A N N/A
Brick gl6.example.com
dhat.com:/mnt/brick_alpha_distrep_1/1 49152 0 Y 6803
Brick gl1.example.com
dhat.com:/mnt/brick_alpha_distrep_2/2 N/A N/A N N/A
Brick gl2.example.com
dhat.com:/mnt/brick_alpha_distrep_2/2 N/A N/A N N/A
Brick gl3.example.com
dhat.com:/mnt/brick_alpha_distrep_2/2 49152 0 Y 6813
Brick gl4.example.com
dhat.com:/mnt/brick_alpha_distrep_2/2 49152 0 Y 6810
Brick gl5.example.com
dhat.com:/mnt/brick_alpha_distrep_2/2 N/A N/A N N/A
Brick gl6.example.com
dhat.com:/mnt/brick_alpha_distrep_2/2 49152 0 Y 6803
Self-heal Daemon on localhost N/A N/A Y 7138
Self-heal Daemon on gl2.example.com N/A N/A Y 15133
Self-heal Daemon on gl5.example.com N/A N/A Y 15126
Self-heal Daemon on gl3.example.com N/A N/A Y 6845
Self-heal Daemon on gl4.example.com N/A N/A Y 6840
Self-heal Daemon on gl6.example.com N/A N/A Y 6839
Task Status of Volume volume_alpha_distrep_6x2
------------------------------------------------------------------------------
There are no active volume tasks
# gluster pool list
UUID Hostname State
eda0bb49-4e25-4a61-bfcb-35b6d06fdbce gl2.example.com Connected
4e3b5b0b-78e6-464f-8cb6-f035ed25ab99 gl3.example.com Connected
137f7aae-54ac-47e0-a804-63df82428ca5 gl4.example.com Connected
11398aa2-8e93-4a5a-b3fe-fcf5afffa9e1 gl5.example.com Connected
240c053e-4344-4e7e-af45-05a16309daea gl6.example.com Connected
5c1daca2-bea9-45d8-909f-4970992e6cf8 localhost Connected
# ls -l /core.*
-rw-------. 1 root root 328843264 Apr 25 11:08 /core.7102
COMMIT: https://review.gluster.org/19977 committed in master by "Raghavendra G" <rgowdapp> with a commit message- Glusterfsd: brick crash during get-state
The xprt's dereferencing wasn't checked before using it for the
strcmp, which caused the segfault and crashed the brick process.
fix: Check every deferenced variable before using it.
Change-Id: I7f705d1c88a124e8219bb877156fadb17ecf11c3
fixes: bz#1575864
Signed-off-by: hari gowtham <hgowtham>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.0, please open a new bug report.
glusterfs-5.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.
[1] https://lists.gluster.org/pipermail/announce/2018-October/000115.html
[2] https://www.gluster.org/pipermail/gluster-users/
Description of problem: When I import Gluster Storage Cluster with one or more volumes into Tendrl, some of the bricks are marked as offline after a while and glusterfsd seems to be crashed on that particular node. I wasn't able to reproduce it without importing the cluster into Tendrl, so there seems to be some connection. How reproducible: 100%, it seems to be more quickly reproducible with two or 3 volumes Steps to Reproduce: 1. Install and configure Gluster Cluster In my case 6 storage nodes and at least 7 spare disks for bricks peer node. 2. Create one or more volumes. In my case: volume_alpha_distrep_6x2[1] and prospectively volume_beta_arbiter_2_plus_1x2[2] and volume_gama_disperse_4_plus_2x2[3]. 3. Install and configure RHGS WA (aka Tendrl). 4. Import Gluster cluster into RHGS WA. 5. Watch the status of volumes and bricks for a while. Actual results: After a while (at maximum few hours, with two or three volumes it seems to be quicker), the volume is switched to degraded/partial state and one or more bricks are in offline state. Expected results: glusterfsd shouldn't crash and all the bricks should be online Additional info: # systemctl status glusterd -l ● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2018-04-25 06:10:39 EDT; 20h ago Main PID: 3967 (glusterd) Tasks: 16 CGroup: /system.slice/glusterd.service ├─3967 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO └─7138 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/3491b382ef0974aa824814231126d638.socket --xlator-option *replicate*.node-uuid=5c1daca2-bea9-45d8-909f-4970992e6cf8 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: dlfcn 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: libpthread 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: llistxattr 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: setfsid 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: spinlock 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: epoll.h 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: xattr.h 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: st_atim.tv_nsec 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: package-string: glusterfs 3.12.2 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: --------- # journalctl -u 'gluster*' -l -- Logs begin at Wed 2018-04-25 06:04:26 EDT, end at Thu 2018-04-26 02:51:44 EDT. -- Apr 25 06:10:37 gl1.example.com systemd[1]: Starting GlusterFS, a clustered file-system Apr 25 06:10:39 gl1.example.com systemd[1]: Started GlusterFS, a clustered file-system Apr 25 06:26:08 gl1.example.com systemd[1]: Started Gluster Events Notifier. Apr 25 06:26:08 gl1.example.com systemd[1]: Starting Gluster Events Notifier... Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: pending frames: Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: frame : type(0) op(0 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: patchset: git://git. Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: signal received: 11 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: time of crash: Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: 2018-04-25 15:08:50 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: configuration detail Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: argp 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: backtrace 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: dlfcn 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: libpthread 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: llistxattr 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: setfsid 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: spinlock 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: epoll.h 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: xattr.h 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: st_atim.tv_nsec 1 Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: package-string: glus Apr 25 11:08:50 gl1.example.com mnt-brick_alpha_distrep_1-1[7102]: --------- # gluster volume status Status of volume: volume_alpha_distrep_6x2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gl1.example.com dhat.com:/mnt/brick_alpha_distrep_1/1 N/A N/A N N/A Brick gl2.example.com dhat.com:/mnt/brick_alpha_distrep_1/1 N/A N/A N N/A Brick gl3.example.com dhat.com:/mnt/brick_alpha_distrep_1/1 49152 0 Y 6813 Brick gl4.example.com dhat.com:/mnt/brick_alpha_distrep_1/1 49152 0 Y 6810 Brick gl5.example.com dhat.com:/mnt/brick_alpha_distrep_1/1 N/A N/A N N/A Brick gl6.example.com dhat.com:/mnt/brick_alpha_distrep_1/1 49152 0 Y 6803 Brick gl1.example.com dhat.com:/mnt/brick_alpha_distrep_2/2 N/A N/A N N/A Brick gl2.example.com dhat.com:/mnt/brick_alpha_distrep_2/2 N/A N/A N N/A Brick gl3.example.com dhat.com:/mnt/brick_alpha_distrep_2/2 49152 0 Y 6813 Brick gl4.example.com dhat.com:/mnt/brick_alpha_distrep_2/2 49152 0 Y 6810 Brick gl5.example.com dhat.com:/mnt/brick_alpha_distrep_2/2 N/A N/A N N/A Brick gl6.example.com dhat.com:/mnt/brick_alpha_distrep_2/2 49152 0 Y 6803 Self-heal Daemon on localhost N/A N/A Y 7138 Self-heal Daemon on gl2.example.com N/A N/A Y 15133 Self-heal Daemon on gl5.example.com N/A N/A Y 15126 Self-heal Daemon on gl3.example.com N/A N/A Y 6845 Self-heal Daemon on gl4.example.com N/A N/A Y 6840 Self-heal Daemon on gl6.example.com N/A N/A Y 6839 Task Status of Volume volume_alpha_distrep_6x2 ------------------------------------------------------------------------------ There are no active volume tasks # gluster pool list UUID Hostname State eda0bb49-4e25-4a61-bfcb-35b6d06fdbce gl2.example.com Connected 4e3b5b0b-78e6-464f-8cb6-f035ed25ab99 gl3.example.com Connected 137f7aae-54ac-47e0-a804-63df82428ca5 gl4.example.com Connected 11398aa2-8e93-4a5a-b3fe-fcf5afffa9e1 gl5.example.com Connected 240c053e-4344-4e7e-af45-05a16309daea gl6.example.com Connected 5c1daca2-bea9-45d8-909f-4970992e6cf8 localhost Connected # ls -l /core.* -rw-------. 1 root root 328843264 Apr 25 11:08 /core.7102