Bug 1809326 - Volumes keeps going offline with messages -- posix-helpers.c:2150:posix_health_check_thread_proc [NEEDINFO]
Summary: Volumes keeps going offline with messages -- posix-helpers.c:2150:posix_healt...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: 6
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Sanju
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-02 20:53 UTC by Amgad
Modified: 2020-03-16 04:31 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-13 17:14:17 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
srakonde: needinfo? (amgad.saleh)
srakonde: needinfo? (amgad.saleh)
ksubrahm: needinfo? (amgad.saleh)


Attachments (Terms of Use)
glusterd on node experiencing volume offline (8.41 MB, text/plain)
2020-03-05 16:05 UTC, Amgad
no flags Details
volume log that experiencing volume offline (743.65 KB, text/plain)
2020-03-05 16:07 UTC, Amgad
no flags Details
gluserd.log when restarted daeemon (12.50 MB, text/plain)
2020-03-05 19:31 UTC, Amgad
no flags Details
glusterd.vol on the troubled node (643 bytes, text/plain)
2020-03-05 19:32 UTC, Amgad
no flags Details

Description Amgad 2020-03-02 20:53:22 UTC
Description of problem:

We have a 3-node system - replica 3 and also a kubernetes storage class.
The issue with the 3-node volumes. Two of them are getting offline all the time. Once you stop/start, they come online and few seconds later they go offline.

The following Broadcast messages came out:

Broadcast message from systemd-journald@telco-control-02 (Mon 2020-03-02 19:02:02 UTC):

data0-glusterfs-helm-home[30981]: [2020-03-02 19:02:02.357179] M [MSGID: 113075] [posix-helpers.c:2168:posix_health_check_thread_proc] 0-bcmt-helm-home-posix: still alive! -> SIGTERM

Message from syslogd@telco-control-02 at Mar  2 19:02:02 ...
 data0-glusterfs-helm-home[30981]:[2020-03-02 19:02:02.356492] M [MSGID: 113075] [posix-helpers.c:2150:posix_health_check_thread_proc] 0-bcmt-helm-home-posix: health-check failed, going down

Message from syslogd@telco-control-02 at Mar  2 19:02:02 ...
 data0-glusterfs-helm-home[30981]:[2020-03-02 19:02:02.357179] M [MSGID: 113075] [posix-helpers.c:2168:posix_health_check_thread_proc] 0-bcmt-helm-home-posix: still alive! -> SIGTERM

Broadcast message from systemd-journald@telco-control-02 (Mon 2020-03-02 19:02:02 UTC):

data0-glusterfs-cbur-repo[31014]: [2020-03-02 19:02:02.891885] M [MSGID: 113075] [posix-helpers.c:2150:posix_health_check_thread_proc] 0-cbur-glusterfs-repo-posix: health-check failed, going down


Broadcast message from systemd-journald@telco-control-02 (Mon 2020-03-02 19:02:02 UTC):

data0-glusterfs-cbur-repo[31014]: [2020-03-02 19:02:02.892088] M [MSGID: 113075] [posix-helpers.c:2168:posix_health_check_thread_proc] 0-cbur-glusterfs-repo-posix: still alive! -> SIGTERM


Message from syslogd@telco-control-02 at Mar  2 19:02:02 ...
 data0-glusterfs-cbur-repo[31014]:[2020-03-02 19:02:02.891885] M [MSGID: 113075] [posix-helpers.c:2150:posix_health_check_thread_proc] 0-cbur-glusterfs-repo-posix: health-check failed, going down

Message from syslogd@telco-control-02 at Mar  2 19:02:02 ...
 data0-glusterfs-cbur-repo[31014]:[2020-03-02 19:02:02.892088] M [MSGID: 113075] [posix-helpers.c:2168:posix_health_check_thread_proc] 0-cbur-glusterfs-repo-posix: still alive! -> SIGTERM




Version-Release number of selected component (if applicable):

GlusterFS is 6.5 running on CentOS EL7 and Kernel 4.18.0-80.11.2.el8_0.x86_64
Is there any issue with GlusterFS EL7 RPMs running on CentOS EL7 and Kernel 4.18.0-80.11.2.el8_0.x86_64?

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Sanju 2020-03-04 10:32:16 UTC
Not sure about the issue with GlusterFS EL7 RPMs running on CentOS EL7 and Kernel 4.18.0-80.11.2.el8_0.x86_64. What are the logs saying?

Thanks,
Sanju

Comment 2 Amgad 2020-03-05 16:05:52 UTC
Created attachment 1667817 [details]
glusterd on node experiencing volume offline

Comment 3 Amgad 2020-03-05 16:07:21 UTC
Created attachment 1667818 [details]
volume log that experiencing volume offline

Comment 4 Amgad 2020-03-05 16:09:11 UTC
Attaching volume log and glusterd logs on the node experiencing volume offline -- there're few of manual stop/start to bring it on line, but it wen down few second right after

Comment 5 Amgad 2020-03-05 17:13:26 UTC
Also on the same node -- glusterd fails to come up with the following messages
[2020-03-05 17:11:28.540630] E [socket.c:923:__socket_server_bind] 0-socket.management: binding to  failed: Address already in use
[2020-03-05 17:11:28.540656] E [socket.c:925:__socket_server_bind] 0-socket.management: Port is already in use

Any suggestion on that?

Comment 6 Amgad 2020-03-05 19:29:36 UTC
I tried to restart glusterd now on that node, but it doesn't start. with messages -- I'm also attaching glusterd.log and glusterd.vol files as well

[root@telco-control-02 glusterfs]# systemctl start glusterd
Job for glusterd.service failed because the control process exited with error code. See "systemctl status glusterd.service" and "journalctl -xe" for details.
[root@telco-control-02 glusterfs]# systemctl status glusterd.service
â glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/etc/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: activating (start) since Thu 2020-03-05 19:23:58 UTC; 6s ago
  Control: 31068 (glusterd)
    Tasks: 190
   Memory: 493.9M (limit: 23.5G)
   CGroup: /system.slice/glusterd.service
           ââ 3873 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level ER...
           ââ19874 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id bcmt-glus...
           ââ19905 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id bcmt-helm...
           ââ19938 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id cbur-glus...
           ââ19977 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id cbur-glus...
           ââ20078 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id cbur-glus...
           ââ20090 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id vol_04c09...
           ââ24041 /usr/sbin/glusterfs -s 172.16.1.17 --volfile-id gluster/gl...
           ââ31068 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level ER...
           ââ31071 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level ER...

Mar 05 19:23:58 telco-control-02 systemd[1]: Stopped GlusterFS, a clustered ....
Mar 05 19:23:58 telco-control-02 systemd[1]: Starting GlusterFS, a clustered....
Hint: Some lines were ellipsized, use -l to show in full.
[root@telco-control-02 glusterfs]# systemctl status -l glusterd.service
â glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/etc/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: activating (start) since Thu 2020-03-05 19:24:24 UTC; 4ms ago
  Control: 31746 (glusterd)
    Tasks: 183
   Memory: 491.9M (limit: 23.5G)
   CGroup: /system.slice/glusterd.service
           ââ 3873 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level ERROR
           ââ19874 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id bcmt-glusterfs.172.16.1.17.data0-glusterfs -p /var/run/gluster/vols/bcmt-glusterfs/172.16.1.17-data0-glusterfs.pid -S /var/run/gluster/1ee29a143ced3772.socket --brick-name /data0/glusterfs -l /var/log/glusterfs/bricks/data0-glusterfs.log --xlator-option *-posix.glusterd-uuid=6e013921-73b5-46cd-b8d7-5e7d93cd6bf0 --process-name brick --brick-port 49152 --xlator-option bcmt-glusterfs-server.listen-port=49152 --xlator-option transport.socket.bind-address=172.16.1.17
           ââ19905 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id bcmt-helm-home.172.16.1.17.data0-glusterfs-helm-home -p /var/run/gluster/vols/bcmt-helm-home/172.16.1.17-data0-glusterfs-helm-home.pid -S /var/run/gluster/32e46f40f6326f49.socket --brick-name /data0/glusterfs-helm-home -l /var/log/glusterfs/bricks/data0-glusterfs-helm-home.log --xlator-option *-posix.glusterd-uuid=6e013921-73b5-46cd-b8d7-5e7d93cd6bf0 --process-name brick --brick-port 49153 --xlator-option bcmt-helm-home-server.listen-port=49153 --xlator-option transport.socket.bind-address=172.16.1.17
           ââ19938 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id cbur-glusterfs-backup.172.16.1.17.data0-glusterfs-cbur-backup -p /var/run/gluster/vols/cbur-glusterfs-backup/172.16.1.17-data0-glusterfs-cbur-backup.pid -S /var/run/gluster/9faf700427a7b186.socket --brick-name /data0/glusterfs-cbur-backup -l /var/log/glusterfs/bricks/data0-glusterfs-cbur-backup.log --xlator-option *-posix.glusterd-uuid=6e013921-73b5-46cd-b8d7-5e7d93cd6bf0 --process-name brick --brick-port 49154 --xlator-option cbur-glusterfs-backup-server.listen-port=49154 --xlator-option transport.socket.bind-address=172.16.1.17
           ââ19977 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id cbur-glusterfs-backup-cluster.172.16.1.17.data0-glusterfs-backup-cluster -p /var/run/gluster/vols/cbur-glusterfs-backup-cluster/172.16.1.17-data0-glusterfs-backup-cluster.pid -S /var/run/gluster/f9bad66bcadcc1cf.socket --brick-name /data0/glusterfs-backup-cluster -l /var/log/glusterfs/bricks/data0-glusterfs-backup-cluster.log --xlator-option *-posix.glusterd-uuid=6e013921-73b5-46cd-b8d7-5e7d93cd6bf0 --process-name brick --brick-port 49155 --xlator-option cbur-glusterfs-backup-cluster-server.listen-port=49155 --xlator-option transport.socket.bind-address=172.16.1.17
           ââ20078 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id cbur-glusterfs-repo.172.16.1.17.data0-glusterfs-cbur-repo -p /var/run/gluster/vols/cbur-glusterfs-repo/172.16.1.17-data0-glusterfs-cbur-repo.pid -S /var/run/gluster/0fd1d96c1f9e0319.socket --brick-name /data0/glusterfs-cbur-repo -l /var/log/glusterfs/bricks/data0-glusterfs-cbur-repo.log --xlator-option *-posix.glusterd-uuid=6e013921-73b5-46cd-b8d7-5e7d93cd6bf0 --process-name brick --brick-port 49156 --xlator-option cbur-glusterfs-repo-server.listen-port=49156 --xlator-option transport.socket.bind-address=172.16.1.17
           ââ20090 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id vol_04c09b2af8fdf1c596fe8cef4d03f966.172.16.1.17.var-lib-heketi-mounts-vg_260e40bf44e9568fe20ed05545208104-brick_2a3320145759ee25cd0a814959ec98e0-brick -p /var/run/gluster/vols/vol_04c09b2af8fdf1c596fe8cef4d03f966/172.16.1.17-var-lib-heketi-mounts-vg_260e40bf44e9568fe20ed05545208104-brick_2a3320145759ee25cd0a814959ec98e0-brick.pid -S /var/run/gluster/117adbbe3cbaa25c.socket --brick-name /var/lib/heketi/mounts/vg_260e40bf44e9568fe20ed05545208104/brick_2a3320145759ee25cd0a814959ec98e0/brick -l /var/log/glusterfs/bricks/var-lib-heketi-mounts-vg_260e40bf44e9568fe20ed05545208104-brick_2a3320145759ee25cd0a814959ec98e0-brick.log --xlator-option *-posix.glusterd-uuid=6e013921-73b5-46cd-b8d7-5e7d93cd6bf0 --process-name brick --brick-port 49157 --xlator-option vol_04c09b2af8fdf1c596fe8cef4d03f966-server.listen-port=49157 --xlator-option transport.socket.bind-address=172.16.1.17
           ââ24041 /usr/sbin/glusterfs -s 172.16.1.17 --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/67898f7af0cbf69c.socket --xlator-option *replicate*.node-uuid=6e013921-73b5-46cd-b8d7-5e7d93cd6bf0 --process-name glustershd --client-pid=-6
           ââ31746 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level ERROR

Mar 05 19:24:24 telco-control-02 systemd[1]: Starting GlusterFS, a clustered file-system server...

Comment 7 Amgad 2020-03-05 19:31:09 UTC
Created attachment 1667890 [details]
gluserd.log when restarted daeemon

Comment 8 Amgad 2020-03-05 19:32:06 UTC
Created attachment 1667891 [details]
glusterd.vol on the troubled node

Comment 9 Amgad 2020-03-05 20:08:21 UTC
while glusterd is not up and running

[root@telco-control-02 glusterfs]# systemctl status glusterd
â glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/etc/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: activating (start) since Thu 2020-03-05 20:05:04 UTC; 4s ago
  Control: 25949 (glusterd)
    Tasks: 190
   Memory: 494.7M (limit: 23.5G)
   CGroup: /system.slice/glusterd.service
           ââ 3873 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level ER...
           ââ19874 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id bcmt-glus...
           ââ19905 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id bcmt-helm...
           ââ19938 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id cbur-glus...
           ââ19977 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id cbur-glus...
           ââ20078 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id cbur-glus...
           ââ20090 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id vol_04c09...
           ââ24041 /usr/sbin/glusterfs -s 172.16.1.17 --volfile-id gluster/gl...
           ââ25949 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level ER...
           ââ25950 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level ER...

Mar 05 20:05:04 telco-control-02 systemd[1]: Starting GlusterFS, a clustered....
Hint: Some lines were ellipsized, use -l to show in full.
[root@telco-control-02 glusterfs]# 

the port 24007 is showing owned by glusterd - (old process)
[root@telco-control-02 glusterfs]# netstat -nltp | grep 24007
tcp        0      0 172.16.1.17:24007       0.0.0.0:*               LISTEN      3873/glusterd

Comment 10 Amgad 2020-03-05 20:12:55 UTC
when I killed the process -- kill -9 3873 == I was able to start glusterd. 
The question is why that process was left there holding to the port?  I believe this is the main issue!!!!!!!!!!!!!!!!!!

Comment 11 Sanju 2020-03-06 05:57:52 UTC
glusterd is coded in such a way that it always runs in port 24007. So, whenever some other process is using this port it can't start. This behaviour is expected.

What is the current status of your cluster? Are the volumes still going offline?

Thanks,
Sanju

Comment 12 Amgad 2020-03-06 14:39:46 UTC
Thanks Sanju

I'm aware of port 24007. The issue is the process that was holding the port is a hung glusterd process. and so systemctl restart glusterd would fail all the time with the error.
That was there for days (See my before last comment). That's the issue

[root@telco-control-02 glusterfs]# netstat -nltp | grep 24007
tcp        0      0 172.16.1.17:24007       0.0.0.0:*               LISTEN      3873/glusterd

** A separate question - how many glustershd processes per host, is it configurable. Can you force it to be 1. And what's the criteria to have multiple?

Comment 13 Amgad 2020-03-09 04:32:06 UTC
Sanju?

Comment 14 Sanju 2020-03-09 06:12:26 UTC
(In reply to Amgad from comment #12)
> ** A separate question - how many glustershd processes per host, is it
> configurable. Can you force it to be 1. And what's the criteria to have
> multiple?

glustershd is a per node process till release-6.

With shd-multiplexing feature, which is introduced in release-7, it is a per volume process. Still, you will have only one shd process running in one node as the shd instances will be attached to the same process.

As of release-6, number of shd processes running in a node is not configurable (not sure about release-7).

Thanks,
Sanju

Comment 15 Amgad 2020-03-09 13:44:52 UTC
Thanks Sanju

Now if you see multiple glustershd processes per node in release 6.x, what does that mean? running away process?

Comment 16 Amgad 2020-03-09 13:45:10 UTC
Thanks Sanju

Now if you see multiple glustershd processes per node in release 6.x, what does that mean? running away process?

Comment 17 Sanju 2020-03-11 07:58:44 UTC
Not sure how we can have multiple shd's running on the same node. Please share the output of the following for the further analysis:
1. ps -ax | grep glustershd
2. gluster v status
3. glustershd.log

@karthik, please add your thoughts on this.

Thanks,
Sanju

Comment 18 Karthik U S 2020-03-11 08:34:42 UTC
Hi Amgad,

Unless you have spawned a new shd process, there will be only one shd running per node. Please give the information requested in comment #17.

Regards,
Karthik

Comment 19 Amgad 2020-03-12 16:48:43 UTC
Thanks Karthik

This was reported by one of our application, that there was transient ones and they were monitoring the number: when it became > 1, they'd kill the daemon and restart glusterd
So it doesn't happen every time. So I'll reach out to application to provide the data when it happens again and not taking any action

Comment 20 Amgad 2020-03-12 16:59:07 UTC
Thanks Karthik

This was reported by one of our application, that there was transient ones and they were monitoring the number: when it became > 1, they'd kill the daemon and restart glusterd
So it doesn't happen every time. So I'll reach out to application to provide the data when it happens again and not taking any action

Comment 21 Karthik U S 2020-03-13 05:33:42 UTC
Hi Amgad,

Thanks for letting us know that it is not consistent. In that case can we close this for the time being and you can reopen or file a new github issue (we have moved to github from bugzilla, and all the open bugs are getting migrated to there) under https://github.com/gluster/glusterfs/issues whenever it is happens again with the data requested in comment #17?

Regards,
Karthik

Comment 22 Amgad 2020-03-13 14:04:34 UTC
Thanks Karthik

Sure you can go ahead and close the ticket. I'll open a new ticket whenever it's re-produced.

Are the following info what you need when it happens?

1. ps -ax | grep glustershd
2. gluster v status
3. glustershd.log

Regards,
Amgad

Comment 23 Sanju 2020-03-13 17:14:17 UTC
(In reply to Amgad from comment #22)
> Thanks Karthik
> 
> Sure you can go ahead and close the ticket. I'll open a new ticket whenever
> it's re-produced.
> 
> Are the following info what you need when it happens?
Yes Amgad.
> 
> 1. ps -ax | grep glustershd
> 2. gluster v status
> 3. glustershd.log
> 
> Regards,
> Amgad

Comment 24 Karthik U S 2020-03-16 04:31:29 UTC
@Sanju, INSUFFICIENT_DATA seems to be more appropriate than NOTABUG. Changing it accordingly.


Note You need to log in before you can comment on or make changes to this bug.