Bug 1809326
Summary: | Volumes keeps going offline with messages -- posix-helpers.c:2150:posix_health_check_thread_proc | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Amgad <amgad.saleh> | ||||||||||
Component: | glusterd | Assignee: | Sanju <srakonde> | ||||||||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | |||||||||||
Severity: | urgent | Docs Contact: | |||||||||||
Priority: | unspecified | ||||||||||||
Version: | 6 | CC: | bugs, ksubrahm, srakonde | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2020-03-13 17:14:17 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
Amgad
2020-03-02 20:53:22 UTC
Not sure about the issue with GlusterFS EL7 RPMs running on CentOS EL7 and Kernel 4.18.0-80.11.2.el8_0.x86_64. What are the logs saying? Thanks, Sanju Created attachment 1667817 [details]
glusterd on node experiencing volume offline
Created attachment 1667818 [details]
volume log that experiencing volume offline
Attaching volume log and glusterd logs on the node experiencing volume offline -- there're few of manual stop/start to bring it on line, but it wen down few second right after Also on the same node -- glusterd fails to come up with the following messages [2020-03-05 17:11:28.540630] E [socket.c:923:__socket_server_bind] 0-socket.management: binding to failed: Address already in use [2020-03-05 17:11:28.540656] E [socket.c:925:__socket_server_bind] 0-socket.management: Port is already in use Any suggestion on that? I tried to restart glusterd now on that node, but it doesn't start. with messages -- I'm also attaching glusterd.log and glusterd.vol files as well [root@telco-control-02 glusterfs]# systemctl start glusterd Job for glusterd.service failed because the control process exited with error code. See "systemctl status glusterd.service" and "journalctl -xe" for details. [root@telco-control-02 glusterfs]# systemctl status glusterd.service â glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/etc/systemd/system/glusterd.service; enabled; vendor preset: disabled) Active: activating (start) since Thu 2020-03-05 19:23:58 UTC; 6s ago Control: 31068 (glusterd) Tasks: 190 Memory: 493.9M (limit: 23.5G) CGroup: /system.slice/glusterd.service ââ 3873 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level ER... ââ19874 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id bcmt-glus... ââ19905 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id bcmt-helm... ââ19938 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id cbur-glus... ââ19977 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id cbur-glus... ââ20078 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id cbur-glus... ââ20090 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id vol_04c09... ââ24041 /usr/sbin/glusterfs -s 172.16.1.17 --volfile-id gluster/gl... ââ31068 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level ER... ââ31071 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level ER... Mar 05 19:23:58 telco-control-02 systemd[1]: Stopped GlusterFS, a clustered .... Mar 05 19:23:58 telco-control-02 systemd[1]: Starting GlusterFS, a clustered.... Hint: Some lines were ellipsized, use -l to show in full. [root@telco-control-02 glusterfs]# systemctl status -l glusterd.service â glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/etc/systemd/system/glusterd.service; enabled; vendor preset: disabled) Active: activating (start) since Thu 2020-03-05 19:24:24 UTC; 4ms ago Control: 31746 (glusterd) Tasks: 183 Memory: 491.9M (limit: 23.5G) CGroup: /system.slice/glusterd.service ââ 3873 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level ERROR ââ19874 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id bcmt-glusterfs.172.16.1.17.data0-glusterfs -p /var/run/gluster/vols/bcmt-glusterfs/172.16.1.17-data0-glusterfs.pid -S /var/run/gluster/1ee29a143ced3772.socket --brick-name /data0/glusterfs -l /var/log/glusterfs/bricks/data0-glusterfs.log --xlator-option *-posix.glusterd-uuid=6e013921-73b5-46cd-b8d7-5e7d93cd6bf0 --process-name brick --brick-port 49152 --xlator-option bcmt-glusterfs-server.listen-port=49152 --xlator-option transport.socket.bind-address=172.16.1.17 ââ19905 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id bcmt-helm-home.172.16.1.17.data0-glusterfs-helm-home -p /var/run/gluster/vols/bcmt-helm-home/172.16.1.17-data0-glusterfs-helm-home.pid -S /var/run/gluster/32e46f40f6326f49.socket --brick-name /data0/glusterfs-helm-home -l /var/log/glusterfs/bricks/data0-glusterfs-helm-home.log --xlator-option *-posix.glusterd-uuid=6e013921-73b5-46cd-b8d7-5e7d93cd6bf0 --process-name brick --brick-port 49153 --xlator-option bcmt-helm-home-server.listen-port=49153 --xlator-option transport.socket.bind-address=172.16.1.17 ââ19938 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id cbur-glusterfs-backup.172.16.1.17.data0-glusterfs-cbur-backup -p /var/run/gluster/vols/cbur-glusterfs-backup/172.16.1.17-data0-glusterfs-cbur-backup.pid -S /var/run/gluster/9faf700427a7b186.socket --brick-name /data0/glusterfs-cbur-backup -l /var/log/glusterfs/bricks/data0-glusterfs-cbur-backup.log --xlator-option *-posix.glusterd-uuid=6e013921-73b5-46cd-b8d7-5e7d93cd6bf0 --process-name brick --brick-port 49154 --xlator-option cbur-glusterfs-backup-server.listen-port=49154 --xlator-option transport.socket.bind-address=172.16.1.17 ââ19977 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id cbur-glusterfs-backup-cluster.172.16.1.17.data0-glusterfs-backup-cluster -p /var/run/gluster/vols/cbur-glusterfs-backup-cluster/172.16.1.17-data0-glusterfs-backup-cluster.pid -S /var/run/gluster/f9bad66bcadcc1cf.socket --brick-name /data0/glusterfs-backup-cluster -l /var/log/glusterfs/bricks/data0-glusterfs-backup-cluster.log --xlator-option *-posix.glusterd-uuid=6e013921-73b5-46cd-b8d7-5e7d93cd6bf0 --process-name brick --brick-port 49155 --xlator-option cbur-glusterfs-backup-cluster-server.listen-port=49155 --xlator-option transport.socket.bind-address=172.16.1.17 ââ20078 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id cbur-glusterfs-repo.172.16.1.17.data0-glusterfs-cbur-repo -p /var/run/gluster/vols/cbur-glusterfs-repo/172.16.1.17-data0-glusterfs-cbur-repo.pid -S /var/run/gluster/0fd1d96c1f9e0319.socket --brick-name /data0/glusterfs-cbur-repo -l /var/log/glusterfs/bricks/data0-glusterfs-cbur-repo.log --xlator-option *-posix.glusterd-uuid=6e013921-73b5-46cd-b8d7-5e7d93cd6bf0 --process-name brick --brick-port 49156 --xlator-option cbur-glusterfs-repo-server.listen-port=49156 --xlator-option transport.socket.bind-address=172.16.1.17 ââ20090 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id vol_04c09b2af8fdf1c596fe8cef4d03f966.172.16.1.17.var-lib-heketi-mounts-vg_260e40bf44e9568fe20ed05545208104-brick_2a3320145759ee25cd0a814959ec98e0-brick -p /var/run/gluster/vols/vol_04c09b2af8fdf1c596fe8cef4d03f966/172.16.1.17-var-lib-heketi-mounts-vg_260e40bf44e9568fe20ed05545208104-brick_2a3320145759ee25cd0a814959ec98e0-brick.pid -S /var/run/gluster/117adbbe3cbaa25c.socket --brick-name /var/lib/heketi/mounts/vg_260e40bf44e9568fe20ed05545208104/brick_2a3320145759ee25cd0a814959ec98e0/brick -l /var/log/glusterfs/bricks/var-lib-heketi-mounts-vg_260e40bf44e9568fe20ed05545208104-brick_2a3320145759ee25cd0a814959ec98e0-brick.log --xlator-option *-posix.glusterd-uuid=6e013921-73b5-46cd-b8d7-5e7d93cd6bf0 --process-name brick --brick-port 49157 --xlator-option vol_04c09b2af8fdf1c596fe8cef4d03f966-server.listen-port=49157 --xlator-option transport.socket.bind-address=172.16.1.17 ââ24041 /usr/sbin/glusterfs -s 172.16.1.17 --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/67898f7af0cbf69c.socket --xlator-option *replicate*.node-uuid=6e013921-73b5-46cd-b8d7-5e7d93cd6bf0 --process-name glustershd --client-pid=-6 ââ31746 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level ERROR Mar 05 19:24:24 telco-control-02 systemd[1]: Starting GlusterFS, a clustered file-system server... Created attachment 1667890 [details]
gluserd.log when restarted daeemon
Created attachment 1667891 [details]
glusterd.vol on the troubled node
while glusterd is not up and running [root@telco-control-02 glusterfs]# systemctl status glusterd â glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/etc/systemd/system/glusterd.service; enabled; vendor preset: disabled) Active: activating (start) since Thu 2020-03-05 20:05:04 UTC; 4s ago Control: 25949 (glusterd) Tasks: 190 Memory: 494.7M (limit: 23.5G) CGroup: /system.slice/glusterd.service ââ 3873 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level ER... ââ19874 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id bcmt-glus... ââ19905 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id bcmt-helm... ââ19938 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id cbur-glus... ââ19977 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id cbur-glus... ââ20078 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id cbur-glus... ââ20090 /usr/sbin/glusterfsd -s 172.16.1.17 --volfile-id vol_04c09... ââ24041 /usr/sbin/glusterfs -s 172.16.1.17 --volfile-id gluster/gl... ââ25949 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level ER... ââ25950 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level ER... Mar 05 20:05:04 telco-control-02 systemd[1]: Starting GlusterFS, a clustered.... Hint: Some lines were ellipsized, use -l to show in full. [root@telco-control-02 glusterfs]# the port 24007 is showing owned by glusterd - (old process) [root@telco-control-02 glusterfs]# netstat -nltp | grep 24007 tcp 0 0 172.16.1.17:24007 0.0.0.0:* LISTEN 3873/glusterd when I killed the process -- kill -9 3873 == I was able to start glusterd. The question is why that process was left there holding to the port? I believe this is the main issue!!!!!!!!!!!!!!!!!! glusterd is coded in such a way that it always runs in port 24007. So, whenever some other process is using this port it can't start. This behaviour is expected. What is the current status of your cluster? Are the volumes still going offline? Thanks, Sanju Thanks Sanju I'm aware of port 24007. The issue is the process that was holding the port is a hung glusterd process. and so systemctl restart glusterd would fail all the time with the error. That was there for days (See my before last comment). That's the issue [root@telco-control-02 glusterfs]# netstat -nltp | grep 24007 tcp 0 0 172.16.1.17:24007 0.0.0.0:* LISTEN 3873/glusterd ** A separate question - how many glustershd processes per host, is it configurable. Can you force it to be 1. And what's the criteria to have multiple? Sanju? (In reply to Amgad from comment #12) > ** A separate question - how many glustershd processes per host, is it > configurable. Can you force it to be 1. And what's the criteria to have > multiple? glustershd is a per node process till release-6. With shd-multiplexing feature, which is introduced in release-7, it is a per volume process. Still, you will have only one shd process running in one node as the shd instances will be attached to the same process. As of release-6, number of shd processes running in a node is not configurable (not sure about release-7). Thanks, Sanju Thanks Sanju Now if you see multiple glustershd processes per node in release 6.x, what does that mean? running away process? Thanks Sanju Now if you see multiple glustershd processes per node in release 6.x, what does that mean? running away process? Not sure how we can have multiple shd's running on the same node. Please share the output of the following for the further analysis: 1. ps -ax | grep glustershd 2. gluster v status 3. glustershd.log @karthik, please add your thoughts on this. Thanks, Sanju Hi Amgad, Unless you have spawned a new shd process, there will be only one shd running per node. Please give the information requested in comment #17. Regards, Karthik Thanks Karthik This was reported by one of our application, that there was transient ones and they were monitoring the number: when it became > 1, they'd kill the daemon and restart glusterd So it doesn't happen every time. So I'll reach out to application to provide the data when it happens again and not taking any action Thanks Karthik This was reported by one of our application, that there was transient ones and they were monitoring the number: when it became > 1, they'd kill the daemon and restart glusterd So it doesn't happen every time. So I'll reach out to application to provide the data when it happens again and not taking any action Hi Amgad, Thanks for letting us know that it is not consistent. In that case can we close this for the time being and you can reopen or file a new github issue (we have moved to github from bugzilla, and all the open bugs are getting migrated to there) under https://github.com/gluster/glusterfs/issues whenever it is happens again with the data requested in comment #17? Regards, Karthik Thanks Karthik Sure you can go ahead and close the ticket. I'll open a new ticket whenever it's re-produced. Are the following info what you need when it happens? 1. ps -ax | grep glustershd 2. gluster v status 3. glustershd.log Regards, Amgad (In reply to Amgad from comment #22) > Thanks Karthik > > Sure you can go ahead and close the ticket. I'll open a new ticket whenever > it's re-produced. > > Are the following info what you need when it happens? Yes Amgad. > > 1. ps -ax | grep glustershd > 2. gluster v status > 3. glustershd.log > > Regards, > Amgad @Sanju, INSUFFICIENT_DATA seems to be more appropriate than NOTABUG. Changing it accordingly. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |