1769216 – glusterfsd fail to get online after reboot two storage node at the same time

Bug 1769216 - glusterfsd fail to get online after reboot two storage node at the same time

Summary: glusterfsd fail to get online after reboot two storage node at the same time

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	7
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-11-06 07:59 UTC by zhou lin
Modified:	2019-11-13 08:49 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-11-13 08:49:46 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
glusterfsd process log (2.25 MB, application/zip) 2019-11-06 07:59 UTC, zhou lin	no flags	Details
glusterd process log (1.59 MB, application/zip) 2019-11-06 08:00 UTC, zhou lin	no flags	Details
View All

Description zhou lin 2019-11-06 07:59:31 UTC

Created attachment 1633224 [details]
glusterfsd process log

Description of problem:

During my recent test on glusterfs7, still found in case of reboot storage nodes, often, after glusterd and glusterfsd get up, the volume status is wrong!
Glusterd and glusterfsd process are both alive however gluster v status command showd glusterfsd process N/A 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.reboot all storage node at the same time
2.wait for all nodes getup
3.execute "gluster v status all"

Actual results:

some volume glusterfsd fail to get online
Expected results:
all glsuterfsd get online

Additional info:
# gluster v status ccs
Status of volume: ccs
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick mn-0.local:/mnt/bricks/ccs/brick      N/A       N/A        N       N/A  
Brick mn-1.local:/mnt/bricks/ccs/brick      53952     0          Y       2065 
Brick dbm-0.local:/mnt/bricks/ccs/brick     N/A       N/A        N       N/A  
Self-heal Daemon on localhost               N/A       N/A        Y       4940 
Self-heal Daemon on dbm-0.local             N/A       N/A        N       N/A  
Self-heal Daemon on mn-1.local              N/A       N/A        Y       2537 
 
Task Status of Volume ccs
------------------------------------------------------------------------------
There are no active volume tasks
# ps -ef | grep glusterfsd| grep ccs
root      1764     1  0 09:10 ?        00:00:07 /usr/sbin/glusterfsd -s mn-0.local --volfile-id ccs.mn-0.local.mnt-bricks-ccs-brick -p /var/run/gluster/vols/ccs/mn-0.local-mnt-bricks-ccs-brick.pid -S /var/run/gluster/7ea87ceb0a781684.socket --brick-name /mnt/bricks/ccs/brick -l /var/log/glusterfs/bricks/mnt-bricks-ccs-brick.log --log-level TRACE --xlator-option *-posix.glusterd-uuid=ebaded6d-91d5-4873-a60a-59bbcc813714 --process-name brick --brick-port 53952 --xlator-option ccs-server.listen-port=53952 --xlator-option transport.socket.bind-address=mn-0.local
[root@mn-0:/var/log/storageinfo/symptom_log]

[root@mn-0:/var/log/storageinfo/symptom_log]
# netstat -anlp| grep 1764 
tcp        0      0 192.168.1.6:53952       0.0.0.0:*               LISTEN      1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.11:49058      ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.6:49069       ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.33:49139      ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.12:49136      ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.16:49139      ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.23:49145      ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.5:49052       ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.8:49113       ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.7:49104       ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.6:49056       ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.6:49082       ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.29:49144      ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.5:49045       ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.11:49100      ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:49149       192.168.1.6:24007       ESTABLISHED 1764/glusterfsd     
unix  2      [ ACC ]     STREAM     LISTENING     25405    1764/glusterfsd      /var/run/gluster/7ea87ceb0a781684.socket
unix  2      [ ACC ]     STREAM     LISTENING     40159    1764/glusterfsd      /var/run/gluster/changelog-25ddbf533d927939.sock
unix  3      [ ]         STREAM     CONNECTED     41282    1764/glusterfsd      /var/run/gluster/7ea87ceb0a781684.socket
unix  2      [ ]         DGRAM                    26910    1764/glusterfsd      
[root@mn-0:/var/log/storageinfo/symptom_log]  

[root@mn-0:/var/log/storageinfo/symptom_log]
# gluster v info ccs

Volume Name: ccs
Type: Replicate
Volume ID: 521261bc-2cba-4e7b-a21a-8486712d7a31
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: mn-0.local:/mnt/bricks/ccs/brick
Brick2: mn-1.local:/mnt/bricks/ccs/brick
Brick3: dbm-0.local:/mnt/bricks/ccs/brick
Options Reconfigured:
diagnostics.brick-log-level: TRACE
cluster.self-heal-daemon: on
nfs.disable: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
cluster.server-quorum-type: none
cluster.quorum-type: auto
cluster.quorum-reads: true
cluster.consistent-metadata: on
server.allow-insecure: on
network.ping-timeout: 42
cluster.favorite-child-policy: mtime
cluster.heal-timeout: 60
performance.client-io-threads: off
cluster.metadata-self-heal: on
cluster.data-self-heal: on
cluster.entry-self-heal: on
cluster.server-quorum-ratio: 51%


[some analysis based on enclosed log]
From glusterd.log
[2019-11-06 07:10:42.708849] D [MSGID: 0] [glusterd-utils.c:6625:glusterd_restart_bricks] 0-management: starting the volume ccs  --------- glusterd start glusterfsd process here
…
[2019-11-06 07:10:43.710937] T [socket.c:226:socket_dump_info] 0-management: $$$ client: connecting to (af:1,sock:12) /var/run/gluster/7ea87ceb0a781684.socket non-SSL (errno:0:Success)  -- does this mean connection with glusterfsd is successful ?


From glusterfsd.log
[2019-11-06 07:10:42.779208] T [socket.c:226:socket_dump_info] 0-socket.glusterfsd: $$$ client: listening on (af:1,sock:7) /var/run/gluster/7ea87ceb0a781684.socket non-SSL (errno:0:Success)  ------I think this means glusterfsd unix domain socket is ready to receive

Comment 1 zhou lin 2019-11-06 08:00:08 UTC

Created attachment 1633225 [details]
glusterd process log

Comment 2 zhou lin 2019-11-13 08:49:46 UTC

it seems like to be a config issue finally, in my env
glusterd.conf
the ping-timeout value is set to be 0, this seems to have sth to do with this issue,
after i change this ping-timeout value to 30, this problem disappeared!

Note You need to log in before you can comment on or make changes to this bug.