Bug 1769216

Summary:

glusterfsd fail to get online after reboot two storage node at the same time

Product:

[Community] GlusterFS

Reporter:

zhou lin <zz.sh.cynthia>

Component:

glusterd

Assignee:

bugs <bugs>

Status:

CLOSED NOTABUG

QA Contact:

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

CC:

bugs, pasik

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-11-13 08:49:46 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
glusterfsd process log	none
glusterd process log	none

Description zhou lin 2019-11-06 07:59:31 UTC

Created attachment 1633224 [details]
glusterfsd process log

Description of problem:

During my recent test on glusterfs7, still found in case of reboot storage nodes, often, after glusterd and glusterfsd get up, the volume status is wrong!
Glusterd and glusterfsd process are both alive however gluster v status command showd glusterfsd process N/A 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.reboot all storage node at the same time
2.wait for all nodes getup
3.execute "gluster v status all"

Actual results:

some volume glusterfsd fail to get online
Expected results:
all glsuterfsd get online

Additional info:
# gluster v status ccs
Status of volume: ccs
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick mn-0.local:/mnt/bricks/ccs/brick      N/A       N/A        N       N/A  
Brick mn-1.local:/mnt/bricks/ccs/brick      53952     0          Y       2065 
Brick dbm-0.local:/mnt/bricks/ccs/brick     N/A       N/A        N       N/A  
Self-heal Daemon on localhost               N/A       N/A        Y       4940 
Self-heal Daemon on dbm-0.local             N/A       N/A        N       N/A  
Self-heal Daemon on mn-1.local              N/A       N/A        Y       2537 
 
Task Status of Volume ccs
------------------------------------------------------------------------------
There are no active volume tasks
# ps -ef | grep glusterfsd| grep ccs
root      1764     1  0 09:10 ?        00:00:07 /usr/sbin/glusterfsd -s mn-0.local --volfile-id ccs.mn-0.local.mnt-bricks-ccs-brick -p /var/run/gluster/vols/ccs/mn-0.local-mnt-bricks-ccs-brick.pid -S /var/run/gluster/7ea87ceb0a781684.socket --brick-name /mnt/bricks/ccs/brick -l /var/log/glusterfs/bricks/mnt-bricks-ccs-brick.log --log-level TRACE --xlator-option *-posix.glusterd-uuid=ebaded6d-91d5-4873-a60a-59bbcc813714 --process-name brick --brick-port 53952 --xlator-option ccs-server.listen-port=53952 --xlator-option transport.socket.bind-address=mn-0.local
[root@mn-0:/var/log/storageinfo/symptom_log]

[root@mn-0:/var/log/storageinfo/symptom_log]
# netstat -anlp| grep 1764 
tcp        0      0 192.168.1.6:53952       0.0.0.0:*               LISTEN      1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.11:49058      ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.6:49069       ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.33:49139      ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.12:49136      ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.16:49139      ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.23:49145      ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.5:49052       ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.8:49113       ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.7:49104       ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.6:49056       ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.6:49082       ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.29:49144      ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.5:49045       ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:53952       192.168.1.11:49100      ESTABLISHED 1764/glusterfsd     
tcp        0      0 192.168.1.6:49149       192.168.1.6:24007       ESTABLISHED 1764/glusterfsd     
unix  2      [ ACC ]     STREAM     LISTENING     25405    1764/glusterfsd      /var/run/gluster/7ea87ceb0a781684.socket
unix  2      [ ACC ]     STREAM     LISTENING     40159    1764/glusterfsd      /var/run/gluster/changelog-25ddbf533d927939.sock
unix  3      [ ]         STREAM     CONNECTED     41282    1764/glusterfsd      /var/run/gluster/7ea87ceb0a781684.socket
unix  2      [ ]         DGRAM                    26910    1764/glusterfsd      
[root@mn-0:/var/log/storageinfo/symptom_log]  

[root@mn-0:/var/log/storageinfo/symptom_log]
# gluster v info ccs

Volume Name: ccs
Type: Replicate
Volume ID: 521261bc-2cba-4e7b-a21a-8486712d7a31
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: mn-0.local:/mnt/bricks/ccs/brick
Brick2: mn-1.local:/mnt/bricks/ccs/brick
Brick3: dbm-0.local:/mnt/bricks/ccs/brick
Options Reconfigured:
diagnostics.brick-log-level: TRACE
cluster.self-heal-daemon: on
nfs.disable: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
cluster.server-quorum-type: none
cluster.quorum-type: auto
cluster.quorum-reads: true
cluster.consistent-metadata: on
server.allow-insecure: on
network.ping-timeout: 42
cluster.favorite-child-policy: mtime
cluster.heal-timeout: 60
performance.client-io-threads: off
cluster.metadata-self-heal: on
cluster.data-self-heal: on
cluster.entry-self-heal: on
cluster.server-quorum-ratio: 51%


[some analysis based on enclosed log]
From glusterd.log
[2019-11-06 07:10:42.708849] D [MSGID: 0] [glusterd-utils.c:6625:glusterd_restart_bricks] 0-management: starting the volume ccs  --------- glusterd start glusterfsd process here
…
[2019-11-06 07:10:43.710937] T [socket.c:226:socket_dump_info] 0-management: $$$ client: connecting to (af:1,sock:12) /var/run/gluster/7ea87ceb0a781684.socket non-SSL (errno:0:Success)  -- does this mean connection with glusterfsd is successful ?


From glusterfsd.log
[2019-11-06 07:10:42.779208] T [socket.c:226:socket_dump_info] 0-socket.glusterfsd: $$$ client: listening on (af:1,sock:7) /var/run/gluster/7ea87ceb0a781684.socket non-SSL (errno:0:Success)  ------I think this means glusterfsd unix domain socket is ready to receive

Comment 1 zhou lin 2019-11-06 08:00:08 UTC

Created attachment 1633225 [details]
glusterd process log

Comment 2 zhou lin 2019-11-13 08:49:46 UTC

it seems like to be a config issue finally, in my env
glusterd.conf
the ping-timeout value is set to be 0, this seems to have sth to do with this issue,
after i change this ping-timeout value to 30, this problem disappeared!