1609450 – Bricks are marked as down, after node reboot

Bug 1609450 - Bricks are marked as down, after node reboot

Summary: Bricks are marked as down, after node reboot

Keywords:
Status:	CLOSED DUPLICATE of bug 1758438
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Sanju
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1609451
TreeView+	depends on / blocked

Reported:	2018-07-28 01:24 UTC by SATHEESARAN
Modified:	2020-07-08 07:09 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1609451 (view as bug list)
Environment:	RHHI
Last Closed:	2020-07-08 07:09:20 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
glusterd log file from the rebooted node (1.07 MB, application/octet-stream) 2018-07-28 01:37 UTC, SATHEESARAN	no flags	Details
glusterd.log (3.39 MB, text/plain) 2018-10-26 15:02 UTC, SATHEESARAN	no flags	Details
View All

Description SATHEESARAN 2018-07-28 01:24:48 UTC

Description of problem:
-----------------------
There were 3 RHGS 3.3.1 nodes in the trusted storage pool. It has the server quorum and client quorum on. There are 3 replica 3 volumes across three nodes in this cluser. When one of the node is rebooted, the bricks are marked down with 'gluster volume status' command, even though there are glusterfsd processes running

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHGS 3.3.1 ( glusterfs-3.8.4-54.15.el7rhgs )

How reproducible:
------------------
Always

Steps to Reproduce:
-------------------
1. Create a 3 node trusted storage pool ( gluster cluster )
2. Create replica 3 volumes
3. Enable server & client quorums
4. Reboot one of the node
5. Check for 'gluster volume status'

Actual results:
---------------
'gluster volume status' command reports that the bricks are not running, though the corresponding glusterfsd processes are still running.

Expected results:
-----------------
After reboot,'gluster volume status' command should report that the correct status of bricks, which in this case is 'up'

Comment 2 SATHEESARAN 2018-07-28 01:35:54 UTC

I have a suspicion around the following messages in glusterd logs

<snip>
[2018-07-27 15:49:12.961132] I [MSGID: 106493] [glusterd-rpc-ops.c:693:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 6715c775-6021-4f21-a669-83bee56e55c5
[2018-07-27 15:49:12.967504] I [socket.c:2465:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2018-07-27 15:49:12.972686] I [MSGID: 106005] [glusterd-handler.c:6122:__glusterd_brick_rpc_notify] 0-management: Brick rhsqa-grafton12.lab.eng.blr.redhat.com:/gluster_bricks/data/data has disconnected from glusterd.
[2018-07-27 15:49:12.980700] I [socket.c:2465:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2018-07-27 15:49:12.986954] I [MSGID: 106005] [glusterd-handler.c:6122:__glusterd_brick_rpc_notify] 0-management: Brick rhsqa-grafton12.lab.eng.blr.redhat.com:/gluster_bricks/engine/engine has disconnected from glusterd.
[2018-07-27 15:49:12.993857] I [socket.c:2465:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2018-07-27 15:49:13.000230] I [MSGID: 106005] [glusterd-handler.c:6122:__glusterd_brick_rpc_notify] 0-management: Brick rhsqa-grafton12.lab.eng.blr.redhat.com:/gluster_bricks/vmstore/vmstore has disconnected from glusterd.
</snip>

Comment 3 SATHEESARAN 2018-07-28 01:36:46 UTC

Workaround exists for this issue:
After reboot of the node, just need to restart gluster service on that node and 'gluster volume status' reports correct status

Comment 4 SATHEESARAN 2018-07-28 01:37:21 UTC

Created attachment 1471171 [details]
glusterd log file from the rebooted node

Comment 10 Atin Mukherjee 2018-10-08 02:31:05 UTC

Sanju - did we try to reproduce this with latest RHGS bits?

Comment 11 Sanju 2018-10-08 08:08:59 UTC

I did the following to reproduce the issue.

1. Formed a 3 node cluster running with RHGS-3.4.0 bits
2. Created 3 replica 3 volumes and started them
3. Enabled server quorum for all volumes
   gluster volume set <volname> cluster.server-quorum-type server
4. Enabled client quorum for all volumes
   gluster v set <volname> cluster.quorum-type auto
5. Rebooted one of the node
6. Started glusterd on rebooted node
7. gluster v status shows all bricks online.


@Sas, I couldn't reproduce this issue with RHGS-3.4.0 bits. I'm in favour of closing this bug. need your inputs here.

Comment 12 Atin Mukherjee 2018-10-08 12:59:23 UTC

I'm closing this. Please feel free to reopen if the issue persists.

Comment 13 SATHEESARAN 2018-10-26 15:01:11 UTC

I have hit the same issue while upgrading from RHHI-V 1.1 to RHHI-V 1.5

RHHI-V 1.1 - glusterfs-3.8.4-15.8.el7rhgs
RHHI-V 1.5 - glusterfs-3.12.2-25.el7rhgs

Post upgrade, the RHVH node was rebooted, whrn the node came up, I could issue gluster volume status and noticed that the brick was down, but after investigating the brick process, those were up and running.

So re-opening the bug

Comment 14 SATHEESARAN 2018-10-26 15:02:21 UTC

Created attachment 1497834 [details]
glusterd.log

Attaching the glusterd.log as the issue is re-surfaced

Comment 16 Atin Mukherjee 2018-10-31 11:44:24 UTC

Sas - To close down the loop, can you please provide us a setup where this can be replicated so what we can take over and start debugging this? We need to prioritize this bug considering the nature of the problem reported.

Comment 29 Yaniv Kaul 2019-04-01 07:38:35 UTC

Is it a temporary situation, that resolves itself after a short period of time?

Comment 30 SATHEESARAN 2019-11-21 18:12:41 UTC

(In reply to Yaniv Kaul from comment #29)
> Is it a temporary situation, that resolves itself after a short period of
> time?

No. It doesn't

It recovers after glusterd restart on that particular HC node

Note You need to log in before you can comment on or make changes to this bug.