1609451 – [Tracker for gluster bug 1609450] Bricks are marked as down, after node reboot

Bug 1609451 - [Tracker for gluster bug 1609450] Bricks are marked as down, after node reboot

Summary: [Tracker for gluster bug 1609450] Bricks are marked as down, after node reboot

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	rhhi
Sub Component:
Version:	rhgs-3.3
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Gobinda Das
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:	1609450 1758438
Blocks:	1548985
TreeView+	depends on / blocked

Reported:	2018-07-28 01:27 UTC by SATHEESARAN
Modified:	2020-08-17 15:16 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	When a node rebooted, including as part of upgrades or updates, subsequent runs of `gluster volume status` sometimes incorrectly reported that bricks were not running, even when the relevant `glusterfsd` processes were running as expected. State is now reported correctly in these circumstances.
Clone Of:	1609450
Environment:	RHHI
Last Closed:	2020-08-17 15:16:31 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
glusterd.log (3.39 MB, text/plain) 2018-10-26 14:56 UTC, SATHEESARAN	no flags	Details
View All

Description SATHEESARAN 2018-07-28 01:27:20 UTC

+++ This bug was initially created as a clone of Bug #1609450 +++

Description of problem:
-----------------------
There were 3 RHGS 3.3.1 nodes in the trusted storage pool. It has the server quorum and client quorum on. There are 3 replica 3 volumes across three nodes in this cluser. When one of the node is rebooted, the bricks are marked down with 'gluster volume status' command, even though there are glusterfsd processes running

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHGS 3.3.1 ( glusterfs-3.8.4-54.15.el7rhgs )

How reproducible:
------------------
Always

Steps to Reproduce:
-------------------
1. Create a 3 node trusted storage pool ( gluster cluster )
2. Create replica 3 volumes
3. Enable server & client quorums
4. Reboot one of the node
5. Check for 'gluster volume status'

Actual results:
---------------
'gluster volume status' command reports that the bricks are not running, though the corresponding glusterfsd processes are still running.

Expected results:
-----------------
After reboot,'gluster volume status' command should report that the correct status of bricks, which in this case is 'up'

--- Additional comment from Red Hat Bugzilla Rules Engine on 2018-07-27 21:24:53 EDT ---

This bug is automatically being proposed for the release of Red Hat Gluster Storage 3 under active development and open for bug fixes, by setting the release flag 'rhgs‑3.4.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

Comment 2 SATHEESARAN 2018-07-28 01:32:27 UTC

I have a suspicion around the following messages in glusterd logs

<snip>
[2018-07-27 15:49:12.961132] I [MSGID: 106493] [glusterd-rpc-ops.c:693:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 6715c775-6021-4f21-a669-83bee56e55c5
[2018-07-27 15:49:12.967504] I [socket.c:2465:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2018-07-27 15:49:12.972686] I [MSGID: 106005] [glusterd-handler.c:6122:__glusterd_brick_rpc_notify] 0-management: Brick rhsqa-grafton12.lab.eng.blr.redhat.com:/gluster_bricks/data/data has disconnected from glusterd.
[2018-07-27 15:49:12.980700] I [socket.c:2465:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2018-07-27 15:49:12.986954] I [MSGID: 106005] [glusterd-handler.c:6122:__glusterd_brick_rpc_notify] 0-management: Brick rhsqa-grafton12.lab.eng.blr.redhat.com:/gluster_bricks/engine/engine has disconnected from glusterd.
[2018-07-27 15:49:12.993857] I [socket.c:2465:socket_event_handler] 0-transport: EPOLLERR - disconnecting now
[2018-07-27 15:49:13.000230] I [MSGID: 106005] [glusterd-handler.c:6122:__glusterd_brick_rpc_notify] 0-management: Brick rhsqa-grafton12.lab.eng.blr.redhat.com:/gluster_bricks/vmstore/vmstore has disconnected from glusterd.
</snip>

Comment 3 Sahina Bose 2018-10-09 05:38:27 UTC

Closed as per the status of the tracking bz

Comment 4 SATHEESARAN 2018-10-26 14:53:24 UTC

I have hit the same issue while upgrading from RHHI-V 1.1 to RHHI-V 1.5

RHHI-V 1.1 - glusterfs-3.8.4-15.8.el7rhgs
RHHI-V 1.5 - glusterfs-3.12.2-25.el7rhgs

Post upgrade, the RHVH node was rebooted, whrn the node came up, I could issue gluster volume status and noticed that the brick was down, but after investigating the brick process, those were up and running.

So re-opening the bug

Comment 6 SATHEESARAN 2018-10-26 14:56:59 UTC

Created attachment 1497833 [details]
glusterd.log

Attaching the glusterd.log from the host (RHVH node) which was upgraded and rebooted

Comment 7 SATHEESARAN 2018-10-26 15:00:27 UTC

This issue to be documented as the known_issue.
Workaround: Restart glusterd on the RHVH node post upgrade/update and reboot.

Comment 8 SATHEESARAN 2019-03-19 20:11:39 UTC

Adding this as a known_issue for RHHI-V 1.6

Comment 9 Yaniv Kaul 2019-03-27 11:28:25 UTC

(In reply to SATHEESARAN from comment #8)
> Adding this as a known_issue for RHHI-V 1.6

Why? You could not reproduce it (https://bugzilla.redhat.com/show_bug.cgi?id=1609450#c28) - please close both.

Comment 11 SATHEESARAN 2019-04-01 07:03:33 UTC

(In reply to Yaniv Kaul from comment #9)
> (In reply to SATHEESARAN from comment #8)
> > Adding this as a known_issue for RHHI-V 1.6
> 
> Why? You could not reproduce it
> (https://bugzilla.redhat.com/show_bug.cgi?id=1609450#c28) - please close
> both.

Yaniv,

I think you have misunderstood the issue. I could not hit issue only with DEBUG logs enabled.
The issue is still seen particularly during the upgrade/update, that the brick process is not
show in 'gluster volume status' but it was still running in the node.

Comment 14 SATHEESARAN 2019-04-12 12:49:57 UTC

@Laura, doc_text looks good. I have edited the text to make sure that this issue
is not hit sometimes, and not all the times

Comment 16 SATHEESARAN 2020-08-17 15:16:31 UTC

This issue is no longer seen with RHHI-V 1.8 with RHV 4.4

Note You need to log in before you can comment on or make changes to this bug.