1699512 – ['gluster peer status' command showing an incorrect status when a host is is rebooting]

Bug 1699512 - ['gluster peer status' command showing an incorrect status when a host is is rebooting]

Summary: ['gluster peer status' command showing an incorrect status when a host is is ...

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	Gluster
Sub Component:
Version:	4.40.0
Hardware:	Unspecified
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	ovirt-4.4.0
Target Release:	---
Assignee:	Kaustav Majumder
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1698519
TreeView+	depends on / blocked

Reported:	2019-04-13 06:29 UTC by Sahina Bose
Modified:	2023-10-06 18:14 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:	1698519
Environment:
Last Closed:	2019-11-28 14:21:07 UTC
oVirt Team:	Gluster
Embargoed:
Dependent Products:
Flags:	sasundar: ovirt-4.4?

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	100693	0	master	ABANDONED	gluster: Added timeout to gluster vol status command pre fencing	2020-07-16 03:56:46 UTC

Description Sahina Bose 2019-04-13 06:29:11 UTC

Description of problem:

This is eventually caused by the single-thread working schema of glusterd. From Atin Mukherjee:

"Sahina - This is the root cause. Since glusterd works in single thread e-poll, it can only process packets under a big lock one at a time. Now in this case if both the shutdown and volume status race with each other, glusterd say picks up v status first, gets into a big lock and send a rpc request assuming the peer is still not disconnected and wait for 10 minutes for rpc frame to bail out. On the other hand the rpc_clnt_disconnect packet which was queued up wasn't processed which resulted peer status to still show the status of the peer to be connected even though the node is down and after 10 minutes once the big lock was released due to the rpc frame bail out this packet was processed and the peer was
marked disconnected."

Need to ensure that the volume status call during fencing pre-check ensures that there is a check to ping the host to see if it's online and add a small delay before triggering the volume status

Comment 3 Sahina Bose 2019-11-28 14:21:07 UTC

Since this patch does not address the race in Gluster when a node reboot and volume status request causes the node status to be shown as connected, closing the bug as can't fix. Bug 1698519 is tracking the original issue, but this cannot be fixed in vdsm-gluster.

Note You need to log in before you can comment on or make changes to this bug.