Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1521041

Summary:	rpc: fix the timedout tests
Product:	[Community] GlusterFS	Reporter:	Amar Tumballi <atumball>
Component:	rpc	Assignee:	bugs <bugs>
Status:	CLOSED WONTFIX	QA Contact:
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	bugs
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-10-12 17:42:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Amar Tumballi 2017-12-05 16:55:30 UTC

Description of problem:

rpc: Prevent frame-timeouts from hanging syncops

Summary:
It was observed while testing the SHD threading code, that under high loads SHD/AFR related
SyncOps & SyncTasks can actually hang/deadlock as the transport
disconnected event (for frame timeouts) never gets bubbled up correctly. Various
tests indicated the ping timeouts worked fine, while "frame timeouts"
did not. The only difference? Ping timeouts actually disconnect
the transport while frame timeouts did not. So from a high-level we
know this prevents deadlock as subsequent tests showed the deadlocks
no longer ocurred (after this change). That said, there may be some
more elegant solution. For now though, forcing a reconnect is
preferential vs hanging clients or deadlocking the SHD.

Test Plan:
It's fairly difficult to write a good prove test for this since it requires human eyes to observe if the SHD is deadlocked (I'm open to ideas). Here's the repro though:
1. Create a 3x replicated cluster on a host.
2. Set the frame-timeout low (say 2 sec)
3. Down a brick, and write a pile of files (maybe 2000)
4. Bring up the downed brick and let the SHD begin healing files
5. During the heal process, kill -STOP <pid of brick> (hang) one of the bricks

Without this patch the SHD will be deadlocked, even though the frame timed out after 2 seconds. With the patch, the plug is pulled on the transport, a disconnect is bubbled up
to the syncop and the SHD resumes.

Comment 1 Worker Ant 2017-12-05 16:56:39 UTC

REVIEW: https://review.gluster.org/18929 (rpc: Prevent frame-timeouts from hanging syncops) posted (#1) for review on master by Amar Tumballi

Comment 2 Amar Tumballi 2018-10-12 17:42:51 UTC

Patch abandoned as not required!