Bug 1643231

Summary:	[RFE] enable ALUA support at the gluster handler
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Prasanna Kumar Kalever <prasanna.kalever>
Component:	tcmu-runner	Assignee:	Xiubo Li <xiubli>
Status:	CLOSED WONTFIX	QA Contact:	Prasanth <pprakash>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	ocs-3.11	CC:	kramdoss, ndevos, pasik, prasanna.kalever, rhs-bugs, xiubli
Target Milestone:	---	Keywords:	FutureFeature
Target Release:	---	Flags:	xiubli: needinfo-
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-05-06 08:25:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1643195, 1761365
Bug Blocks:	1641915

Description Prasanna Kumar Kalever 2018-10-25 17:52:31 UTC

Description of problem:

Add ALUA support at glfs handler of tcmu-runner

The alua is a must for gluster-block, because due to the design of the LIO & tcmu, if one path has been blocked for a long time(such due to the network's reason) and then the IO requests in the client side will timed out and try to resend the IO requests through the other path. Just then the blocked path recovered and it will continue the old IO requests to the backend, which may overwrite and crash the data.

### When does ALUA be of help ?

Let me explain this with a simple example:
Example 1:
Say initiator had sent a write request (lets call it as cmd[0]) to tcmu-runner, which send that down to glusterfs. But for some reason the cmd[0] is delayed at glusterfs layer. In the meanwhile, path switch happens and same cmd[0] sent by application through node2 and gluster return success. Now there is an another write request (cmd[1]) at the same offset of previous command, consider cmd[1] also succeed. Now what if cmd[0] lingering there sent though node 1, goes into action ? corruption at offset ?

Example 2:
Say write is arrived at tcmu-runner (call it as cmd[0]) just got delayed in tcmu-runner layer for some reasons (not yet sent to gluster yet). If its delayed too long for reasons like resource crunch or coz of a network disconnect between tcmu-runner and initiator, there will be path switch. Now consider the same old case, cmd[0] is sent via path2 and new write cmd[1] kicks in  at the same offset from path2 and post this, if cmd[0] is issued from tcmu-runner on node 1 to glusterfs ? what happens ? corruption ?

Yes this is what we want to solve with ALUA.

Read more details at:
- https://github.com/gluster/glusterfs/issues/466#issuecomment-425428654
- https://github.com/gluster/gluster-block/issues/53#issuecomment-432924044

Comment 2 Niels de Vos 2019-02-07 11:18:53 UTC

What is the current status of this?

Bugs 1669500 and 1669984 have been reported for the multipath configuration and seem related. Is it expected that ALUA is configured already, without this BZ being addressed?

Comment 5 Prasanna Kumar Kalever 2020-02-28 06:59:52 UTC

Xiubo,

Please add the patch link and move this to POST.

thanks!