1473762 – [GSS] [RFE] Add Graceful brick shutdown for clients to reduce service loss during reboot.

Bug 1473762 - [GSS] [RFE] Add Graceful brick shutdown for clients to reduce service loss during reboot.

Summary: [GSS] [RFE] Add Graceful brick shutdown for clients to reduce service loss du...

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterfs
Sub Component:
Version:	rhgs-3.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Mohit Agrawal
QA Contact:	Bala Konda Reddy M
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1473759 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-07-21 14:46 UTC by Paul Armstrong
Modified:	2020-09-10 11:00 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-11-08 10:34:45 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Paul Armstrong 2017-07-21 14:46:58 UTC

Description of problem: Customer is working through patching and other maintenance scenarios. When system is rebooted after patching, client systems experience network.ping-timeout seconds of lost access to the volume. Customer is looking to minimize the loss of volume availability during maintenance. i.e Leave volume online while patches are being applied, then reboot the system, glusterd manages the graceful disconnect of clients from the brick on shutdown, then self-heal is initiated when the system comes back online.

Version-Release number of selected component (if applicable):
3.2 (mine)
3.3 (customer)

How reproducible:
Always.

Steps to Reproduce:
1. run client in a test mode writing to replicated volume.
2. patch one server and reboot

Actual results:
3. client hangs for 42s by default

Expected results:
3. client continues to write to available volume.

Additional info:

workaround:
pkill -f volume_path prior to reboot
gluster volume start vol_name force on reboot

better would be:
gluster volume maintenance vol_name on
gluster peer maintenance peer_name on

tells individual volume or all volumes on a node to go into maintenance mode - when signalled they gracefully terminate connections and let clients know that they are unavailable.

gluster volume maintenance vol_name off
gluster peer maintenance peer_name off

instructs individual volumes or all volumes on a node to exit maintenance mode, forcibly start and perform self-heal operations.

Customer is opening case.

Comment 2 Rejy M Cyriac 2017-07-26 07:50:39 UTC

*** Bug 1473759 has been marked as a duplicate of this bug. ***

Comment 6 Amar Tumballi 2018-10-30 06:18:08 UTC

Notice that the request came for a 'replica 2' volume, where there are possibilities of service disruption when one of the server is taken down for maintenance.

As we have started to recommend 'replica 3' (or arbiter type), this scenario won't happen anymore with RHGS

With this data point, would like to CLOSE the bug as CANTFIX (for replica 2), but the content GSS can take from this is that 'replica 3' or 'arbiter' (or in future thin-arbiter) are the solution for this.

Will close the issue with this data point if there is no responses/disagreements for this update in next 2 weeks.

Note You need to log in before you can comment on or make changes to this bug.