1396166 – self-heal info command hangs after triggering self-heal

Bug 1396166 - self-heal info command hangs after triggering self-heal

Summary: self-heal info command hangs after triggering self-heal

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	replicate
Sub Component:
Version:	rhgs-3.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Krutika Dhananjay
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	Gluster-HC-2 1351528 1398566 1398888
TreeView+	depends on / blocked

Reported:	2016-11-17 16:11 UTC by SATHEESARAN
Modified:	2017-03-23 06:19 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.8.4-6
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1398566 (view as bug list)
Environment:	RHV-RHGS HCI
Last Closed:	2017-03-23 06:19:44 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Client statedump taken from qemu process of VM1 using gdb (572.64 KB, application/x-gzip) 2016-11-18 06:14 UTC, SATHEESARAN	no flags	Details
Client statedump taken from qemu process of VM2 using gdb (112.45 KB, application/x-gzip) 2016-11-18 06:14 UTC, SATHEESARAN	no flags	Details
clients logs from VM1 (11.39 KB, text/plain) 2016-11-18 06:16 UTC, SATHEESARAN	no flags	Details
client logs from VM2 (8.38 KB, text/plain) 2016-11-18 06:16 UTC, SATHEESARAN	no flags	Details
brick1-statedump (15.42 KB, application/x-gzip) 2016-11-23 07:07 UTC, SATHEESARAN	no flags	Details
brick2-statedump (11.67 KB, application/x-gzip) 2016-11-23 07:07 UTC, SATHEESARAN	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Description SATHEESARAN 2016-11-17 16:11:45 UTC

Description of problem:
------------------------
After issuing 'gluster volume heal', 'gluster volume heal info' hangs, when compound-fops is enabled on the replica 3 volume

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
RHEL 7.3
RHGS 3.2.0 interim build ( glusterfs-3.8.4-5.el7rhgs )

How reproducible:
-----------------
Always

Steps to Reproduce:
-------------------
1. Create a replica 3 volume
2. Optimize the volume for VM store usecase
3. Enable compound-fops on the volume
4. Create a VM, and install OS
5. While OS installation is in progress, kill brick1 on server1
6. After VM installation is completed, bring back the brick up
7. Trigger self-heal on the volume
8. Get the self-heal info

Actual results:
---------------
self-heal info command is hung

Expected results:
-----------------
'self-heal info' should provide the correct information about un-synced entries

Additional info:
----------------
When compound-fops is disabled on the volume, this issue is not seen

Comment 2 SATHEESARAN 2016-11-17 16:25:50 UTC

I have tested this with qemu's native driver for glusterfs ( which uses gfapi )

Comment 3 SATHEESARAN 2016-11-18 06:14:08 UTC

Created attachment 1221739 [details]
Client statedump taken from qemu process of VM1 using gdb

Comment 4 SATHEESARAN 2016-11-18 06:14:29 UTC

Created attachment 1221740 [details]
Client statedump taken from qemu process of VM2 using gdb

Comment 5 SATHEESARAN 2016-11-18 06:16:21 UTC

Created attachment 1221741 [details]
clients logs from VM1

Comment 6 SATHEESARAN 2016-11-18 06:16:44 UTC

Created attachment 1221742 [details]
client logs from VM2

Comment 7 Krutika Dhananjay 2016-11-22 15:22:39 UTC

You do have the brick statedump too, don't you? Could you please attach those as well?

-Krutika

Comment 8 SATHEESARAN 2016-11-23 07:04:43 UTC

(In reply to Krutika Dhananjay from comment #7)
> You do have the brick statedump too, don't you? Could you please attach
> those as well?
> 
> -Krutika

Hi Krutika,

I have mistakenly re-provisioned my third server in the cluster to simulate failed node scenario.

But I have brick statedump from server1 and server2. I will attach them

Comment 9 SATHEESARAN 2016-11-23 07:07:11 UTC

Created attachment 1223015 [details]
brick1-statedump

Comment 10 SATHEESARAN 2016-11-23 07:07:48 UTC

Created attachment 1223016 [details]
brick2-statedump

Comment 11 Atin Mukherjee 2016-11-25 07:55:54 UTC

As per the triaging we all have the agreement that this BZ has to be fixed in rhgs-3.2.0. Providing devel_ack.

Comment 12 Krutika Dhananjay 2016-11-25 10:39:30 UTC

patch on master posted for review at http://review.gluster.org/15929

Moving this bug to POST state.

Comment 13 Krutika Dhananjay 2016-11-28 05:19:29 UTC

https://code.engineering.redhat.com/gerrit/#/c/91332/1 <-- that's the downstream patch. Waiting on QE and PM ack before asking for it to be merged.

Comment 17 SATHEESARAN 2016-12-27 09:14:08 UTC

Tested with glusterfs-3.8.4-10.el7rhgs with the following steps:

1. Created replica 3 sharded volume with compound-fops enabled
2. Optimized the volume for VM store usecase and fuse mounted the volume on the hypervisor
3. Created a sparse image file on the VM and started the OS installation.
4. While VM installation is in progress, killed the first brick
5. After VM installation is completed, brought back the brick and initiated heal on that volume 'gluster volume heal <vol>'
6. Checked for heal status using 'gluster volume heal <vol> info'

'gluster volume heal info' listed the entries that were pending to heal

Comment 19 errata-xmlrpc 2017-03-23 06:19:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.