Bug 1399476

Summary:	IO got hanged while doing in-service update from 3.1.3 to 3.2
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Byreddy <bsrirama>
Component:	write-behind	Assignee:	Raghavendra G <rgowdapp>
Status:	CLOSED ERRATA	QA Contact:	Byreddy <bsrirama>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.2	CC:	amukherj, asrivast, ksubrahm, rcyriac, rhs-bugs, storage-qa-internal
Target Milestone:	---
Target Release:	RHGS 3.2.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.8.4-7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-03-23 05:52:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1351528

Description Byreddy 2016-11-29 06:41:50 UTC

Description of problem:
=======================
IO got hanged while doing in-service update from rhgs 3.1.3 to 3.2.
When IO got hanged the server was in RHGS3.2 and client was in 3.1.3 bits.


some dev debug details on the live setup:
=========================================

[root@dhcp gluster]# cat /proc/6370/stack
[<ffffffff811a490b>] pipe_wait+0x5b/0x80
[<ffffffff811a4caa>] pipe_write+0x37a/0x6b0
[<ffffffff8119996a>] do_sync_write+0xfa/0x140
[<ffffffff81199c68>] vfs_write+0xb8/0x1a0
[<ffffffff8119a7a1>] sys_write+0x51/0xb0
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
[root@dhcp gluster]# cat /proc/6369/stack
[<ffffffffa02592ad>] __fuse_request_send+0xed/0x2b0 [fuse]
[<ffffffffa0259482>] fuse_request_send+0x12/0x20 [fuse]
[<ffffffffa0260176>] fuse_flush+0x106/0x140 [fuse]
[<ffffffff8119683c>] filp_close+0x3c/0x90
[<ffffffff81196935>] sys_close+0xa5/0x100
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
[root@dhcp gluster]# lsof /mnt
COMMAND  PID USER   FD   TYPE DEVICE SIZE/OFF                 NODE NAME
bash    6240 root  cwd    DIR   0,20     4096                    1 /mnt
tar     6369 root  cwd    DIR   0,20     4096                    1 /mnt
xz      6370 root  cwd    DIR   0,20     4096                    1 /mnt
xz      6370 root    0r   REG   0,20 91976832 11426561208144685685 /mnt/linux-4.8.11.tar.xz (deleted)



Version-Release number of selected component (if applicable):
=============================================================
Server: glusterfs-3.8.4-5.el6rhs.x86_64
Client: glusterfs-3.7.9-12.el6.x86_64


How reproducible:
=================
One time


Steps to Reproduce:
====================
1. Do in-service update from 3.1.3 to 3.2
2.
3.

Actual results:
===============
IO got hanged 

Expected results:
=================
IOs should not hang.


Additional info:

Comment 5 Byreddy 2016-12-01 05:45:13 UTC

Some more details:

[root@ ~]# lsof /mnt/
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF                 NODE NAME
bash    32183 root  cwd    DIR   0,20     4096                    1 /mnt
bash    32369 root  cwd    DIR   0,20     4096 10728560248349618169 /mnt/tmp
tar     32403 root  cwd    DIR   0,20     4096 10728560248349618169 /mnt/tmp
xz      32404 root  cwd    DIR   0,20     4096 10728560248349618169 /mnt/tmp
xz      32404 root    0r   REG   0,20 91976832 10482997327777766662 /mnt/tmp/linux-4.8.11.tar.xz

[root@ ~]# cat /proc/32404/stack 
[<ffffffff811a490b>] pipe_wait+0x5b/0x80
[<ffffffff811a4caa>] pipe_write+0x37a/0x6b0
[<ffffffff8119996a>] do_sync_write+0xfa/0x140
[<ffffffff81199c68>] vfs_write+0xb8/0x1a0
[<ffffffff8119a7a1>] sys_write+0x51/0xb0
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
[root@ ~]# 
[root@ ~]# cat /proc/32403/stack 
[<ffffffffa0259181>] wait_answer_interruptible+0x81/0xc0 [fuse]
[<ffffffffa025939b>] __fuse_request_send+0x1db/0x2b0 [fuse]
[<ffffffffa0259482>] fuse_request_send+0x12/0x20 [fuse]
[<ffffffffa0260176>] fuse_flush+0x106/0x140 [fuse]
[<ffffffff8119683c>] filp_close+0x3c/0x90
[<ffffffff81196935>] sys_close+0xa5/0x100
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
[root@ ~]#

Comment 6 Atin Mukherjee 2016-12-01 10:05:52 UTC

Given this is hit one more time, its need to be fixed looking at the severity as it impacts the upgrade path. Providing dev_ack.

Comment 11 Karthik U S 2016-12-29 14:23:40 UTC

This is a bug in 3.1.3 as per my observations and testing. There is a fix which is missing in 3.1.3 and the same is fixed in 3.2.0. [1] is the link for the fix in upstream and [2] in downstream. The patch explains the scenario very well. I applied [1] on 3.1.3 and tried to reproduce the issue with single and multiple clients, and upgraded the servers to glusterfs-3.8.4-7.el6rhs.x86_64 build. I did not hit the issue in both cases.
[3] is the link to the custom build I used to reproduce the issue, which includes [1]. It took ~30 mins to untar the linux kernal in both cases. Could you please try to reproduce the issue with [3] and confirm whether we hit this again or not.

[1] http://review.gluster.org/#/c/15579/
[2] https://code.engineering.redhat.com/gerrit/#/c/91956/
[3] https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12279228

Comment 15 Byreddy 2017-01-04 05:49:08 UTC

Verified this issue multiple times from 3.1.3 bits to 3.2.0 ( glusterfs-3.8.4-10) and 
from glusterfs-3.8.4-7 to glusterfs-3.8.4-10.

In both cases,update worked well, didn't seen the reported issue.

Moving to verified state.

Comment 17 errata-xmlrpc 2017-03-23 05:52:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Comment 18 Red Hat Bugzilla 2023-09-14 03:35:19 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days