Bug 1399476 - IO got hanged while doing in-service update from 3.1.3 to 3.2
Summary: IO got hanged while doing in-service update from 3.1.3 to 3.2
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: write-behind
Version: rhgs-3.2
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: RHGS 3.2.0
Assignee: Raghavendra G
QA Contact: Byreddy
URL:
Whiteboard:
Depends On:
Blocks: 1351528
TreeView+ depends on / blocked
 
Reported: 2016-11-29 06:41 UTC by Byreddy
Modified: 2023-09-14 03:35 UTC (History)
6 users (show)

Fixed In Version: glusterfs-3.8.4-7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-03-23 05:52:00 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:0486 0 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 09:18:45 UTC

Description Byreddy 2016-11-29 06:41:50 UTC
Description of problem:
=======================
IO got hanged while doing in-service update from rhgs 3.1.3 to 3.2.
When IO got hanged the server was in RHGS3.2 and client was in 3.1.3 bits.


some dev debug details on the live setup:
=========================================

[root@dhcp gluster]# cat /proc/6370/stack
[<ffffffff811a490b>] pipe_wait+0x5b/0x80
[<ffffffff811a4caa>] pipe_write+0x37a/0x6b0
[<ffffffff8119996a>] do_sync_write+0xfa/0x140
[<ffffffff81199c68>] vfs_write+0xb8/0x1a0
[<ffffffff8119a7a1>] sys_write+0x51/0xb0
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
[root@dhcp gluster]# cat /proc/6369/stack
[<ffffffffa02592ad>] __fuse_request_send+0xed/0x2b0 [fuse]
[<ffffffffa0259482>] fuse_request_send+0x12/0x20 [fuse]
[<ffffffffa0260176>] fuse_flush+0x106/0x140 [fuse]
[<ffffffff8119683c>] filp_close+0x3c/0x90
[<ffffffff81196935>] sys_close+0xa5/0x100
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
[root@dhcp gluster]# lsof /mnt
COMMAND  PID USER   FD   TYPE DEVICE SIZE/OFF                 NODE NAME
bash    6240 root  cwd    DIR   0,20     4096                    1 /mnt
tar     6369 root  cwd    DIR   0,20     4096                    1 /mnt
xz      6370 root  cwd    DIR   0,20     4096                    1 /mnt
xz      6370 root    0r   REG   0,20 91976832 11426561208144685685 /mnt/linux-4.8.11.tar.xz (deleted)



Version-Release number of selected component (if applicable):
=============================================================
Server: glusterfs-3.8.4-5.el6rhs.x86_64
Client: glusterfs-3.7.9-12.el6.x86_64


How reproducible:
=================
One time


Steps to Reproduce:
====================
1. Do in-service update from 3.1.3 to 3.2
2.
3.

Actual results:
===============
IO got hanged 

Expected results:
=================
IOs should not hang.


Additional info:

Comment 5 Byreddy 2016-12-01 05:45:13 UTC
Some more details:

[root@ ~]# lsof /mnt/
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF                 NODE NAME
bash    32183 root  cwd    DIR   0,20     4096                    1 /mnt
bash    32369 root  cwd    DIR   0,20     4096 10728560248349618169 /mnt/tmp
tar     32403 root  cwd    DIR   0,20     4096 10728560248349618169 /mnt/tmp
xz      32404 root  cwd    DIR   0,20     4096 10728560248349618169 /mnt/tmp
xz      32404 root    0r   REG   0,20 91976832 10482997327777766662 /mnt/tmp/linux-4.8.11.tar.xz

[root@ ~]# cat /proc/32404/stack 
[<ffffffff811a490b>] pipe_wait+0x5b/0x80
[<ffffffff811a4caa>] pipe_write+0x37a/0x6b0
[<ffffffff8119996a>] do_sync_write+0xfa/0x140
[<ffffffff81199c68>] vfs_write+0xb8/0x1a0
[<ffffffff8119a7a1>] sys_write+0x51/0xb0
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
[root@ ~]# 
[root@ ~]# cat /proc/32403/stack 
[<ffffffffa0259181>] wait_answer_interruptible+0x81/0xc0 [fuse]
[<ffffffffa025939b>] __fuse_request_send+0x1db/0x2b0 [fuse]
[<ffffffffa0259482>] fuse_request_send+0x12/0x20 [fuse]
[<ffffffffa0260176>] fuse_flush+0x106/0x140 [fuse]
[<ffffffff8119683c>] filp_close+0x3c/0x90
[<ffffffff81196935>] sys_close+0xa5/0x100
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
[root@ ~]#

Comment 6 Atin Mukherjee 2016-12-01 10:05:52 UTC
Given this is hit one more time, its need to be fixed looking at the severity as it impacts the upgrade path. Providing dev_ack.

Comment 11 Karthik U S 2016-12-29 14:23:40 UTC
This is a bug in 3.1.3 as per my observations and testing. There is a fix which is missing in 3.1.3 and the same is fixed in 3.2.0. [1] is the link for the fix in upstream and [2] in downstream. The patch explains the scenario very well. I applied [1] on 3.1.3 and tried to reproduce the issue with single and multiple clients, and upgraded the servers to glusterfs-3.8.4-7.el6rhs.x86_64 build. I did not hit the issue in both cases.
[3] is the link to the custom build I used to reproduce the issue, which includes [1]. It took ~30 mins to untar the linux kernal in both cases. Could you please try to reproduce the issue with [3] and confirm whether we hit this again or not.

[1] http://review.gluster.org/#/c/15579/
[2] https://code.engineering.redhat.com/gerrit/#/c/91956/
[3] https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12279228

Comment 15 Byreddy 2017-01-04 05:49:08 UTC
Verified this issue multiple times from 3.1.3 bits to 3.2.0 ( glusterfs-3.8.4-10) and 
from glusterfs-3.8.4-7 to glusterfs-3.8.4-10.

In both cases,update worked well, didn't seen the reported issue.

Moving to verified state.

Comment 17 errata-xmlrpc 2017-03-23 05:52:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Comment 18 Red Hat Bugzilla 2023-09-14 03:35:19 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.