1168873 – [USS]: If uss is disabled and than enabled ls from virtual snap world hungs

Bug 1168873 - [USS]: If uss is disabled and than enabled ls from virtual snap world hungs

Summary: [USS]: If uss is disabled and than enabled ls from virtual snap world hungs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	snapshot
Sub Component:
Version:	rhgs-3.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.0.3
Assignee:	Vijaikumar Mallikarjuna
QA Contact:	storage-qa-internal@redhat.com
Docs Contact:
URL:
Whiteboard:	USS
Depends On:
Blocks:	1162694
TreeView+	depends on / blocked

Reported:	2014-11-28 09:45 UTC by Rahul Hinduja
Modified:	2016-09-17 13:04 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.6.0.35
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-01-15 13:43:01 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:0038	0	normal	SHIPPED_LIVE	Red Hat Storage 3.0 enhancement and bug fix update #3	2015-01-15 18:35:28 UTC

Description Rahul Hinduja 2014-11-28 09:45:50 UTC

Description of problem:
=======================

If you are in virtual snap world (.snaps) and uss is disabled, ls from it errors out as expected. But if you again enable the uss which starts the new snapd process and if you try to do ls from it, it hungs. It should re-establish the connect OR it should gracefully error out.

USS is enabled and we are inside the virtual snap world from client:
====================================================================

[root@wingo vol2]# cd .snaps
[root@wingo .snaps]# ls
rs1  rs2
[root@wingo .snaps]# cd rs1
[root@wingo rs1]# ls
etc.1  etc.2
[root@wingo rs1]# cd etc.1
[root@wingo etc.1]#
[root@wingo etc.1]# pwd
/mnt/vol2/.snaps/rs1/etc.1
[root@wingo etc.1]#

Disable the USS and try to ls from inside, it errors as expected:
=================================================================

[root@inception ~]# gluster v set vol2 uss off
volume set: success

[root@wingo etc.1]# pwd
/mnt/vol2/.snaps/rs1/etc.1
[root@wingo etc.1]# ls
ls: cannot open directory .: No such file or directory
[root@wingo etc.1]# ls
ls: cannot open directory .: No such file or directory
[root@wingo etc.1]#

Re-enable the USS and try to do ls:
===================================

[root@inception ~]# gluster v set vol2 uss on
volume set: success
[root@inception ~]# 
[root@wingo etc.1]# ls
^C
^C^C
^C
^C


Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.6.0.34-1.el6rhs.x86_64


How reproducible:
=================

always


Steps to Reproduce:
===================
1. Create 4 node cluster 
2. Create 2x2 volume
3. Mount on client (Fuse and NFS)
4. From Fuse do cp -rf /etc etc.1
5. From Nfs do cp -rf /etc etc.2
6. Create 2 snapshots rs1 and rs2
7. Activate snapshots rs1 and rs2
8. Enable USS
9. cd to the virtual .snaps directory from fuse and nfs
From fuse: cd .snaps/rs1/etc.1
From nfs: cd .snaps/rs2/etc.2
10. ls from both the above directories, it should list entries
11. Disable the USS
12. ls again, it should error out
13. Enable the USS
14. ls again

Actual results:
===============

It hungs from fuse and nfs

Expected results:
=================

It should re-establish the connection and list the entries OR, it should gracefully error out

Comment 1 Rahul Hinduja 2014-11-28 09:48:10 UTC

nfs: server inception.lab.eng.blr.redhat.com not responding, still trying
INFO: task ls:6757 blocked for more than 120 seconds.
      Not tainted 2.6.32-504.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ls            D 0000000000000000     0  6757   6339 0x00000084
 ffff880119e85a68 0000000000000082 0000000000000001 0000000000000246
 ffff880119e859f8 0000000000000206 ffff880119e859f8 ffffffff8109ece7
 ffff880119e85a08 ffff880119594d98 ffff8801196f65f8 ffff880119e85fd8
Call Trace:
 [<ffffffff8109ece7>] ? finish_wait+0x67/0x80
 [<ffffffff8109ee2e>] ? prepare_to_wait+0x4e/0x80
 [<ffffffffa00e433d>] __fuse_request_send+0xed/0x2b0 [fuse]
 [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa00e4512>] fuse_request_send+0x12/0x20 [fuse]
 [<ffffffffa00e7ccc>] fuse_do_getattr+0x10c/0x2c0 [fuse]
 [<ffffffffa00e7efd>] fuse_update_attributes+0x7d/0x90 [fuse]
 [<ffffffffa00e9038>] fuse_permission+0xb8/0x1f0 [fuse]
 [<ffffffff8119e1e3>] __link_path_walk+0xb3/0x1000
 [<ffffffff8119c5b5>] ? path_init+0x185/0x250
 [<ffffffff8119f3ea>] path_walk+0x6a/0xe0
 [<ffffffff8119f5fb>] filename_lookup+0x6b/0xc0
 [<ffffffff8122d466>] ? security_file_alloc+0x16/0x20
 [<ffffffff811a0ad4>] do_filp_open+0x104/0xd20
 [<ffffffff811a3752>] ? vfs_ioctl+0x22/0xa0
 [<ffffffff81298eea>] ? strncpy_from_user+0x4a/0x90
 [<ffffffff811adf82>] ? alloc_fd+0x92/0x160
 [<ffffffff8118ae07>] do_sys_open+0x67/0x130
 [<ffffffff8118af10>] sys_open+0x20/0x30
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[root@wingo ~]#

Comment 4 Vijaikumar Mallikarjuna 2014-11-28 11:08:35 UTC

I am not able to re-create this problem in the latest downstream code

Comment 6 Vijaikumar Mallikarjuna 2014-12-02 13:49:56 UTC

We were not able to re-create this problem with the below setup:

Installed glusterfs-3.6.0.35
Created 4 node cluster
Created 2x2 volume
Followed the instruction mentioned in the description

Comment 7 Rahul Hinduja 2014-12-03 07:34:55 UTC

This issue is easily reproducible with build: glusterfs-3.6.0.34

With build: glusterfs-3.6.0.35, after enabling, it errors out with "No such file or directory" and no hung observed.

Comment 8 Vijaikumar Mallikarjuna 2014-12-03 09:28:42 UTC

Patch https://code.engineering.redhat.com/gerrit/#/c/37398/ has fixed this issue

Comment 11 errata-xmlrpc 2015-01-15 13:43:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0038.html

Note You need to log in before you can comment on or make changes to this bug.