953887 – [RHEV-RHS]: VM moved to paused status due to unknown storage error while self heal and rebalance was in progress

Bug 953887 - [RHEV-RHS]: VM moved to paused status due to unknown storage error while self heal and rebalance was in progress

Summary: [RHEV-RHS]: VM moved to paused status due to unknown storage error while self...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	3.4.0-beta
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Pranith Kumar K
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	922183 952693
TreeView+	depends on / blocked

Reported:	2013-04-19 12:33 UTC by Jeff Darcy
Modified:	2013-08-15 16:44 UTC (History)
CC List:	15 users (show)
Fixed In Version:	glusterfs-3.4.0
Clone Of:	922183
Environment:
Last Closed:	2013-07-24 17:24:36 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Gluster log files from servers and clients (56.37 KB, application/zip) 2013-07-24 15:51 UTC, Samuli Heinonen	no flags	Details
View All

Comment 1 Anand Avati 2013-04-22 07:13:31 UTC

REVIEW: http://review.gluster.org/4868 (cluster/afr: Don't queue transactions during open-fd fix) posted (#1) for review on release-3.4 by Pranith Kumar Karampuri (pkarampu)

Comment 2 Anand Avati 2013-05-09 03:53:06 UTC

COMMIT: http://review.gluster.org/4868 committed in release-3.4 by Vijay Bellur (vbellur) 
------
commit 2c80052dbe5aca895a13597e36add51f796000e0
Author: Pranith Kumar K <pkarampu>
Date:   Wed Feb 20 09:53:41 2013 +0530

    cluster/afr: Don't queue transactions during open-fd fix
    
    Before Anonymous fds are available, afr had to queue up
    transactions if the file is not opened on one of its
    subvolumes. This happens until the attempt to open the
    file either succeeds or fails. These attempts happen
    until the file is successfully opened on the subvolume.
    Now client xlator uses anonymous fds to perform the fops
    if the fd used for the fop is not 'opened'.
    Fops will be successful even when the file is not opened
    so there is no need to queue up the transactions anymore in afr.
    Open is attempted on the subvolume where it is not
    opened independent of the fop.
    
    Change-Id: I6d59293023e2de41c606395028c8980b83faca3f
    BUG: 953887
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/4868
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Vijay Bellur <vbellur>

Comment 3 Samuli Heinonen 2013-07-01 09:28:00 UTC

Are there some other patches needed for this? If I'm correct this patch should be included in GlusterFS 3.4 beta4.

I'm hitting this issue with oVirt 3.3 (nightly) and GlusterFS 3.4.0beta4 when doing rebalance. Self heal alone seems to work fine. My setup consists of two Gluster servers and two oVirt nodes.

Gluster volume has following configuration:
Volume Name: ovirtsas
Type: Distributed-Replicate
Volume ID: 238fabac-911f-4283-b98a-a44e18beb02f
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: boar2:/gluster/sas/brick1/ovirtsas
Brick2: boar1:/gluster/sas/brick1/ovirtsas
Brick3: boar1:/gluster/sas/brick3/ovirtsas
Brick4: boar2:/gluster/sas/brick3/ovirtsas
Options Reconfigured:
storage.owner-uid: 36
storage.owner-gid: 36
network.ping-timeout: 10
performance.quick-read: off
performance.io-cache: off
performance.stat-prefetch: off
network.remote-dio: enable
performance.client-io-threads: enable

Filesystem is mounted with following options:
/dev/mapper/sas--brick1-export1 on /gluster/sas/brick1 type xfs (rw,noatime,inode64,nobarrier)
/dev/mapper/sas--brick1-export3 on /gluster/sas/brick3 type xfs (rw,noatime,inode64,nobarrier)

Gluster volume has been configured as POSIX storage domain in oVirt with background-qlen=32768 mount option.

Steps to reproduce:
1. Create distributed-replicated Gluster volume
2. Create and start some virtual machines with oVirt
2. Add new bricks to Gluster volume
3. Start rebalance

Actual results:
- Virtual machines which image file is being rebalanced are paused because of unknown storage error. VMs have to be shutdown ungracefully before they can be started again.
- On some cases mountpoint is showing "Transpoint endpoint is not connected" error and oVirt node has to be rebooted before oVirt is able to connect to it again.

Expected results:
Rebalance doesn't affect running virtual machines.

Comment 4 Samuli Heinonen 2013-07-24 12:04:57 UTC

I'm still seeing this issue with GlusterFS 3.4 GA and oVirt 3.2.2 using fuse . All systems are running CentOS 6.4 and GlusterFS installed from http://download.gluster.org/pub/gluster/glusterfs/LATEST/EPEL.repo/.

Is anyone else able to reproduce this?

Comment 5 Pranith Kumar K 2013-07-24 13:40:19 UTC

Samuli,
     Could you provide the mount and rebalance and brick logs of this test run.
Just zip them and attach to this bug.
Let me take a look at the logs and we will take it from there.

Pranith.

Comment 6 Samuli Heinonen 2013-07-24 15:51:05 UTC

Created attachment 777850 [details]
Gluster log files from servers and clients

Log file attached. Please let me know if you need anything else.

Comment 7 Samuli Heinonen 2013-07-26 06:48:49 UTC

I rested this without background-qlen mount option and issue exists also without it.

Comment 8 Pranith Kumar K 2013-07-26 07:03:59 UTC

Samuli,
     I shall get back to you on this by tuesday.

Pranith.

Comment 9 Pranith Kumar K 2013-07-31 06:25:04 UTC

Samuli,
   We found that there are logs as below which indicate the vms moving to paused state. We are yet to root cause what lead to file descriptors going bad. 

Root cause for this issue seems to be different than why this bug is opened. Will be opening a new bug and add the logs you posted to that bug. You will be put in CC list of the bug. Will post my results in 2-3 days. Appreciate your help in giving the logs.

Pranith.

[2013-07-24 09:51:36.882031] W [client-rpc-fops.c:873:client3_3_writev_cbk] 1-hiomo1-dev1-sas1-client-4: remote operation failed: Bad file descriptor
[2013-07-24 09:51:36.882135] W [client-rpc-fops.c:873:client3_3_writev_cbk] 1-hiomo1-dev1-sas1-client-5: remote operation failed: Bad file descriptor
[2013-07-24 09:51:52.031444] W [fuse-bridge.c:2127:fuse_writev_cbk] 0-glusterfs-fuse: 571386: WRITE => -1 (Bad file descriptor)

Note You need to log in before you can comment on or make changes to this bug.