Bug 1061580 - mount hangs, when all the bricks in replica group go down while writes are in progress
Summary: mount hangs, when all the bricks in replica group go down while writes are in...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterfs
Version: 2.1
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: RHGS 3.0.0
Assignee: Pranith Kumar K
QA Contact: SATHEESARAN
URL:
Whiteboard:
: 1062256 (view as bug list)
Depends On:
Blocks: 1035040 1062256 1063190
TreeView+ depends on / blocked
 
Reported: 2014-02-05 06:45 UTC by SATHEESARAN
Modified: 2015-05-15 17:42 UTC (History)
14 users (show)

Fixed In Version: glusterfs-3.6.0.4-1
Doc Type: Bug Fix
Doc Text:
When all the bricks in replica group go down while writes are in progress on that replica group, the mount used to hang some times because of stale structures that were not removed from a list. Now removing of stale structures from the list is added to fix the issue.
Clone Of:
: 1062256 1063190 (view as bug list)
Environment:
Last Closed: 2014-09-22 19:32:24 UTC
Embargoed:


Attachments (Terms of Use)
screenshot of appvm1 (39.94 KB, image/png)
2014-02-05 07:06 UTC, SATHEESARAN
no flags Details
screenshot of appvm2 (40.86 KB, image/png)
2014-02-05 07:07 UTC, SATHEESARAN
no flags Details
sosreport from RHSS Node1 (11.13 MB, application/x-xz)
2014-02-05 07:10 UTC, SATHEESARAN
no flags Details
sosreport from RHSS Node2 (11.01 MB, application/x-xz)
2014-02-05 07:13 UTC, SATHEESARAN
no flags Details
Fuse mount log files from Hypervisor (56.09 KB, text/x-log)
2014-02-05 07:17 UTC, SATHEESARAN
no flags Details
tcpdump from hypervisor side (2.95 MB, application/octet-stream)
2014-02-06 14:48 UTC, SATHEESARAN
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2014:1278 0 normal SHIPPED_LIVE Red Hat Storage Server 3.0 bug fix and enhancement update 2014-09-22 23:26:55 UTC

Description SATHEESARAN 2014-02-05 06:45:42 UTC
Description of problem:
-----------------------
When the server side quorum is not met, App VMs are paused.
But when the server side quorum is met again, the Volume continues to be "inaccessible".

"Inaccessible" here can be noticed by the following,
1. From RHEVM, data domain remains to be DOWN
2. From the hypervisor, "mount and df" command on mount point hangs

The volume is though seen up from RHSS Nodes (gluster cli)

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHEVM - IS33 ( 3.3.0-0.46.el6ev )
RHSS 2.1Update2 RC2 - glusterfs-3.4.0.58rhs.el6rhs

How reproducible:
-----------------
Happened once in the first attempt.

Steps to Reproduce:
-------------------
0. Create a Trusted storage pool with 2 RHSS Nodes
1. Created a distributed-replicate volume of 3X2
2. Optimize the volume for 'virt-store' and start the volume
3. Use this volume for data domain for RHEV DataCenter
4. Create App VMs with its disk images ( of raw type) on this data domain backed by gluster volume 
5. Install all App VMs with RHEL OS
6. Simulate running IO in App VMs using 'dd' command
7. Force shutdown the RHSS NODE2
    NOTE: server quorum will not be met, when NODE2 goes down
    All the App VMs got 'paused'
8. After sometime, brough back the NODE2 that was earlier powered off 
9. Resume the App VMs

Actual results:
---------------
1. App VMs are resumed but continuous error messages are seen:
"echo 0> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
INFO: task jbd2/dm-0-8:317 blocked for more than 120 seconds." 

2. App VMs could not be reached through SSH

3. Storage Domain ( to be precise, Data Domain ), went DOWN
No new images could be created on it.

4. "df -Th & mount" command on the mount point also hangs. 

Expected results:
------------------
Once the quorum is met again, App VMs should resume their operations and there should not be any problem with their normal operations

Additional info:
-----------------

1. Installation Info
======================
RHSS ISO  - RHSS 2.1Update2 RC2 - RHSS-2.1-20140130.0-RHS-x86_64-DVD1.iso
http://download.devel.redhat.com/composes/candidates/RHSS-2.1-20140130.0/2.1/RHS/x86_64/iso/RHSS-2.1-20140130.0-RHS-x86_64-DVD1.iso

2. RHSS Node Info
==================
There are 2 nodes in the 'Trusted Storage Pool' (gluster cluster)
These machines are VMs provisioned with 8GB of RAM and 4 VCPUs

RHSS Node1 - rhss1.lab.eng.blr.redhat.com (10.70.37.86)
RHSS Node2 - rhss2.lab.eng.blr.redhat.com (10.70.37.187)

All the commands are executed from, Node1, 10.70.37.86

3. RHSS Volume Info
====================
There are 2 volumes available - dr-store & repvol
dr-store is the volume that is containing the Virtual machine images

[Tue Feb  4 21:36:20 UTC 2014 root.37.86:~ ] # gluster v i dr-store
 
Volume Name: dr-store
Type: Distributed-Replicate
Volume ID: c654edf4-3336-47aa-af57-4d096d17da14
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.37.86:/rhs/brick1/drs1
Brick2: 10.70.37.187:/rhs/brick1/dsr1
Brick3: 10.70.37.86:/rhs/brick2/dsr2
Brick4: 10.70.37.187:/rhs/brick2/dsr2
Brick5: 10.70.37.86:/rhs/brick3/dsr3
Brick6: 10.70.37.187:/rhs/brick3/dsr3
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off

[Tue Feb  4 21:52:25 UTC 2014 root.37.86:~ ] # gluster volume status dr-store
Status of volume: dr-store
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.86:/rhs/brick1/drs1                      49152   Y       21810
Brick 10.70.37.187:/rhs/brick1/dsr1                     49152   Y       2214
Brick 10.70.37.86:/rhs/brick2/dsr2                      49153   Y       21819
Brick 10.70.37.187:/rhs/brick2/dsr2                     49153   Y       2215
Brick 10.70.37.86:/rhs/brick3/dsr3                      49155   Y       21824
Brick 10.70.37.187:/rhs/brick3/dsr3                     49155   Y       2225
NFS Server on localhost                                 2049    Y       31631
Self-heal Daemon on localhost                           N/A     Y       31638
NFS Server on 10.70.37.187                              2049    Y       2232
Self-heal Daemon on 10.70.37.187                        N/A     Y       2238
 
Task Status of Volume dr-store
------------------------------------------------------------------------------
There are no active volume tasks

4. Hypervisor(client) Info
==========================
Hypervisor used : rhs-client15.lab.eng.blr.redhat.com ( 10.70.36.39 )

mount point : /rhev/data-center/mnt/glusterSD/10.70.37.198:dr-store

5. Other relevant Info
=======================
This volume is created as 2X2 volume, but later added a pair of bricks to this volume to make it as 3X2. Still there is no rebalance being done.

Comment 1 SATHEESARAN 2014-02-05 06:50:38 UTC
Consequence on App VM
=====================
1. VM is not functional, couldn't log in to VM using ssh,
2. Storage domain was DOWN with the error,
"Failed to Reconstruct Master domain for Data Center virt-dc."
3. no new disk images can be created as domain is not available

Comment 2 SATHEESARAN 2014-02-05 06:55:14 UTC
Console logs as captured from RHSS Node1 - 10.70.37.86
=======================================================

[Tue Feb  4 20:58:28 UTC 2014 root.37.86:~ ] # gluster v i
 
Volume Name: dr-store
Type: Distributed-Replicate
Volume ID: c654edf4-3336-47aa-af57-4d096d17da14
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.37.86:/rhs/brick1/drs1
Brick2: 10.70.37.187:/rhs/brick1/dsr1
Brick3: 10.70.37.86:/rhs/brick2/dsr2
Brick4: 10.70.37.187:/rhs/brick2/dsr2
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
 
Volume Name: repvol
Type: Replicate
Volume ID: 419bdc4c-2802-4eb4-96d5-b71c1da2fc2c
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.70.37.86:/rhs/brick4/repdir1
Brick2: 10.70.37.187:/rhs/brick4/repdir1
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off

[Tue Feb  4 20:58:33 UTC 2014 root.37.86:~ ] # gluster volume add-brick dr-store 10.70.37.86:/rhs/brick3/dsr3 10.70.37.187:/rhs/brick3/dsr3
volume add-brick: success

<<<<<< Forced power-off the second RHSS Node here >>>>>>>>

[Tue Feb  4 20:59:40 UTC 2014 root.37.86:~ ] # gluster volume status
Status of volume: dr-store
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.86:/rhs/brick1/drs1                      N/A     N       N/A
Brick 10.70.37.86:/rhs/brick2/dsr2                      N/A     N       N/A
Brick 10.70.37.86:/rhs/brick3/dsr3                      N/A     N       N/A
NFS Server on localhost                                 2049    Y       31631
Self-heal Daemon on localhost                           N/A     Y       31638
 
Task Status of Volume dr-store
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: repvol
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.86:/rhs/brick4/repdir1                   N/A     N       N/A
NFS Server on localhost                                 2049    Y       31631
Self-heal Daemon on localhost                           N/A     Y       31638
 
Task Status of Volume repvol
------------------------------------------------------------------------------
There are no active volume tasks
 
[Tue Feb  4 21:36:20 UTC 2014 root.37.86:~ ] # gluster v i dr-store
 
Volume Name: dr-store
Type: Distributed-Replicate
Volume ID: c654edf4-3336-47aa-af57-4d096d17da14
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.37.86:/rhs/brick1/drs1
Brick2: 10.70.37.187:/rhs/brick1/dsr1
Brick3: 10.70.37.86:/rhs/brick2/dsr2
Brick4: 10.70.37.187:/rhs/brick2/dsr2
Brick5: 10.70.37.86:/rhs/brick3/dsr3
Brick6: 10.70.37.187:/rhs/brick3/dsr3
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off

[Tue Feb  4 21:37:02 UTC 2014 root.37.86:~ ] # gluster volume status
Status of volume: dr-store
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.86:/rhs/brick1/drs1                      N/A     N       N/A
Brick 10.70.37.86:/rhs/brick2/dsr2                      N/A     N       N/A
Brick 10.70.37.86:/rhs/brick3/dsr3                      N/A     N       N/A
NFS Server on localhost                                 2049    Y       31631
Self-heal Daemon on localhost                           N/A     Y       31638
 
Task Status of Volume dr-store
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: repvol
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.86:/rhs/brick4/repdir1                   N/A     N       N/A
NFS Server on localhost                                 2049    Y       31631
Self-heal Daemon on localhost                           N/A     Y       31638
 
Task Status of Volume repvol
------------------------------------------------------------------------------
There are no active volume tasks
 
[Tue Feb  4 21:48:03 UTC 2014 root.37.86:~ ] # gluster volume status
Status of volume: dr-store
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.86:/rhs/brick1/drs1                      N/A     N       N/A
Brick 10.70.37.86:/rhs/brick2/dsr2                      N/A     N       N/A
Brick 10.70.37.86:/rhs/brick3/dsr3                      N/A     N       N/A
NFS Server on localhost                                 2049    Y       31631
Self-heal Daemon on localhost                           N/A     Y       31638
 
Task Status of Volume dr-store
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: repvol
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.86:/rhs/brick4/repdir1                   N/A     N       N/A
NFS Server on localhost                                 2049    Y       31631
Self-heal Daemon on localhost                           N/A     Y       31638
 
Task Status of Volume repvol
------------------------------------------------------------------------------
There are no active volume tasks
 
[Tue Feb  4 21:48:09 UTC 2014 root.37.86:~ ] # gluster pe s
Number of Peers: 1

Hostname: 10.70.37.187
Uuid: 28eef69b-10d6-4b9d-9103-844eb46b357c
State: Peer in Cluster (Disconnected)

[Tue Feb  4 21:48:18 UTC 2014 root.37.86:~ ] # gluster peer probe 10.70.37.198
peer probe: failed: Cluster quorum is not met. Changing peers is not allowed in this state

<<<< Powered on the second RHSS Node here>>>>

[Tue Feb  4 21:52:25 UTC 2014 root.37.86:~ ] # gluster volume status
Status of volume: dr-store
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.86:/rhs/brick1/drs1                      49152   Y       21810
Brick 10.70.37.187:/rhs/brick1/dsr1                     49152   Y       2214
Brick 10.70.37.86:/rhs/brick2/dsr2                      49153   Y       21819
Brick 10.70.37.187:/rhs/brick2/dsr2                     49153   Y       2215
Brick 10.70.37.86:/rhs/brick3/dsr3                      49155   Y       21824
Brick 10.70.37.187:/rhs/brick3/dsr3                     49155   Y       2225
NFS Server on localhost                                 2049    Y       31631
Self-heal Daemon on localhost                           N/A     Y       31638
NFS Server on 10.70.37.187                              2049    Y       2232
Self-heal Daemon on 10.70.37.187                        N/A     Y       2238
 
Task Status of Volume dr-store
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: repvol
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.86:/rhs/brick4/repdir1                   49154   Y       21831
Brick 10.70.37.187:/rhs/brick4/repdir1                  49154   Y       2210
NFS Server on localhost                                 2049    Y       31631
Self-heal Daemon on localhost                           N/A     Y       31638
NFS Server on 10.70.37.187                              2049    Y       2232
Self-heal Daemon on 10.70.37.187                        N/A     Y       2238
 
Task Status of Volume repvol
------------------------------------------------------------------------------
There are no active volume tasks
 
[Tue Feb  4 22:15:37 UTC 2014 root.37.86:~ ] # gluster volume status
Status of volume: dr-store
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.86:/rhs/brick1/drs1                      49152   Y       21810
Brick 10.70.37.187:/rhs/brick1/dsr1                     49152   Y       2214
Brick 10.70.37.86:/rhs/brick2/dsr2                      49153   Y       21819
Brick 10.70.37.187:/rhs/brick2/dsr2                     49153   Y       2215
Brick 10.70.37.86:/rhs/brick3/dsr3                      49155   Y       21824
Brick 10.70.37.187:/rhs/brick3/dsr3                     49155   Y       2225
NFS Server on localhost                                 2049    Y       31631
Self-heal Daemon on localhost                           N/A     Y       31638
NFS Server on 10.70.37.187                              2049    Y       2232
Self-heal Daemon on 10.70.37.187                        N/A     Y       2238
 
Task Status of Volume dr-store
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: repvol
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.86:/rhs/brick4/repdir1                   49154   Y       21831
Brick 10.70.37.187:/rhs/brick4/repdir1                  49154   Y       2210
NFS Server on localhost                                 2049    Y       31631
Self-heal Daemon on localhost                           N/A     Y       31638
NFS Server on 10.70.37.187                              2049    Y       2232
Self-heal Daemon on 10.70.37.187                        N/A     Y       2238
 
Task Status of Volume repvol
------------------------------------------------------------------------------
There are no active volume tasks

Comment 3 SATHEESARAN 2014-02-05 07:05:31 UTC
Error messages from fuse mount log:

<snip1>
[2014-02-04 21:56:39.522835] W [afr-inode-write.c:510:afr_writev] 1-dr-store-replicate-0: failing writev due to lack of quorum
[2014-02-04 21:56:39.522869] W [fuse-bridge.c:2696:fuse_writev_cbk] 0-glusterfs-fuse: 17070203: WRITE => -1 (Read-only file system)
[2014-02-04 21:56:39.536886] W [afr-inode-write.c:510:afr_writev] 1-dr-store-replicate-1: failing writev due to lack of quorum
[2014-02-04 21:56:39.536913] W [fuse-bridge.c:2696:fuse_writev_cbk] 0-glusterfs-fuse: 17070207: WRITE => -1 (Read-only file system)
[2014-02-04 21:56:39.726681] W [afr-inode-write.c:510:afr_writev] 1-dr-store-replicate-0: failing writev due to lack of quorum
[2014-02-04 21:56:39.726711] W [fuse-bridge.c:2696:fuse_writev_cbk] 0-glusterfs-fuse: 17070250: WRITE => -1 (Read-only file system)
[2014-02-04 21:56:39.838559] W [afr-inode-write.c:510:afr_writev] 1-dr-store-replicate-0: failing writev due to lack of quorum
[2014-02-04 21:56:39.838589] W [fuse-bridge.c:2696:fuse_writev_cbk] 0-glusterfs-fuse: 17070252: WRITE => -1 (Read-only file system)
[2014-02-04 21:56:46.903560] I [rpc-clnt.c:1690:rpc_clnt_reconfig] 1-dr-store-client-2: changing port to 49153 (from 0)
[2014-02-04 21:56:46.908532] I [rpc-clnt.c:1690:rpc_clnt_reconfig] 1-dr-store-client-4: changing port to 49155 (from 0)
</snip1>

<snip2>
2014-02-04 22:17:24.095650] C [client-handshake.c:127:rpc_client_ping_timer_expired] 1-dr-store-client-1: server 10.70.37.187:49152 has not responded in the last 42 seconds, disconnecting.
[2014-02-04 22:17:47.099842] C [client-handshake.c:127:rpc_client_ping_timer_expired] 1-dr-store-client-2: server 10.70.37.86:49153 has not responded in the last 42 seconds, disconnecting.
[2014-02-04 22:17:48.099986] C [client-handshake.c:127:rpc_client_ping_timer_expired] 1-dr-store-client-0: server 10.70.37.86:49152 has not responded in the last 42 seconds, disconnecting.
[2014-02-04 22:46:42.227994] E [rpc-clnt.c:208:call_bail] 1-dr-store-client-1: bailing out frame type(GlusterFS 3.3) op(FSYNC(16)) xid = 0x234866x sent = 2014-02-04 22:16:38.497154. timeout = 1800 for 10.70.37.187:49152
[2014-02-04 22:46:42.228049] W [client-rpc-fops.c:984:client3_3_fsync_cbk] 1-dr-store-client-1: remote operation failed: Transport endpoint is not connected
[2014-02-04 22:46:42.228069] W [afr-transaction.c:1553:afr_changelog_fsync_cbk] 1-dr-store-replicate-0: fsync(2287b21f-dc5e-47f5-8730-679ee9b44273) failed on subvolume dr-store-client-1. Transaction was WRITE
[2014-02-04 22:46:48.228531] E [rpc-clnt.c:208:call_bail] 1-dr-store-client-0: bailing out frame type(GlusterFS 3.3) op(FSYNC(16)) xid = 0x235871x sent = 2014-02-04 22:16:38.497142. timeout = 1800 for 10.70.37.86:49152
[2014-02-04 22:46:48.228592] W [client-rpc-fops.c:984:client3_3_fsync_cbk] 1-dr-store-client-0: remote operation failed: Transport endpoint is not connected
[2014-02-04 22:46:48.228612] W [afr-transaction.c:1553:afr_changelog_fsync_cbk] 1-dr-store-replicate-0: fsync(2287b21f-dc5e-47f5-8730-679ee9b44273) failed on subvolume dr-store-client-0. Transaction was WRITE
</snip2>

Comment 4 SATHEESARAN 2014-02-05 07:06:44 UTC
Created attachment 859496 [details]
screenshot of appvm1

This screenshot of appvms shows that this appvm's state

Comment 5 SATHEESARAN 2014-02-05 07:07:33 UTC
Created attachment 859498 [details]
screenshot of appvm2

This screenshot of appvm show the state of appvm

Comment 6 SATHEESARAN 2014-02-05 07:10:14 UTC
Created attachment 859499 [details]
sosreport from RHSS Node1

Attached the sosreport from RHSS Node1 - 10.70.37.86

Comment 7 SATHEESARAN 2014-02-05 07:13:25 UTC
Created attachment 859501 [details]
sosreport from RHSS Node2

Attached the sosreport from RHSS Node2 - 10.70.37.187

Comment 8 SATHEESARAN 2014-02-05 07:16:45 UTC
sosreport on Hypervisor hung as the same as "mount and df" commands

So I am attaching the fuse mount logs seperately.
Hope that is the most required from Hypervisor

Comment 9 SATHEESARAN 2014-02-05 07:17:30 UTC
Created attachment 859502 [details]
Fuse mount log files from Hypervisor

Attaching the the fuse mount log as got from the Hypervisor

Comment 10 SATHEESARAN 2014-02-05 07:28:19 UTC
It is expected that RHSS Node can go down in a cluster.
If the cluster has limited number of nodes, say 2 or 4. There is high possibility that server-quorum is not met.

In those cases, App VMs goes to paused state and unable to resume their operations as normal, once after quorum is again met.
No new VMs could also be created with its disk images on this volume.

App VMs should not get affected, before and after ,"quorum is not met" and "quorum is met"

Considering all this, moving the Severity to 'Urgent'

Comment 11 SATHEESARAN 2014-02-05 08:31:01 UTC
All the operations on the mount point are hung.

Suspecting the problems are with Brick processes. ( as Pranith observed initially)
Restarting the volume could be the better workaround.
But I haven't done this, as Pranith wanted to retain the setup.

Comment 12 SATHEESARAN 2014-02-05 11:41:36 UTC
I repeated the same test with same version of gluster and I could still hit this problem

Only thing I could see now is, App VMs are automatically UP, after quorum is met.
They were running the paused IO for a minute or so again.
After few minutes ( probably 1 minute or more ), I see the Storage Domain again went down.

I see the same error messages in App VMs and the "df" command hung

Its consistently reproducible

Comment 13 SATHEESARAN 2014-02-05 14:49:28 UTC
As on Pranith's request, tried the same test on glusterfs-3.4.0.56rhs-1 and this issue was not seen.

As far as now, there is something wrong happening with FSYNC fop and this issue was not seen in glusterfs-3.4.0.56rhs-1 , as seen below:

<<<< Filtering for FSYNC related errors with the glusterfs-3.4.0.58rhs-1.el6rhs
<<<<< from fuse mount log

root@rhs-client15 glusterfs]# grep -i fsync rhev-data-center-mnt-10.70.37.86:dr-store.log

[2014-02-04 21:36:08.678370] E [rpc-clnt.c:369:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x164) [0x7fa34cfca524] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x7fa34cfca063] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7fa34cfc9f7e]))) 1-dr-store-client-3: forced unwinding frame type(GlusterFS 3.3) op(FSYNC(16)) called at 2014-02-04 21:35:00.542502 (xid=0x118726x)
[2014-02-04 21:36:08.678385] W [client-rpc-fops.c:984:client3_3_fsync_cbk] 1-dr-store-client-3: remote operation failed: Transport endpoint is not connected
[2014-02-04 21:36:08.678409] W [afr-transaction.c:1553:afr_changelog_fsync_cbk] 1-dr-store-replicate-1: fsync(b31d8277-fae7-48c7-a377-1b8fc817a6eb) failed on subvolume dr-store-client-3. Transaction was WRITE
[2014-02-04 22:46:42.227994] E [rpc-clnt.c:208:call_bail] 1-dr-store-client-1: bailing out frame type(GlusterFS 3.3) op(FSYNC(16)) xid = 0x234866x sent = 2014-02-04 22:16:38.497154. timeout = 1800 for 10.70.37.187:49152
[2014-02-04 22:46:42.228049] W [client-rpc-fops.c:984:client3_3_fsync_cbk] 1-dr-store-client-1: remote operation failed: Transport endpoint is not connected
[2014-02-04 22:46:42.228069] W [afr-transaction.c:1553:afr_changelog_fsync_cbk] 1-dr-store-replicate-0: fsync(2287b21f-dc5e-47f5-8730-679ee9b44273) failed on subvolume dr-store-client-1. Transaction was WRITE
[2014-02-04 22:46:48.228531] E [rpc-clnt.c:208:call_bail] 1-dr-store-client-0: bailing out frame type(GlusterFS 3.3) op(FSYNC(16)) xid = 0x235871x sent = 2014-02-04 22:16:38.497142. timeout = 1800 for 10.70.37.86:49152
[2014-02-04 22:46:48.228592] W [client-rpc-fops.c:984:client3_3_fsync_cbk] 1-dr-store-client-0: remote operation failed: Transport endpoint is not connected
[2014-02-04 22:46:48.228612] W [afr-transaction.c:1553:afr_changelog_fsync_cbk] 1-dr-store-replicate-0: fsync(2287b21f-dc5e-47f5-8730-679ee9b44273) failed on subvolume dr-store-client-0. Transaction was WRITE

<<<< Filtering for FSYNC related errors with the glusterfs-3.4.0.56rhs-1.el6rhs
<<<<< from fuse mount log

[root@rhs-client15 glusterfs]# grep -i fsync rhev-data-center-mnt-10.70.40.116:drvol.log
[root@rhs-client15 glusterfs]#

Comment 16 SATHEESARAN 2014-02-06 05:39:39 UTC
As far as I see, the only workaround is:

1. Identify the mount process
(i.e) ps aux | grep gluster

2. Kill the appropriate mount process
(i.e) kill -9 <pid>

3. Lazy unmount the fuse mount
(i.e) umount -l <fuse-mount-by-rhev>
Existing UP ( though the appvms are in hung state, these are shown as UP from RHEVM UI ) appvms will go to paused state

4. Activate the Storage domain from RHEVM UI
Here the fuse mount will happen again

5. Force shutdown all existing paused app vms

6. Restart the existing appvms
You can also create a new appvms with disk images on this storage domain

I performed the above steps and resume the storage domain operations.
Capture the the logs from hypervisor as below:

Console logs on the hypervisor
==============================
[root@rhs-client10 ~]# ps aux | grep gluster
root     11252  183  0.3 310444 50088 ?        R<sl Feb05 2141:50 /usr/sbin/glusterfs --volfile-server=10.70.37.46 --volfile-id=test-drvol /rhev/data-center/mnt/10.70.37.46:test-drvol
root     23481  0.0  0.0 103256   856 pts/6    S+   05:47   0:00 grep gluster

[root@rhs-client10 ~]# kill -9 11252

<<<< df command was not hanging now >>
[root@rhs-client10 ~]# df -Th
df: `/rhev/data-center/mnt/10.70.37.46:test-drvol': Transport endpoint is not connected
Filesystem     Type   Size  Used Avail Use% Mounted on
/dev/sda2      ext4    99G  2.4G   92G   3% /
tmpfs          tmpfs  7.8G  8.0K  7.8G   1% /dev/shm
/dev/sda1      ext4  1008M   89M  868M  10% /boot

[root@rhs-client10 ~]# umount /rhev/data-center/mnt/10.70.37.46:test-drvol
umount: /rhev/data-center/mnt/10.70.37.46:test-drvol: device is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))
[root@rhs-client10 ~]# ps aux | grep gluster
root     24461  0.0  0.0 103252   852 pts/6    S+   05:48   0:00 grep gluster

[root@rhs-client10 ~]# umount -l /rhev/data-center/mnt/10.70.37.46:test-drvol
[root@rhs-client10 ~]#

<<<< after activation of storage domain from RHEVM UI>>>>>
[root@rhs-client10 ~]# df -Th
Filesystem             Type            Size  Used Avail Use% Mounted on
/dev/sda2              ext4             99G  2.4G   92G   3% /
tmpfs                  tmpfs           7.8G  8.0K  7.8G   1% /dev/shm
/dev/sda1              ext4           1008M   89M  868M  10% /boot
10.70.37.46:test-drvol fuse.glusterfs  200G  8.0G  192G   4% /rhev/data-center/mnt/10.70.37.46:test-drvol

Comment 19 SATHEESARAN 2014-02-06 14:36:02 UTC
Continued the test with RHEL 6.5 and found the same issue again.

In one of my attempt, I restored quorum quickly ( within 5 minutes ), by bringing up the RHSS Node, from the time quorum was not met, and all VM Instances got restored normally. Probably no timer expired ? ( not sure ).

But when there is enough time interval ( say 30 minutes or more ), before restoring quorum, I see this error consistently.

In all the test cases that I did, I see App VMs returning to normal condition immediately after restoring quorum, and after sometime I could see possible errors

In all the test scenarios, I see this error message consistently in log message:

<snip>
[2014-02-06 19:37:22.151422] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-cvol-client-1: server 10.70.37.187:49156 has not responded in the last 42 seconds, disconnecting.
[2014-02-06 19:37:27.155575] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-cvol-client-0: server 10.70.37.86:49156 has not responded in the last 42 seconds, disconnecting.
</snip>

Why is the disconnect happening after restoring the quorum ?

I have captured the specific part of the fuse mount log, from the point server side quorum is restored. Hope this would be helpful to understand the operations

Bringing up the node now
=========================
[2014-02-06 18:23:42.776185] W [fuse-bridge.c:1134:fuse_attr_cbk] 0-glusterfs-fuse: 207291: LOOKUP() / => -1 (Transport endpoint is not connected)
[2014-02-06 18:32:22.296254] W [fuse-bridge.c:1134:fuse_attr_cbk] 0-glusterfs-fuse: 207333: LOOKUP() / => -1 (Transport endpoint is not connected)
[2014-02-06 18:41:04.518361] W [fuse-bridge.c:1134:fuse_attr_cbk] 0-glusterfs-fuse: 207375: LOOKUP() / => -1 (Transport endpoint is not connected)
[2014-02-06 18:50:00.035737] W [fuse-bridge.c:1134:fuse_attr_cbk] 0-glusterfs-fuse: 207417: LOOKUP() / => -1 (Transport endpoint is not connected)
[2014-02-06 18:58:37.462267] W [fuse-bridge.c:1134:fuse_attr_cbk] 0-glusterfs-fuse: 207459: LOOKUP() / => -1 (Transport endpoint is not connected)
[2014-02-06 19:07:19.194087] W [fuse-bridge.c:1134:fuse_attr_cbk] 0-glusterfs-fuse: 207501: LOOKUP() / => -1 (Transport endpoint is not connected)
[2014-02-06 19:15:50.710956] W [fuse-bridge.c:1134:fuse_attr_cbk] 0-glusterfs-fuse: 207543: LOOKUP() / => -1 (Transport endpoint is not connected)
[2014-02-06 19:24:32.568873] W [fuse-bridge.c:1134:fuse_attr_cbk] 0-glusterfs-fuse: 207585: LOOKUP() / => -1 (Transport endpoint is not connected)
[2014-02-06 19:30:17.539664] E [client-handshake.c:1778:client_query_portmap_cbk] 0-cvol-client-1: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2014-02-06 19:30:17.539745] I [client.c:2103:client_rpc_notify] 0-cvol-client-1: disconnected from 10.70.37.187:24007. Client process will keep trying to connect to glusterd until brick's port is available. 
[2014-02-06 19:30:17.540290] E [client-handshake.c:1778:client_query_portmap_cbk] 0-cvol-client-3: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2014-02-06 19:30:17.540341] I [client.c:2103:client_rpc_notify] 0-cvol-client-3: disconnected from 10.70.37.187:24007. Client process will keep trying to connect to glusterd until brick's port is available. 
[2014-02-06 19:30:18.091145] I [rpc-clnt.c:1690:rpc_clnt_reconfig] 0-cvol-client-3: changing port to 49157 (from 0)
[2014-02-06 19:30:18.094941] I [client-handshake.c:1695:select_server_supported_programs] 0-cvol-client-3: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2014-02-06 19:30:18.097395] I [client-handshake.c:1466:client_setvolume_cbk] 0-cvol-client-3: 'server-pkg-version' key not found, handshaked with older client
[2014-02-06 19:30:18.097417] I [client-handshake.c:1480:client_setvolume_cbk] 0-cvol-client-3: Connected to 10.70.37.187:49157, attached to remote volume '/rhs/brick2/dir2'.
[2014-02-06 19:30:18.097426] I [client-handshake.c:1492:client_setvolume_cbk] 0-cvol-client-3: Server and Client lk-version numbers are not same, reopening the fds
[2014-02-06 19:30:18.097434] I [client-handshake.c:1314:client_post_handshake] 0-cvol-client-3: 2 fds open - Delaying child_up until they are re-opened
[2014-02-06 19:30:18.101298] I [client-handshake.c:936:client_child_up_reopen_done] 0-cvol-client-3: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2014-02-06 19:30:18.101337] I [afr-common.c:3988:afr_notify] 0-cvol-replicate-1: Subvolume 'cvol-client-3' came back up; going online.
[2014-02-06 19:30:18.101591] I [client-handshake.c:450:client_set_lk_version_cbk] 0-cvol-client-3: Server lk version = 1
[2014-02-06 19:30:19.095146] I [rpc-clnt.c:1690:rpc_clnt_reconfig] 0-cvol-client-1: changing port to 49156 (from 0)
[2014-02-06 19:30:19.098860] I [client-handshake.c:1695:select_server_supported_programs] 0-cvol-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2014-02-06 19:30:19.101509] I [client-handshake.c:1466:client_setvolume_cbk] 0-cvol-client-1: 'server-pkg-version' key not found, handshaked with older client
[2014-02-06 19:30:19.101535] I [client-handshake.c:1480:client_setvolume_cbk] 0-cvol-client-1: Connected to 10.70.37.187:49156, attached to remote volume '/rhs/brick1/dir1'.
[2014-02-06 19:30:19.101544] I [client-handshake.c:1492:client_setvolume_cbk] 0-cvol-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2014-02-06 19:30:19.101552] I [client-handshake.c:1314:client_post_handshake] 0-cvol-client-1: 1 fds open - Delaying child_up until they are re-opened
[2014-02-06 19:30:19.102325] I [client-handshake.c:936:client_child_up_reopen_done] 0-cvol-client-1: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2014-02-06 19:30:19.102363] I [afr-common.c:3988:afr_notify] 0-cvol-replicate-0: Subvolume 'cvol-client-1' came back up; going online.
[2014-02-06 19:30:19.102554] I [client-handshake.c:450:client_set_lk_version_cbk] 0-cvol-client-1: Server lk version = 1
[2014-02-06 19:30:20.101868] I [rpc-clnt.c:1690:rpc_clnt_reconfig] 0-cvol-client-2: changing port to 49157 (from 0)
[2014-02-06 19:30:20.107270] I [client-handshake.c:1695:select_server_supported_programs] 0-cvol-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2014-02-06 19:30:20.107339] I [rpc-clnt.c:1690:rpc_clnt_reconfig] 0-cvol-client-0: changing port to 49156 (from 0)
[2014-02-06 19:30:20.110192] I [client-handshake.c:1466:client_setvolume_cbk] 0-cvol-client-2: 'server-pkg-version' key not found, handshaked with older client
[2014-02-06 19:30:20.110215] I [client-handshake.c:1480:client_setvolume_cbk] 0-cvol-client-2: Connected to 10.70.37.86:49157, attached to remote volume '/rhs/brick2/dir2'.
[2014-02-06 19:30:20.110225] I [client-handshake.c:1492:client_setvolume_cbk] 0-cvol-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2014-02-06 19:30:20.110233] I [client-handshake.c:1314:client_post_handshake] 0-cvol-client-2: 2 fds open - Delaying child_up until they are re-opened
[2014-02-06 19:30:20.110657] I [client-handshake.c:1695:select_server_supported_programs] 0-cvol-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2014-02-06 19:30:20.110881] I [client-handshake.c:936:client_child_up_reopen_done] 0-cvol-client-2: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2014-02-06 19:30:20.111031] I [client-handshake.c:1466:client_setvolume_cbk] 0-cvol-client-0: 'server-pkg-version' key not found, handshaked with older client
[2014-02-06 19:30:20.111071] I [client-handshake.c:1480:client_setvolume_cbk] 0-cvol-client-0: Connected to 10.70.37.86:49156, attached to remote volume '/rhs/brick1/dir1'.
[2014-02-06 19:30:20.111083] I [client-handshake.c:1492:client_setvolume_cbk] 0-cvol-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2014-02-06 19:30:20.111110] I [client-handshake.c:1314:client_post_handshake] 0-cvol-client-0: 1 fds open - Delaying child_up until they are re-opened
[2014-02-06 19:30:20.111110] I [client-handshake.c:1314:client_post_handshake] 0-cvol-client-0: 1 fds open - Delaying child_up until they are re-opened
[2014-02-06 19:30:20.111194] I [client-handshake.c:450:client_set_lk_version_cbk] 0-cvol-client-2: Server lk version = 1
[2014-02-06 19:30:20.111463] I [client-handshake.c:936:client_child_up_reopen_done] 0-cvol-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP
[2014-02-06 19:30:20.111656] I [client-handshake.c:450:client_set_lk_version_cbk] 0-cvol-client-0: Server lk version = 1
[2014-02-06 19:37:22.151422] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-cvol-client-1: server 10.70.37.187:49156 has not responded in the last 42 seconds, disconnecting.
[2014-02-06 19:37:27.155575] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-cvol-client-0: server 10.70.37.86:49156 has not responded in the last 42 seconds, disconnecting.

Comment 20 SATHEESARAN 2014-02-06 14:48:19 UTC
Created attachment 860207 [details]
tcpdump from hypervisor side

Took the tcpdump from the Hypervisor side, analysing this dump with Network Packet Analyser like Wireshark, could help in identifying which transaction actually failed

Comment 21 krishnan parthasarathi 2014-02-07 12:13:05 UTC
When the problem was recreated, we observed that the event_dispatch_epoll() thread is blocked on a lock, which is held by the FUSE thread which is looping infinitely over the list of eager locked transactions on an fd opened on behalf of one of the VMs on the volume. This appears to be an issue with eager lock management in afr xlator. Further analysis is required to identify what might have caused this to happen.

We also know that this issue is not observed on 3.4.0.56rhs rpms. We are looking into the patches that went in after 56 and trying to identify what might have caused this.

Comment 22 Pranith Kumar K 2014-02-09 01:51:30 UTC
I don't think any patch between 3.4.0.56rhs 3.4.0.58rhs could have caused this.

Satheesaran,
      If I remember correctly we brought up the VMs quickly (definitely within 5 to 10 minutes) on 3.4.0.56rhs the last time we tried. Did you get a chance to re-create the bug again and waited for at least 30 minutes even with 3.4.0.56rhs?

Krishnan,
      Did you get a chance to observe if write-behind is pumping writes to afr xlator when you saw this issue?

Pranith.

Comment 23 SATHEESARAN 2014-02-10 05:44:50 UTC
(In reply to Pranith Kumar K from comment #22)
> I don't think any patch between 3.4.0.56rhs 3.4.0.58rhs could have caused
> this.
> 
> Satheesaran,
>       If I remember correctly we brought up the VMs quickly (definitely
> within 5 to 10 minutes) on 3.4.0.56rhs the last time we tried. Did you get a
> chance to re-create the bug again and waited for at least 30 minutes even
> with 3.4.0.56rhs?

Pranith,

I retried the same issue with glusterfs-3.4.0.56rhs and gave sufficient time interval.

I am not seeing any issues with 56rhs build.
I have the 56rhs for you to investigate in to it.

Comment 24 Pranith Kumar K 2014-02-10 09:59:16 UTC
I was able to figure out why there could be an infinite loop in afr eager-lock functionality. When both the bricks go down and an 'finodelk' for eager-lock is in progress and if it fails it leaves a stale stub in the list of stubs which share the eager-lock. this stale stub's memory is re-used by new frames which get same pool of memory. So list traversal with this stale stub could lead to undefined behavior could also be infinite loops. I attached the process to gdb to confirm the case I mentioned above lead to stale stubs. I sent the fix for this bug @http://review.gluster.org/6944 upstream.

we confirmed that the following logs "[2014-02-04 21:36:12.702263] I [afr-transaction.c:1119:afr_post_blocking_inodelk_cbk] 1-dr-store-replicate-0: Blocking inodelks failed" are present in the setup where the issue is seen and not present in the setup (the one with 3.4.0.56rhs) where the issue is not seen.

This is not a regression, this bug must have been there from bigbend.

Thanks vijay, kp, satheesaran for helping me find the root cause.

I am not sure if there are more than one issues in this bug, for example the reboot we saw on one of the attempts to re-create the issue. But for the issue of infinite loop we saw, we have the fix.

Easier steps to re-create the bug:
1) Create a replicate volume and disable all performance xlators.
2) open a bash fd on one of the files on mount using bash fds "exec fd-num>file-name".
3) start performing writes on fd in a while loop in one window.
4) attach the mount process to gdb, put a break point just before afr sends finodelks
5) once the break point is hit stop the volume. This hits the error condition which will lead to infinite loops
6) Restart the volumes and wait for some time and df will hang.

I re-created this issue in front of satheesaran and with the fix this issue is not seen any more.

Comment 25 Pranith Kumar K 2014-02-10 10:01:54 UTC
Forgot to add one more point about the bug:
For this issue to be hit both the bricks need to go down just before taking finodelk. Server-quorum was just simulating that. We can hit this bug even without any quorum and just using volume stop just before finodelk as well.

Comment 26 SATHEESARAN 2014-02-11 09:29:58 UTC
AS per comment25, this bug is not relevant to server quorum only, and modifying the summary accordingly

Comment 27 Vivek Agarwal 2014-02-20 08:36:26 UTC
adding 3.0 flag and removing 2.1.z

Comment 29 Nagaprasad Sathyanarayana 2014-05-19 10:56:23 UTC
Setting flags required to add BZs to RHS 3.0 Errata

Comment 30 SATHEESARAN 2014-05-30 06:26:25 UTC
*** Bug 1062256 has been marked as a duplicate of this bug. ***

Comment 31 SATHEESARAN 2014-07-14 09:37:23 UTC
Tested with glusterfs-3.6.0.24-1.el6rhs with the following steps as in comment0 and mount point doesn't hang once the client-side quorum is restored.

Comment 33 errata-xmlrpc 2014-09-22 19:32:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html


Note You need to log in before you can comment on or make changes to this bug.