Description of problem: ----------------------- When the server side quorum is not met, App VMs are paused. But when the server side quorum is met again, the Volume continues to be "inaccessible". "Inaccessible" here can be noticed by the following, 1. From RHEVM, data domain remains to be DOWN 2. From the hypervisor, "mount and df" command on mount point hangs The volume is though seen up from RHSS Nodes (gluster cli) Version-Release number of selected component (if applicable): ------------------------------------------------------------- RHEVM - IS33 ( 3.3.0-0.46.el6ev ) RHSS 2.1Update2 RC2 - glusterfs-3.4.0.58rhs.el6rhs How reproducible: ----------------- Happened once in the first attempt. Steps to Reproduce: ------------------- 0. Create a Trusted storage pool with 2 RHSS Nodes 1. Created a distributed-replicate volume of 3X2 2. Optimize the volume for 'virt-store' and start the volume 3. Use this volume for data domain for RHEV DataCenter 4. Create App VMs with its disk images ( of raw type) on this data domain backed by gluster volume 5. Install all App VMs with RHEL OS 6. Simulate running IO in App VMs using 'dd' command 7. Force shutdown the RHSS NODE2 NOTE: server quorum will not be met, when NODE2 goes down All the App VMs got 'paused' 8. After sometime, brough back the NODE2 that was earlier powered off 9. Resume the App VMs Actual results: --------------- 1. App VMs are resumed but continuous error messages are seen: "echo 0> /proc/sys/kernel/hung_task_timeout_secs" disables this message. INFO: task jbd2/dm-0-8:317 blocked for more than 120 seconds." 2. App VMs could not be reached through SSH 3. Storage Domain ( to be precise, Data Domain ), went DOWN No new images could be created on it. 4. "df -Th & mount" command on the mount point also hangs. Expected results: ------------------ Once the quorum is met again, App VMs should resume their operations and there should not be any problem with their normal operations Additional info: ----------------- 1. Installation Info ====================== RHSS ISO - RHSS 2.1Update2 RC2 - RHSS-2.1-20140130.0-RHS-x86_64-DVD1.iso http://download.devel.redhat.com/composes/candidates/RHSS-2.1-20140130.0/2.1/RHS/x86_64/iso/RHSS-2.1-20140130.0-RHS-x86_64-DVD1.iso 2. RHSS Node Info ================== There are 2 nodes in the 'Trusted Storage Pool' (gluster cluster) These machines are VMs provisioned with 8GB of RAM and 4 VCPUs RHSS Node1 - rhss1.lab.eng.blr.redhat.com (10.70.37.86) RHSS Node2 - rhss2.lab.eng.blr.redhat.com (10.70.37.187) All the commands are executed from, Node1, 10.70.37.86 3. RHSS Volume Info ==================== There are 2 volumes available - dr-store & repvol dr-store is the volume that is containing the Virtual machine images [Tue Feb 4 21:36:20 UTC 2014 root.37.86:~ ] # gluster v i dr-store Volume Name: dr-store Type: Distributed-Replicate Volume ID: c654edf4-3336-47aa-af57-4d096d17da14 Status: Started Number of Bricks: 3 x 2 = 6 Transport-type: tcp Bricks: Brick1: 10.70.37.86:/rhs/brick1/drs1 Brick2: 10.70.37.187:/rhs/brick1/dsr1 Brick3: 10.70.37.86:/rhs/brick2/dsr2 Brick4: 10.70.37.187:/rhs/brick2/dsr2 Brick5: 10.70.37.86:/rhs/brick3/dsr3 Brick6: 10.70.37.187:/rhs/brick3/dsr3 Options Reconfigured: storage.owner-gid: 36 storage.owner-uid: 36 cluster.server-quorum-type: server cluster.quorum-type: auto network.remote-dio: enable cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off [Tue Feb 4 21:52:25 UTC 2014 root.37.86:~ ] # gluster volume status dr-store Status of volume: dr-store Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.86:/rhs/brick1/drs1 49152 Y 21810 Brick 10.70.37.187:/rhs/brick1/dsr1 49152 Y 2214 Brick 10.70.37.86:/rhs/brick2/dsr2 49153 Y 21819 Brick 10.70.37.187:/rhs/brick2/dsr2 49153 Y 2215 Brick 10.70.37.86:/rhs/brick3/dsr3 49155 Y 21824 Brick 10.70.37.187:/rhs/brick3/dsr3 49155 Y 2225 NFS Server on localhost 2049 Y 31631 Self-heal Daemon on localhost N/A Y 31638 NFS Server on 10.70.37.187 2049 Y 2232 Self-heal Daemon on 10.70.37.187 N/A Y 2238 Task Status of Volume dr-store ------------------------------------------------------------------------------ There are no active volume tasks 4. Hypervisor(client) Info ========================== Hypervisor used : rhs-client15.lab.eng.blr.redhat.com ( 10.70.36.39 ) mount point : /rhev/data-center/mnt/glusterSD/10.70.37.198:dr-store 5. Other relevant Info ======================= This volume is created as 2X2 volume, but later added a pair of bricks to this volume to make it as 3X2. Still there is no rebalance being done.
Consequence on App VM ===================== 1. VM is not functional, couldn't log in to VM using ssh, 2. Storage domain was DOWN with the error, "Failed to Reconstruct Master domain for Data Center virt-dc." 3. no new disk images can be created as domain is not available
Console logs as captured from RHSS Node1 - 10.70.37.86 ======================================================= [Tue Feb 4 20:58:28 UTC 2014 root.37.86:~ ] # gluster v i Volume Name: dr-store Type: Distributed-Replicate Volume ID: c654edf4-3336-47aa-af57-4d096d17da14 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.37.86:/rhs/brick1/drs1 Brick2: 10.70.37.187:/rhs/brick1/dsr1 Brick3: 10.70.37.86:/rhs/brick2/dsr2 Brick4: 10.70.37.187:/rhs/brick2/dsr2 Options Reconfigured: storage.owner-gid: 36 storage.owner-uid: 36 cluster.server-quorum-type: server cluster.quorum-type: auto network.remote-dio: enable cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off Volume Name: repvol Type: Replicate Volume ID: 419bdc4c-2802-4eb4-96d5-b71c1da2fc2c Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 10.70.37.86:/rhs/brick4/repdir1 Brick2: 10.70.37.187:/rhs/brick4/repdir1 Options Reconfigured: storage.owner-gid: 36 storage.owner-uid: 36 cluster.server-quorum-type: server cluster.quorum-type: auto network.remote-dio: enable cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off [Tue Feb 4 20:58:33 UTC 2014 root.37.86:~ ] # gluster volume add-brick dr-store 10.70.37.86:/rhs/brick3/dsr3 10.70.37.187:/rhs/brick3/dsr3 volume add-brick: success <<<<<< Forced power-off the second RHSS Node here >>>>>>>> [Tue Feb 4 20:59:40 UTC 2014 root.37.86:~ ] # gluster volume status Status of volume: dr-store Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.86:/rhs/brick1/drs1 N/A N N/A Brick 10.70.37.86:/rhs/brick2/dsr2 N/A N N/A Brick 10.70.37.86:/rhs/brick3/dsr3 N/A N N/A NFS Server on localhost 2049 Y 31631 Self-heal Daemon on localhost N/A Y 31638 Task Status of Volume dr-store ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: repvol Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.86:/rhs/brick4/repdir1 N/A N N/A NFS Server on localhost 2049 Y 31631 Self-heal Daemon on localhost N/A Y 31638 Task Status of Volume repvol ------------------------------------------------------------------------------ There are no active volume tasks [Tue Feb 4 21:36:20 UTC 2014 root.37.86:~ ] # gluster v i dr-store Volume Name: dr-store Type: Distributed-Replicate Volume ID: c654edf4-3336-47aa-af57-4d096d17da14 Status: Started Number of Bricks: 3 x 2 = 6 Transport-type: tcp Bricks: Brick1: 10.70.37.86:/rhs/brick1/drs1 Brick2: 10.70.37.187:/rhs/brick1/dsr1 Brick3: 10.70.37.86:/rhs/brick2/dsr2 Brick4: 10.70.37.187:/rhs/brick2/dsr2 Brick5: 10.70.37.86:/rhs/brick3/dsr3 Brick6: 10.70.37.187:/rhs/brick3/dsr3 Options Reconfigured: storage.owner-gid: 36 storage.owner-uid: 36 cluster.server-quorum-type: server cluster.quorum-type: auto network.remote-dio: enable cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off [Tue Feb 4 21:37:02 UTC 2014 root.37.86:~ ] # gluster volume status Status of volume: dr-store Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.86:/rhs/brick1/drs1 N/A N N/A Brick 10.70.37.86:/rhs/brick2/dsr2 N/A N N/A Brick 10.70.37.86:/rhs/brick3/dsr3 N/A N N/A NFS Server on localhost 2049 Y 31631 Self-heal Daemon on localhost N/A Y 31638 Task Status of Volume dr-store ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: repvol Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.86:/rhs/brick4/repdir1 N/A N N/A NFS Server on localhost 2049 Y 31631 Self-heal Daemon on localhost N/A Y 31638 Task Status of Volume repvol ------------------------------------------------------------------------------ There are no active volume tasks [Tue Feb 4 21:48:03 UTC 2014 root.37.86:~ ] # gluster volume status Status of volume: dr-store Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.86:/rhs/brick1/drs1 N/A N N/A Brick 10.70.37.86:/rhs/brick2/dsr2 N/A N N/A Brick 10.70.37.86:/rhs/brick3/dsr3 N/A N N/A NFS Server on localhost 2049 Y 31631 Self-heal Daemon on localhost N/A Y 31638 Task Status of Volume dr-store ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: repvol Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.86:/rhs/brick4/repdir1 N/A N N/A NFS Server on localhost 2049 Y 31631 Self-heal Daemon on localhost N/A Y 31638 Task Status of Volume repvol ------------------------------------------------------------------------------ There are no active volume tasks [Tue Feb 4 21:48:09 UTC 2014 root.37.86:~ ] # gluster pe s Number of Peers: 1 Hostname: 10.70.37.187 Uuid: 28eef69b-10d6-4b9d-9103-844eb46b357c State: Peer in Cluster (Disconnected) [Tue Feb 4 21:48:18 UTC 2014 root.37.86:~ ] # gluster peer probe 10.70.37.198 peer probe: failed: Cluster quorum is not met. Changing peers is not allowed in this state <<<< Powered on the second RHSS Node here>>>> [Tue Feb 4 21:52:25 UTC 2014 root.37.86:~ ] # gluster volume status Status of volume: dr-store Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.86:/rhs/brick1/drs1 49152 Y 21810 Brick 10.70.37.187:/rhs/brick1/dsr1 49152 Y 2214 Brick 10.70.37.86:/rhs/brick2/dsr2 49153 Y 21819 Brick 10.70.37.187:/rhs/brick2/dsr2 49153 Y 2215 Brick 10.70.37.86:/rhs/brick3/dsr3 49155 Y 21824 Brick 10.70.37.187:/rhs/brick3/dsr3 49155 Y 2225 NFS Server on localhost 2049 Y 31631 Self-heal Daemon on localhost N/A Y 31638 NFS Server on 10.70.37.187 2049 Y 2232 Self-heal Daemon on 10.70.37.187 N/A Y 2238 Task Status of Volume dr-store ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: repvol Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.86:/rhs/brick4/repdir1 49154 Y 21831 Brick 10.70.37.187:/rhs/brick4/repdir1 49154 Y 2210 NFS Server on localhost 2049 Y 31631 Self-heal Daemon on localhost N/A Y 31638 NFS Server on 10.70.37.187 2049 Y 2232 Self-heal Daemon on 10.70.37.187 N/A Y 2238 Task Status of Volume repvol ------------------------------------------------------------------------------ There are no active volume tasks [Tue Feb 4 22:15:37 UTC 2014 root.37.86:~ ] # gluster volume status Status of volume: dr-store Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.86:/rhs/brick1/drs1 49152 Y 21810 Brick 10.70.37.187:/rhs/brick1/dsr1 49152 Y 2214 Brick 10.70.37.86:/rhs/brick2/dsr2 49153 Y 21819 Brick 10.70.37.187:/rhs/brick2/dsr2 49153 Y 2215 Brick 10.70.37.86:/rhs/brick3/dsr3 49155 Y 21824 Brick 10.70.37.187:/rhs/brick3/dsr3 49155 Y 2225 NFS Server on localhost 2049 Y 31631 Self-heal Daemon on localhost N/A Y 31638 NFS Server on 10.70.37.187 2049 Y 2232 Self-heal Daemon on 10.70.37.187 N/A Y 2238 Task Status of Volume dr-store ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: repvol Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.86:/rhs/brick4/repdir1 49154 Y 21831 Brick 10.70.37.187:/rhs/brick4/repdir1 49154 Y 2210 NFS Server on localhost 2049 Y 31631 Self-heal Daemon on localhost N/A Y 31638 NFS Server on 10.70.37.187 2049 Y 2232 Self-heal Daemon on 10.70.37.187 N/A Y 2238 Task Status of Volume repvol ------------------------------------------------------------------------------ There are no active volume tasks
Error messages from fuse mount log: <snip1> [2014-02-04 21:56:39.522835] W [afr-inode-write.c:510:afr_writev] 1-dr-store-replicate-0: failing writev due to lack of quorum [2014-02-04 21:56:39.522869] W [fuse-bridge.c:2696:fuse_writev_cbk] 0-glusterfs-fuse: 17070203: WRITE => -1 (Read-only file system) [2014-02-04 21:56:39.536886] W [afr-inode-write.c:510:afr_writev] 1-dr-store-replicate-1: failing writev due to lack of quorum [2014-02-04 21:56:39.536913] W [fuse-bridge.c:2696:fuse_writev_cbk] 0-glusterfs-fuse: 17070207: WRITE => -1 (Read-only file system) [2014-02-04 21:56:39.726681] W [afr-inode-write.c:510:afr_writev] 1-dr-store-replicate-0: failing writev due to lack of quorum [2014-02-04 21:56:39.726711] W [fuse-bridge.c:2696:fuse_writev_cbk] 0-glusterfs-fuse: 17070250: WRITE => -1 (Read-only file system) [2014-02-04 21:56:39.838559] W [afr-inode-write.c:510:afr_writev] 1-dr-store-replicate-0: failing writev due to lack of quorum [2014-02-04 21:56:39.838589] W [fuse-bridge.c:2696:fuse_writev_cbk] 0-glusterfs-fuse: 17070252: WRITE => -1 (Read-only file system) [2014-02-04 21:56:46.903560] I [rpc-clnt.c:1690:rpc_clnt_reconfig] 1-dr-store-client-2: changing port to 49153 (from 0) [2014-02-04 21:56:46.908532] I [rpc-clnt.c:1690:rpc_clnt_reconfig] 1-dr-store-client-4: changing port to 49155 (from 0) </snip1> <snip2> 2014-02-04 22:17:24.095650] C [client-handshake.c:127:rpc_client_ping_timer_expired] 1-dr-store-client-1: server 10.70.37.187:49152 has not responded in the last 42 seconds, disconnecting. [2014-02-04 22:17:47.099842] C [client-handshake.c:127:rpc_client_ping_timer_expired] 1-dr-store-client-2: server 10.70.37.86:49153 has not responded in the last 42 seconds, disconnecting. [2014-02-04 22:17:48.099986] C [client-handshake.c:127:rpc_client_ping_timer_expired] 1-dr-store-client-0: server 10.70.37.86:49152 has not responded in the last 42 seconds, disconnecting. [2014-02-04 22:46:42.227994] E [rpc-clnt.c:208:call_bail] 1-dr-store-client-1: bailing out frame type(GlusterFS 3.3) op(FSYNC(16)) xid = 0x234866x sent = 2014-02-04 22:16:38.497154. timeout = 1800 for 10.70.37.187:49152 [2014-02-04 22:46:42.228049] W [client-rpc-fops.c:984:client3_3_fsync_cbk] 1-dr-store-client-1: remote operation failed: Transport endpoint is not connected [2014-02-04 22:46:42.228069] W [afr-transaction.c:1553:afr_changelog_fsync_cbk] 1-dr-store-replicate-0: fsync(2287b21f-dc5e-47f5-8730-679ee9b44273) failed on subvolume dr-store-client-1. Transaction was WRITE [2014-02-04 22:46:48.228531] E [rpc-clnt.c:208:call_bail] 1-dr-store-client-0: bailing out frame type(GlusterFS 3.3) op(FSYNC(16)) xid = 0x235871x sent = 2014-02-04 22:16:38.497142. timeout = 1800 for 10.70.37.86:49152 [2014-02-04 22:46:48.228592] W [client-rpc-fops.c:984:client3_3_fsync_cbk] 1-dr-store-client-0: remote operation failed: Transport endpoint is not connected [2014-02-04 22:46:48.228612] W [afr-transaction.c:1553:afr_changelog_fsync_cbk] 1-dr-store-replicate-0: fsync(2287b21f-dc5e-47f5-8730-679ee9b44273) failed on subvolume dr-store-client-0. Transaction was WRITE </snip2>
Created attachment 859496 [details] screenshot of appvm1 This screenshot of appvms shows that this appvm's state
Created attachment 859498 [details] screenshot of appvm2 This screenshot of appvm show the state of appvm
Created attachment 859499 [details] sosreport from RHSS Node1 Attached the sosreport from RHSS Node1 - 10.70.37.86
Created attachment 859501 [details] sosreport from RHSS Node2 Attached the sosreport from RHSS Node2 - 10.70.37.187
sosreport on Hypervisor hung as the same as "mount and df" commands So I am attaching the fuse mount logs seperately. Hope that is the most required from Hypervisor
Created attachment 859502 [details] Fuse mount log files from Hypervisor Attaching the the fuse mount log as got from the Hypervisor
It is expected that RHSS Node can go down in a cluster. If the cluster has limited number of nodes, say 2 or 4. There is high possibility that server-quorum is not met. In those cases, App VMs goes to paused state and unable to resume their operations as normal, once after quorum is again met. No new VMs could also be created with its disk images on this volume. App VMs should not get affected, before and after ,"quorum is not met" and "quorum is met" Considering all this, moving the Severity to 'Urgent'
All the operations on the mount point are hung. Suspecting the problems are with Brick processes. ( as Pranith observed initially) Restarting the volume could be the better workaround. But I haven't done this, as Pranith wanted to retain the setup.
I repeated the same test with same version of gluster and I could still hit this problem Only thing I could see now is, App VMs are automatically UP, after quorum is met. They were running the paused IO for a minute or so again. After few minutes ( probably 1 minute or more ), I see the Storage Domain again went down. I see the same error messages in App VMs and the "df" command hung Its consistently reproducible
As on Pranith's request, tried the same test on glusterfs-3.4.0.56rhs-1 and this issue was not seen. As far as now, there is something wrong happening with FSYNC fop and this issue was not seen in glusterfs-3.4.0.56rhs-1 , as seen below: <<<< Filtering for FSYNC related errors with the glusterfs-3.4.0.58rhs-1.el6rhs <<<<< from fuse mount log root@rhs-client15 glusterfs]# grep -i fsync rhev-data-center-mnt-10.70.37.86:dr-store.log [2014-02-04 21:36:08.678370] E [rpc-clnt.c:369:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x164) [0x7fa34cfca524] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x7fa34cfca063] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7fa34cfc9f7e]))) 1-dr-store-client-3: forced unwinding frame type(GlusterFS 3.3) op(FSYNC(16)) called at 2014-02-04 21:35:00.542502 (xid=0x118726x) [2014-02-04 21:36:08.678385] W [client-rpc-fops.c:984:client3_3_fsync_cbk] 1-dr-store-client-3: remote operation failed: Transport endpoint is not connected [2014-02-04 21:36:08.678409] W [afr-transaction.c:1553:afr_changelog_fsync_cbk] 1-dr-store-replicate-1: fsync(b31d8277-fae7-48c7-a377-1b8fc817a6eb) failed on subvolume dr-store-client-3. Transaction was WRITE [2014-02-04 22:46:42.227994] E [rpc-clnt.c:208:call_bail] 1-dr-store-client-1: bailing out frame type(GlusterFS 3.3) op(FSYNC(16)) xid = 0x234866x sent = 2014-02-04 22:16:38.497154. timeout = 1800 for 10.70.37.187:49152 [2014-02-04 22:46:42.228049] W [client-rpc-fops.c:984:client3_3_fsync_cbk] 1-dr-store-client-1: remote operation failed: Transport endpoint is not connected [2014-02-04 22:46:42.228069] W [afr-transaction.c:1553:afr_changelog_fsync_cbk] 1-dr-store-replicate-0: fsync(2287b21f-dc5e-47f5-8730-679ee9b44273) failed on subvolume dr-store-client-1. Transaction was WRITE [2014-02-04 22:46:48.228531] E [rpc-clnt.c:208:call_bail] 1-dr-store-client-0: bailing out frame type(GlusterFS 3.3) op(FSYNC(16)) xid = 0x235871x sent = 2014-02-04 22:16:38.497142. timeout = 1800 for 10.70.37.86:49152 [2014-02-04 22:46:48.228592] W [client-rpc-fops.c:984:client3_3_fsync_cbk] 1-dr-store-client-0: remote operation failed: Transport endpoint is not connected [2014-02-04 22:46:48.228612] W [afr-transaction.c:1553:afr_changelog_fsync_cbk] 1-dr-store-replicate-0: fsync(2287b21f-dc5e-47f5-8730-679ee9b44273) failed on subvolume dr-store-client-0. Transaction was WRITE <<<< Filtering for FSYNC related errors with the glusterfs-3.4.0.56rhs-1.el6rhs <<<<< from fuse mount log [root@rhs-client15 glusterfs]# grep -i fsync rhev-data-center-mnt-10.70.40.116:drvol.log [root@rhs-client15 glusterfs]#
As far as I see, the only workaround is: 1. Identify the mount process (i.e) ps aux | grep gluster 2. Kill the appropriate mount process (i.e) kill -9 <pid> 3. Lazy unmount the fuse mount (i.e) umount -l <fuse-mount-by-rhev> Existing UP ( though the appvms are in hung state, these are shown as UP from RHEVM UI ) appvms will go to paused state 4. Activate the Storage domain from RHEVM UI Here the fuse mount will happen again 5. Force shutdown all existing paused app vms 6. Restart the existing appvms You can also create a new appvms with disk images on this storage domain I performed the above steps and resume the storage domain operations. Capture the the logs from hypervisor as below: Console logs on the hypervisor ============================== [root@rhs-client10 ~]# ps aux | grep gluster root 11252 183 0.3 310444 50088 ? R<sl Feb05 2141:50 /usr/sbin/glusterfs --volfile-server=10.70.37.46 --volfile-id=test-drvol /rhev/data-center/mnt/10.70.37.46:test-drvol root 23481 0.0 0.0 103256 856 pts/6 S+ 05:47 0:00 grep gluster [root@rhs-client10 ~]# kill -9 11252 <<<< df command was not hanging now >> [root@rhs-client10 ~]# df -Th df: `/rhev/data-center/mnt/10.70.37.46:test-drvol': Transport endpoint is not connected Filesystem Type Size Used Avail Use% Mounted on /dev/sda2 ext4 99G 2.4G 92G 3% / tmpfs tmpfs 7.8G 8.0K 7.8G 1% /dev/shm /dev/sda1 ext4 1008M 89M 868M 10% /boot [root@rhs-client10 ~]# umount /rhev/data-center/mnt/10.70.37.46:test-drvol umount: /rhev/data-center/mnt/10.70.37.46:test-drvol: device is busy. (In some cases useful info about processes that use the device is found by lsof(8) or fuser(1)) [root@rhs-client10 ~]# ps aux | grep gluster root 24461 0.0 0.0 103252 852 pts/6 S+ 05:48 0:00 grep gluster [root@rhs-client10 ~]# umount -l /rhev/data-center/mnt/10.70.37.46:test-drvol [root@rhs-client10 ~]# <<<< after activation of storage domain from RHEVM UI>>>>> [root@rhs-client10 ~]# df -Th Filesystem Type Size Used Avail Use% Mounted on /dev/sda2 ext4 99G 2.4G 92G 3% / tmpfs tmpfs 7.8G 8.0K 7.8G 1% /dev/shm /dev/sda1 ext4 1008M 89M 868M 10% /boot 10.70.37.46:test-drvol fuse.glusterfs 200G 8.0G 192G 4% /rhev/data-center/mnt/10.70.37.46:test-drvol
Continued the test with RHEL 6.5 and found the same issue again. In one of my attempt, I restored quorum quickly ( within 5 minutes ), by bringing up the RHSS Node, from the time quorum was not met, and all VM Instances got restored normally. Probably no timer expired ? ( not sure ). But when there is enough time interval ( say 30 minutes or more ), before restoring quorum, I see this error consistently. In all the test cases that I did, I see App VMs returning to normal condition immediately after restoring quorum, and after sometime I could see possible errors In all the test scenarios, I see this error message consistently in log message: <snip> [2014-02-06 19:37:22.151422] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-cvol-client-1: server 10.70.37.187:49156 has not responded in the last 42 seconds, disconnecting. [2014-02-06 19:37:27.155575] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-cvol-client-0: server 10.70.37.86:49156 has not responded in the last 42 seconds, disconnecting. </snip> Why is the disconnect happening after restoring the quorum ? I have captured the specific part of the fuse mount log, from the point server side quorum is restored. Hope this would be helpful to understand the operations Bringing up the node now ========================= [2014-02-06 18:23:42.776185] W [fuse-bridge.c:1134:fuse_attr_cbk] 0-glusterfs-fuse: 207291: LOOKUP() / => -1 (Transport endpoint is not connected) [2014-02-06 18:32:22.296254] W [fuse-bridge.c:1134:fuse_attr_cbk] 0-glusterfs-fuse: 207333: LOOKUP() / => -1 (Transport endpoint is not connected) [2014-02-06 18:41:04.518361] W [fuse-bridge.c:1134:fuse_attr_cbk] 0-glusterfs-fuse: 207375: LOOKUP() / => -1 (Transport endpoint is not connected) [2014-02-06 18:50:00.035737] W [fuse-bridge.c:1134:fuse_attr_cbk] 0-glusterfs-fuse: 207417: LOOKUP() / => -1 (Transport endpoint is not connected) [2014-02-06 18:58:37.462267] W [fuse-bridge.c:1134:fuse_attr_cbk] 0-glusterfs-fuse: 207459: LOOKUP() / => -1 (Transport endpoint is not connected) [2014-02-06 19:07:19.194087] W [fuse-bridge.c:1134:fuse_attr_cbk] 0-glusterfs-fuse: 207501: LOOKUP() / => -1 (Transport endpoint is not connected) [2014-02-06 19:15:50.710956] W [fuse-bridge.c:1134:fuse_attr_cbk] 0-glusterfs-fuse: 207543: LOOKUP() / => -1 (Transport endpoint is not connected) [2014-02-06 19:24:32.568873] W [fuse-bridge.c:1134:fuse_attr_cbk] 0-glusterfs-fuse: 207585: LOOKUP() / => -1 (Transport endpoint is not connected) [2014-02-06 19:30:17.539664] E [client-handshake.c:1778:client_query_portmap_cbk] 0-cvol-client-1: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. [2014-02-06 19:30:17.539745] I [client.c:2103:client_rpc_notify] 0-cvol-client-1: disconnected from 10.70.37.187:24007. Client process will keep trying to connect to glusterd until brick's port is available. [2014-02-06 19:30:17.540290] E [client-handshake.c:1778:client_query_portmap_cbk] 0-cvol-client-3: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. [2014-02-06 19:30:17.540341] I [client.c:2103:client_rpc_notify] 0-cvol-client-3: disconnected from 10.70.37.187:24007. Client process will keep trying to connect to glusterd until brick's port is available. [2014-02-06 19:30:18.091145] I [rpc-clnt.c:1690:rpc_clnt_reconfig] 0-cvol-client-3: changing port to 49157 (from 0) [2014-02-06 19:30:18.094941] I [client-handshake.c:1695:select_server_supported_programs] 0-cvol-client-3: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2014-02-06 19:30:18.097395] I [client-handshake.c:1466:client_setvolume_cbk] 0-cvol-client-3: 'server-pkg-version' key not found, handshaked with older client [2014-02-06 19:30:18.097417] I [client-handshake.c:1480:client_setvolume_cbk] 0-cvol-client-3: Connected to 10.70.37.187:49157, attached to remote volume '/rhs/brick2/dir2'. [2014-02-06 19:30:18.097426] I [client-handshake.c:1492:client_setvolume_cbk] 0-cvol-client-3: Server and Client lk-version numbers are not same, reopening the fds [2014-02-06 19:30:18.097434] I [client-handshake.c:1314:client_post_handshake] 0-cvol-client-3: 2 fds open - Delaying child_up until they are re-opened [2014-02-06 19:30:18.101298] I [client-handshake.c:936:client_child_up_reopen_done] 0-cvol-client-3: last fd open'd/lock-self-heal'd - notifying CHILD-UP [2014-02-06 19:30:18.101337] I [afr-common.c:3988:afr_notify] 0-cvol-replicate-1: Subvolume 'cvol-client-3' came back up; going online. [2014-02-06 19:30:18.101591] I [client-handshake.c:450:client_set_lk_version_cbk] 0-cvol-client-3: Server lk version = 1 [2014-02-06 19:30:19.095146] I [rpc-clnt.c:1690:rpc_clnt_reconfig] 0-cvol-client-1: changing port to 49156 (from 0) [2014-02-06 19:30:19.098860] I [client-handshake.c:1695:select_server_supported_programs] 0-cvol-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2014-02-06 19:30:19.101509] I [client-handshake.c:1466:client_setvolume_cbk] 0-cvol-client-1: 'server-pkg-version' key not found, handshaked with older client [2014-02-06 19:30:19.101535] I [client-handshake.c:1480:client_setvolume_cbk] 0-cvol-client-1: Connected to 10.70.37.187:49156, attached to remote volume '/rhs/brick1/dir1'. [2014-02-06 19:30:19.101544] I [client-handshake.c:1492:client_setvolume_cbk] 0-cvol-client-1: Server and Client lk-version numbers are not same, reopening the fds [2014-02-06 19:30:19.101552] I [client-handshake.c:1314:client_post_handshake] 0-cvol-client-1: 1 fds open - Delaying child_up until they are re-opened [2014-02-06 19:30:19.102325] I [client-handshake.c:936:client_child_up_reopen_done] 0-cvol-client-1: last fd open'd/lock-self-heal'd - notifying CHILD-UP [2014-02-06 19:30:19.102363] I [afr-common.c:3988:afr_notify] 0-cvol-replicate-0: Subvolume 'cvol-client-1' came back up; going online. [2014-02-06 19:30:19.102554] I [client-handshake.c:450:client_set_lk_version_cbk] 0-cvol-client-1: Server lk version = 1 [2014-02-06 19:30:20.101868] I [rpc-clnt.c:1690:rpc_clnt_reconfig] 0-cvol-client-2: changing port to 49157 (from 0) [2014-02-06 19:30:20.107270] I [client-handshake.c:1695:select_server_supported_programs] 0-cvol-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2014-02-06 19:30:20.107339] I [rpc-clnt.c:1690:rpc_clnt_reconfig] 0-cvol-client-0: changing port to 49156 (from 0) [2014-02-06 19:30:20.110192] I [client-handshake.c:1466:client_setvolume_cbk] 0-cvol-client-2: 'server-pkg-version' key not found, handshaked with older client [2014-02-06 19:30:20.110215] I [client-handshake.c:1480:client_setvolume_cbk] 0-cvol-client-2: Connected to 10.70.37.86:49157, attached to remote volume '/rhs/brick2/dir2'. [2014-02-06 19:30:20.110225] I [client-handshake.c:1492:client_setvolume_cbk] 0-cvol-client-2: Server and Client lk-version numbers are not same, reopening the fds [2014-02-06 19:30:20.110233] I [client-handshake.c:1314:client_post_handshake] 0-cvol-client-2: 2 fds open - Delaying child_up until they are re-opened [2014-02-06 19:30:20.110657] I [client-handshake.c:1695:select_server_supported_programs] 0-cvol-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2014-02-06 19:30:20.110881] I [client-handshake.c:936:client_child_up_reopen_done] 0-cvol-client-2: last fd open'd/lock-self-heal'd - notifying CHILD-UP [2014-02-06 19:30:20.111031] I [client-handshake.c:1466:client_setvolume_cbk] 0-cvol-client-0: 'server-pkg-version' key not found, handshaked with older client [2014-02-06 19:30:20.111071] I [client-handshake.c:1480:client_setvolume_cbk] 0-cvol-client-0: Connected to 10.70.37.86:49156, attached to remote volume '/rhs/brick1/dir1'. [2014-02-06 19:30:20.111083] I [client-handshake.c:1492:client_setvolume_cbk] 0-cvol-client-0: Server and Client lk-version numbers are not same, reopening the fds [2014-02-06 19:30:20.111110] I [client-handshake.c:1314:client_post_handshake] 0-cvol-client-0: 1 fds open - Delaying child_up until they are re-opened [2014-02-06 19:30:20.111110] I [client-handshake.c:1314:client_post_handshake] 0-cvol-client-0: 1 fds open - Delaying child_up until they are re-opened [2014-02-06 19:30:20.111194] I [client-handshake.c:450:client_set_lk_version_cbk] 0-cvol-client-2: Server lk version = 1 [2014-02-06 19:30:20.111463] I [client-handshake.c:936:client_child_up_reopen_done] 0-cvol-client-0: last fd open'd/lock-self-heal'd - notifying CHILD-UP [2014-02-06 19:30:20.111656] I [client-handshake.c:450:client_set_lk_version_cbk] 0-cvol-client-0: Server lk version = 1 [2014-02-06 19:37:22.151422] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-cvol-client-1: server 10.70.37.187:49156 has not responded in the last 42 seconds, disconnecting. [2014-02-06 19:37:27.155575] C [client-handshake.c:127:rpc_client_ping_timer_expired] 0-cvol-client-0: server 10.70.37.86:49156 has not responded in the last 42 seconds, disconnecting.
Created attachment 860207 [details] tcpdump from hypervisor side Took the tcpdump from the Hypervisor side, analysing this dump with Network Packet Analyser like Wireshark, could help in identifying which transaction actually failed
When the problem was recreated, we observed that the event_dispatch_epoll() thread is blocked on a lock, which is held by the FUSE thread which is looping infinitely over the list of eager locked transactions on an fd opened on behalf of one of the VMs on the volume. This appears to be an issue with eager lock management in afr xlator. Further analysis is required to identify what might have caused this to happen. We also know that this issue is not observed on 3.4.0.56rhs rpms. We are looking into the patches that went in after 56 and trying to identify what might have caused this.
I don't think any patch between 3.4.0.56rhs 3.4.0.58rhs could have caused this. Satheesaran, If I remember correctly we brought up the VMs quickly (definitely within 5 to 10 minutes) on 3.4.0.56rhs the last time we tried. Did you get a chance to re-create the bug again and waited for at least 30 minutes even with 3.4.0.56rhs? Krishnan, Did you get a chance to observe if write-behind is pumping writes to afr xlator when you saw this issue? Pranith.
(In reply to Pranith Kumar K from comment #22) > I don't think any patch between 3.4.0.56rhs 3.4.0.58rhs could have caused > this. > > Satheesaran, > If I remember correctly we brought up the VMs quickly (definitely > within 5 to 10 minutes) on 3.4.0.56rhs the last time we tried. Did you get a > chance to re-create the bug again and waited for at least 30 minutes even > with 3.4.0.56rhs? Pranith, I retried the same issue with glusterfs-3.4.0.56rhs and gave sufficient time interval. I am not seeing any issues with 56rhs build. I have the 56rhs for you to investigate in to it.
I was able to figure out why there could be an infinite loop in afr eager-lock functionality. When both the bricks go down and an 'finodelk' for eager-lock is in progress and if it fails it leaves a stale stub in the list of stubs which share the eager-lock. this stale stub's memory is re-used by new frames which get same pool of memory. So list traversal with this stale stub could lead to undefined behavior could also be infinite loops. I attached the process to gdb to confirm the case I mentioned above lead to stale stubs. I sent the fix for this bug @http://review.gluster.org/6944 upstream. we confirmed that the following logs "[2014-02-04 21:36:12.702263] I [afr-transaction.c:1119:afr_post_blocking_inodelk_cbk] 1-dr-store-replicate-0: Blocking inodelks failed" are present in the setup where the issue is seen and not present in the setup (the one with 3.4.0.56rhs) where the issue is not seen. This is not a regression, this bug must have been there from bigbend. Thanks vijay, kp, satheesaran for helping me find the root cause. I am not sure if there are more than one issues in this bug, for example the reboot we saw on one of the attempts to re-create the issue. But for the issue of infinite loop we saw, we have the fix. Easier steps to re-create the bug: 1) Create a replicate volume and disable all performance xlators. 2) open a bash fd on one of the files on mount using bash fds "exec fd-num>file-name". 3) start performing writes on fd in a while loop in one window. 4) attach the mount process to gdb, put a break point just before afr sends finodelks 5) once the break point is hit stop the volume. This hits the error condition which will lead to infinite loops 6) Restart the volumes and wait for some time and df will hang. I re-created this issue in front of satheesaran and with the fix this issue is not seen any more.
Forgot to add one more point about the bug: For this issue to be hit both the bricks need to go down just before taking finodelk. Server-quorum was just simulating that. We can hit this bug even without any quorum and just using volume stop just before finodelk as well.
AS per comment25, this bug is not relevant to server quorum only, and modifying the summary accordingly
adding 3.0 flag and removing 2.1.z
Setting flags required to add BZs to RHS 3.0 Errata
*** Bug 1062256 has been marked as a duplicate of this bug. ***
Tested with glusterfs-3.6.0.24-1.el6rhs with the following steps as in comment0 and mount point doesn't hang once the client-side quorum is restored.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html