960046 – [RHEV-RHS] vms goes into paused state after starting rebalance

Bug 960046 - [RHEV-RHS] vms goes into paused state after starting rebalance

Summary: [RHEV-RHS] vms goes into paused state after starting rebalance

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterfs
Sub Component:
Version:	2.1
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Raghavendra Bhat
QA Contact:	shylesh
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	922183 922361 923774 998352
TreeView+	depends on / blocked

Reported:	2013-05-06 13:36 UTC by shylesh
Modified:	2013-09-23 22:29 UTC (History)
CC List:	10 users (show)
Fixed In Version:	glusterfs-3.4.0.19rhs-1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	998352 (view as bug list)
Environment:	virt rhev integration
Last Closed:	2013-09-23 22:29:50 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
ext4 corruption snapshot (192.58 KB, image/png) 2013-08-14 14:05 UTC, shylesh	no flags	Details
program which can be run to test the bug (2.76 KB, text/x-csrc) 2013-08-14 16:10 UTC, Raghavendra Bhat	no flags	Details
View All

Description shylesh 2013-05-06 13:36:10 UTC

Description of problem:
starting rebalance operation brings vms into paused state

Version-Release number of selected component (if applicable):
[root@rhs1-bb ~]# rpm -qa | grep gluster
glusterfs-server-3.4.0.3rhs-1.el6rhs.x86_64
glusterfs-fuse-3.4.0.3rhs-1.el6rhs.x86_64
glusterfs-devel-3.4.0.3rhs-1.el6rhs.x86_64
glusterfs-3.4.0.3rhs-1.el6rhs.x86_64
glusterfs-debuginfo-3.4.0.3rhs-1.el6rhs.x86_64


How reproducible:


Steps to Reproduce:
1. created a 6x2 distributed replicate
2. create 5 vms on this volume
3. added one more pair of brick and started fix-layout
4. once fix-layout is over issued the command
gluster volume rebalance vstore start
  
Actual results:
rebalance ran for sometime , while its still in progress vms got paused one by one



Additional info:

[root@rhs1-bb ~]# gluster v info
 
Volume Name: vstore
Type: Distributed-Replicate
Volume ID: e8fe6a61-6345-41f0-9329-a802b051a026
Status: Started
Number of Bricks: 7 x 2 = 14
Transport-type: tcp
Bricks:
Brick1: 10.70.37.76:/brick1/vs1
Brick2: 10.70.37.133:/brick1/vs1
Brick3: 10.70.37.76:/brick2/vs2
Brick4: 10.70.37.133:/brick2/vs2
Brick5: 10.70.37.76:/brick3/vs3
Brick6: 10.70.37.133:/brick3/vs3
Brick7: 10.70.37.76:/brick4/vs4
Brick8: 10.70.37.133:/brick4/vs4
Brick9: 10.70.37.76:/brick5/vs5
Brick10: 10.70.37.133:/brick5/vs5
Brick11: 10.70.37.76:/brick6/vs6
Brick12: 10.70.37.133:/brick6/vs6
Brick13: 10.70.37.134:/brick1/vs1
Brick14: 10.70.37.59:/brick1/vs1
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: on





errors from the hypervisor mount
================================

[2013-05-06 13:11:16.803849] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vstore-client-5: remote operation failed: Bad file descriptor
[2013-05-06 13:11:16.803907] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vstore-client-4: remote operation failed: Bad file descriptor
[2013-05-06 13:11:16.803939] W [fuse-bridge.c:2127:fuse_writev_cbk] 0-glusterfs-fuse: 546926: WRITE => -1 (Bad file descriptor)
[2013-05-06 13:11:16.805217] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vstore-client-4: remote operation failed: Bad file descriptor
[2013-05-06 13:11:16.805422] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vstore-client-5: remote operation failed: Bad file descriptor
[2013-05-06 13:11:16.805451] W [fuse-bridge.c:2127:fuse_writev_cbk] 0-glusterfs-fuse: 546928: WRITE => -1 (Bad file descriptor)
[2013-05-06 13:11:16.807145] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vstore-client-4: remote operation failed: Bad file descriptor
[2013-05-06 13:11:16.807230] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vstore-client-5: remote operation failed: Bad file descriptor
[2013-05-06 13:11:16.807259] W [fuse-bridge.c:2127:fuse_writev_cbk] 0-glusterfs-fuse: 546930: WRITE => -1 (Bad file descriptor)
[2013-05-06 13:11:16.809052] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vstore-client-5: remote operation failed: Bad file descriptor
[2013-05-06 13:11:16.809995] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vstore-client-4: remote operation failed: Bad file descriptor
[2013-05-06 13:11:16.810026] W [fuse-bridge.c:2127:fuse_writev_cbk] 0-glusterfs-fuse: 546932: WRITE => -1 (Bad file descriptor)
[2013-05-06 13:11:16.811380] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vstore-client-4: remote operation failed: Bad file descriptor
[2013-05-06 13:11:16.811564] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vstore-client-5: remote operation failed: Bad file descriptor
[2013-05-06 13:11:16.811589] W [fuse-bridge.c:2127:fuse_writev_cbk] 0-glusterfs-fuse: 546934: WRITE => -1 (Bad file descriptor)



vms could be stopped and restarted


attached the sosreports

Comment 3 shishir gowda 2013-05-07 12:12:07 UTC

Looks like a grap switch at client is leading to EBADFD errors due to disconnections at the server side on the old graph.

[2013-05-06 12:34:57.980819] I [server-handshake.c:567:server_setvolume] 0-vstore-server: accepted client from rhs4-bb.lab.eng.blr.redhat.com-14007-2013/05/06-12:34:52:846644-vstore-client-0-0 (version: 3.4.0.3r
hs)
[2013-05-06 12:34:57.987618] I [server-handshake.c:567:server_setvolume] 0-vstore-server: accepted client from rhs3-bb.lab.eng.blr.redhat.com-13915-2013/05/06-12:34:52:855118-vstore-client-0-0 (version: 3.4.0.3r
hs)
[2013-05-06 12:34:58.003871] I [server-handshake.c:567:server_setvolume] 0-vstore-server: accepted client from rhs2-bb.lab.eng.blr.redhat.com-14187-2013/05/06-12:34:52:851734-vstore-client-0-0 (version: 3.4.0.3r
hs)
[2013-05-06 12:34:59.015321] I [server.c:762:server_rpc_notify] 0-vstore-server: disconnecting connectionfrom rhs2-bb.lab.eng.blr.redhat.com-14187-2013/05/06-12:34:52:851734-vstore-client-0-0
[2013-05-06 12:34:59.015377] I [server-helpers.c:726:server_connection_put] 0-vstore-server: Shutting down connection rhs2-bb.lab.eng.blr.redhat.com-14187-2013/05/06-12:34:52:851734-vstore-client-0-0
[2013-05-06 12:34:59.015434] I [server-helpers.c:614:server_connection_destroy] 0-vstore-server: destroyed connection of rhs2-bb.lab.eng.blr.redhat.com-14187-2013/05/06-12:34:52:851734-vstore-client-0-0
[2013-05-06 12:34:59.016573] I [server.c:762:server_rpc_notify] 0-vstore-server: disconnecting connectionfrom rhs4-bb.lab.eng.blr.redhat.com-14007-2013/05/06-12:34:52:846644-vstore-client-0-0
[2013-05-06 12:34:59.016635] I [server-helpers.c:726:server_connection_put] 0-vstore-server: Shutting down connection rhs4-bb.lab.eng.blr.redhat.com-14007-2013/05/06-12:34:52:846644-vstore-client-0-0
[2013-05-06 12:34:59.016729] I [server-helpers.c:614:server_connection_destroy] 0-vstore-server: destroyed connection of rhs4-bb.lab.eng.blr.redhat.com-14007-2013/05/06-12:34:52:846644-vstore-client-0-0
[2013-05-06 12:34:59.031861] I [server.c:762:server_rpc_notify] 0-vstore-server: disconnecting connectionfrom rhs3-bb.lab.eng.blr.redhat.com-13915-2013/05/06-12:34:52:855118-vstore-client-0-0
[2013-05-06 12:34:59.031924] I [server-helpers.c:726:server_connection_put] 0-vstore-server: Shutting down connection rhs3-bb.lab.eng.blr.redhat.com-13915-2013/05/06-12:34:52:855118-vstore-client-0-0
[2013-05-06 12:34:59.031974] I [server-helpers.c:614:server_connection_destroy] 0-vstore-server: destroyed connection of rhs3-bb.lab.eng.blr.redhat.com-13915-2013/05/06-12:34:52:855118-vstore-client-0-0
[2013-05-06 12:35:57.880891] E [posix.c:2135:posix_writev] 0-vstore-posix: write failed: offset 526389248, Bad file descriptor
[2013-05-06 12:35:57.880972] I [server-rpc-fops.c:1439:server_writev_cbk] 0-vstore-server: 1712: WRITEV 1 (45228a74-2dbf-4871-9bf3-4e4550aaa7a8) ==> (Bad file descriptor)
[2013-05-06 12:35:57.909136] E [posix.c:2135:posix_writev] 0-vstore-posix: write failed: offset 530587648, Bad file descriptor
[2013-05-06 12:35:57.909201] I [server-rpc-fops.c:1439:server_writev_cbk] 0-vstore-server: 1715: WRITEV 1 (45228a74-2dbf-4871-9bf3-4e4550aaa7a8) ==> (Bad file descriptor)
[2013-05-06 12:35:57.911811] E [posix.c:2135:posix_writev] 0-vstore-posix: write failed: offset 11404472320, Bad file descriptor
[2013-05-06 12:35:57.911863] I [server-rpc-fops.c:1439:server_writev_cbk] 0-vstore-server: 1718: WRITEV 1 (45228a74-2dbf-4871-9bf3-4e4550aaa7a8) ==> (Bad file descriptor)
[2013-05-06 12:35:57.914596] E [posix.c:2135:posix_writev] 0-vstore-posix: write failed: offset 666689536, Bad file descriptor
[2013-05-06 12:35:57.914644] I [server-rpc-fops.c:1439:server_writev_cbk] 0-vstore-server: 1722: WRITEV 1 (45228a74-2dbf-4871-9bf3-4e4550aaa7a8) ==> (Bad file descriptor)
[2013-05-06 12:35:57.917624] E [posix.c:2135:posix_writev] 0-vstore-posix: write failed: offset 9275346944, Bad file descriptor
[2013-05-06 12:35:57.917808] I [server-rpc-fops.c:1439:server_writev_cbk] 0-vstore-server: 1725: WRITEV 1 (45228a74-2dbf-4871-9bf3-4e4550aaa7a8) ==> (Bad file descriptor)
[2013-05-06 12:52:56.316583] I [server.c:762:server_rpc_notify] 0-vstore-server: disconnecting connectionfrom rhs1-bb.lab.eng.blr.redhat.com-12009-2013/05/06-12:34:47:766675-vstore-client-0-0
[2013-05-06 12:52:56.316703] I [server-helpers.c:726:server_connection_put] 0-vstore-server: Shutting down connection rhs1-bb.lab.eng.blr.redhat.com-12009-2013/05/06-12:34:47:766675-vstore-client-0-0
[2013-05-06 12:52:56.349423] I [server-helpers.c:460:do_fd_cleanup] 0-vstore-server: fd cleanup on /f3e8bf4f-1791-4777-bb97-ab161efa7fcc/images/f87f3951-3c46-494e-be48-124ca38ee3fa/cca8ce16-c191-42a8-8e1e-bd7635bffe81

Comment 4 Rejy M Cyriac 2013-05-16 09:55:18 UTC

Issue reproduced on glusterfs-server-3.4.0.8rhs-1.el6rhs.x86_64

Environment: RHEV+RHS
RHEVM: 3.2.0-10.21.master.el6ev 
Hypervisor: RHEL 6.4
RHS: 4 nodes running gluster*3.4.0.8rhs-1.el6rhs.x86_64
Volume Name: RHEV-BigBend_extra

Bricks were added to the volume and rebalance was started as given below:

----------------------------------------------------- 

[Thu May 16 13:54:00 root@rhs-client45:~ ] #gluster volume rebalance RHEV-BigBend_extra start
volume rebalance: RHEV-BigBend_extra: success: Starting rebalance on volume RHEV-BigBend_extra has been successful.
ID: 35858114-cb13-48ce-a189-c499aa480810
[Thu May 16 13:54:51 root@rhs-client45:~ ] #gluster volume rebalance RHEV-BigBend_extra status
                                    Node Rebalanced-files          size       scanned      failures         status run time in secs
                               ---------      -----------   -----------   -----------   -----------   ------------   --------------
                               localhost                5         3.0MB            13             2    in progress            59.00
     rhs-client37.lab.eng.blr.redhat.com                0        0Bytes            17             0      completed             1.00
      rhs-client4.lab.eng.blr.redhat.com                0        0Bytes            17             0      completed             1.00
     rhs-client15.lab.eng.blr.redhat.com                3         8.9KB            19             2      completed             6.00
volume rebalance: RHEV-BigBend_extra: success: 

....
 
[Thu May 16 14:04:36 root@rhs-client45:~ ] #gluster volume rebalance RHEV-BigBend_extra status
                                    Node Rebalanced-files          size       scanned      failures         status run time in secs
                               ---------      -----------   -----------   -----------   -----------   ------------   --------------
                               localhost                7        20.0GB            17             2    in progress           677.00
     rhs-client37.lab.eng.blr.redhat.com                0        0Bytes            17             0      completed             1.00
      rhs-client4.lab.eng.blr.redhat.com                0        0Bytes            17             0      completed             1.00
     rhs-client15.lab.eng.blr.redhat.com                3         8.9KB            19             2      completed             6.00
volume rebalance: RHEV-BigBend_extra: success: 
[Thu May 16 14:06:08 root@rhs-client45:~ ] #gluster volume rebalance RHEV-BigBend_extra status
                                    Node Rebalanced-files          size       scanned      failures         status run time in secs
                               ---------      -----------   -----------   -----------   -----------   ------------   --------------
                               localhost                9        45.0GB            23             2      completed           691.00
     rhs-client37.lab.eng.blr.redhat.com                0        0Bytes            17             0      completed             1.00
      rhs-client4.lab.eng.blr.redhat.com                0        0Bytes            17             0      completed             1.00
     rhs-client15.lab.eng.blr.redhat.com                3         8.9KB            19             2      completed             6.00
volume rebalance: RHEV-BigBend_extra: success: 

-----------------------------------------------------

2 VMs got paused during the operation. They were recoverable only after they were forcefully stopped, and started.

Comment 7 Rejy M Cyriac 2013-05-17 20:15:24 UTC

It is interesting to note that VMs that are being migrated during the rebalance operation, seem to be automatically recoverable from the issue reported here, as seen during the verification of BZ 923523 (comment 8)

Comment 8 Raghavendra Bhat 2013-06-12 13:26:04 UTC

The rebalance process from one of the nodes had logged these.


1) Its surprising that how did inode become NULL because of which inode_ctx_get failed.

2) Why getting node-uuid was not obtained in getxattr call.

[2013-05-06 12:34:58.468919] E [dht-helper.c:1054:dht_inode_ctx_get] (-->/usr/lib64/glusterfs/3.4.0.3rhs/xlator/clus
ter/distribute.so(dht_lookup_linkfile_create_cbk+0x75) [0x7f38fb120c85] (-->/usr/lib64/glusterfs/3.4.0.3rhs/xlator/c
luster/distribute.so(dht_layout_preset+0x5e) [0x7f38fb10819e] (-->/usr/lib64/glusterfs/3.4.0.3rhs/xlator/cluster/dis
tribute.so(dht_inode_ctx_layout_set+0x34) [0x7f38fb1094d4]))) 0-vstore-dht: invalid argument: inode
[2013-05-06 12:34:58.468983] E [dht-helper.c:1073:dht_inode_ctx_set] (-->/usr/lib64/glusterfs/3.4.0.3rhs/xlator/clus
ter/distribute.so(dht_lookup_linkfile_create_cbk+0x75) [0x7f38fb120c85] (-->/usr/lib64/glusterfs/3.4.0.3rhs/xlator/c
luster/distribute.so(dht_layout_preset+0x5e) [0x7f38fb10819e] (-->/usr/lib64/glusterfs/3.4.0.3rhs/xlator/cluster/dis
tribute.so(dht_inode_ctx_layout_set+0x52) [0x7f38fb1094f2]))) 0-vstore-dht: invalid argument: inode
[2013-05-06 12:34:58.469142] E [dht-common.c:2100:dht_getxattr] 0-vstore-dht: layout is NULL
[2013-05-06 12:34:58.469215] E [dht-rebalance.c:1210:gf_defrag_migrate_data] 0-vstore-dht: Failed to get node-uuid f
or /f3e8bf4f-1791-4777-bb97-ab161efa7fcc/images/333561b6-2bc7-4bde-ae79-41b4a9ad56ee/5f4cacb7-fa3c-46ee-82f8-47a8921
13119.lease

Comment 10 Amar Tumballi 2013-07-23 08:06:58 UTC

Rejy/Shanks, there are couple of more fixes in rebalance now which should have fixed the issue now in Big Bend. Can we please test this once more?

Comment 11 shishir gowda 2013-07-24 03:39:56 UTC

We have 2 fixes in rebalance/remove-brick code path 976755 and 981949 merged in. In addition to bug 981708 is a client side fix which could potentially affect the bug.
Could you please re-run these tests and check if the issue is fixed? Please re-open the bug if the issue is hit again

Comment 12 shylesh 2013-08-08 11:14:45 UTC

This issue is still reproducible on 3.4.0.18rhs-1.el6rhs.x86_64

RHS nodes
========
10.70.37.113
10.70.37.133


Mounted on 
=========
rhs-client36.lab.eng.blr.redhat.com

mount point
===========
/rhev/data-center/mnt/10.70.37.113:vmstore



Volume Name: vmstore
Type: Distributed-Replicate
Volume ID: 10b93f79-2a1d-4737-8632-05f57c97db93
Status: Started
Number of Bricks: 7 x 2 = 14
Transport-type: tcp
Bricks:
Brick1: 10.70.37.113:/brick1/vss1
Brick2: 10.70.37.133:/brick1/vss1
Brick3: 10.70.37.113:/brick2/vss2
Brick4: 10.70.37.133:/brick2/vss2
Brick5: 10.70.37.113:/brick3/vss3
Brick6: 10.70.37.133:/brick3/vss3
Brick7: 10.70.37.113:/brick4/vss4
Brick8: 10.70.37.133:/brick4/vss4
Brick9: 10.70.37.113:/brick4/vss5
Brick10: 10.70.37.133:/brick5/vss5
Brick11: 10.70.37.113:/brick6/vss6
Brick12: 10.70.37.133:/brick6/vss6
Brick13: 10.70.37.113:/brick1/vss7
Brick14: 10.70.37.133:/brick1/vss7
Options Reconfigured:
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
storage.owner-uid: 36
storage.owner-gid: 36


48325] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-4: remote operation failed: Bad file descriptor
[2013-08-08 10:52:46.951735] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-5: remote operation failed: Bad file descriptor
[2013-08-08 10:52:46.951771] W [fuse-bridge.c:2695:fuse_writev_cbk] 0-glusterfs-fuse: 1311765: WRITE => -1 (Bad file descriptor)
[2013-08-08 10:52:46.971392] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-4: remote operation failed: Bad file descriptor
[2013-08-08 10:52:46.975536] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-5: remote operation failed: Bad file descriptor
[2013-08-08 10:52:46.975575] W [fuse-bridge.c:2695:fuse_writev_cbk] 0-glusterfs-fuse: 1311773: WRITE => -1 (Bad file descriptor)
[2013-08-08 10:52:46.997078] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-5: remote operation failed: Bad file descriptor
[2013-08-08 10:52:46.997968] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-4: remote operation failed: Bad file descriptor
[2013-08-08 10:52:46.998002] W [fuse-bridge.c:2695:fuse_writev_cbk] 0-glusterfs-fuse: 1311776: WRITE => -1 (Bad file descriptor)
[2013-08-08 10:52:47.020290] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-5: remote operation failed: Bad file descriptor
[2013-08-08 10:52:47.020474] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-4: remote operation failed: Bad file descriptor
[2013-08-08 10:52:47.020508] W [fuse-bridge.c:2695:fuse_writev_cbk] 0-glusterfs-fuse: 1311778: WRITE => -1 (Bad file descriptor)
[2013-08-08 10:52:47.038749] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-4: remote operation failed: Bad file descriptor
[2013-08-08 10:52:47.039092] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-5: remote operation failed: Bad file descriptor
[2013-08-08 10:52:47.039123] W [fuse-bridge.c:2695:fuse_writev_cbk] 0-glusterfs-fuse: 1311780: WRITE => -1 (Bad file descriptor)
[2013-08-08 10:52:47.045422] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-4: remote operation failed: Bad file descriptor
[2013-08-08 10:52:47.047381] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-5: remote operation failed: Bad file descriptor
[2013-08-08 10:52:47.047412] W [fuse-bridge.c:2695:fuse_writev_cbk] 0-glusterfs-fuse: 1311782: WRITE => -1 (Bad file descriptor)
[2013-08-08 10:52:47.053965] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-4: remote operation failed: Bad file descriptor
[2013-08-08 10:52:47.054327] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-5: remote operation failed: Bad file descriptor
[2013-08-08 10:52:47.054356] W [fuse-bridge.c:2695:fuse_writev_cbk] 0-glusterfs-fuse: 1311784: WRITE => -1 (Bad file descriptor)
[2013-08-08 10:52:47.063849] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-4: remote operation failed: Bad file descriptor
[2013-08-08 10:52:47.064494] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-5: remote operation failed: Bad file descriptor
[2013-08-08 10:52:47.064523] W [fuse-bridge.c:2695:fuse_writev_cbk] 0-glusterfs-fuse: 1311786: WRITE => -1 (Bad file descriptor)
[2013-08-08 10:52:47.073986] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-4: remote operation failed: Bad file descriptor
[2013-08-08 10:52:47.074109] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-5: remote operation failed: Bad file descriptor
[2013-08-08 10:52:47.074138] W [fuse-bridge.c:2695:fuse_writev_cbk] 0-glusterfs-fuse: 1311788: WRITE => -1 (Bad file descriptor)
[2013-08-08 10:52:47.083434] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-vmstore-client-4: remote operation failed: Bad file descriptor




sosreports @
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/923774/

Comment 13 Amar Tumballi 2013-08-13 18:46:38 UTC

https://code.engineering.redhat.com/gerrit/11398 fixes the issue.

Comment 14 shylesh 2013-08-14 14:04:01 UTC

While verifying this bug same steps lead to ext4 corruption on one of the app vm.

Impact: EXT4 corruption in app vm

Volume Name: vmstore
Type: Distributed-Replicate
Volume ID: 10b93f79-2a1d-4737-8632-05f57c97db93
Status: Started
Number of Bricks: 8 x 2 = 16
Transport-type: tcp
Bricks:
Brick1: 10.70.37.113:/brick1/vss1
Brick2: 10.70.37.133:/brick1/vss1
Brick3: 10.70.37.113:/brick2/vss2
Brick4: 10.70.37.133:/brick2/vss2
Brick5: 10.70.37.113:/brick3/vss3
Brick6: 10.70.37.133:/brick3/vss3
Brick7: 10.70.37.113:/brick4/vss4
Brick8: 10.70.37.133:/brick4/vss4
Brick9: 10.70.37.113:/brick4/vss5
Brick10: 10.70.37.133:/brick5/vss5
Brick11: 10.70.37.113:/brick6/vss6
Brick12: 10.70.37.133:/brick6/vss6
Brick13: 10.70.37.113:/brick1/vss7
Brick14: 10.70.37.133:/brick1/vss7
Brick15: 10.70.37.113:/brick1/vss8
Brick16: 10.70.37.133:/brick1/vss8
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off



cluster info
============
RHS nodes
---------
10.70.37.113
10.70.37.133


Hypervisor
==========
rhs-client36.lab.eng.blr.redhat.com

Mount point
===========
/rhev/data-center/mnt/10.70.37.113:vmstore



Mount log messages
===================
2013-08-14 12:08:13.805531] I [client.c:2103:client_rpc_notify] 0-vmstore-client-11: disconnected from 10.70.37.133:49170. Client pro
cess will keep trying to connect to glusterd until brick's port is available.
[2013-08-14 12:08:13.805540] E [afr-common.c:3832:afr_notify] 0-vmstore-replicate-5: All subvolumes are down. Going offline until atle
ast one of them comes back up.
[2013-08-14 12:08:13.805556] I [client.c:2103:client_rpc_notify] 0-vmstore-client-12: disconnected from 10.70.37.113:49164. Client pro
cess will keep trying to connect to glusterd until brick's port is available.
[2013-08-14 12:08:13.805574] I [client.c:2103:client_rpc_notify] 0-vmstore-client-13: disconnected from 10.70.37.133:49171. Client pro
cess will keep trying to connect to glusterd until brick's port is available.
[2013-08-14 12:08:13.805583] E [afr-common.c:3832:afr_notify] 0-vmstore-replicate-6: All subvolumes are down. Going offline until atle
ast one of them comes back up.
[2013-08-14 12:08:13.806891] W [client-rpc-fops.c:2604:client3_3_lookup_cbk] 1-vmstore-client-14: remote operation failed: Permission
denied. Path: /05ba73ee-552a-4eb4-9368-6db52bac31ef (00000000-0000-0000-0000-000000000000)
[2013-08-14 12:08:13.807476] W [client-rpc-fops.c:2604:client3_3_lookup_cbk] 1-vmstore-client-14: remote operation failed: Permission
denied. Path: /05ba73ee-552a-4eb4-9368-6db52bac31ef (00000000-0000-0000-0000-000000000000)
[2013-08-14 12:08:13.813842] I [dht-layout.c:633:dht_layout_normalize] 1-vmstore-dht: found anomalies in /05ba73ee-552a-4eb4-9368-6db5
2bac31ef. holes=1 overlaps=0 missing=0 down=0 misc=1
[2013-08-14 12:08:13.813876] W [dht-selfheal.c:916:dht_selfheal_directory] 1-vmstore-dht: 1 subvolumes have unrecoverable errors
[2013-08-14 12:08:13.814402] I [dht-layout.c:633:dht_layout_normalize] 1-vmstore-dht: found anomalies in /05ba73ee-552a-4eb4-9368-6db52bac31ef. holes=1 overlaps=0 missing=0 down=0 misc=1
[2013-08-14 12:08:13.814421] W [dht-selfheal.c:916:dht_selfheal_directory] 1-vmstore-dht: 1 subvolumes have unrecoverable errors
[2013-08-14 12:08:13.815177] W [client-rpc-fops.c:2604:client3_3_lookup_cbk] 1-vmstore-client-14: remote operation failed: Permission denied. Path: /05ba73ee-552a-4eb4-9368-6db52bac31ef (140eaaf5-c667-4a71-aef1-a69a50c249b0)
[2013-08-14 12:08:13.815222] I [dht-common.c:567:dht_revalidate_cbk] 1-vmstore-dht: subvolume vmstore-replicate-7 for /05ba73ee-552a-4eb4-9368-6db52bac31ef returned -1 (Permission denied)
[2013-08-14 12:08:13.815493] W [client-rpc-fops.c:2604:client3_3_lookup_cbk] 1-vmstore-client-14: remote operation failed: Permission denied. Path: /05ba73ee-552a-4eb4-9368-6db52bac31ef (140eaaf5-c667-4a71-aef1-a69a50c249b0)
[2013-08-14 12:08:13.815510] I [dht-common.c:567:dht_revalidate_cbk] 1-vmstore-dht: subvolume vmstore-replicate-7 for /05ba73ee-552a-4eb4-9368-6db52bac31ef returned -1 (Permission denied)
[2013-08-14 12:08:13.834774] I [dht-layout.c:633:dht_layout_normalize] 1-vmstore-dht: found anomalies in <gfid:140eaaf5-c667-4a71-aef1-a69a50c249b0>. holes=1 overlaps=0 missing=1 down=0 misc=0
[2013-08-14 12:08:13.835080] I [dht-layout.c:633:dht_layout_normalize] 1-vmstore-dht: found anomalies in <gfid:140eaaf5-c667-4a71-aef1-a69a50c249b0>. holes=1 overlaps=0 missing=1 down=0 misc=0
[2013-08-14 12:08:13.835545] W [client-rpc-fops.c:519:client3_3_stat_cbk] 1-vmstore-client-14: remote operation failed: No such file or directory
[2013-08-14 12:08:13.836324] W [client-rpc-fops.c:807:client3_3_statfs_cbk] 1-vmstore-client-14: remote operation failed: No such file



there are some permission denied errors on one of the brick.

Brick15 and Brick16 are the newly added bricks and then rebalance was invoked

No vm pausing is seen, but ext4 curruption was seen on the vm

(attached the ext4 corrutpion message snapshot)

attached the sosreports

Comment 15 shylesh 2013-08-14 14:05:05 UTC

Created attachment 786542 [details]
ext4 corruption snapshot

Comment 18 Raghavendra Bhat 2013-08-14 16:10:44 UTC

Created attachment 786599 [details]
program which can be run to test the bug

Comment 21 Raghavendra Bhat 2013-08-19 05:57:08 UTC

the gfid handle for the mentioned gfid can be found in /<brick-path>/.glusterfs/05/ba/05ba73ee-552a-4eb4-9368-6db52bac31ef

Comment 22 shylesh 2013-08-19 12:18:18 UTC

Marking this bug as verified as the original issue is no more reproducible, opening a seperate bug for vm corruption issue.
verified on 3.4.0.19rhs-2.el6rhs.x86_64

Comment 23 Scott Haines 2013-09-23 22:29:50 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Note You need to log in before you can comment on or make changes to this bug.