Bug 923555

Summary: [RHEV-RHS] - remove-brick operation on distribute-replicate RHS 2.1 volume, used as VM image store on RHEV, leads to VM corruption
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Rejy M Cyriac <rcyriac>
Component: glusterfsAssignee: shishir gowda <sgowda>
Status: CLOSED ERRATA QA Contact: Rejy M Cyriac <rcyriac>
Severity: high Docs Contact:
Priority: high    
Version: 2.1CC: amarts, cdhouch, grajaiya, kaushal, nsathyan, rcyriac, rhs-bugs, sdharane, sgowda, shmohan, surs, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.4.0.19rhs-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 996474 (view as bug list) Environment:
virt rhev integration
Last Closed: 2013-09-23 22:35:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 963896    
Bug Blocks: 996474    

Description Rejy M Cyriac 2013-03-20 05:19:40 UTC
Description of problem:

Towards the end of a remove-brick operation on distribute-replicate volume, for VM image store on RHEV,  VM got paused, and had to be manually recovered.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.add distribute-replicate volume to RHEV as posixfs storage domain
2.create and run VMS on the storage domain
3.perform remove-brick operation start
4.access VM function till remove-brick commit
  
Actual results:

VM got paused, and had to be manually recovered

Expected results:

Functioning of VM should not be impacted during the operation.

Additional info:

Comment 2 Gowrishankar Rajaiyan 2013-03-20 21:28:56 UTC
Rejy,

Could you please mention the steps used to manually recover the VM ?

Comment 3 Rejy M Cyriac 2013-03-21 09:39:27 UTC
(In reply to comment #2)
> Rejy,
> 
> Could you please mention the steps used to manually recover the VM ?

The VM remained in paused state, till I manually clicked the option to run the VM, and then it came out of paused state.

- rejy (rmc)

Comment 4 Rejy M Cyriac 2013-03-21 09:42:46 UTC
Version Info:

RHEV-M : 3.1.0-49.el6ev

Hypervisors:

RHEV-H 6.4 (20130306.2.el6_4)
RHEL 6.4
RHEL 6.3

RHS servers: RHS-2.0-20130317.0-RHS-x86_64-DVD1.iso

gluster related rpms:

glusterfs-fuse-3.3.0.6rhs-4.el6rhs.x86_64
vdsm-gluster-4.9.6-19.el6rhs.noarch
gluster-swift-plugin-1.0-5.noarch
gluster-swift-account-1.4.8-5.el6rhs.noarch
glusterfs-3.3.0.6rhs-4.el6rhs.x86_64
org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch
glusterfs-server-3.3.0.6rhs-4.el6rhs.x86_64
glusterfs-rdma-3.3.0.6rhs-4.el6rhs.x86_64
gluster-swift-object-1.4.8-5.el6rhs.noarch
gluster-swift-container-1.4.8-5.el6rhs.noarch
gluster-swift-doc-1.4.8-5.el6rhs.noarch
gluster-swift-1.4.8-5.el6rhs.noarch
gluster-swift-proxy-1.4.8-5.el6rhs.noarch
glusterfs-geo-replication-3.3.0.6rhs-4.el6rhs.x86_64


I have the sosreport from RHEV-M, Hypervisors, and RHS servers if required.

Comment 5 shishir gowda 2013-03-28 05:57:50 UTC
Please provide the sos reports

Comment 14 Rejy M Cyriac 2013-03-28 19:25:21 UTC
(In reply to comment #5)
> Please provide the sos reports

sosreport attached from all systems that were part of the environment

- rejy (rmc)

Comment 15 Rejy M Cyriac 2013-03-28 20:24:51 UTC
(In reply to comment #14)
> (In reply to comment #5)
> > Please provide the sos reports
> 
> sosreport attached from all systems that were part of the environment
> 
> - rejy (rmc)

I have to warn you that there may be a lot of information in the logs from the system, since the issue occurred towards the end of 100+ test-cases test run. Look towards the latter part of the logs for information regarding this issue.

---------

The next round of the same test on RHS-2.0-20130320.2-RHS-x86_64-DVD1.iso (glusterfs*-3.3.0.7rhs-1.el6rhs.x86_64) led to the Data Center and the Storage being unavailable for a long time, with the VMs initially accessible, but not accessible after shutdown. This issue is reported at Bug 928054 

- rejy (rmc)

Comment 16 Kaushal 2013-04-01 06:20:34 UTC
The following are log snippets from the sos-reports,

<snip1>

rhel6.4/var/log/glusterfs/rhev-data-center-mnt-rhs-client45.lab.eng.blr.redhat.com:_RHS__VM__imagestore.log:3317:[2013-03-20 10:31:13.253518] W [client3_1-fops.c:1120:client3_1_getxattr_cbk] 0-RHS_VM_imagestore-client-0: remote operation failed: Permission denied. Path: /b949cede-515e-483f-a15a-4983e2e5241c/images/94af07c1-3b89-4403-9b89-cdaac51a4c8f/d8320b26-d471-4a79-b6d9-38022b461f95.meta (8bde5e62-bd6d-4cd4-ac33-49d4747f3183). Key: trusted.glusterfs.dht.linkto
rhel6.4/var/log/glusterfs/rhev-data-center-mnt-rhs-client45.lab.eng.blr.redhat.com:_RHS__VM__imagestore.log:3318:[2013-03-20 10:31:13.253941] W [client3_1-fops.c:1120:client3_1_getxattr_cbk] 0-RHS_VM_imagestore-client-1: remote operation failed: Permission denied. Path: /b949cede-515e-483f-a15a-4983e2e5241c/images/94af07c1-3b89-4403-9b89-cdaac51a4c8f/d8320b26-d471-4a79-b6d9-38022b461f95.meta (8bde5e62-bd6d-4cd4-ac33-49d4747f3183). Key: trusted.glusterfs.dht.linkto
rhel6.4/var/log/glusterfs/rhev-data-center-mnt-rhs-client45.lab.eng.blr.redhat.com:_RHS__VM__imagestore.log:3319:[2013-03-20 10:31:13.254023] E [dht-helper.c:652:dht_migration_complete_check_task] 0-RHS_VM_imagestore-dht: /b949cede-515e-483f-a15a-4983e2e5241c/images/94af07c1-3b89-4403-9b89-cdaac51a4c8f/d8320b26-d471-4a79-b6d9-38022b461f95.meta: failed to get the 'linkto' xattr Permission denied
rhel6.4/var/log/glusterfs/rhev-data-center-mnt-rhs-client45.lab.eng.blr.redhat.com:_RHS__VM__imagestore.log:3321:[2013-03-20 10:31:13.255275] W [client3_1-fops.c:1120:client3_1_getxattr_cbk] 0-RHS_VM_imagestore-client-0: remote operation failed: Permission denied. Path: /b949cede-515e-483f-a15a-4983e2e5241c/images/94af07c1-3b89-4403-9b89-cdaac51a4c8f/d8320b26-d471-4a79-b6d9-38022b461f95.meta (8bde5e62-bd6d-4cd4-ac33-49d4747f3183). Key: trusted.glusterfs.dht.linkto
rhel6.4/var/log/glusterfs/rhev-data-center-mnt-rhs-client45.lab.eng.blr.redhat.com:_RHS__VM__imagestore.log:3322:[2013-03-20 10:31:13.255708] W [client3_1-fops.c:1120:client3_1_getxattr_cbk] 0-RHS_VM_imagestore-client-1: remote operation failed: Permission denied. Path: /b949cede-515e-483f-a15a-4983e2e5241c/images/94af07c1-3b89-4403-9b89-cdaac51a4c8f/d8320b26-d471-4a79-b6d9-38022b461f95.meta (8bde5e62-bd6d-4cd4-ac33-49d4747f3183). Key: trusted.glusterfs.dht.linkto
rhel6.4/var/log/glusterfs/rhev-data-center-mnt-rhs-client45.lab.eng.blr.redhat.com:_RHS__VM__imagestore.log:3323:[2013-03-20 10:31:13.255761] E [dht-helper.c:652:dht_migration_complete_check_task] 0-RHS_VM_imagestore-dht: /b949cede-515e-483f-a15a-4983e2e5241c/images/94af07c1-3b89-4403-9b89-cdaac51a4c8f/d8320b26-d471-4a79-b6d9-38022b461f95.meta: failed to get the 'linkto' xattr Permission denied
rhel6.4/var/log/glusterfs/rhev-data-center-mnt-rhs-client45.lab.eng.blr.redhat.com:_RHS__VM__imagestore.log:3349:[2013-03-20 10:32:13.359447] W [client3_1-fops.c:1120:client3_1_getxattr_cbk] 0-RHS_VM_imagestore-client-0: remote operation failed: Permission denied. Path: /b949cede-515e-483f-a15a-4983e2e5241c/images/94af07c1-3b89-4403-9b89-cdaac51a4c8f/d8320b26-d471-4a79-b6d9-38022b461f95.meta (8bde5e62-bd6d-4cd4-ac33-49d4747f3183). Key: trusted.glusterfs.dht.linkto
rhel6.4/var/log/glusterfs/rhev-data-center-mnt-rhs-client45.lab.eng.blr.redhat.com:_RHS__VM__imagestore.log:3350:[2013-03-20 10:32:13.359962] W [client3_1-fops.c:1120:client3_1_getxattr_cbk] 0-RHS_VM_imagestore-client-1: remote operation failed: Permission denied. Path: /b949cede-515e-483f-a15a-4983e2e5241c/images/94af07c1-3b89-4403-9b89-cdaac51a4c8f/d8320b26-d471-4a79-b6d9-38022b461f95.meta (8bde5e62-bd6d-4cd4-ac33-49d4747f3183). Key: trusted.glusterfs.dht.linkto
rhel6.4/var/log/glusterfs/rhev-data-center-mnt-rhs-client45.lab.eng.blr.redhat.com:_RHS__VM__imagestore.log:3351:[2013-03-20 10:32:13.360090] E [dht-helper.c:652:dht_migration_complete_check_task] 0-RHS_VM_imagestore-dht: /b949cede-515e-483f-a15a-4983e2e5241c/images/94af07c1-3b89-4403-9b89-cdaac51a4c8f/d8320b26-d471-4a79-b6d9-38022b461f95.meta: failed to get the 'linkto' xattr Permission denied</snip1>

<snip2>

rhevh6.4/var/log/glusterfs/rhev-data-center-mnt-rhs-client45.lab.eng.blr.redhat.com:_RHS__VM__imagestore.log:2500:[2013-03-20 05:01:20.722058] W [client3_1-fops.c:1120:client3_1_getxattr_cbk] 0-RHS_VM_imagestore-client-1: remote operation failed: Permission denied. Path: /b949cede-515e-483f-a15a-4983e2e5241c/images/94af07c1-3b89-4403-9b89-cdaac51a4c8f/d8320b26-d471-4a79-b6d9-38022b461f95.meta (8bde5e62-bd6d-4cd4-ac33-49d4747f3183). Key: trusted.glusterfs.dht.linkto
rhevh6.4/var/log/glusterfs/rhev-data-center-mnt-rhs-client45.lab.eng.blr.redhat.com:_RHS__VM__imagestore.log:2501:[2013-03-20 05:01:20.722634] W [client3_1-fops.c:1120:client3_1_getxattr_cbk] 0-RHS_VM_imagestore-client-0: remote operation failed: Permission denied. Path: /b949cede-515e-483f-a15a-4983e2e5241c/images/94af07c1-3b89-4403-9b89-cdaac51a4c8f/d8320b26-d471-4a79-b6d9-38022b461f95.meta (8bde5e62-bd6d-4cd4-ac33-49d4747f3183). Key: trusted.glusterfs.dht.linkto
rhevh6.4/var/log/glusterfs/rhev-data-center-mnt-rhs-client45.lab.eng.blr.redhat.com:_RHS__VM__imagestore.log:2502:[2013-03-20 05:01:20.722734] E [dht-helper.c:652:dht_migration_complete_check_task] 0-RHS_VM_imagestore-dht: /b949cede-515e-483f-a15a-4983e2e5241c/images/94af07c1-3b89-4403-9b89-cdaac51a4c8f/d8320b26-d471-4a79-b6d9-38022b461f95.meta: failed to get the 'linkto' xattr Permission denied
rhevh6.4/var/log/glusterfs/rhev-data-center-mnt-rhs-client45.lab.eng.blr.redhat.com:_RHS__VM__imagestore.log:2504:[2013-03-20 05:01:20.723463] W [client3_1-fops.c:1120:client3_1_getxattr_cbk] 0-RHS_VM_imagestore-client-1: remote operation failed: Permission denied. Path: /b949cede-515e-483f-a15a-4983e2e5241c/images/94af07c1-3b89-4403-9b89-cdaac51a4c8f/d8320b26-d471-4a79-b6d9-38022b461f95.meta (8bde5e62-bd6d-4cd4-ac33-49d4747f3183). Key: trusted.glusterfs.dht.linkto
rhevh6.4/var/log/glusterfs/rhev-data-center-mnt-rhs-client45.lab.eng.blr.redhat.com:_RHS__VM__imagestore.log:2505:[2013-03-20 05:01:20.723943] W [client3_1-fops.c:1120:client3_1_getxattr_cbk] 0-RHS_VM_imagestore-client-0: remote operation failed: Permission denied. Path: /b949cede-515e-483f-a15a-4983e2e5241c/images/94af07c1-3b89-4403-9b89-cdaac51a4c8f/d8320b26-d471-4a79-b6d9-38022b461f95.meta (8bde5e62-bd6d-4cd4-ac33-49d4747f3183). Key: trusted.glusterfs.dht.linkto
rhevh6.4/var/log/glusterfs/rhev-data-center-mnt-rhs-client45.lab.eng.blr.redhat.com:_RHS__VM__imagestore.log:2506:[2013-03-20 05:01:20.723992] E [dht-helper.c:652:dht_migration_complete_check_task] 0-RHS_VM_imagestore-dht: /b949cede-515e-483f-a15a-4983e2e5241c/images/94af07c1-3b89-4403-9b89-cdaac51a4c8f/d8320b26-d471-4a79-b6d9-38022b461f95.meta: failed to get the 'linkto' xattr Permission denied
rhevh6.4/var/log/glusterfs/rhev-data-center-mnt-rhs-client45.lab.eng.blr.redhat.com:_RHS__VM__imagestore.log:2507:[2013-03-20 05:02:20.835389] W [client3_1-fops.c:1120:client3_1_getxattr_cbk] 0-RHS_VM_imagestore-client-1: remote operation failed: Permission denied. Path: /b949cede-515e-483f-a15a-4983e2e5241c/images/94af07c1-3b89-4403-9b89-cdaac51a4c8f/d8320b26-d471-4a79-b6d9-38022b461f95.meta (8bde5e62-bd6d-4cd4-ac33-49d4747f3183). Key: trusted.glusterfs.dht.linkto
rhevh6.4/var/log/glusterfs/rhev-data-center-mnt-rhs-client45.lab.eng.blr.redhat.com:_RHS__VM__imagestore.log:2508:[2013-03-20 05:02:20.835927] W [client3_1-fops.c:1120:client3_1_getxattr_cbk] 0-RHS_VM_imagestore-client-0: remote operation failed: Permission denied. Path: /b949cede-515e-483f-a15a-4983e2e5241c/images/94af07c1-3b89-4403-9b89-cdaac51a4c8f/d8320b26-d471-4a79-b6d9-38022b461f95.meta (8bde5e62-bd6d-4cd4-ac33-49d4747f3183). Key: trusted.glusterfs.dht.linkto
rhevh6.4/var/log/glusterfs/rhev-data-center-mnt-rhs-client45.lab.eng.blr.redhat.com:_RHS__VM__imagestore.log:2509:[2013-03-20 05:02:20.835996] E [dht-helper.c:652:dht_migration_complete_check_task] 0-RHS_VM_imagestore-dht: /b949cede-515e-483f-a15a-4983e2e5241c/images/94af07c1-3b89-4403-9b89-cdaac51a4c8f/d8320b26-d471-4a79-b6d9-38022b461f95.meta: failed to get the 'linkto' xattr Permission denied

</snip2>

From the above we see that the clients are getting permission denied when they perform a getxatter fop for the linkto attr. This was caused by the acl translator incorrectly checking for permissions on an already opened fd. This has been fixed in rhs glusterfs versions 3.3.0.7 as a fix for bug-918567.

Comment 17 shishir gowda 2013-04-08 06:56:59 UTC
Is this bug still reproducible of version 3.3.0.7? Fix for bug-918567 has been merged to handle permission denied issues for getxattr calls.

Comment 18 Rejy M Cyriac 2013-04-16 14:21:36 UTC
(In reply to comment #17)
> Is this bug still reproducible of version 3.3.0.7? Fix for bug-918567 has
> been merged to handle permission denied issues for getxattr calls.

The next round of the same test on RHS-2.0-20130320.2-RHS-x86_64-DVD1.iso (glusterfs*-3.3.0.7rhs-1.el6rhs.x86_64) led to the Data Center and the Storage being unavailable for a long time, with the VMs initially accessible, but not accessible after shutdown. This issue is reported at Bug 928054

Comment 19 Rejy M Cyriac 2013-05-16 12:50:15 UTC
Issue still reproducible on glusterfs*3.4.0.8rhs-1.el6rhs.x86_64

During the multiple runs of remove-brick operations, it was observed that the VMs were brought to paused state in those operations that involved data migration.

Environment: RHEV+RHS
RHEVM: 3.2.0-10.21.master.el6ev 
Hypervisor: RHEL 6.4
RHS: 4 nodes running gluster*3.4.0.8rhs-1.el6rhs.x86_64
Volume Name: RHEV-BigBend_extra

---------------------------------------------------------------------

[Thu May 16 17:08:23 root@rhs-client45:~ ] #gluster volume remove-brick RHEV-BigBend_extra rhs-client45.lab.eng.blr.redhat.com:/rhs/brick9/RHEV-BigBend_extra rhs-client37.lab.eng.blr.redhat.com:/rhs/brick9/RHEV-BigBend_extra start
volume remove-brick start: success
ID: ec497c4e-4b5d-4761-990e-a6a080797116

[Thu May 16 17:10:27 root@rhs-client45:~ ] #gluster volume remove-brick RHEV-BigBend_extra rhs-client45.lab.eng.blr.redhat.com:/rhs/brick9/RHEV-BigBend_extra rhs-client37.lab.eng.blr.redhat.com:/rhs/brick9/RHEV-BigBend_extra status
                                    Node Rebalanced-files          size       scanned      failures         status run-time in secs
                               ---------      -----------   -----------   -----------   -----------   ------------   --------------
                               localhost                0        0Bytes             0             0      completed             0.00
     rhs-client37.lab.eng.blr.redhat.com                0        0Bytes            17             0      completed             0.00
      rhs-client4.lab.eng.blr.redhat.com                0        0Bytes             0             0    not started             0.00
     rhs-client15.lab.eng.blr.redhat.com                0        0Bytes             0             0    not started             0.00

[Thu May 16 17:18:45 root@rhs-client45:~ ] #gluster volume remove-brick RHEV-BigBend_extra rhs-client45.lab.eng.blr.redhat.com:/rhs/brick9/RHEV-BigBend_extra rhs-client37.lab.eng.blr.redhat.com:/rhs/brick9/RHEV-BigBend_extra commit
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
volume remove-brick commit: success

[Thu May 16 17:19:19 root@rhs-client45:~ ] #gluster volume remove-brick RHEV-BigBend_extra rhs-client15.lab.eng.blr.redhat.com:/rhs/brick9/RHEV-BigBend_extra rhs-client4.lab.eng.blr.redhat.com:/rhs/brick9/RHEV-BigBend_extra start
volume remove-brick start: success
ID: ce758707-d15d-4a6c-b01c-693ff4e140b6

[Thu May 16 17:19:51 root@rhs-client45:~ ] #gluster volume remove-brick RHEV-BigBend_extra rhs-client15.lab.eng.blr.redhat.com:/rhs/brick9/RHEV-BigBend_extra rhs-client4.lab.eng.blr.redhat.com:/rhs/brick9/RHEV-BigBend_extra status
                                    Node Rebalanced-files          size       scanned      failures         status run-time in secs
                               ---------      -----------   -----------   -----------   -----------   ------------   --------------
                               localhost                0        0Bytes             0             0    not started             0.00
     rhs-client37.lab.eng.blr.redhat.com                0        0Bytes             0             0    not started             0.00
      rhs-client4.lab.eng.blr.redhat.com                0        0Bytes            16             0      completed             0.00
     rhs-client15.lab.eng.blr.redhat.com                3         2.0MB             7             0    in progress             9.00
.....

[Thu May 16 17:28:54 root@rhs-client45:~ ] #gluster volume remove-brick RHEV-BigBend_extra rhs-client15.lab.eng.blr.redhat.com:/rhs/brick9/RHEV-BigBend_extra rhs-client4.lab.eng.blr.redhat.com:/rhs/brick9/RHEV-BigBend_extra status
                                    Node Rebalanced-files          size       scanned      failures         status run-time in secs
                               ---------      -----------   -----------   -----------   -----------   ------------   --------------
                               localhost                0        0Bytes             0             0    not started             0.00
     rhs-client37.lab.eng.blr.redhat.com                0        0Bytes             0             0    not started             0.00
      rhs-client4.lab.eng.blr.redhat.com                0        0Bytes            16             0      completed             0.00
     rhs-client15.lab.eng.blr.redhat.com                4         5.0GB            16             0    in progress           576.00
[Thu May 16 17:29:27 root@rhs-client45:~ ] #gluster volume remove-brick RHEV-BigBend_extra rhs-client15.lab.eng.blr.redhat.com:/rhs/brick9/RHEV-BigBend_extra rhs-client4.lab.eng.blr.redhat.com:/rhs/brick9/RHEV-BigBend_extra status
                                    Node Rebalanced-files          size       scanned      failures         status run-time in secs
                               ---------      -----------   -----------   -----------   -----------   ------------   --------------
                               localhost                0        0Bytes             0             0    not started             0.00
     rhs-client37.lab.eng.blr.redhat.com                0        0Bytes             0             0    not started             0.00
      rhs-client4.lab.eng.blr.redhat.com                0        0Bytes            16             0      completed             0.00
     rhs-client15.lab.eng.blr.redhat.com                6        30.0GB            18             0      completed           648.00

[Thu May 16 17:32:04 root@rhs-client45:~ ] #gluster volume remove-brick RHEV-BigBend_extra rhs-client15.lab.eng.blr.redhat.com:/rhs/brick9/RHEV-BigBend_extra rhs-client4.lab.eng.blr.redhat.com:/rhs/brick9/RHEV-BigBend_extra commit
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
volume remove-brick commit: success

[Thu May 16 17:37:01 root@rhs-client45:~ ] #gluster volume remove-brick RHEV-BigBend_extra rhs-client45.lab.eng.blr.redhat.com:/rhs/brick8/RHEV-BigBend_extra rhs-client37.lab.eng.blr.redhat.com:/rhs/brick8/RHEV-BigBend_extra start
volume remove-brick start: success
ID: c6231879-f33e-43f3-b1cf-b095fc351149

[Thu May 16 17:44:22 root@rhs-client45:~ ] #gluster volume remove-brick RHEV-BigBend_extra rhs-client15.lab.eng.blr.redhat.com:/rhs/brick9/RHEV-BigBend_extra rhs-client4.lab.eng.blr.redhat.com:/rhs/brick9/RHEV-BigBend_extra status
                                    Node Rebalanced-files          size       scanned      failures         status run-time in secs
                               ---------      -----------   -----------   -----------   -----------   ------------   --------------
                               localhost                0        0Bytes             0             0      completed             0.00
     rhs-client37.lab.eng.blr.redhat.com                0        0Bytes            16             0      completed             0.00
      rhs-client4.lab.eng.blr.redhat.com                0        0Bytes             0             0    not started             0.00
     rhs-client15.lab.eng.blr.redhat.com                0        0Bytes             0             0    not started             0.00

[Thu May 16 17:44:34 root@rhs-client45:~ ] #gluster volume remove-brick RHEV-BigBend_extra rhs-client45.lab.eng.blr.redhat.com:/rhs/brick8/RHEV-BigBend_extra rhs-client37.lab.eng.blr.redhat.com:/rhs/brick8/RHEV-BigBend_extra commit
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
volume remove-brick commit: success

[Thu May 16 17:45:48 root@rhs-client45:~ ] #gluster volume remove-brick RHEV-BigBend_extra rhs-client15.lab.eng.blr.redhat.com:/rhs/brick8/RHEV-BigBend_extra rhs-client4.lab.eng.blr.redhat.com:/rhs/brick8/RHEV-BigBend_extra start
volume remove-brick start: success
ID: 0eae71da-3866-4ba6-bc60-32e7134cffb1

[Thu May 16 17:46:26 root@rhs-client45:~ ] #gluster volume remove-brick RHEV-BigBend_extra rhs-client15.lab.eng.blr.redhat.com:/rhs/brick8/RHEV-BigBend_extra rhs-client4.lab.eng.blr.redhat.com:/rhs/brick8/RHEV-BigBend_extra status
                                    Node Rebalanced-files          size       scanned      failures         status run-time in secs
                               ---------      -----------   -----------   -----------   -----------   ------------   --------------
                               localhost                0        0Bytes             0             0    not started             0.00
     rhs-client37.lab.eng.blr.redhat.com                0        0Bytes             0             0    not started             0.00
      rhs-client4.lab.eng.blr.redhat.com                0        0Bytes            11             0      completed             0.00
     rhs-client15.lab.eng.blr.redhat.com                2         1.0MB             6             0    in progress             4.00

....

[Thu May 16 17:51:28 root@rhs-client45:~ ] #gluster volume remove-brick RHEV-BigBend_extra rhs-client15.lab.eng.blr.redhat.com:/rhs/brick8/RHEV-BigBend_extra rhs-client4.lab.eng.blr.redhat.com:/rhs/brick8/RHEV-BigBend_extra status
                                    Node Rebalanced-files          size       scanned      failures         status run-time in secs
                               ---------      -----------   -----------   -----------   -----------   ------------   --------------
                               localhost                0        0Bytes             0             0    not started             0.00
     rhs-client37.lab.eng.blr.redhat.com                0        0Bytes             0             0    not started             0.00
      rhs-client4.lab.eng.blr.redhat.com                0        0Bytes            11             0      completed             0.00
     rhs-client15.lab.eng.blr.redhat.com                5        30.0GB            12             0      completed           381.00

[Thu May 16 17:53:02 root@rhs-client45:~ ] #gluster volume remove-brick RHEV-BigBend_extra rhs-client15.lab.eng.blr.redhat.com:/rhs/brick8/RHEV-BigBend_extra rhs-client4.lab.eng.blr.redhat.com:/rhs/brick8/RHEV-BigBend_extra commit
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
volume remove-brick commit: success

---------------------------------------------------------------------

Comment 21 Rejy M Cyriac 2013-05-16 13:06:22 UTC
The VMs that were brought to paused state, described in comment 19, were not recoverable even after forcefully shutting them down.

On trying to start the VMs, the following type of messages were displayed, and the VMs remained dowm.

---------------------------------------------------------------------
	
2013-May-16, 18:32
	
Failed to run VM virtBB03 (User: admin@internal).
	
35d701bd
	
oVirt
	
	
	
2013-May-16, 18:32
	
Failed to run VM virtBB03 on Host rhs-gp-srv15.
	
35d701bd
	
oVirt
	
	
	
2013-May-16, 18:32
	
VM virtBB03 is down. Exit message: 'truesize'.
	
	
oVirt
	
	
	
2013-May-16, 18:32
	
VM virtBB03 was started by admin@internal (Host: rhs-gp-srv15).
	
35d701bd
	
oVirt

---------------------------------------------------------------------

So this Bug has led to *VM Data Corruption*, and *VM loss*.

Comment 22 Rejy M Cyriac 2013-05-16 13:17:05 UTC
Additional Info:

The VMs that were brought to paused state, described in comment 19, are no longer removable as well, and the status of the disks of the VMs are shown as 'Illegal'

Comment 23 shishir gowda 2013-05-17 06:41:17 UTC
These error messages are seen in the tail end of the logs. The current graph count is 6.

[2013-05-16 12:23:26.243298] E [client-handshake.c:1741:client_query_portmap_cbk] 3-RHEV-BigBend_extra-client-14: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2013-05-16 12:23:26.243361] W [socket.c:515:__socket_rwv] 3-RHEV-BigBend_extra-client-14: readv on 10.70.36.39:24007 failed (No data available)

[2013-05-16 12:23:26.247349] E [client-handshake.c:1741:client_query_portmap_cbk] 4-RHEV-BigBend_extra-client-14: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2013-05-16 12:23:26.247401] W [socket.c:515:__socket_rwv] 4-RHEV-BigBend_extra-client-14: readv on 10.70.36.39:24007 failed (No data available)

[2013-05-16 12:23:26.264549] W [socket.c:515:__socket_rwv] 5-RHEV-BigBend_extra-client-13: readv on 10.70.36.28:24007 failed (No data available)
[2013-05-16 12:23:26.264582] I [client.c:2103:client_rpc_notify] 5-RHEV-BigBend_extra-client-13: disconnected from 10.70.36.28:24007. Client process will keep trying to connect to glusterd until brick's port is available. 
[2013-05-16 12:23:26.268634] W [socket.c:515:__socket_rwv] 3-RHEV-BigBend_extra-client-12: readv on 10.70.36.69:24007 failed (No data available)
[2013-05-16 12:23:26.272808] W [socket.c:515:__socket_rwv] 4-RHEV-BigBend_extra-client-12: readv on 10.70.36.69:24007 failed (No data available)
[2013-05-16 12:23:27.277194] W [socket.c:515:__socket_rwv] 4-RHEV-BigBend_extra-client-13: readv on 10.70.36.61:24007 failed (No data available)

Comment 24 shishir gowda 2013-05-24 05:08:44 UTC
This issue seen in release glusterfs*3.4.0.8rhs-1.el6rhs.x86_64 might be a duplicate of bug 963896. The regression could lead to migration of data away from  sub volumes that might not be under decommission. On a commit, the data in the brick being removed would be lost.

Comment 25 shishir gowda 2013-06-04 10:25:27 UTC
Fix for bug 963896 is merged downstream. The issue was remove-brick was marking the incorrect brick/subvolume as being decommissioned. This leads to data loss, after a remove-brick commit.

Comment 26 Rejy M Cyriac 2013-06-28 10:46:44 UTC
Issue remains.

VMs irrecoverably went into paused state, during and after remove-brick operations, on 10X2 Distribute-Replicate volume, used as image-store in RHEVM+RHS environment with -

RHEVM 3.3 : 3.3.0-0.4.master.el6ev
RHS 2.1 : 3.4.0.12rhs-1.el6rhs.x86_64
Hypervisor: RHEL 6.4 + glusterfs*-3.4.0.12rhs-1.el6rhs.x86_64

Two separate runs were attempted during the test.

In the first run, the 'remove-brick start' command finished very quickly, and the VMs went into paused state after the 'remove-brick commit' command was run. It seems that the data from the bricks removed was failed to be migrated. The VMs were irrecoverable from the paused state, or after powered down. Given below is the output of commands from the RHS server.

---------------------------------------------------------------------------

[root@rhs-client45 ~]# gluster volume info
 
Volume Name: BendVol
Type: Distributed-Replicate
Volume ID: c2158e6b-4072-417a-8259-b9b073e0c3c4
Status: Started
Number of Bricks: 10 x 2 = 20
Transport-type: tcp
Bricks:
Brick1: rhs-client45.lab.eng.blr.redhat.com:/rhs/brick4/BendVol
Brick2: rhs-client37.lab.eng.blr.redhat.com:/rhs/brick4/BendVol
Brick3: rhs-client15.lab.eng.blr.redhat.com:/rhs/brick4/BendVol
Brick4: rhs-client4.lab.eng.blr.redhat.com:/rhs/brick4/BendVol
Brick5: rhs-client45.lab.eng.blr.redhat.com:/rhs/brick5/BendVol
Brick6: rhs-client37.lab.eng.blr.redhat.com:/rhs/brick5/BendVol
Brick7: rhs-client15.lab.eng.blr.redhat.com:/rhs/brick5/BendVol
Brick8: rhs-client4.lab.eng.blr.redhat.com:/rhs/brick5/BendVol
Brick9: rhs-client45.lab.eng.blr.redhat.com:/rhs/brick6/BendVol
Brick10: rhs-client37.lab.eng.blr.redhat.com:/rhs/brick6/BendVol
Brick11: rhs-client15.lab.eng.blr.redhat.com:/rhs/brick6/BendVol
Brick12: rhs-client4.lab.eng.blr.redhat.com:/rhs/brick6/BendVol
Brick13: rhs-client45.lab.eng.blr.redhat.com:/rhs/brick7/BendVol
Brick14: rhs-client37.lab.eng.blr.redhat.com:/rhs/brick7/BendVol
Brick15: rhs-client15.lab.eng.blr.redhat.com:/rhs/brick7/BendVol
Brick16: rhs-client4.lab.eng.blr.redhat.com:/rhs/brick7/BendVol
Brick17: rhs-client45.lab.eng.blr.redhat.com:/rhs/brick8/BendVol
Brick18: rhs-client37.lab.eng.blr.redhat.com:/rhs/brick8/BendVol
Brick19: rhs-client15.lab.eng.blr.redhat.com:/rhs/brick8/BendVol
Brick20: rhs-client4.lab.eng.blr.redhat.com:/rhs/brick8/BendVol
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off

[root@rhs-client45 ~]# ls /rhs/brick8/BendVol/
1071de86-7917-48a7-8063-7a9cc82c598f/ __DIRECT_IO_TEST__                    .glusterfs/   
                        
[root@rhs-client45 ~]# ls /rhs/brick8/BendVol/1071de86-7917-48a7-8063-7a9cc82c598f/images/
2146b854-28a9-473e-80ad-47224a798619  51af6f3d-4e84-4cf1-8a1f-2762e4b42487  94382a3a-8554-43cc-a3ee-ddc2b4d595ef  d81d9131-2c29-42d3-9bfa-d7071030e739
225cb764-22aa-435d-b802-794cc5e1bdc7  8f775a74-e300-4512-9ed8-a4d5b852eb3d  95c77039-aca1-43c4-bf20-46852fe6d1de

[root@rhs-client45 ~]# ls -lh /rhs/brick8/BendVol/1071de86-7917-48a7-8063-7a9cc82c598f/images/51af6f3d-4e84-4cf1-8a1f-2762e4b42487/
total 12G
-rw-rw----. 2 vdsm kvm 12G Jun 28 12:18 42ee340f-3086-4721-b726-4adf5966c9f6

[root@rhs-client45 ~]# ls -lh /rhs/brick8/BendVol/1071de86-7917-48a7-8063-7a9cc82c598f/images/8f775a74-e300-4512-9ed8-a4d5b852eb3d/
total 15G
-rw-rw----. 2 vdsm kvm 15G Jun 28 12:18 6840ccf1-9787-44e8-9907-6cad97a6e2ea

[root@rhs-client45 ~]# ls -lh /rhs/brick8/BendVol/1071de86-7917-48a7-8063-7a9cc82c598f/images/95c77039-aca1-43c4-bf20-46852fe6d1de/
total 4.0K
-rw-r--r--. 2 vdsm kvm 274 Jun 27 18:43 3a55e8ef-5a4d-4358-ac33-436f2cba2a13.meta

[root@rhs-client45 ~]# gluster volume remove-brick BendVol rhs-client45.lab.eng.blr.redhat.com:/rhs/brick8/BendVol rhs-client37.lab.eng.blr.redhat.com:/rhs/brick8/BendVol start
volume remove-brick start: success
ID: 9144c90b-f578-4129-a8fc-be10fff1dd2c

[root@rhs-client45 ~]# gluster volume remove-brick BendVol rhs-client45.lab.eng.blr.redhat.com:/rhs/brick8/BendVol rhs-client37.lab.eng.blr.redhat.com:/rhs/brick8/BendVol status
                                    Node Rebalanced-files          size       scanned      failures         status run-time in secs
                               ---------      -----------   -----------   -----------   -----------   ------------   --------------
                               localhost                1        0Bytes             2             0      completed             0.00
     rhs-client37.lab.eng.blr.redhat.com                0        0Bytes            34             0      completed             0.00
     rhs-client15.lab.eng.blr.redhat.com                0        0Bytes             0             0    not started             0.00
      rhs-client4.lab.eng.blr.redhat.com                0        0Bytes             0             0    not started             0.00

[root@rhs-client45 ~]# gluster volume remove-brick BendVol rhs-client45.lab.eng.blr.redhat.com:/rhs/brick8/BendVol rhs-client37.lab.eng.blr.redhat.com:/rhs/brick8/BendVol status
                                    Node Rebalanced-files          size       scanned      failures         status run-time in secs
                               ---------      -----------   -----------   -----------   -----------   ------------   --------------
                               localhost                1        0Bytes             2             0      completed             0.00
     rhs-client37.lab.eng.blr.redhat.com                0        0Bytes            34             0      completed             0.00
     rhs-client15.lab.eng.blr.redhat.com                0        0Bytes             0             0    not started             0.00
      rhs-client4.lab.eng.blr.redhat.com                0        0Bytes             0             0    not started             0.00

[root@rhs-client45 ~]# gluster volume remove-brick BendVol rhs-client45.lab.eng.blr.redhat.com:/rhs/brick8/BendVol rhs-client37.lab.eng.blr.redhat.com:/rhs/brick8/BendVol commit
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
volume remove-brick commit: success

[root@rhs-client45 ~]# gluster volume info
 
Volume Name: BendVol
Type: Distributed-Replicate
Volume ID: c2158e6b-4072-417a-8259-b9b073e0c3c4
Status: Started
Number of Bricks: 9 x 2 = 18
Transport-type: tcp
Bricks:
Brick1: rhs-client45.lab.eng.blr.redhat.com:/rhs/brick4/BendVol
Brick2: rhs-client37.lab.eng.blr.redhat.com:/rhs/brick4/BendVol
Brick3: rhs-client15.lab.eng.blr.redhat.com:/rhs/brick4/BendVol
Brick4: rhs-client4.lab.eng.blr.redhat.com:/rhs/brick4/BendVol
Brick5: rhs-client45.lab.eng.blr.redhat.com:/rhs/brick5/BendVol
Brick6: rhs-client37.lab.eng.blr.redhat.com:/rhs/brick5/BendVol
Brick7: rhs-client15.lab.eng.blr.redhat.com:/rhs/brick5/BendVol
Brick8: rhs-client4.lab.eng.blr.redhat.com:/rhs/brick5/BendVol
Brick9: rhs-client45.lab.eng.blr.redhat.com:/rhs/brick6/BendVol
Brick10: rhs-client37.lab.eng.blr.redhat.com:/rhs/brick6/BendVol
Brick11: rhs-client15.lab.eng.blr.redhat.com:/rhs/brick6/BendVol
Brick12: rhs-client4.lab.eng.blr.redhat.com:/rhs/brick6/BendVol
Brick13: rhs-client45.lab.eng.blr.redhat.com:/rhs/brick7/BendVol
Brick14: rhs-client37.lab.eng.blr.redhat.com:/rhs/brick7/BendVol
Brick15: rhs-client15.lab.eng.blr.redhat.com:/rhs/brick7/BendVol
Brick16: rhs-client4.lab.eng.blr.redhat.com:/rhs/brick7/BendVol
Brick17: rhs-client15.lab.eng.blr.redhat.com:/rhs/brick8/BendVol
Brick18: rhs-client4.lab.eng.blr.redhat.com:/rhs/brick8/BendVol
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off

[root@rhs-client45 ~]# gluster volume status
Status of volume: BendVol
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick rhs-client45.lab.eng.blr.redhat.com:/rhs/brick4/B
endVol							49155	Y	9702
Brick rhs-client37.lab.eng.blr.redhat.com:/rhs/brick4/B
endVol							49155	Y	9786
Brick rhs-client15.lab.eng.blr.redhat.com:/rhs/brick4/B
endVol							49155	Y	10080
Brick rhs-client4.lab.eng.blr.redhat.com:/rhs/brick4/Be
ndVol							49155	Y	10217
Brick rhs-client45.lab.eng.blr.redhat.com:/rhs/brick5/B
endVol							49156	Y	9711
Brick rhs-client37.lab.eng.blr.redhat.com:/rhs/brick5/B
endVol							49156	Y	9795
Brick rhs-client15.lab.eng.blr.redhat.com:/rhs/brick5/B
endVol							49156	Y	10089
Brick rhs-client4.lab.eng.blr.redhat.com:/rhs/brick5/Be
ndVol							49156	Y	10226
Brick rhs-client45.lab.eng.blr.redhat.com:/rhs/brick6/B
endVol							49157	Y	9720
Brick rhs-client37.lab.eng.blr.redhat.com:/rhs/brick6/B
endVol							49157	Y	9804
Brick rhs-client15.lab.eng.blr.redhat.com:/rhs/brick6/B
endVol							49157	Y	10098
Brick rhs-client4.lab.eng.blr.redhat.com:/rhs/brick6/Be
ndVol							49157	Y	10235
Brick rhs-client45.lab.eng.blr.redhat.com:/rhs/brick7/B
endVol							49158	Y	9729
Brick rhs-client37.lab.eng.blr.redhat.com:/rhs/brick7/B
endVol							49158	Y	9813
Brick rhs-client15.lab.eng.blr.redhat.com:/rhs/brick7/B
endVol							49158	Y	10107
Brick rhs-client4.lab.eng.blr.redhat.com:/rhs/brick7/Be
ndVol							49158	Y	10244
Brick rhs-client15.lab.eng.blr.redhat.com:/rhs/brick8/B
endVol							49159	Y	10116
Brick rhs-client4.lab.eng.blr.redhat.com:/rhs/brick8/Be
ndVol							49159	Y	10253
NFS Server on localhost					2049	Y	19422
Self-heal Daemon on localhost				N/A	Y	19429
NFS Server on 9c1c5f38-19d0-475e-897c-d88f651a54ba	2049	Y	19440
Self-heal Daemon on 9c1c5f38-19d0-475e-897c-d88f651a54b
a							N/A	Y	19447
NFS Server on 49257f07-7344-4b00-9ff3-544959419579	2049	Y	20241
Self-heal Daemon on 49257f07-7344-4b00-9ff3-54495941957
9							N/A	Y	20248
NFS Server on 22b94b39-514d-4986-8e37-36322a08b9c1	2049	Y	20112
Self-heal Daemon on 22b94b39-514d-4986-8e37-36322a08b9c
1							N/A	Y	20119
 
There are no active volume tasks

[root@rhs-client45 ~]# -------------------------------------------------------------^C
 
[root@rhs-client45 ~]# ls -lh /rhs/brick8/BendVol/1071de86-7917-48a7-8063-7a9cc82c598f/images/
total 0
drwxr-xr-x. 2 vdsm kvm  6 Jun 27 19:57 2146b854-28a9-473e-80ad-47224a798619
drwxr-xr-x. 2 vdsm kvm  6 Jun 27 18:41 225cb764-22aa-435d-b802-794cc5e1bdc7
drwxr-xr-x. 2 vdsm kvm 49 Jun 27 18:53 51af6f3d-4e84-4cf1-8a1f-2762e4b42487
drwxr-xr-x. 2 vdsm kvm 49 Jun 27 18:40 8f775a74-e300-4512-9ed8-a4d5b852eb3d
drwxr-xr-x. 2 vdsm kvm  6 Jun 27 18:55 94382a3a-8554-43cc-a3ee-ddc2b4d595ef
drwxr-xr-x. 2 vdsm kvm 54 Jun 27 18:43 95c77039-aca1-43c4-bf20-46852fe6d1de
drwxr-xr-x. 2 vdsm kvm  6 Jun 27 18:38 d81d9131-2c29-42d3-9bfa-d7071030e739

[root@rhs-client45 ~]# ls -lh /rhs/brick8/BendVol/1071de86-7917-48a7-8063-7a9cc82c598f/images/51af6f3d-4e84-4cf1-8a1f-2762e4b42487/
total 12G
-rw-rw----. 2 vdsm kvm 12G Jun 28 12:26 42ee340f-3086-4721-b726-4adf5966c9f6

[root@rhs-client45 ~]# ls -lh /rhs/brick8/BendVol/1071de86-7917-48a7-8063-7a9cc82c598f/images/8f775a74-e300-4512-9ed8-a4d5b852eb3d/
total 15G
-rw-rw----. 2 vdsm kvm 15G Jun 28 12:26 6840ccf1-9787-44e8-9907-6cad97a6e2ea

[root@rhs-client45 ~]# ls -lh /rhs/brick8/BendVol/1071de86-7917-48a7-8063-7a9cc82c598f/images/95c77039-aca1-43c4-bf20-46852fe6d1de/
total 4.0K
-rw-r--r--. 2 vdsm kvm 274 Jun 27 18:43 3a55e8ef-5a4d-4358-ac33-436f2cba2a13.meta

---------------------------------------------------------------------------

Given below are the messages noticed on the RHEVM UI during the events, and finally when one of the VMs was removed.

---------------------------------------------------------------------------

ID	139
Time	2013-Jun-28, 12:27
Message	VM Snaf4 has paused due to unknown storage error.

ID	139
Time	2013-Jun-28, 12:27
Message	VM VQ2 has paused due to unknown storage error.

ID	119
Time	2013-Jun-28, 12:33
Message	VM VQ2 is down. Exit message: 'truesize'.

ID	119
Time	2013-Jun-28, 12:34
Message	VM Snaf4 is down. Exit message: 'truesize'.

---------------------------------------------------------------------------

On the second run of the test, the 'remove-brick start' command took some time to finish. But during its run, the VM got irrecoverably paused. The 'remove-brick commit' was run after the completion of the first stage. The VM was successfully resumed from the p[paused state, but after a shut down, the VM was irrecoverable. Given below is the output of commands from the RHS server.

---------------------------------------------------------------------------

[root@rhs-client15 ~]# ls -lh /rhs/brick8/BendVol/
1071de86-7917-48a7-8063-7a9cc82c598f/ __DIRECT_IO_TEST__                    .glusterfs/
                           
[root@rhs-client15 ~]# ls -lh /rhs/brick8/BendVol/1071de86-7917-48a7-8063-7a9cc82c598f/images/
total 0
drwxr-xr-x. 2 vdsm kvm  6 Jun 27 19:57 2146b854-28a9-473e-80ad-47224a798619
drwxr-xr-x. 2 vdsm kvm 54 Jun 27 18:42 225cb764-22aa-435d-b802-794cc5e1bdc7
drwxr-xr-x. 2 vdsm kvm 55 Jun 28 12:27 51af6f3d-4e84-4cf1-8a1f-2762e4b42487
drwxr-xr-x. 2 vdsm kvm 54 Jun 27 18:42 8f775a74-e300-4512-9ed8-a4d5b852eb3d
drwxr-xr-x. 2 vdsm kvm  6 Jun 27 18:55 94382a3a-8554-43cc-a3ee-ddc2b4d595ef
drwxr-xr-x. 2 vdsm kvm  6 Jun 28 12:27 95c77039-aca1-43c4-bf20-46852fe6d1de
drwxr-xr-x. 2 vdsm kvm 55 Jun 27 18:43 d81d9131-2c29-42d3-9bfa-d7071030e739

[root@rhs-client15 ~]# ls -lh /rhs/brick8/BendVol/1071de86-7917-48a7-8063-7a9cc82c598f/images/225cb764-22aa-435d-b802-794cc5e1bdc7/
total 4.0K
-rw-r--r--. 2 vdsm kvm 274 Jun 27 18:42 6fdc1ab8-7f03-4726-9b07-2ef528f8132e.meta

[root@rhs-client15 ~]# ls -lh /rhs/brick8/BendVol/1071de86-7917-48a7-8063-7a9cc82c598f/images/51af6f3d-4e84-4cf1-8a1f-2762e4b42487/
total 1.0M
-rw-rw----. 2 vdsm kvm 1.0M Jun 27 18:55 42ee340f-3086-4721-b726-4adf5966c9f6.lease

[root@rhs-client15 ~]# ls -lh /rhs/brick8/BendVol/1071de86-7917-48a7-8063-7a9cc82c598f/images/8f775a74-e300-4512-9ed8-a4d5b852eb3d/
total 0
---------T. 2 root root 0 Jun 27 18:42 6840ccf1-9787-44e8-9907-6cad97a6e2ea.meta

[root@rhs-client15 ~]# ls -lh /rhs/brick8/BendVol/1071de86-7917-48a7-8063-7a9cc82c598f/images/d81d9131-2c29-42d3-9bfa-d7071030e739/
total 1.0M
-rw-rw----. 2 vdsm kvm 1.0M Jun 27 18:43 52e9b2b6-f7c0-4e50-b333-51413571e487.lease


[root@rhs-client45 ~]# -----------------------------------------------------------------------^C
[root@rhs-client45 ~]# gluster volume info
 
Volume Name: BendVol
Type: Distributed-Replicate
Volume ID: c2158e6b-4072-417a-8259-b9b073e0c3c4
Status: Started
Number of Bricks: 9 x 2 = 18
Transport-type: tcp
Bricks:
Brick1: rhs-client45.lab.eng.blr.redhat.com:/rhs/brick4/BendVol
Brick2: rhs-client37.lab.eng.blr.redhat.com:/rhs/brick4/BendVol
Brick3: rhs-client15.lab.eng.blr.redhat.com:/rhs/brick4/BendVol
Brick4: rhs-client4.lab.eng.blr.redhat.com:/rhs/brick4/BendVol
Brick5: rhs-client45.lab.eng.blr.redhat.com:/rhs/brick5/BendVol
Brick6: rhs-client37.lab.eng.blr.redhat.com:/rhs/brick5/BendVol
Brick7: rhs-client15.lab.eng.blr.redhat.com:/rhs/brick5/BendVol
Brick8: rhs-client4.lab.eng.blr.redhat.com:/rhs/brick5/BendVol
Brick9: rhs-client45.lab.eng.blr.redhat.com:/rhs/brick6/BendVol
Brick10: rhs-client37.lab.eng.blr.redhat.com:/rhs/brick6/BendVol
Brick11: rhs-client15.lab.eng.blr.redhat.com:/rhs/brick6/BendVol
Brick12: rhs-client4.lab.eng.blr.redhat.com:/rhs/brick6/BendVol
Brick13: rhs-client45.lab.eng.blr.redhat.com:/rhs/brick7/BendVol
Brick14: rhs-client37.lab.eng.blr.redhat.com:/rhs/brick7/BendVol
Brick15: rhs-client15.lab.eng.blr.redhat.com:/rhs/brick7/BendVol
Brick16: rhs-client4.lab.eng.blr.redhat.com:/rhs/brick7/BendVol
Brick17: rhs-client15.lab.eng.blr.redhat.com:/rhs/brick8/BendVol
Brick18: rhs-client4.lab.eng.blr.redhat.com:/rhs/brick8/BendVol
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off

[root@rhs-client45 ~]# gluster volume status
Status of volume: BendVol
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick rhs-client45.lab.eng.blr.redhat.com:/rhs/brick4/B
endVol							49155	Y	9702
Brick rhs-client37.lab.eng.blr.redhat.com:/rhs/brick4/B
endVol							49155	Y	9786
Brick rhs-client15.lab.eng.blr.redhat.com:/rhs/brick4/B
endVol							49155	Y	10080
Brick rhs-client4.lab.eng.blr.redhat.com:/rhs/brick4/Be
ndVol							49155	Y	10217
Brick rhs-client45.lab.eng.blr.redhat.com:/rhs/brick5/B
endVol							49156	Y	9711
Brick rhs-client37.lab.eng.blr.redhat.com:/rhs/brick5/B
endVol							49156	Y	9795
Brick rhs-client15.lab.eng.blr.redhat.com:/rhs/brick5/B
endVol							49156	Y	10089
Brick rhs-client4.lab.eng.blr.redhat.com:/rhs/brick5/Be
ndVol							49156	Y	10226
Brick rhs-client45.lab.eng.blr.redhat.com:/rhs/brick6/B
endVol							49157	Y	9720
Brick rhs-client37.lab.eng.blr.redhat.com:/rhs/brick6/B
endVol							49157	Y	9804
Brick rhs-client15.lab.eng.blr.redhat.com:/rhs/brick6/B
endVol							49157	Y	10098
Brick rhs-client4.lab.eng.blr.redhat.com:/rhs/brick6/Be
ndVol							49157	Y	10235
Brick rhs-client45.lab.eng.blr.redhat.com:/rhs/brick7/B
endVol							49158	Y	9729
Brick rhs-client37.lab.eng.blr.redhat.com:/rhs/brick7/B
endVol							49158	Y	9813
Brick rhs-client15.lab.eng.blr.redhat.com:/rhs/brick7/B
endVol							49158	Y	10107
Brick rhs-client4.lab.eng.blr.redhat.com:/rhs/brick7/Be
ndVol							49158	Y	10244
Brick rhs-client15.lab.eng.blr.redhat.com:/rhs/brick8/B
endVol							49159	Y	10116
Brick rhs-client4.lab.eng.blr.redhat.com:/rhs/brick8/Be
ndVol							49159	Y	10253
NFS Server on localhost					2049	Y	19422
Self-heal Daemon on localhost				N/A	Y	19429
NFS Server on 49257f07-7344-4b00-9ff3-544959419579	2049	Y	20241
Self-heal Daemon on 49257f07-7344-4b00-9ff3-54495941957
9							N/A	Y	20248
NFS Server on 22b94b39-514d-4986-8e37-36322a08b9c1	2049	Y	20112
Self-heal Daemon on 22b94b39-514d-4986-8e37-36322a08b9c
1							N/A	Y	20119
NFS Server on 9c1c5f38-19d0-475e-897c-d88f651a54ba	2049	Y	19440
Self-heal Daemon on 9c1c5f38-19d0-475e-897c-d88f651a54b
a							N/A	Y	19447
 
There are no active volume tasks

[root@rhs-client45 ~]# gluster volume remove-brick BendVol rhs-client15.lab.eng.blr.redhat.com:/rhs/brick8/BendVol rhs-client4.lab.eng.blr.redhat.com:/rhs/brick8/BendVol start
volume remove-brick start: success
ID: c9d75e81-f6ec-4967-8e6b-2b24559d51bc

[root@rhs-client45 ~]# gluster volume remove-brick BendVol rhs-client15.lab.eng.blr.redhat.com:/rhs/brick8/BendVol rhs-client4.lab.eng.blr.redhat.com:/rhs/brick8/BendVol status
                                    Node Rebalanced-files          size       scanned      failures         status run-time in secs
                               ---------      -----------   -----------   -----------   -----------   ------------   --------------
                               localhost                0        0Bytes             0             0    not started             0.00
     rhs-client37.lab.eng.blr.redhat.com                0        0Bytes             0             0    not started             0.00
     rhs-client15.lab.eng.blr.redhat.com                6         2.0MB            13             0    in progress             9.00
      rhs-client4.lab.eng.blr.redhat.com                0        0Bytes            30             0      completed             0.00

[root@rhs-client45 ~]# gluster volume remove-brick BendVol rhs-client15.lab.eng.blr.redhat.com:/rhs/brick8/BendVol rhs-client4.lab.eng.blr.redhat.com:/rhs/brick8/BendVol status
                                    Node Rebalanced-files          size       scanned      failures         status run-time in secs
                               ---------      -----------   -----------   -----------   -----------   ------------   --------------
                               localhost                0        0Bytes             0             0    not started             0.00
     rhs-client37.lab.eng.blr.redhat.com                0        0Bytes             0             0    not started             0.00
     rhs-client15.lab.eng.blr.redhat.com               15        12.0GB            36             0      completed           239.00
      rhs-client4.lab.eng.blr.redhat.com                0        0Bytes            30             0      completed             0.00

[root@rhs-client45 ~]# gluster volume remove-brick BendVol rhs-client15.lab.eng.blr.redhat.com:/rhs/brick8/BendVol rhs-client4.lab.eng.blr.redhat.com:/rhs/brick8/BendVol commit
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
volume remove-brick commit: success

[root@rhs-client45 ~]# gluster volume remove-brick BendVol rhs-client15.lab.eng.blr.redhat.com:/rhs/brick8/BendVol rhs-client4.lab.eng.blr.redhat.com:/rhs/brick8/BendVol status
                                    Node Rebalanced-files          size       scanned      failures         status run-time in secs
                               ---------      -----------   -----------   -----------   -----------   ------------   --------------
                               localhost                0        0Bytes             0             0    not started             0.00
     rhs-client37.lab.eng.blr.redhat.com                0        0Bytes             0             0    not started             0.00
     rhs-client15.lab.eng.blr.redhat.com                0        0Bytes             0             0    not started             0.00
      rhs-client4.lab.eng.blr.redhat.com                0        0Bytes             0             0    not started             0.00

[root@rhs-client45 ~]# gluster volume info
 
Volume Name: BendVol
Type: Distributed-Replicate
Volume ID: c2158e6b-4072-417a-8259-b9b073e0c3c4
Status: Started
Number of Bricks: 8 x 2 = 16
Transport-type: tcp
Bricks:
Brick1: rhs-client45.lab.eng.blr.redhat.com:/rhs/brick4/BendVol
Brick2: rhs-client37.lab.eng.blr.redhat.com:/rhs/brick4/BendVol
Brick3: rhs-client15.lab.eng.blr.redhat.com:/rhs/brick4/BendVol
Brick4: rhs-client4.lab.eng.blr.redhat.com:/rhs/brick4/BendVol
Brick5: rhs-client45.lab.eng.blr.redhat.com:/rhs/brick5/BendVol
Brick6: rhs-client37.lab.eng.blr.redhat.com:/rhs/brick5/BendVol
Brick7: rhs-client15.lab.eng.blr.redhat.com:/rhs/brick5/BendVol
Brick8: rhs-client4.lab.eng.blr.redhat.com:/rhs/brick5/BendVol
Brick9: rhs-client45.lab.eng.blr.redhat.com:/rhs/brick6/BendVol
Brick10: rhs-client37.lab.eng.blr.redhat.com:/rhs/brick6/BendVol
Brick11: rhs-client15.lab.eng.blr.redhat.com:/rhs/brick6/BendVol
Brick12: rhs-client4.lab.eng.blr.redhat.com:/rhs/brick6/BendVol
Brick13: rhs-client45.lab.eng.blr.redhat.com:/rhs/brick7/BendVol
Brick14: rhs-client37.lab.eng.blr.redhat.com:/rhs/brick7/BendVol
Brick15: rhs-client15.lab.eng.blr.redhat.com:/rhs/brick7/BendVol
Brick16: rhs-client4.lab.eng.blr.redhat.com:/rhs/brick7/BendVol
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off

[root@rhs-client45 ~]# gluster volume status
Status of volume: BendVol
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick rhs-client45.lab.eng.blr.redhat.com:/rhs/brick4/B
endVol							49155	Y	9702
Brick rhs-client37.lab.eng.blr.redhat.com:/rhs/brick4/B
endVol							49155	Y	9786
Brick rhs-client15.lab.eng.blr.redhat.com:/rhs/brick4/B
endVol							49155	Y	10080
Brick rhs-client4.lab.eng.blr.redhat.com:/rhs/brick4/Be
ndVol							49155	Y	10217
Brick rhs-client45.lab.eng.blr.redhat.com:/rhs/brick5/B
endVol							49156	Y	9711
Brick rhs-client37.lab.eng.blr.redhat.com:/rhs/brick5/B
endVol							49156	Y	9795
Brick rhs-client15.lab.eng.blr.redhat.com:/rhs/brick5/B
endVol							49156	Y	10089
Brick rhs-client4.lab.eng.blr.redhat.com:/rhs/brick5/Be
ndVol							49156	Y	10226
Brick rhs-client45.lab.eng.blr.redhat.com:/rhs/brick6/B
endVol							49157	Y	9720
Brick rhs-client37.lab.eng.blr.redhat.com:/rhs/brick6/B
endVol							49157	Y	9804
Brick rhs-client15.lab.eng.blr.redhat.com:/rhs/brick6/B
endVol							49157	Y	10098
Brick rhs-client4.lab.eng.blr.redhat.com:/rhs/brick6/Be
ndVol							49157	Y	10235
Brick rhs-client45.lab.eng.blr.redhat.com:/rhs/brick7/B
endVol							49158	Y	9729
Brick rhs-client37.lab.eng.blr.redhat.com:/rhs/brick7/B
endVol							49158	Y	9813
Brick rhs-client15.lab.eng.blr.redhat.com:/rhs/brick7/B
endVol							49158	Y	10107
Brick rhs-client4.lab.eng.blr.redhat.com:/rhs/brick7/Be
ndVol							49158	Y	10244
NFS Server on localhost					2049	Y	19656
Self-heal Daemon on localhost				N/A	Y	19663
NFS Server on 9c1c5f38-19d0-475e-897c-d88f651a54ba	2049	Y	19634
Self-heal Daemon on 9c1c5f38-19d0-475e-897c-d88f651a54b
a							N/A	Y	19641
NFS Server on 49257f07-7344-4b00-9ff3-544959419579	2049	Y	20426
Self-heal Daemon on 49257f07-7344-4b00-9ff3-54495941957
9							N/A	Y	20433
NFS Server on 22b94b39-514d-4986-8e37-36322a08b9c1	2049	Y	20304
Self-heal Daemon on 22b94b39-514d-4986-8e37-36322a08b9c
1							N/A	Y	20311
 
There are no active volume tasks

[root@rhs-client15 ~]# ---------------------------------------------------^C
[root@rhs-client15 ~]# 
[root@rhs-client15 ~]# 
[root@rhs-client15 ~]# ls -lh /rhs/brick8/BendVol/1071de86-7917-48a7-8063-7a9cc82c598f/images/2146b854-28a9-473e-80ad-47224a798619/
total 0
[root@rhs-client15 ~]# ls -lh /rhs/brick8/BendVol/1071de86-7917-48a7-8063-7a9cc82c598f/images/225cb764-22aa-435d-b802-794cc5e1bdc7/
total 0
[root@rhs-client15 ~]# ls -lh /rhs/brick8/BendVol/1071de86-7917-48a7-8063-7a9cc82c598f/images/51af6f3d-4e84-4cf1-8a1f-2762e4b42487/
total 0
[root@rhs-client15 ~]# ls -lh /rhs/brick8/BendVol/1071de86-7917-48a7-8063-7a9cc82c598f/images/8f775a74-e300-4512-9ed8-a4d5b852eb3d/
total 0
[root@rhs-client15 ~]# ls -lh /rhs/brick8/BendVol/1071de86-7917-48a7-8063-7a9cc82c598f/images/94382a3a-8554-43cc-a3ee-ddc2b4d595ef/
total 0
[root@rhs-client15 ~]# ls -lh /rhs/brick8/BendVol/1071de86-7917-48a7-8063-7a9cc82c598f/images/95c77039-aca1-43c4-bf20-46852fe6d1de/
total 0
[root@rhs-client15 ~]# ls -lh /rhs/brick8/BendVol/1071de86-7917-48a7-8063-7a9cc82c598f/images/d81d9131-2c29-42d3-9bfa-d7071030e739/
total 0
[root@rhs-client15 ~]#

---------------------------------------------------------------------------

Given below are the messages noticed on the RHEVM UI during the events. There was no error during removal of VM in this case.

---------------------------------------------------------------------------

ID	139
Time	2013-Jun-28, 12:40
Message	VM VQ1 has paused due to unknown storage error.

ID	119
Time	2013-Jun-28, 12:49
Message	VM VQ1 is down. Exit message: 'truesize'.

---------------------------------------------------------------------------

Comment 34 Rejy M Cyriac 2013-07-10 15:44:11 UTC
This BZ was initially opened for RHS 2.0+. But this has now inadvertently evolved to handling another issue, also caused by remove-brick operation, but valid only on RHS 2.1 , and where, as a result the VMs get corrupted.

So another BZ 983145 has been opened to deal with the original issue afresh on RHS 2.0+, which is still reproducible, and leads to intermittent instances of paused VMs, and may lead to loss of data not synced

Comment 35 Rejy M Cyriac 2013-07-24 12:03:03 UTC
Reproduced issue on glusterfs-server-3.4.0.12rhs.beta6-1.el6rhs.x86_64

The xattrs information are given below:

------------------------------------------------------------------------

[2013-07-24 11:11:49.503538] D [afr-common.c:1385:afr_lookup_select_read_child] 0-Hacker-replicate-5: Source selected as 0 for /
[2013-07-24 11:11:49.503545] D [afr-common.c:1122:afr_lookup_build_response_params] 0-Hacker-replicate-5: Building lookup response from 0
[2013-07-24 11:11:49.503577] D [afr-self-heal-common.c:138:afr_sh_print_pending_matrix] 0-Hacker-replicate-4: pending_matrix: [ 0 0 ]
[2013-07-24 11:11:49.503586] D [afr-self-heal-common.c:138:afr_sh_print_pending_matrix] 0-Hacker-replicate-4: pending_matrix: [ 0 0 ]
[2013-07-24 11:11:49.503592] D [afr-self-heal-common.c:887:afr_mark_sources] 0-Hacker-replicate-4: Number of sources: 0
[2013-07-24 11:11:49.503598] D [afr-self-heal-data.c:929:afr_lookup_select_read_child_by_txn_type] 0-Hacker-replicate-4: returning read_child: 0
[2013-07-24 11:11:49.503605] D [afr-common.c:1385:afr_lookup_select_read_child] 0-Hacker-replicate-4: Source selected as 0 for /
[2013-07-24 11:11:49.503611] D [afr-common.c:1122:afr_lookup_build_response_params] 0-Hacker-replicate-4: Building lookup response from 0
[2013-07-24 11:11:49.503654] D [afr-self-heal-common.c:138:afr_sh_print_pending_matrix] 0-Hacker-replicate-6: pending_matrix: [ 0 0 ]
[2013-07-24 11:11:49.503663] D [afr-self-heal-common.c:138:afr_sh_print_pending_matrix] 0-Hacker-replicate-6: pending_matrix: [ 0 0 ]
[2013-07-24 11:11:49.503670] D [afr-self-heal-common.c:887:afr_mark_sources] 0-Hacker-replicate-6: Number of sources: 0
[2013-07-24 11:11:49.503676] D [afr-self-heal-data.c:929:afr_lookup_select_read_child_by_txn_type] 0-Hacker-replicate-6: returning read_child: 1
[2013-07-24 11:11:49.503682] D [afr-common.c:1385:afr_lookup_select_read_child] 0-Hacker-replicate-6: Source selected as 1 for /
[2013-07-24 11:11:49.503689] D [afr-common.c:1122:afr_lookup_build_response_params] 0-Hacker-replicate-6: Building lookup response from 1
[2013-07-24 11:11:49.503720] I [dht-rebalance.c:1106:gf_defrag_migrate_data] 0-Hacker-dht: migrate data called on /
[2013-07-24 11:11:49.504432] D [afr-dir-read.c:126:afr_examine_dir_readdir_cbk] 0-Hacker-replicate-0: /: no entries found in Hacker-client-0
[2013-07-24 11:11:49.504494] D [afr-dir-read.c:126:afr_examine_dir_readdir_cbk] 0-Hacker-replicate-0: /: no entries found in Hacker-client-1
[2013-07-24 11:11:49.504697] D [afr-dir-read.c:126:afr_examine_dir_readdir_cbk] 0-Hacker-replicate-5: /: no entries found in Hacker-client-10
[2013-07-24 11:11:49.504724] D [afr-dir-read.c:126:afr_examine_dir_readdir_cbk] 0-Hacker-replicate-2: /: no entries found in Hacker-client-5
[2013-07-24 11:11:49.504920] D [afr-dir-read.c:126:afr_examine_dir_readdir_cbk] 0-Hacker-replicate-5: /: no entries found in Hacker-client-11
[2013-07-24 11:11:49.504944] D [afr-dir-read.c:126:afr_examine_dir_readdir_cbk] 0-Hacker-replicate-4: /: no entries found in Hacker-client-8
[2013-07-24 11:11:49.504957] D [afr-dir-read.c:126:afr_examine_dir_readdir_cbk] 0-Hacker-replicate-2: /: no entries found in Hacker-client-4
[2013-07-24 11:11:49.504972] D [afr-dir-read.c:126:afr_examine_dir_readdir_cbk] 0-Hacker-replicate-1: /: no entries found in Hacker-client-3
[2013-07-24 11:11:49.504998] D [afr-dir-read.c:126:afr_examine_dir_readdir_cbk] 0-Hacker-replicate-3: /: no entries found in Hacker-client-6
[2013-07-24 11:11:49.505011] D [afr-dir-read.c:126:afr_examine_dir_readdir_cbk] 0-Hacker-replicate-1: /: no entries found in Hacker-client-2
[2013-07-24 11:11:49.505027] D [afr-dir-read.c:126:afr_examine_dir_readdir_cbk] 0-Hacker-replicate-3: /: no entries found in Hacker-client-7
[2013-07-24 11:11:49.505043] D [afr-dir-read.c:126:afr_examine_dir_readdir_cbk] 0-Hacker-replicate-6: /: no entries found in Hacker-client-12
[2013-07-24 11:11:49.505055] D [afr-dir-read.c:126:afr_examine_dir_readdir_cbk] 0-Hacker-replicate-6: /: no entries found in Hacker-client-13
[2013-07-24 11:11:49.505082] D [afr-dir-read.c:126:afr_examine_dir_readdir_cbk] 0-Hacker-replicate-4: /: no entries found in Hacker-client-9
[2013-07-24 11:11:49.505160] D [afr-common.c:745:afr_get_call_child] 0-Hacker-replicate-0: Returning 0, call_child: 0, last_index: -1
[2013-07-24 11:11:49.505484] D [afr-common.c:745:afr_get_call_child] 0-Hacker-replicate-1: Returning 0, call_child: 1, last_index: -1
[2013-07-24 11:11:49.505783] D [afr-common.c:745:afr_get_call_child] 0-Hacker-replicate-2: Returning 0, call_child: 0, last_index: -1
[2013-07-24 11:11:49.505958] D [afr-common.c:745:afr_get_call_child] 0-Hacker-replicate-3: Returning 0, call_child: 1, last_index: -1
[2013-07-24 11:11:49.506242] D [afr-common.c:745:afr_get_call_child] 0-Hacker-replicate-4: Returning 0, call_child: 0, last_index: -1
[2013-07-24 11:11:49.506489] D [afr-common.c:745:afr_get_call_child] 0-Hacker-replicate-5: Returning 0, call_child: 0, last_index: -1
[2013-07-24 11:11:49.506652] D [afr-common.c:745:afr_get_call_child] 0-Hacker-replicate-6: Returning 0, call_child: 1, last_index: -1
[2013-07-24 11:11:49.507006] I [dht-rebalance.c:1311:gf_defrag_migrate_data] 0-Hacker-dht: Migration operation on dir / took 0.00 secs

------------------------------------------------------------------------

------------------------------------------------------------------------

[root@rhs-client4 glusterfs]# ls -l /rhs/brick*/Hacker
/rhs/brick1/Hacker:
total 0
drwxr-xr-x 5 vdsm kvm 45 Jul 24 10:30 fc59491a-b214-4023-a7d3-59ff6b0f25da

/rhs/brick3/Hacker:
total 0
drwxr-xr-x 5 vdsm kvm 45 Jul 24 10:30 fc59491a-b214-4023-a7d3-59ff6b0f25da

/rhs/brick5/Hacker:
total 0
drwxr-xr-x 5 vdsm kvm 45 Jul 24 10:30 fc59491a-b214-4023-a7d3-59ff6b0f25da

/rhs/brick7/Hacker:
total 0
drwxr-xr-x 5 vdsm kvm 45 Jul 24 10:30 fc59491a-b214-4023-a7d3-59ff6b0f25da

[root@rhs-client4 glusterfs]# ls -l /rhs/brick*/Hacker/*
/rhs/brick1/Hacker/fc59491a-b214-4023-a7d3-59ff6b0f25da:
total 0
drwxr-xr-x  2 vdsm kvm  28 Jul 24 10:30 dom_md
drwxr-xr-x 10 vdsm kvm 350 Jul 24 11:08 images
drwxr-xr-x  4 vdsm kvm  28 Jul 24 10:30 master

/rhs/brick3/Hacker/fc59491a-b214-4023-a7d3-59ff6b0f25da:
total 0
drwxr-xr-x  2 vdsm kvm  29 Jul 24 16:41 dom_md
drwxr-xr-x 10 vdsm kvm 350 Jul 24 11:08 images
drwxr-xr-x  4 vdsm kvm  28 Jul 24 10:30 master

/rhs/brick5/Hacker/fc59491a-b214-4023-a7d3-59ff6b0f25da:
total 0
drwxr-xr-x  2 vdsm kvm  34 Jul 24 16:28 dom_md
drwxr-xr-x 10 vdsm kvm 350 Jul 24 11:08 images
drwxr-xr-x  4 vdsm kvm  28 Jul 24 10:30 master

/rhs/brick7/Hacker/fc59491a-b214-4023-a7d3-59ff6b0f25da:
total 0
drwxr-xr-x  2 vdsm kvm  21 Jul 24 10:42 dom_md
drwxr-xr-x 10 vdsm kvm 350 Jul 24 11:08 images
drwxr-xr-x  4 vdsm kvm  28 Jul 24 10:30 master


[root@rhs-client4 glusterfs]# getfattr -m . -d -e hex /rhs/brick*/Hacker/
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/Hacker/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000555555547ffffffd
trusted.glusterfs.volume-id=0x50fa570f609c4bd89ebd53e4ff360c14

# file: rhs/brick3/Hacker/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000d5555552ffffffff
trusted.glusterfs.volume-id=0x50fa570f609c4bd89ebd53e4ff360c14

# file: rhs/brick5/Hacker/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000007ffffffeaaaaaaa7
trusted.glusterfs.volume-id=0x50fa570f609c4bd89ebd53e4ff360c14

# file: rhs/brick7/Hacker/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000000000000000000000
trusted.glusterfs.volume-id=0x50fa570f609c4bd89ebd53e4ff360c14


[root@rhs-client10 glusterfs]# getfattr -m . -d -e hex /rhs/brick*/Hacker/
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/Hacker/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000555555547ffffffd
trusted.glusterfs.volume-id=0x50fa570f609c4bd89ebd53e4ff360c14

# file: rhs/brick3/Hacker/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000d5555552ffffffff
trusted.glusterfs.volume-id=0x50fa570f609c4bd89ebd53e4ff360c14

# file: rhs/brick5/Hacker/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000007ffffffeaaaaaaa7
trusted.glusterfs.volume-id=0x50fa570f609c4bd89ebd53e4ff360c14

# file: rhs/brick7/Hacker/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000000000000000000000
trusted.glusterfs.volume-id=0x50fa570f609c4bd89ebd53e4ff360c14

[root@rhs-client15 ~]# getfattr -m . -d -e hex /rhs/brick*/Hacker/fc59491a-b214-4023-a7d3-59ff6b0f25da/master/tasks/*
getfattr: /rhs/brick*/Hacker/fc59491a-b214-4023-a7d3-59ff6b0f25da/master/tasks/*: No such file or directory
[root@rhs-client15 ~]# getfattr -m . -d -e hex /rhs/brick*/Hacker/
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick2/Hacker/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000002aaaaaaa55555553
trusted.glusterfs.volume-id=0x50fa570f609c4bd89ebd53e4ff360c14

# file: rhs/brick4/Hacker/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000000000002aaaaaa9
trusted.glusterfs.volume-id=0x50fa570f609c4bd89ebd53e4ff360c14

# file: rhs/brick6/Hacker/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000000000000000000000
trusted.glusterfs.volume-id=0x50fa570f609c4bd89ebd53e4ff360c14

# file: rhs/brick8/Hacker/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000aaaaaaa8d5555551
trusted.glusterfs.volume-id=0x50fa570f609c4bd89ebd53e4ff360c14

[root@rhs-client37 ~]# getfattr -m . -d -e hex /rhs/brick*/Hacker/
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick2/Hacker/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000002aaaaaaa55555553
trusted.glusterfs.volume-id=0x50fa570f609c4bd89ebd53e4ff360c14

# file: rhs/brick4/Hacker/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000000000002aaaaaa9
trusted.glusterfs.volume-id=0x50fa570f609c4bd89ebd53e4ff360c14

# file: rhs/brick6/Hacker/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000000000000000000000
trusted.glusterfs.volume-id=0x50fa570f609c4bd89ebd53e4ff360c14

# file: rhs/brick8/Hacker/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000aaaaaaa8d5555551
trusted.glusterfs.volume-id=0x50fa570f609c4bd89ebd53e4ff360c14

------------------------------------------------------------------------

Comment 37 Rejy M Cyriac 2013-07-24 12:19:33 UTC
Information on directory layout from the RHS servers client10,15,37 was missing from comment 35. Adding them here.

---------------------------------------------------------------

[root@rhs-client10 ~]# ls -l /rhs/brick*/Hacker/*
/rhs/brick1/Hacker/fc59491a-b214-4023-a7d3-59ff6b0f25da:
total 0
drwxr-xr-x  2 vdsm kvm  16 Jul 24 17:36 dom_md
drwxr-xr-x 10 vdsm kvm 350 Jul 24 11:08 images
drwxr-xr-x  4 vdsm kvm  28 Jul 24 10:30 master

/rhs/brick3/Hacker/fc59491a-b214-4023-a7d3-59ff6b0f25da:
total 0
drwxr-xr-x  2 vdsm kvm  29 Jul 24 16:41 dom_md
drwxr-xr-x 10 vdsm kvm 350 Jul 24 11:08 images
drwxr-xr-x  4 vdsm kvm  28 Jul 24 10:30 master

/rhs/brick5/Hacker/fc59491a-b214-4023-a7d3-59ff6b0f25da:
total 0
drwxr-xr-x  2 vdsm kvm  21 Jul 24 17:36 dom_md
drwxr-xr-x 10 vdsm kvm 350 Jul 24 11:08 images
drwxr-xr-x  4 vdsm kvm  28 Jul 24 10:30 master

/rhs/brick7/Hacker/fc59491a-b214-4023-a7d3-59ff6b0f25da:
total 0
drwxr-xr-x  2 vdsm kvm  19 Jul 24 17:36 dom_md
drwxr-xr-x 10 vdsm kvm 350 Jul 24 11:08 images
drwxr-xr-x  4 vdsm kvm  28 Jul 24 10:30 master
[root@rhs-client10 ~]# ls -l /rhs/brick*/Hacker
/rhs/brick1/Hacker:
total 0
drwxr-xr-x 5 vdsm kvm 45 Jul 24 10:30 fc59491a-b214-4023-a7d3-59ff6b0f25da

/rhs/brick3/Hacker:
total 0
drwxr-xr-x 5 vdsm kvm 45 Jul 24 10:30 fc59491a-b214-4023-a7d3-59ff6b0f25da

/rhs/brick5/Hacker:
total 0
drwxr-xr-x 5 vdsm kvm 45 Jul 24 10:30 fc59491a-b214-4023-a7d3-59ff6b0f25da

/rhs/brick7/Hacker:
total 0
drwxr-xr-x 5 vdsm kvm 45 Jul 24 10:30 fc59491a-b214-4023-a7d3-59ff6b0f25da
[root@rhs-client10 ~]# 

---------------------------------------------------------------

---------------------------------------------------------------

[root@rhs-client15 ~]# ls -l /rhs/brick*/Hacker/*
/rhs/brick2/Hacker/fc59491a-b214-4023-a7d3-59ff6b0f25da:
total 0
drwxr-xr-x  2 vdsm kvm  19 Jul 24 17:36 dom_md
drwxr-xr-x 10 vdsm kvm 350 Jul 24 11:08 images
drwxr-xr-x  4 vdsm kvm  28 Jul 24 10:30 master

/rhs/brick4/Hacker/fc59491a-b214-4023-a7d3-59ff6b0f25da:
total 0
drwxr-xr-x  2 vdsm kvm   6 Jul 24 17:36 dom_md
drwxr-xr-x 10 vdsm kvm 350 Jul 24 11:08 images
drwxr-xr-x  4 vdsm kvm  28 Jul 24 10:30 master

/rhs/brick6/Hacker/fc59491a-b214-4023-a7d3-59ff6b0f25da:
total 0
drwxr-xr-x  2 vdsm kvm   6 Jul 24 16:04 dom_md
drwxr-xr-x 10 vdsm kvm 350 Jul 24 11:08 images
drwxr-xr-x  4 vdsm kvm  28 Jul 24 10:30 master

/rhs/brick8/Hacker/fc59491a-b214-4023-a7d3-59ff6b0f25da:
total 0
drwxr-xr-x  2 vdsm kvm  18 Jul 24 16:04 dom_md
drwxr-xr-x 10 vdsm kvm 350 Jul 24 11:08 images
drwxr-xr-x  4 vdsm kvm  28 Jul 24 10:30 master
[root@rhs-client15 ~]# ls -l /rhs/brick*/Hacker
/rhs/brick2/Hacker:
total 0
drwxr-xr-x 5 vdsm kvm 45 Jul 24 10:30 fc59491a-b214-4023-a7d3-59ff6b0f25da

/rhs/brick4/Hacker:
total 0
drwxr-xr-x 5 vdsm kvm 45 Jul 24 10:30 fc59491a-b214-4023-a7d3-59ff6b0f25da

/rhs/brick6/Hacker:
total 0
drwxr-xr-x 5 vdsm kvm 45 Jul 24 10:30 fc59491a-b214-4023-a7d3-59ff6b0f25da

/rhs/brick8/Hacker:
total 0
drwxr-xr-x 5 vdsm kvm 45 Jul 24 10:30 fc59491a-b214-4023-a7d3-59ff6b0f25da
[root@rhs-client15 ~]# 

---------------------------------------------------------------

---------------------------------------------------------------

[root@rhs-client37 ~]# ls -l /rhs/brick*/Hacker/*
/rhs/brick2/Hacker/fc59491a-b214-4023-a7d3-59ff6b0f25da:
total 0
drwxr-xr-x  2 vdsm kvm  19 Jul 24 17:36 dom_md
drwxr-xr-x 10 vdsm kvm 350 Jul 24 11:08 images
drwxr-xr-x  4 vdsm kvm  28 Jul 24 10:30 master

/rhs/brick4/Hacker/fc59491a-b214-4023-a7d3-59ff6b0f25da:
total 0
drwxr-xr-x  2 vdsm kvm   6 Jul 24 17:36 dom_md
drwxr-xr-x 10 vdsm kvm 350 Jul 24 11:08 images
drwxr-xr-x  4 vdsm kvm  28 Jul 24 10:30 master

/rhs/brick6/Hacker/fc59491a-b214-4023-a7d3-59ff6b0f25da:
total 0
drwxr-xr-x  2 vdsm kvm   6 Jul 24 16:04 dom_md
drwxr-xr-x 10 vdsm kvm 350 Jul 24 11:08 images
drwxr-xr-x  4 vdsm kvm  28 Jul 24 10:30 master

/rhs/brick8/Hacker/fc59491a-b214-4023-a7d3-59ff6b0f25da:
total 0
drwxr-xr-x  2 vdsm kvm  18 Jul 24 16:04 dom_md
drwxr-xr-x 10 vdsm kvm 350 Jul 24 11:08 images
drwxr-xr-x  4 vdsm kvm  28 Jul 24 10:30 master
[root@rhs-client37 ~]# ls -l /rhs/brick*/Hacker
/rhs/brick2/Hacker:
total 0
drwxr-xr-x 5 vdsm kvm 45 Jul 24 10:30 fc59491a-b214-4023-a7d3-59ff6b0f25da

/rhs/brick4/Hacker:
total 0
drwxr-xr-x 5 vdsm kvm 45 Jul 24 10:30 fc59491a-b214-4023-a7d3-59ff6b0f25da

/rhs/brick6/Hacker:
total 0
drwxr-xr-x 5 vdsm kvm 45 Jul 24 10:30 fc59491a-b214-4023-a7d3-59ff6b0f25da

/rhs/brick8/Hacker:
total 0
drwxr-xr-x 5 vdsm kvm 45 Jul 24 10:30 fc59491a-b214-4023-a7d3-59ff6b0f25da
[root@rhs-client37 ~]# 

---------------------------------------------------------------

Comment 38 shishir gowda 2013-07-24 12:22:37 UTC
It looks like rebalance/remove-brick process is not getting any entries reported in readdir fop. Have verified the back-end bricks, and can confirm files/directories on all of the bricks. Looks like underlying subvolume is not returning any entries

Comment 41 Rejy M Cyriac 2013-07-26 10:31:05 UTC
Since all earlier issue reproductions involved Red Hat Storage servers on physical systems, the issue was tested on an environment with the Red Hat Storage servers installed on Virtual Machines.

The issue was *reproduced* in this set-up as well.

Comment 44 Amar Tumballi 2013-07-30 11:58:58 UTC
Spent some time on this issue today with Avati and Shishir, and it seems that this particular cases have come up in Big Bend now is because of 'open-behind' (and if you notice, 'gluster volume set <VOL> group virt' didn't disable open-behind). With build glusterfs-3.4.0.13rhs, open-behind is default off on the volume, and hence can we get a final round of test done on this ?

Comment 45 shishir gowda 2013-08-12 05:13:54 UTC
Bug 988262 seems to be similar to this one.

Comment 46 Sachidananda Urs 2013-08-12 12:25:55 UTC
As per Rich and Sayan's triage, removing the blocker flag

Comment 48 Sachidananda Urs 2013-08-12 15:00:00 UTC
Rejy, the reason for removing this blocker is:
"Introduced by open-behind; not blocker, since not default nor recommended for virtualization case"

Comment 49 Rejy M Cyriac 2013-08-12 15:39:57 UTC
(In reply to Sachidananda Urs from comment #48)
> Rejy, the reason for removing this blocker is:
> "Introduced by open-behind; not blocker, since not default nor recommended
> for virtualization case"

The test environment that reproduces the bug uses the 'virt' group to set the volume options, as is recommended. The rest of the volume options used are as default for the particular build. No volume options are being set manually.

If there has been a change in the default volume options for the latest build, it needs to be tested whether that fixes the reported issue. If it is considered that a fix for the reported bug is in place, with any new patch, the BZ needs to be moved to 'ON_QA', with information on the build that contains the patch that fixes the issue.

Comment 50 Rejy M Cyriac 2013-08-12 15:43:17 UTC
(In reply to Rejy M Cyriac from comment #49)
> (In reply to Sachidananda Urs from comment #48)
> > Rejy, the reason for removing this blocker is:
> > "Introduced by open-behind; not blocker, since not default nor recommended
> > for virtualization case"
> 
> The test environment that reproduces the bug uses the 'virt' group to set
> the volume options, as is recommended. The rest of the volume options used
> are as default for the particular build. No volume options are being set
> manually.
> 
> If there has been a change in the default volume options for the latest
> build, it needs to be tested whether that fixes the reported issue. If it is
> considered that a fix for the reported bug is in place, with any new patch,
> the BZ needs to be moved to 'ON_QA', with information on the build that
> contains the patch that fixes the issue.

I need to add that there are two volume options that are being set manually, which are the ones for setting user and group ownership to 36:36. These are also being done as per recommendation.

storage.owner-uid 36
storage.owner-gid 36

No other volume options are being set manually.

Comment 52 Rejy M Cyriac 2013-08-22 13:17:16 UTC
Verified that the remove-brick operation now migrates the data as expected, and the VMs stay online, and available, after completion of the operation. 


Test Environment versions:
RHS - glusterfs-server-3.4.0.21rhs-1.el6rhs.x86_64
6X2 Distribute-Replicate Volume used a Storage Domain
Red Hat Enterprise Virtualization Manager Version: 3.2.2-0.41.el6ev
RHEVH-6.4 Hypervisor with glusterfs-3.4.0.21rhs-1.el6_4.x86_64
RHEL-6.4 Hypervisor with glusterfs-3.4.0.21rhs-1.el6_4.x86_64

Comment 53 Scott Haines 2013-09-23 22:35:13 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html