928054 – remove-brick operation led to Storage Domain and Data Center being brought down to non-responsive state for a long time

Bug 928054 - remove-brick operation led to Storage Domain and Data Center being brought down to non-responsive state for a long time

Summary: remove-brick operation led to Storage Domain and Data Center being brought do...

Keywords:
Status:	CLOSED DUPLICATE of bug 924572
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterfs
Sub Component:
Version:	2.0
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	shishir gowda
QA Contact:	Rejy M Cyriac
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-03-26 19:25 UTC by Rejy M Cyriac
Modified:	2013-12-09 01:35 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-05-09 09:52:15 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Rejy M Cyriac 2013-03-26 19:25:39 UTC

Description of problem:

On a 6x2 distribute-replicate volume added as Storage Domain for VM image store, 2 more replica pairs were added to make it 8x2, and rebalance operation was performed. The Vms stayed online, and the Storage Domain, and Data Center were healthy after the operation.

-----------------------------------------------------------

[root@rhs-client45 ~]# gluster volume add-brick virtVOL rhs-client45.lab.eng.blr.redhat.com:/VM_brick4 rhs-client37.lab.eng.blr.redhat.com:/VM_brick4 rhs-client15.lab.eng.blr.redhat.com:/VM_brick4 rhs-client10.lab.eng.blr.redhat.com:/VM_brick4
Add Brick successful
[root@rhs-client45 ~]# gluster volume info
 
Volume Name: virtVOL
Type: Distributed-Replicate
Volume ID: 689aa65d-b49a-42f8-a20f-6bac6e116d6b
Status: Started
Number of Bricks: 8 x 2 = 16
Transport-type: tcp
Bricks:
Brick1: rhs-client45.lab.eng.blr.redhat.com:/VM_brick1
Brick2: rhs-client37.lab.eng.blr.redhat.com:/VM_brick1
Brick3: rhs-client15.lab.eng.blr.redhat.com:/VM_brick1
Brick4: rhs-client10.lab.eng.blr.redhat.com:/VM_brick1
Brick5: rhs-client45.lab.eng.blr.redhat.com:/VM_brick2
Brick6: rhs-client37.lab.eng.blr.redhat.com:/VM_brick2
Brick7: rhs-client15.lab.eng.blr.redhat.com:/VM_brick2
Brick8: rhs-client10.lab.eng.blr.redhat.com:/VM_brick2
Brick9: rhs-client45.lab.eng.blr.redhat.com:/VM_brick3
Brick10: rhs-client37.lab.eng.blr.redhat.com:/VM_brick3
Brick11: rhs-client15.lab.eng.blr.redhat.com:/VM_brick3
Brick12: rhs-client10.lab.eng.blr.redhat.com:/VM_brick3
Brick13: rhs-client45.lab.eng.blr.redhat.com:/VM_brick4
Brick14: rhs-client37.lab.eng.blr.redhat.com:/VM_brick4
Brick15: rhs-client15.lab.eng.blr.redhat.com:/VM_brick4
Brick16: rhs-client10.lab.eng.blr.redhat.com:/VM_brick4
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
network.remote-dio: on
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off


[root@rhs-client45 ~]# gluster volume rebalance virtVOL start
Starting rebalance on volume virtVOL has been successful

[root@rhs-client45 ~]# gluster volume rebalance virtVOL status
                                    Node Rebalanced-files          size       scanned      failures         status
                               ---------      -----------   -----------   -----------   -----------   ------------
                               localhost                0            0            8            0    in progress
     rhs-client10.lab.eng.blr.redhat.com                0            0           24            0    in progress
     rhs-client15.lab.eng.blr.redhat.com                0            0            1            0    in progress
     rhs-client37.lab.eng.blr.redhat.com                0            0           24            0    in progress

[root@rhs-client45 ~]# gluster volume rebalance virtVOL status
                                    Node Rebalanced-files          size       scanned      failures         status
                               ---------      -----------   -----------   -----------   -----------   ------------
                               localhost                2      2097152           12            0    in progress
     rhs-client15.lab.eng.blr.redhat.com                5      3150839           30            4      completed
     rhs-client37.lab.eng.blr.redhat.com                0            0           25            0      completed
     rhs-client10.lab.eng.blr.redhat.com                0            0           25            0      completed


[root@rhs-client45 ~]# gluster volume rebalance virtVOL status
                                    Node Rebalanced-files          size       scanned      failures         status
                               ---------      -----------   -----------   -----------   -----------   ------------
                               localhost               10  48322580942           36            1      completed
     rhs-client37.lab.eng.blr.redhat.com                0            0           25            0      completed
     rhs-client15.lab.eng.blr.redhat.com                5      3150839           30            4      completed
     rhs-client10.lab.eng.blr.redhat.com                0            0           25            0      completed

-----------------------------------------------------------

Then the remove-brick operation was performed to remove one replica pair. It seemed to go well, from the status messages, and the VMs stayed online.

-----------------------------------------------------------

[root@rhs-client45 ~]# gluster volume remove-brick virtVOL rhs-client45.lab.eng.blr.redhat.com:/VM_brick4 rhs-client37.lab.eng.blr.redhat.com:/VM_brick4 rhs-client15.lab.eng.blr.redhat.com:/VM_brick4 rhs-client10.lab.eng.blr.redhat.com:/VM_brick4
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) n

[root@rhs-client45 ~]# gluster volume remove-brick virtVOL rhs-client45.lab.eng.blr.redhat.com:/VM_brick4 rhs-client37.lab.eng.blr.redhat.com:/VM_brick4 rhs-client15.lab.eng.blr.redhat.com:/VM_brick4 rhs-client10.lab.eng.blr.redhat.com:/VM_brick4 start
Bricks not from same subvol for replica

[root@rhs-client45 ~]# gluster volume remove-brick virtVOL rhs-client45.lab.eng.blr.redhat.com:/VM_brick4 rhs-client37.lab.eng.blr.redhat.com:/VM_brick4 start
 
Remove Brick start successful

[root@rhs-client45 ~]# 
[root@rhs-client45 ~]# gluster volume remove-brick virtVOL rhs-client45.lab.eng.blr.redhat.com:/VM_brick4 rhs-client37.lab.eng.blr.redhat.com:/VM_brick4 status
                                    Node Rebalanced-files          size       scanned      failures         status
                               ---------      -----------   -----------   -----------   -----------   ------------
                               localhost                2      1048576           20            0    in progress
     rhs-client37.lab.eng.blr.redhat.com                0            0           28            0      completed
     rhs-client15.lab.eng.blr.redhat.com                5      3150839           30            4    not started
     rhs-client10.lab.eng.blr.redhat.com                0            0           25            0    not started

[root@rhs-client45 ~]# gluster volume remove-brick virtVOL rhs-client45.lab.eng.blr.redhat.com:/VM_brick4 rhs-client37.lab.eng.blr.redhat.com:/VM_brick4 status
                                    Node Rebalanced-files          size       scanned      failures         status
                               ---------      -----------   -----------   -----------   -----------   ------------
                               localhost                9  10739796172           31            0      completed
     rhs-client37.lab.eng.blr.redhat.com                0            0           28            0      completed
     rhs-client15.lab.eng.blr.redhat.com                5      3150839           30            4    not started
     rhs-client10.lab.eng.blr.redhat.com                0            0           25            0    not started

[root@rhs-client45 ~]# gluster volume remove-brick virtVOL rhs-client45.lab.eng.blr.redhat.com:/VM_brick4 rhs-client37.lab.eng.blr.redhat.com:/VM_brick4 commit
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
Remove Brick commit successful

[root@rhs-client45 ~]# gluster volume remove-brick virtVOL rhs-client45.lab.eng.blr.redhat.com:/VM_brick4 rhs-client37.lab.eng.blr.redhat.com:/VM_brick4 status
                                    Node Rebalanced-files          size       scanned      failures         status
                               ---------      -----------   -----------   -----------   -----------   ------------
                               localhost                9  10739796172           31            0    not started
     rhs-client37.lab.eng.blr.redhat.com                0            0           28            0    not started
     rhs-client10.lab.eng.blr.redhat.com                0            0           25            0    not started
     rhs-client15.lab.eng.blr.redhat.com                5      3150839           30            4    not started
[root@rhs-client45 ~]# gluster volume info
 
Volume Name: virtVOL
Type: Distributed-Replicate
Volume ID: 689aa65d-b49a-42f8-a20f-6bac6e116d6b
Status: Started
Number of Bricks: 7 x 2 = 14
Transport-type: tcp
Bricks:
Brick1: rhs-client45.lab.eng.blr.redhat.com:/VM_brick1
Brick2: rhs-client37.lab.eng.blr.redhat.com:/VM_brick1
Brick3: rhs-client15.lab.eng.blr.redhat.com:/VM_brick1
Brick4: rhs-client10.lab.eng.blr.redhat.com:/VM_brick1
Brick5: rhs-client45.lab.eng.blr.redhat.com:/VM_brick2
Brick6: rhs-client37.lab.eng.blr.redhat.com:/VM_brick2
Brick7: rhs-client15.lab.eng.blr.redhat.com:/VM_brick2
Brick8: rhs-client10.lab.eng.blr.redhat.com:/VM_brick2
Brick9: rhs-client45.lab.eng.blr.redhat.com:/VM_brick3
Brick10: rhs-client37.lab.eng.blr.redhat.com:/VM_brick3
Brick11: rhs-client15.lab.eng.blr.redhat.com:/VM_brick3
Brick12: rhs-client10.lab.eng.blr.redhat.com:/VM_brick3
Brick13: rhs-client15.lab.eng.blr.redhat.com:/VM_brick4
Brick14: rhs-client10.lab.eng.blr.redhat.com:/VM_brick4
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
network.remote-dio: on
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
[root@rhs-client45 ~]# gluster volume status
Status of volume: virtVOL
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick rhs-client45.lab.eng.blr.redhat.com:/VM_brick1	24009	Y	6694
Brick rhs-client37.lab.eng.blr.redhat.com:/VM_brick1	24009	Y	6650
Brick rhs-client15.lab.eng.blr.redhat.com:/VM_brick1	24009	Y	6658
Brick rhs-client10.lab.eng.blr.redhat.com:/VM_brick1	24009	Y	6666
Brick rhs-client45.lab.eng.blr.redhat.com:/VM_brick2	24010	Y	6700
Brick rhs-client37.lab.eng.blr.redhat.com:/VM_brick2	24010	Y	6655
Brick rhs-client15.lab.eng.blr.redhat.com:/VM_brick2	24010	Y	6664
Brick rhs-client10.lab.eng.blr.redhat.com:/VM_brick2	24010	Y	6671
Brick rhs-client45.lab.eng.blr.redhat.com:/VM_brick3	24011	Y	6705
Brick rhs-client37.lab.eng.blr.redhat.com:/VM_brick3	24011	Y	6661
Brick rhs-client15.lab.eng.blr.redhat.com:/VM_brick3	24011	Y	6669
Brick rhs-client10.lab.eng.blr.redhat.com:/VM_brick3	24011	Y	6677
Brick rhs-client15.lab.eng.blr.redhat.com:/VM_brick4	24012	Y	7276
Brick rhs-client10.lab.eng.blr.redhat.com:/VM_brick4	24012	Y	7284
NFS Server on localhost					38467	Y	8433
Self-heal Daemon on localhost				N/A	Y	8439
NFS Server on rhs-client37.lab.eng.blr.redhat.com	38467	Y	8271
Self-heal Daemon on rhs-client37.lab.eng.blr.redhat.com	N/A	Y	8277
NFS Server on rhs-client15.lab.eng.blr.redhat.com	38467	Y	8349
Self-heal Daemon on rhs-client15.lab.eng.blr.redhat.com	N/A	Y	8355
NFS Server on rhs-client10.lab.eng.blr.redhat.com	38467	Y	8340
Self-heal Daemon on rhs-client10.lab.eng.blr.redhat.com	N/A	Y	8346
 
[root@rhs-client45 ~]# 

-----------------------------------------------------------

Suddenly the Data Center and the Storage Domain were brought down, with invalid status messages, and one of the Hypervisors was brought to non-responsive status. However the VMs were accessible.

-----------------------------------------------------------
	
2013-Mar-26, 23:49
	
Invalid status on Data Center SpaceMan. Setting status to Non-Responsive.
	
	
2013-Mar-26, 23:49
	
Invalid status on Data Center SpaceMan. Setting status to Non-Responsive.
	
	
2013-Mar-26, 23:48
	
Invalid status on Data Center SpaceMan. Setting status to Non-Responsive.
	
	
2013-Mar-26, 23:48
	
Invalid status on Data Center SpaceMan. Setting status to Non-Responsive.
	
	
2013-Mar-26, 23:47
	
Invalid status on Data Center SpaceMan. Setting status to Non-Responsive.
	
	
2013-Mar-26, 23:47
	
Invalid status on Data Center SpaceMan. Setting status to Non-Responsive.
	
	
2013-Mar-26, 23:46
	
Invalid status on Data Center SpaceMan. Setting status to Non-Responsive.
	
	
2013-Mar-26, 23:46
	
Invalid status on Data Center SpaceMan. Setting status to Non-Responsive.
	
	
2013-Mar-26, 23:45
	
Invalid status on Data Center SpaceMan. Setting status to Non-Responsive.
	
	
2013-Mar-26, 23:45
	
Failed to connect Host RHEV-H-6.4-rhs-gp-srv12 to Storage Pool SpaceMan
	
	
2013-Mar-26, 23:45
	
Host RHEV-H-6.4-rhs-gp-srv12 cannot access one of the Storage Domains attached to the Data Center SpaceMan. Setting Host state to Non-Operational.
	
	
2013-Mar-26, 23:45
	
Invalid status on Data Center SpaceMan. Setting status to Non-Responsive.
	
	
2013-Mar-26, 23:45
	
Detected new Host RHEV-H-6.4-rhs-gp-srv12. Host state was set to Up.
	
	
2013-Mar-26, 23:43
	
Invalid status on Data Center SpaceMan. Setting status to Non-Responsive.
	
	
2013-Mar-26, 23:43
	
Invalid status on Data Center SpaceMan. Setting status to Non-Responsive.
	
	
2013-Mar-26, 23:42
	
Invalid status on Data Center SpaceMan. Setting status to Non-Responsive.
	
	
2013-Mar-26, 23:42
	
Invalid status on Data Center SpaceMan. Setting status to Non-Responsive.
	
	
2013-Mar-26, 23:41
	
Invalid status on Data Center SpaceMan. Setting status to Non-Responsive.
	
	
2013-Mar-26, 23:41
	
Invalid status on Data Center SpaceMan. Setting status to Non-Responsive.
	
	
2013-Mar-26, 23:40
	
Invalid status on Data Center SpaceMan. Setting status to Non-Responsive.
	
	
2013-Mar-26, 23:40
	
Invalid status on Data Center SpaceMan. Setting status to Non-Responsive.
	
	
2013-Mar-26, 23:40
	
Failed to connect Host RHEV-H-6.4-rhs-gp-srv12 to Storage Pool SpaceMan
	
	
2013-Mar-26, 23:40
	
Detected new Host RHEV-H-6.4-rhs-gp-srv12. Host state was set to Up.
	
	
2013-Mar-26, 23:40
	
Host RHEV-H-6.4-rhs-gp-srv12 was autorecovered.
	
	
2013-Mar-26, 23:39
	
VM yabadaba03 is down. Exit message: Migration succeeded
	
	
2013-Mar-26, 23:39
	
VM writebackVM is down. Exit message: Migration succeeded
	
	
2013-Mar-26, 23:39
	
Migration complete (VM: yabadaba03, Source Host: RHEV-H-6.4-rhs-gp-srv12).
	
	
2013-Mar-26, 23:39
	
Migration complete (VM: writebackVM, Source Host: RHEV-H-6.4-rhs-gp-srv12).
	
	
2013-Mar-26, 23:39
	
Invalid status on Data Center SpaceMan. Setting status to Non-Responsive.
	
	
2013-Mar-26, 23:39
	
Failed to connect Host RHEV-H-6.4-rhs-gp-srv12 to Storage Pool SpaceMan
	
	
2013-Mar-26, 23:39
	
Power Management test failed for Host RHEV-H-6.4-rhs-gp-srv12.There is no other host in the data center that can be used to test the power management settings.
	
	
2013-Mar-26, 23:39
	
Host RHEV-H-6.4-rhs-gp-srv12 cannot access one of the Storage Domains attached to the Data Center SpaceMan. Setting Host state to Non-Operational.
	
	
2013-Mar-26, 23:39
	
Invalid status on Data Center SpaceMan. Setting status to Non-Responsive.
	
	
2013-Mar-26, 23:39
	
Detected new Host RHEV-H-6.4-rhs-gp-srv12. Host state was set to Up.
	
	
2013-Mar-26, 23:39
	
Host RHEV-H-6.4-rhs-gp-srv12 is initializing. Message: Recovering from crash or Initializing
	
	
2013-Mar-26, 23:39
	
Invalid status on Data Center SpaceMan. Setting Data Center status to Non-Responsive (On host RHEV-H-6.4-rhs-gp-srv12, Error: Error marking master storage domain).

-----------------------------------------------------------

All attempts to revive the Storage Domain and Data Center were futile. Then I shutdown all the VMs, and rebooted the Hypervisors. That too did not bring any relief. The VMs now were not bootable, since the master Storage Domain was down.
After about 40 minutes, the Storage Domain and the Data Center recovered itself, and I brought back the VMs online.

To try issue reproduction, I went ahead and removed another replica pair. But everything went smooth this time, with no issues at all.

-----------------------------------------------------------

[root@rhs-client45 ~]# gluster volume info
 
Volume Name: virtVOL
Type: Distributed-Replicate
Volume ID: 689aa65d-b49a-42f8-a20f-6bac6e116d6b
Status: Started
Number of Bricks: 7 x 2 = 14
Transport-type: tcp
Bricks:
Brick1: rhs-client45.lab.eng.blr.redhat.com:/VM_brick1
Brick2: rhs-client37.lab.eng.blr.redhat.com:/VM_brick1
Brick3: rhs-client15.lab.eng.blr.redhat.com:/VM_brick1
Brick4: rhs-client10.lab.eng.blr.redhat.com:/VM_brick1
Brick5: rhs-client45.lab.eng.blr.redhat.com:/VM_brick2
Brick6: rhs-client37.lab.eng.blr.redhat.com:/VM_brick2
Brick7: rhs-client15.lab.eng.blr.redhat.com:/VM_brick2
Brick8: rhs-client10.lab.eng.blr.redhat.com:/VM_brick2
Brick9: rhs-client45.lab.eng.blr.redhat.com:/VM_brick3
Brick10: rhs-client37.lab.eng.blr.redhat.com:/VM_brick3
Brick11: rhs-client15.lab.eng.blr.redhat.com:/VM_brick3
Brick12: rhs-client10.lab.eng.blr.redhat.com:/VM_brick3
Brick13: rhs-client15.lab.eng.blr.redhat.com:/VM_brick4
Brick14: rhs-client10.lab.eng.blr.redhat.com:/VM_brick4
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
network.remote-dio: on
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off

[root@rhs-client45 ~]# gluster volume remove-brick virtVOL rhs-client15.lab.eng.blr.redhat.com:/VM_brick4 rhs-client10.lab.eng.blr.redhat.com:/VM_brick4 start 
Remove Brick start successful

[root@rhs-client45 ~]# gluster volume remove-brick virtVOL rhs-client15.lab.eng.blr.redhat.com:/VM_brick4 rhs-client10.lab.eng.blr.redhat.com:/VM_brick4 status
                                    Node Rebalanced-files          size       scanned      failures         status
                               ---------      -----------   -----------   -----------   -----------   ------------
                               localhost                9  10739796172           31            0    not started
     rhs-client10.lab.eng.blr.redhat.com                0            0           28            0      completed
     rhs-client37.lab.eng.blr.redhat.com                0            0           28            0    not started
     rhs-client15.lab.eng.blr.redhat.com                4      3146298            8            0    in progress

[root@rhs-client45 ~]# gluster volume remove-brick virtVOL rhs-client15.lab.eng.blr.redhat.com:/VM_brick4 rhs-client10.lab.eng.blr.redhat.com:/VM_brick4 status
                                    Node Rebalanced-files          size       scanned      failures         status
                               ---------      -----------   -----------   -----------   -----------   ------------
                               localhost                9  10739796172           31            0    not started
     rhs-client15.lab.eng.blr.redhat.com               16  21499751897           28            0      completed
     rhs-client37.lab.eng.blr.redhat.com                0            0           28            0    not started
     rhs-client10.lab.eng.blr.redhat.com                0            0           28            0      completed

[root@rhs-client45 ~]# gluster volume remove-brick virtVOL rhs-client15.lab.eng.blr.redhat.com:/VM_brick4 rhs-client10.lab.eng.blr.redhat.com:/VM_brick4 commit
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
Remove Brick commit successful

[root@rhs-client45 ~]# gluster volume info
 
Volume Name: virtVOL
Type: Distributed-Replicate
Volume ID: 689aa65d-b49a-42f8-a20f-6bac6e116d6b
Status: Started
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: rhs-client45.lab.eng.blr.redhat.com:/VM_brick1
Brick2: rhs-client37.lab.eng.blr.redhat.com:/VM_brick1
Brick3: rhs-client15.lab.eng.blr.redhat.com:/VM_brick1
Brick4: rhs-client10.lab.eng.blr.redhat.com:/VM_brick1
Brick5: rhs-client45.lab.eng.blr.redhat.com:/VM_brick2
Brick6: rhs-client37.lab.eng.blr.redhat.com:/VM_brick2
Brick7: rhs-client15.lab.eng.blr.redhat.com:/VM_brick2
Brick8: rhs-client10.lab.eng.blr.redhat.com:/VM_brick2
Brick9: rhs-client45.lab.eng.blr.redhat.com:/VM_brick3
Brick10: rhs-client37.lab.eng.blr.redhat.com:/VM_brick3
Brick11: rhs-client15.lab.eng.blr.redhat.com:/VM_brick3
Brick12: rhs-client10.lab.eng.blr.redhat.com:/VM_brick3
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
network.remote-dio: on
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
[root@rhs-client45 ~]# 
 
-----------------------------------------------------------

Version-Release number of selected component (if applicable):

RHEV-M:
3.1.0-50.el6ev

Hypervisors:
RHEV-H 6.4
RHEL 6.4
RHEL 6.3

RHS:
RHS-2.0-20130320.2-RHS-x86_64-DVD1.iso
glusterfs-server-3.3.0.7rhs-1.el6rhs.x86_64

How reproducible:
Occurred once
Not sure if reproducible
  
Actual results:

remove-brick operation led to the Data Center and Storage Domain being brought down for about 40 minutes

Expected results:

the remove-brick operation must not impact the availability of the Data Center and Storage Domain 

Additional info:

Comment 9 shishir gowda 2013-04-23 10:28:51 UTC

Looks like ops are failing with permission denied issues, which is leading to failures.

[2013-03-27 00:21:11.045151] W [fuse-bridge.c:725:fuse_fd_cbk] 0-glusterfs-fuse: 2401: OPEN() /79f2acbd-6f1c-4976-a8e6-c82a0073b6bb/dom_md/ids => -1 (Invalid argument)
[2013-03-27 00:21:17.454194] I [afr-self-heal-entry.c:2309:afr_sh_entry_fix] 0-virtVOL-replicate-2: /79f2acbd-6f1c-4976-a8e6-c82a0073b6bb/dom_md: Performing conservative merge
[2013-03-27 00:21:17.454300] I [afr-self-heal-entry.c:2309:afr_sh_entry_fix] 0-virtVOL-replicate-0: /79f2acbd-6f1c-4976-a8e6-c82a0073b6bb/dom_md: Performing conservative merge
[2013-03-27 00:21:17.466916] I [dht-common.c:997:dht_lookup_everywhere_cbk] 0-virtVOL-dht: deleting stale linkfile /79f2acbd-6f1c-4976-a8e6-c82a0073b6bb/dom_md/ids on virtVOL-replicate-2
[2013-03-27 00:21:17.467590] W [client3_1-fops.c:651:client3_1_unlink_cbk] 0-virtVOL-client-4: remote operation failed: Permission denied
[2013-03-27 00:21:17.467630] W [client3_1-fops.c:651:client3_1_unlink_cbk] 0-virtVOL-client-5: remote operation failed: Permission denied
[2013-03-27 00:21:17.468303] W [client3_1-fops.c:258:client3_1_mknod_cbk] 0-virtVOL-client-0: remote operation failed: Permission denied. Path: /79f2acbd-6f1c-4976-a8e6-c82a0073b6bb/dom_md/ids (e5572401-ce56-4c8
2-a4c1-5f54f6948f44)
[2013-03-27 00:21:17.468345] W [client3_1-fops.c:258:client3_1_mknod_cbk] 0-virtVOL-client-1: remote operation failed: Permission denied. Path: /79f2acbd-6f1c-4976-a8e6-c82a0073b6bb/dom_md/ids (e5572401-ce56-4c8
2-a4c1-5f54f6948f44)

Comment 10 shishir gowda 2013-05-03 07:25:45 UTC

The problem lies in dht_discover_cbk returning EINVAL on ENOENT errors on newly added bricks. It should trigger selfheal on these bricks.

Looks like a duplicate of bug 924572

[2013-03-26 18:15:03.372833] I [dht-layout.c:611:dht_layout_normalize] 1-virtVOL-dht: found anomalies in /79f2acbd-6f1c-4976-a8e6-c82a0073b6bb/dom_md. holes=1 overlaps=1
[2013-03-26 18:15:03.377632] I [dht-layout.c:611:dht_layout_normalize] 1-virtVOL-dht: found anomalies in <gfid:4099f439-fc15-4379-bf25-8c15c401952d>. holes=0 overlaps=1
[2013-03-26 18:15:03.377691] W [fuse-resolve.c:152:fuse_resolve_gfid_cbk] 0-fuse: 4099f439-fc15-4379-bf25-8c15c401952d: failed to resolve (Invalid argument)
[2013-03-26 18:15:03.377707] E [fuse-bridge.c:555:fuse_getattr_resume] 0-glusterfs-fuse: 689633: GETATTR 139816465486164 (4099f439-fc15-4379-bf25-8c15c401952d) resolution failed

Comment 11 shishir gowda 2013-05-09 09:52:15 UTC

Marking it as duplicate of bug 924572, as the root cause is dht_discover_complete returning EINVAL errors when layout anomalies were found.

*** This bug has been marked as a duplicate of bug 924572 ***

Note You need to log in before you can comment on or make changes to this bug.