Description of problem:
When 2 storage nodes in a distribute-replicate volume are brought down , 2 VM's in a storage domain moved to paused state because of following reasons:
1. Storage Permissions Problem
2. Storage I/O Problem
Version-Release number of selected component (if applicable):
[12/05/12 - 12:12:18 root@rhs-gp-srv12 ~]# rpm -qa | grep gluster
[12/05/12 - 12:12:23 root@rhs-gp-srv12 ~]# glusterfs --version
glusterfs 3.3.0rhsvirt1 built on Nov 7 2012 10:11:13
Steps to Reproduce:
1. Create a distribute-replicate volume for storing VM's (2x2)
2. Set the group to "virt", storage.owner-uid and storage.owner-gid to 36 for the volume.
3. Start the volume.
4. Create Storage Domain using the volume and a Host from RHEVM.
5. Create 5 VM's from RHEVM.
6. Add 2 bricks to the volume to change the type to 3x2 distribute-replicate volume.
7. Bring down storage-node1 and storage-node3
8. Perform the following when the storage nodes are offline:
Create snap-shots of all the 5 vm's,
rhn_register all the VM's,
create new snap-shots of all the VM's.
Create 2 new VM's ,
9. Bring back storage-node1 and storage-node3 online.
10. Initiate self-heal, start rebalance. When self-heal is in progress, run load on all the VM's.
11. Re-balance and Self-heal of the files are successful.
12. Bring down storage-node2 and storage-node4.
As soon as the storage-node2 and storage-node4 is brought offline, few VM's are moved to paused state.
VM's should be running successfully.
[12/05/12 - 11:09:36 root@rhs-client1 ~]# gluster volume info
Volume Name: vol-rhev-dis-rep
Volume ID: fd2508b8-c160-410d-95df-a76fcd6609d8
Number of Bricks: 3 x 2 = 6
Shishir, can you have a look on this?
Shwetha, can we get sosreport from storage nodes and host (RHEVH) node?
DHT has not logged any failures. Only layout mismatch info logs.
AFR has few error messages logged(host mount logs) as below:
[2012-12-04 17:13:28.198946] E [afr-self-heal-common.c:2160:afr_self_heal_completion_cbk] 1-vol-rhev-dis-rep-replicate-1: background data missing-entry gfid self-heal failed on /0fda8a26-4329-4aab-a369-9e3e49b51f7e/images/9a85342e-a427-4054-be67-fd7c9c472630/867a1e13-5ea3-4eb2-90dc-ade876a54bb1
[2012-12-04 17:13:28.199026] I [afr-self-heal-common.c:1941:afr_sh_post_nb_entrylk_conflicting_sh_cbk] 1-vol-rhev-dis-rep-replicate-1: Non blocking entrylks failed.
[2012-12-04 17:13:28.199048] E [afr-self-heal-common.c:2160:afr_self_heal_completion_cbk] 1-vol-rhev-dis-rep-replicate-1: background data missing-entry gfid self-heal failed on /0fda8a26-4329-4aab-a369-9e3e49b51f7e/images/6258de39-c491-463b-a5bd-6d3cb7b14bd7/4643eb79-00ee-454b-bfc8-d2da498bdeab
[2012-12-04 17:13:28.237749] I [afr-inode-write.c:428:afr_open_fd_fix] 1-vol-rhev-dis-rep-replicate-1: Opening fd 0x21ab198
[2012-12-04 17:13:28.238110] W [client3_1-fops.c:473:client3_1_open_cbk] 1-vol-rhev-dis-rep-client-2: remote operation failed: No such file or directory. Path: /0fda8a26-4329-4aab-a369-9e3e49b51f7e/images/df1edfe6-4bca-445f-9d2b-42cefdb40895/2b3dbe90-e7f9-4d76-a3a9-7515bc943939 (6624f211-33ee-4355-bf62-6408d0f99ae3)
Was the 2 newly added bricks permissions set to 36:36 before add-brick?
The volume group is set to "virt".Hence, when the bricks were added, the uid and gid was automatically set to 36 for newly added bricks.
Can we please check if this is related to selinux being enforcing on the hypervisor?
(In reply to comment #7)
> Can we please check if this is related to selinux being enforcing on the
Audit logs are available in sosreport.
Assigning the bug to shishir for now, for him to take a look at this
Other than the self heal error messages, the sos reports do not have any other messages which would help us proceed. Suspected it to be a duplicate of bug 883853, but logs do not point to that.
Per 01/31 tiger team meeting, marking as release blocker.
*** This bug has been marked as a duplicate of bug 859589 ***