Bug 883716 - [RHEV-RHS] VM's moved to paused state due to storage permissions problem
Summary: [RHEV-RHS] VM's moved to paused state due to storage permissions problem
Status: CLOSED DUPLICATE of bug 859589
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterfs
Version: 2.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: ---
Assignee: Vijay Bellur
QA Contact: spandura
Depends On:
TreeView+ depends on / blocked
Reported: 2012-12-05 06:55 UTC by spandura
Modified: 2013-02-26 08:53 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2013-02-26 08:53:38 UTC
Target Upstream Version:

Attachments (Terms of Use)

Description spandura 2012-12-05 06:55:35 UTC
Description of problem:
When 2 storage nodes in a distribute-replicate volume are brought down , 2 VM's in a storage domain moved to paused state because of following reasons:

1. Storage Permissions Problem

2. Storage I/O Problem

Version-Release number of selected component (if applicable):
[12/05/12 - 12:12:18 root@rhs-gp-srv12 ~]# rpm -qa | grep gluster

[12/05/12 - 12:12:23 root@rhs-gp-srv12 ~]# glusterfs --version
glusterfs 3.3.0rhsvirt1 built on Nov  7 2012 10:11:13

How reproducible:

Steps to Reproduce:
1. Create a distribute-replicate volume for storing VM's (2x2)

2. Set the group to "virt", storage.owner-uid and storage.owner-gid to 36 for the volume.

3. Start the volume. 

4. Create Storage Domain using the volume and a Host from RHEVM. 

5. Create 5 VM's from RHEVM. 

6. Add 2 bricks to the volume to change the type to 3x2 distribute-replicate volume. 

7. Bring down storage-node1 and storage-node3 

8. Perform the following when the storage nodes are offline: 
    Create snap-shots of all the 5 vm's, 
    rhn_register all the VM's, 
    yum update, 
    create new snap-shots of all the VM's.
    Create 2 new VM's , 

9. Bring back storage-node1 and storage-node3 online. 

10. Initiate self-heal, start rebalance. When self-heal is in progress, run load on all the VM's.

11. Re-balance and Self-heal of the files are successful.

12. Bring down storage-node2 and  storage-node4. 
Actual results:
As soon as the storage-node2 and storage-node4 is brought offline, few VM's are moved to paused state. 

Expected results:
VM's should be running successfully. 

Additional info:

[12/05/12 - 11:09:36 root@rhs-client1 ~]# gluster volume info
Volume Name: vol-rhev-dis-rep
Type: Distributed-Replicate
Volume ID: fd2508b8-c160-410d-95df-a76fcd6609d8
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Brick1: rhs-client1:/disk0
Brick2: rhs-client16:/disk0
Brick3: rhs-client17:/disk0
Brick4: rhs-client18:/disk0
Brick5: rhs-client1:/disk1
Brick6: rhs-client16:/disk1
Options Reconfigured:
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
storage.linux-aio: enable
cluster.eager-lock: enable
storage.owner-uid: 36
storage.owner-gid: 36

Comment 2 Amar Tumballi 2012-12-05 07:41:53 UTC
Shishir, can you have a look on this?

Shwetha, can we get sosreport from storage nodes and host (RHEVH) node?

Comment 4 shishir gowda 2012-12-05 09:34:28 UTC
DHT has not logged any failures. Only layout mismatch info logs.
AFR has few error messages logged(host mount logs) as below:

[2012-12-04 17:13:28.198946] E [afr-self-heal-common.c:2160:afr_self_heal_completion_cbk] 1-vol-rhev-dis-rep-replicate-1: background  data missing-entry gfid self-heal failed on /0fda8a26-4329-4aab-a369-9e3e49b51f7e/images/9a85342e-a427-4054-be67-fd7c9c472630/867a1e13-5ea3-4eb2-90dc-ade876a54bb1
[2012-12-04 17:13:28.199026] I [afr-self-heal-common.c:1941:afr_sh_post_nb_entrylk_conflicting_sh_cbk] 1-vol-rhev-dis-rep-replicate-1: Non blocking entrylks failed.
[2012-12-04 17:13:28.199048] E [afr-self-heal-common.c:2160:afr_self_heal_completion_cbk] 1-vol-rhev-dis-rep-replicate-1: background  data missing-entry gfid self-heal failed on /0fda8a26-4329-4aab-a369-9e3e49b51f7e/images/6258de39-c491-463b-a5bd-6d3cb7b14bd7/4643eb79-00ee-454b-bfc8-d2da498bdeab
[2012-12-04 17:13:28.237749] I [afr-inode-write.c:428:afr_open_fd_fix] 1-vol-rhev-dis-rep-replicate-1: Opening fd 0x21ab198
[2012-12-04 17:13:28.238110] W [client3_1-fops.c:473:client3_1_open_cbk] 1-vol-rhev-dis-rep-client-2: remote operation failed: No such file or directory. Path: /0fda8a26-4329-4aab-a369-9e3e49b51f7e/images/df1edfe6-4bca-445f-9d2b-42cefdb40895/2b3dbe90-e7f9-4d76-a3a9-7515bc943939 (6624f211-33ee-4355-bf62-6408d0f99ae3)

Comment 5 shishir gowda 2012-12-05 09:37:47 UTC
Was the 2 newly added bricks permissions set to 36:36 before add-brick?

Comment 6 spandura 2012-12-05 11:35:37 UTC

The volume group is set to "virt".Hence, when the bricks were added, the uid and gid was automatically set to 36 for newly added bricks.

Comment 7 Vijay Bellur 2012-12-12 08:08:32 UTC
Can we please check if this is related to selinux being enforcing on the hypervisor?

Comment 8 Gowrishankar Rajaiyan 2013-01-02 11:46:23 UTC
(In reply to comment #7)
> Can we please check if this is related to selinux being enforcing on the
> hypervisor?

Audit logs are available in sosreport.

Comment 9 Pranith Kumar K 2013-01-08 11:10:40 UTC
Assigning the bug to shishir for now, for him to take a look at this

Comment 10 shishir gowda 2013-01-24 05:11:15 UTC
Other than the self heal error messages, the sos reports do not have any other messages which would help us proceed. Suspected it to be a duplicate of bug 883853, but logs do not point to that.

Comment 11 Scott Haines 2013-01-31 04:10:33 UTC
Per 01/31 tiger team meeting, marking as release blocker.

Comment 12 Vijay Bellur 2013-02-26 08:53:38 UTC

*** This bug has been marked as a duplicate of bug 859589 ***

Note You need to log in before you can comment on or make changes to this bug.