Bug 912383

Summary:

[QEMU/KVM-RHS] gluster self heal daemon is not operational after few operations on the volume

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

SATHEESARAN <sasundar>

Component:

glusterfs

Assignee:

vsomyaju

Status:

CLOSED WORKSFORME

QA Contact:

SATHEESARAN <sasundar>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

2.0

CC:

amarts, grajaiya, nsathyan, rhinduja, rhs-bugs, sasundar, shaines, spandura, vbellur

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

virt qemu integration

Last Closed:

2013-07-18 06:18:53 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
sosreport on which problem occured	none

Description SATHEESARAN 2013-02-18 15:04:40 UTC

Created attachment 698915 [details]
sosreport on which problem occured

Description of problem:
gluster self heal daemon is not operational after few gluster operations like add-brick, and self-healing.

Also there is a following error message, in /var/log/glusterfs/glustershd.log

[2013-02-18 16:21:58.853780] E [options.c:166:xlator_option_validate_bool] 0-rep-qcow2-replicate-0: option eager-lock ^A: '^A' is not a valid boolean value
[2013-02-18 16:21:58.853806] W [options.c:771:xl_opt_validate] 0-rep-qcow2-replicate-0: validate of eager-lock returned -1
[2013-02-18 16:21:58.853836] E [graph.c:272:glusterfs_graph_validate_options] 0-rep-qcow2-replicate-0: validation failed: option eager-lock ^A: '^A' is not a valid boolean value
[2013-02-18 16:21:58.853861] E [graph.c:476:glusterfs_graph_activate] 0-graph: validate options failed
[2013-02-18 16:21:58.854198] W [glusterfsd.c:924:cleanup_and_exit] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5) [0x335100f0c5] (-->/usr/sbin/glusterfs(mgmt_getspec_cbk+0xe0) [0x40ca30] (-->/usr/sbin/glusterfs(glusterfs_process_volfp+0x198) [0x405a88]))) 0-: received signum (0), shutting down


Version-Release number of selected component (if applicable):
RHS2.0+ - [ glusterfs-3.3.0.5rhs40 ]

How reproducible:
Once. I have not tried it again 

Steps to Reproduce:
1. Create a 6X2 distributed replicate volume

2. Fuse mount the volume

3. Create a qcow2 image on the volume

4. Create a VM [appvm] using the above created image as backend.
   (i.e) virt-install --name vm1 --ram 4096 --vcpus 4 --location <iso-location>
 --disk path=<path-to-image-file-on-fuse-mnt>,format=qcow2,bus=virtio --vnc

5. Create a snapshot on the appvm, while saving one of the RHS node.
   <snapshot> : virsh snapshot-create-as --name snap1 --domain appvm
   <saving VM> : virsh save <domain-name> --file <file-name-to-save>
NOTE: After saving, VM will be shutdown

6. Create few files from inside VM, once snapshot is created

7. Add a pair of bricks to the volume.

8. Start the VM which was shutdown earlier after saving
   (i.e) virsh start <domain-name>

9. Start the snapshot-revert on the appvm
   (i.e) virsh snapshot-revert --domain appvm --snapshotname snap1

10. Start rebalancing operation as follows,
   gluster volume rebalance <vol-name> fix-layout start
   gluster volume rebalance <vol-name> start

11. Initiate self heal also, by
   gluster volume heal <vol-name>

12. Check for status of rebalance by using the command,
   gluster volume rebalance <vol-name> status

13. Check the status of healing, by using the command,
   gluster volume heal <vol-name> info
   gluster volume rebalance <vol-name> status

14. Then again removed the added brick [safe removal]
   gluster volume remove-brick <vol-name> start

15. Monitor the remove-brick operation moves the data across bricks. IN PROGRESS status should be moved to COMPLETED status.
   gluster volume remove-brick <vol-name> status

16. Complete remove-brick operation
   gluster volume remove-brick <vol-name> commit 

17. After few days[4 in my case], again try healing the volume
    gluster volume heal <vol-name>

  
Actual results:

Self-heal should happen successfully, and cluster.eager-lock is set to some control character, [^A]

Expected results:

Self-heal daemon was not operational

Additional info:
Following info on Test Bed design,
1. Create a 6 Logical Volume, each on the hard disk of 550GB
2. XFS partition all the logical volumes
3. Mount all volumes, under, /home/rhsvms/rhsvm{1..6}
4. On 3 such Logical volumes, Create four 100G raw images and one 50G raw images
5. On other 3 such Logical volumes, Create four 100G qcow2 images and one 50G raw images
6. Create 3 VMs, each using 4 raw images as additional disks [bricks], and 50GB disk for installation of RHS
7. Create 3 VMs, each using 4 qcow2 images as additional disks [bricks], and 50GB for installation of RHS
8. Create 6x2 distributed-replicate volume, with following pair of disk combo.
 [ raw-raw, raw-raw,qcow2-qcow2,qcow2-qcow2,raw-qcow2,raw-qcow2 ]
9. Create a 3 numbers of 1X2 replica volume, with raw-raw, raw-qcow2, qcow2-qcow2

Comment 1 SATHEESARAN 2013-02-18 15:15:20 UTC

Gluster volume info command output,

Volume Name: distrep-vmstore
Type: Distributed-Replicate
Volume ID: b0df694d-aaea-4a2e-8ffd-7773dda51177
Status: Started
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.37.105:/bricks/dist-repl-brick1
Brick2: 10.70.37.157:/bricks/dist-repl-brick1
Brick3: 10.70.37.105:/bricks/spare-brick1
Brick4: 10.70.37.157:/bricks/spare-brick1
Brick5: 10.70.37.162:/bricks/dist-repl-brick1
Brick6: 10.70.37.112:/bricks/dist-repl-brick1
Brick7: 10.70.37.162:/bricks/spare-brick1
Brick8: 10.70.37.112:/bricks/spare-brick1
Brick9: 10.70.37.150:/bricks/dist-repl-brick1
Brick10: 10.70.37.124:/bricks/dist-repl-brick1
Brick11: 10.70.37.150:/bricks/spare-brick1
Brick12: 10.70.37.124:/bricks/spare-brick1
Options Reconfigured:
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
storage.linux-aio: disable
cluster.eager-lock: enable
network.remote-dio: enable
storage.owner-uid: 36
storage.owner-gid: 36

Comment 2 Scott Haines 2013-03-08 21:05:59 UTC

Targeting for Arches.

Comment 3 Scott Haines 2013-04-11 17:02:33 UTC

Per 04-10-2013 Storage bug triage meeting, targeting for Big Bend.

Comment 4 vsomyaju 2013-06-05 09:09:40 UTC

We were not able to reproduce this bug.
Can you please provide the step for the same.

Comment 5 SATHEESARAN 2013-06-06 09:48:40 UTC

Hi Venkatesh,

Even I tried to reproduce this issue, but could not able to reproduce it.

Comment 6 Nagaprasad Sathyanarayana 2013-07-18 06:18:53 UTC

Since QE is unable to reproduce this bug, closing this for now.  If found in future, we shall take up.