Bug 912383 - [QEMU/KVM-RHS] gluster self heal daemon is not operational after few operations on the volume
Summary: [QEMU/KVM-RHS] gluster self heal daemon is not operational after few operatio...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterfs
Version: 2.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: vsomyaju
QA Contact: SATHEESARAN
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-02-18 15:04 UTC by SATHEESARAN
Modified: 2013-07-18 06:18 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
virt qemu integration
Last Closed: 2013-07-18 06:18:53 UTC
Embargoed:


Attachments (Terms of Use)
sosreport on which problem occured (1.00 MB, application/x-xz)
2013-02-18 15:04 UTC, SATHEESARAN
no flags Details

Description SATHEESARAN 2013-02-18 15:04:40 UTC
Created attachment 698915 [details]
sosreport on which problem occured

Description of problem:
gluster self heal daemon is not operational after few gluster operations like add-brick, and self-healing.

Also there is a following error message, in /var/log/glusterfs/glustershd.log

[2013-02-18 16:21:58.853780] E [options.c:166:xlator_option_validate_bool] 0-rep-qcow2-replicate-0: option eager-lock ^A: '^A' is not a valid boolean value
[2013-02-18 16:21:58.853806] W [options.c:771:xl_opt_validate] 0-rep-qcow2-replicate-0: validate of eager-lock returned -1
[2013-02-18 16:21:58.853836] E [graph.c:272:glusterfs_graph_validate_options] 0-rep-qcow2-replicate-0: validation failed: option eager-lock ^A: '^A' is not a valid boolean value
[2013-02-18 16:21:58.853861] E [graph.c:476:glusterfs_graph_activate] 0-graph: validate options failed
[2013-02-18 16:21:58.854198] W [glusterfsd.c:924:cleanup_and_exit] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5) [0x335100f0c5] (-->/usr/sbin/glusterfs(mgmt_getspec_cbk+0xe0) [0x40ca30] (-->/usr/sbin/glusterfs(glusterfs_process_volfp+0x198) [0x405a88]))) 0-: received signum (0), shutting down


Version-Release number of selected component (if applicable):
RHS2.0+ - [ glusterfs-3.3.0.5rhs40 ]

How reproducible:
Once. I have not tried it again 

Steps to Reproduce:
1. Create a 6X2 distributed replicate volume

2. Fuse mount the volume

3. Create a qcow2 image on the volume

4. Create a VM [appvm] using the above created image as backend.
   (i.e) virt-install --name vm1 --ram 4096 --vcpus 4 --location <iso-location>
 --disk path=<path-to-image-file-on-fuse-mnt>,format=qcow2,bus=virtio --vnc

5. Create a snapshot on the appvm, while saving one of the RHS node.
   <snapshot> : virsh snapshot-create-as --name snap1 --domain appvm
   <saving VM> : virsh save <domain-name> --file <file-name-to-save>
NOTE: After saving, VM will be shutdown

6. Create few files from inside VM, once snapshot is created

7. Add a pair of bricks to the volume.

8. Start the VM which was shutdown earlier after saving
   (i.e) virsh start <domain-name>

9. Start the snapshot-revert on the appvm
   (i.e) virsh snapshot-revert --domain appvm --snapshotname snap1

10. Start rebalancing operation as follows,
   gluster volume rebalance <vol-name> fix-layout start
   gluster volume rebalance <vol-name> start

11. Initiate self heal also, by
   gluster volume heal <vol-name>

12. Check for status of rebalance by using the command,
   gluster volume rebalance <vol-name> status

13. Check the status of healing, by using the command,
   gluster volume heal <vol-name> info
   gluster volume rebalance <vol-name> status

14. Then again removed the added brick [safe removal]
   gluster volume remove-brick <vol-name> start

15. Monitor the remove-brick operation moves the data across bricks. IN PROGRESS status should be moved to COMPLETED status.
   gluster volume remove-brick <vol-name> status

16. Complete remove-brick operation
   gluster volume remove-brick <vol-name> commit 

17. After few days[4 in my case], again try healing the volume
    gluster volume heal <vol-name>

  
Actual results:

Self-heal should happen successfully, and cluster.eager-lock is set to some control character, [^A]

Expected results:

Self-heal daemon was not operational

Additional info:
Following info on Test Bed design,
1. Create a 6 Logical Volume, each on the hard disk of 550GB
2. XFS partition all the logical volumes
3. Mount all volumes, under, /home/rhsvms/rhsvm{1..6}
4. On 3 such Logical volumes, Create four 100G raw images and one 50G raw images
5. On other 3 such Logical volumes, Create four 100G qcow2 images and one 50G raw images
6. Create 3 VMs, each using 4 raw images as additional disks [bricks], and 50GB disk for installation of RHS
7. Create 3 VMs, each using 4 qcow2 images as additional disks [bricks], and 50GB for installation of RHS
8. Create 6x2 distributed-replicate volume, with following pair of disk combo.
 [ raw-raw, raw-raw,qcow2-qcow2,qcow2-qcow2,raw-qcow2,raw-qcow2 ]
9. Create a 3 numbers of 1X2 replica volume, with raw-raw, raw-qcow2, qcow2-qcow2

Comment 1 SATHEESARAN 2013-02-18 15:15:20 UTC
Gluster volume info command output,

Volume Name: distrep-vmstore
Type: Distributed-Replicate
Volume ID: b0df694d-aaea-4a2e-8ffd-7773dda51177
Status: Started
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.37.105:/bricks/dist-repl-brick1
Brick2: 10.70.37.157:/bricks/dist-repl-brick1
Brick3: 10.70.37.105:/bricks/spare-brick1
Brick4: 10.70.37.157:/bricks/spare-brick1
Brick5: 10.70.37.162:/bricks/dist-repl-brick1
Brick6: 10.70.37.112:/bricks/dist-repl-brick1
Brick7: 10.70.37.162:/bricks/spare-brick1
Brick8: 10.70.37.112:/bricks/spare-brick1
Brick9: 10.70.37.150:/bricks/dist-repl-brick1
Brick10: 10.70.37.124:/bricks/dist-repl-brick1
Brick11: 10.70.37.150:/bricks/spare-brick1
Brick12: 10.70.37.124:/bricks/spare-brick1
Options Reconfigured:
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
storage.linux-aio: disable
cluster.eager-lock: enable
network.remote-dio: enable
storage.owner-uid: 36
storage.owner-gid: 36

Comment 2 Scott Haines 2013-03-08 21:05:59 UTC
Targeting for Arches.

Comment 3 Scott Haines 2013-04-11 17:02:33 UTC
Per 04-10-2013 Storage bug triage meeting, targeting for Big Bend.

Comment 4 vsomyaju 2013-06-05 09:09:40 UTC
We were not able to reproduce this bug.
Can you please provide the step for the same.

Comment 5 SATHEESARAN 2013-06-06 09:48:40 UTC
Hi Venkatesh,

Even I tried to reproduce this issue, but could not able to reproduce it.

Comment 6 Nagaprasad Sathyanarayana 2013-07-18 06:18:53 UTC
Since QE is unable to reproduce this bug, closing this for now.  If found in future, we shall take up.


Note You need to log in before you can comment on or make changes to this bug.