Bug 1004745
| Summary: | [RHS-RHOS] Snapshot of instances with cinder boot volumes stuck during self-heal and rebalance. | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Anush Shetty <ashetty> |
| Component: | replicate | Assignee: | Pranith Kumar K <pkarampu> |
| Status: | CLOSED EOL | QA Contact: | Anush Shetty <ashetty> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 2.1 | CC: | divya, grajaiya, pkarampu, rhs-bugs, rwheeler, smanjara, ssaha, storage-qa-internal, vagarwal, vbellur |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: |
virt rhos cinder rhs integration
|
|
| Last Closed: | 2015-12-03 17:22:10 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Sosreports and statedumps here, http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1004745 Amar, This bug has been identified as a known issue for Big Bend release. Please provide CCFR information in the Doc Text field. Divya, as of now, the RCA for the bug is not done. hence the summary of the bug itself serves as the CCFR. I don't see any blocked locks or pending frames in brick statedumps. pk@localhost - ~/sos 12:56:01 :( ⚡ ls *dump* | xargs grep -i complete pk@localhost - ~/sos 12:56:10 :( ⚡ ls *dump* | xargs grep -i blocked Unfortunately no statedumps for mount or rebalance or glustershd are attached to the bug-report. So we couldn't find where the fops could have been stuck. Could we try re-creating this issue. Pranith Tested on RHOS4.0 with RHS2.1 glusterfs-3.4.0.59rhs-1.el6_4.x86_64. With client-quorum enabled with the latest RHS version, I only brought down second bricks in the cluster. Could not reproduce this issue. 1. Tested with instance being booted out of glance image: Works fine but takes time to upload the snapshot image to upload. 2. Tested with instance being booted out of voliume: Creates a zero byte snap that is unusable. Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/ If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release. |
Description of problem: While taking snapshots of nova instances with bootable cinder volumes with self-heal and rebalance running, some of the nova instances were found to be QUEUED state for more than 24hrs. Both cinder volumes and glance images were served out of RHS volumes. Efforts to terminate the nova distances didn't succeed, hence making them unusable. Version-Release number of selected component (if applicable): RHS: glusterfs-3.4.0.30rhs-2.el6rhs.x86_64 Openstack: openstack-cinder-2013.1.2-3.el6ost.noarch openstack-nova-compute-2013.1.2-4.el6ost.noarch How reproducible: Consistent Steps to Reproduce: 1. Create 2 2x2 Distributed-Replicate volumes for cinder and glance 2. Tag the volumes with group virt (i.e) gluster volume set cinder-vol group virt gluster volume set glance-vol group virt 3. Set the storage.owner-uid and storage.owner-gid of glance-vol to 161 gluster volume set glance-vol storage.owner-uid 161 gluster volume set glance-vol storage.owner-gid 161 4. Set the storage.owner-uid and storage.owner-gid of cinder-vol to 165 gluster volume set cinder-vol storage.owner-uid 165 gluster volume set cinder-vol storage.owner-gid 165 5. Volume info Volume Name: cinder-vol Type: Distributed-Replicate Volume ID: 25b9729b-b326-4eb8-9068-961c67ee25c6 Status: Started Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: rhshdp01.lab.eng.blr.redhat.com:/cinder1/s1 Brick2: rhshdp02.lab.eng.blr.redhat.com:/cinder1/s2 Brick3: rhshdp03.lab.eng.blr.redhat.com:/cinder1/s3 Brick4: rhshdp04.lab.eng.blr.redhat.com:/cinder1/s4 Brick5: rhshdp03.lab.eng.blr.redhat.com:/cinder2/s5 Brick6: rhshdp04.lab.eng.blr.redhat.com:/cinder2/s6 Brick7: rhshdp01.lab.eng.blr.redhat.com:/cinder2/s7 Brick8: rhshdp02.lab.eng.blr.redhat.com:/cinder2/s8 Options Reconfigured: network.remote-dio: enable cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off storage.owner-uid: 165 storage.owner-gid: 165 Volume Name: glance-vol Type: Distributed-Replicate Volume ID: c3fe0412-9fec-4914-8fcc-648dc8632a2e Status: Started Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: rhshdp01.lab.eng.blr.redhat.com:/glance1/s1 Brick2: rhshdp02.lab.eng.blr.redhat.com:/glance1/s2 Brick3: rhshdp03.lab.eng.blr.redhat.com:/glance1/s3 Brick4: rhshdp04.lab.eng.blr.redhat.com:/glance1/s4 Brick5: rhshdp03.lab.eng.blr.redhat.com:/glance3/s5 Brick6: rhshdp04.lab.eng.blr.redhat.com:/glance3/s6 Brick7: rhshdp01.lab.eng.blr.redhat.com:/glance3/s7 Brick8: rhshdp02.lab.eng.blr.redhat.com:/glance3/s8 Options Reconfigured: performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: enable storage.owner-uid: 161 storage.owner-gid: 161 6. Configure cinder to use glusterfs volume a. # openstack-config --set /etc/cinder/cinder.conf DEFAULT volume_driver cinder.volume.drivers.glusterfs.GlusterfsDriver # openstack-config --set /etc/cinder/cinder.conf DEFAULT glusterfs_shares_config /etc/cinder/shares.conf # openstack-config --set /etc/cinder/cinder.conf DEFAULT glusterfs_mount_point_base /var/lib/cinder/volumes b. # cat /etc/cinder/shares.conf rhshdp01.lab.eng.blr.redhat.com:cinder-vol c. for i in api scheduler volume; do sudo service openstack-cinder-${i} restart; done 7. Mount the RHS glance volume on /var/lib/glance/images 8. Uploaded RHEL 6.4 ISO from openstack horizon dashboard # http://download.eng.blr.redhat.com/pub/rhel/released/RHEL-6/6.4/Server/x86_64/iso/RHEL6.4-20130130.0-Server-x86_64-DVD1.iso 9. Created 10 cinder-volumes of various sizes # cinder list +--------------------------------------+--------+--------------+------+-------------+----------+--------------------------------------+ | ID | Status | Display Name | Size | Volume Type | Bootable | Attached to | +--------------------------------------+--------+--------------+------+-------------+----------+--------------------------------------+ | 0680e544-4569-42f8-951a-43239211f944 | in-use | vol_5 | 20 | None | false | 232e081b-0a78-4660-a3da-cbcded620a86 | | 38b16c7a-0aa6-4411-b767-121adfd629e1 | in-use | vol_9 | 10 | None | false | 602fdbcf-35f7-4842-9153-2ef486100232 | | 4570c78c-e6be-4b25-912d-ef516f9bebaa | in-use | vol_1 | 60 | None | false | d0f2fb56-501b-415e-8259-4c3c68d5d3c3 | | 4f60a78e-c28d-46da-be16-030f2fc98e97 | in-use | vol_2 | 60 | None | false | b5315199-cf5b-49e4-a38e-915b50398060 | | 55344707-5d38-42e6-b64e-657c1392bf33 | in-use | vol_7 | 10 | None | false | 28f115ed-4035-49a4-9a0e-20172145558e | | 81562845-7ce0-4638-a9b6-35288757cb33 | in-use | vol_10 | 10 | None | false | 81f3eeb9-4dc9-4123-aaf1-3a7361c8ab79 | | 82b693a9-1551-4cf6-9565-04c88afa9697 | in-use | vol_6 | 20 | None | false | 15478a16-f291-427a-89ae-9b2117dcba65 | | cf2c0134-3435-43d6-a534-9993b28e212e | in-use | vol_4 | 20 | None | false | 64096361-a9d7-4cd6-902b-cc078792f6ad | | d860fb21-5dfc-4f96-ba4f-486a507d3539 | in-use | vol_8 | 10 | None | false | a49085f0-3400-4667-ad9b-f24ce4acbc4e | | e427682f-0d8f-474b-b0ab-22038e7421ed | in-use | vol_3 | 20 | None | false | 845e13a7-87dc-4431-85d1-de8faa3387d5 | +--------------------------------------+--------+--------------+------+-------------+----------+--------------------------------------+ 10. Created 10 instances using the cinder volumes as boot volumes. 11. After creating the instances, ran the snapshot command to create a snapshot of each of the instances. # nova image-create <instance-id> <snap-name> # nova image-list +--------------------------------------+----------+--------+--------------------------------------+ | ID | Name | Status | Server | +--------------------------------------+----------+--------+--------------------------------------+ | 5717fe64-b592-40c3-abb8-396305944d89 | rhel-iso | ACTIVE | | | 4d324180-86b8-4e77-af5e-fdba70c99833 | snap_1 | SAVING | d0f2fb56-501b-415e-8259-4c3c68d5d3c3 | | d729e8da-3d27-4e45-8071-044134ac8e0b | snap_10 | ACTIVE | 81f3eeb9-4dc9-4123-aaf1-3a7361c8ab79 | | 3ae40bc6-32e7-42ef-846d-f3bfa2deabf0 | snap_2 | SAVING | b5315199-cf5b-49e4-a38e-915b50398060 | | a2912a53-5c92-4e3b-9a2b-85a76da40eac | snap_3 | SAVING | 845e13a7-87dc-4431-85d1-de8faa3387d5 | | 73ca28da-05f6-4bea-b93b-8dde2f14bb83 | snap_4 | SAVING | 64096361-a9d7-4cd6-902b-cc078792f6ad | | 1cd7ae75-f212-45bd-a4f2-c7b84e276597 | snap_5 | SAVING | 232e081b-0a78-4660-a3da-cbcded620a86 | | e0c88433-e0c6-4db3-b5c7-f12ff7ec0072 | snap_6 | SAVING | 15478a16-f291-427a-89ae-9b2117dcba65 | | e4e3eca7-7108-4d25-9186-610866537eed | snap_7 | SAVING | 28f115ed-4035-49a4-9a0e-20172145558e | | 5e46a462-e29a-497f-b633-b1d036fe68e8 | snap_8 | SAVING | a49085f0-3400-4667-ad9b-f24ce4acbc4e | | 6513a0a4-d150-40e2-b472-90245937b842 | snap_9 | ACTIVE | 602fdbcf-35f7-4842-9153-2ef486100232 | +--------------------------------------+----------+--------+--------------------------------------+ 12) While the snapshot was being taken, 2 brick machines, rhshdp02.lab.eng.blr.redhat.com and rhshdp03.lab.eng.blr.redhat.com were powered off and brought up back again. 13) 4 bricks were added to glance-vol and cinder-vol using add-brick and then rebalance command was run. The snapshot process wasn't completed yet. gluster volume add-brick cinder-vol rhshdp03.lab.eng.blr.redhat.com:/cinder2/s5 rhshdp04.lab.eng.blr.redhat.com:/cinder2/s6 rhshdp01.lab.eng.blr.redhat.com:/cinder2/s7 rhshdp02.lab.eng.blr.redhat.com:/cinder2/s8 gluster volume add-brick glance-vol rhshdp03.lab.eng.blr.redhat.com:/glance3/s5 rhshdp04.lab.eng.blr.redhat.com:/glance3/s6 rhshdp01.lab.eng.blr.redhat.com:/glance3/s7 rhshdp02.lab.eng.blr.redhat.com:/glance3/s8 Actual results: Snapshots for 2 instances succeeded and the rest of the snapshot processes have been in 'Queued' state for more than 24hrs Expected results: Snapshots should be successful. Additional info: 1) The snapshot images get saved on the RHS glance-vol 2) # gluster volume heal cinder-vol info Gathering list of entries to be healed on volume cinder-vol has been successful Brick rhshdp01.lab.eng.blr.redhat.com:/cinder1/s1 Number of entries: 0 Brick rhshdp02.lab.eng.blr.redhat.com:/cinder1/s2 Number of entries: 0 Brick rhshdp03.lab.eng.blr.redhat.com:/cinder1/s3 Number of entries: 0 Brick rhshdp04.lab.eng.blr.redhat.com:/cinder1/s4 Number of entries: 0 Brick rhshdp03.lab.eng.blr.redhat.com:/cinder2/s5 Number of entries: 0 Brick rhshdp04.lab.eng.blr.redhat.com:/cinder2/s6 Number of entries: 0 Brick rhshdp01.lab.eng.blr.redhat.com:/cinder2/s7 Number of entries: 0 Brick rhshdp02.lab.eng.blr.redhat.com:/cinder2/s8 Number of entries: 0 [root@rhshdp01 ~]# gluster volume heal glance-vol info Gathering list of entries to be healed on volume glance-vol has been successful Brick rhshdp01.lab.eng.blr.redhat.com:/glance1/s1 Number of entries: 0 Brick rhshdp02.lab.eng.blr.redhat.com:/glance1/s2 Number of entries: 0 Brick rhshdp03.lab.eng.blr.redhat.com:/glance1/s3 Number of entries: 0 Brick rhshdp04.lab.eng.blr.redhat.com:/glance1/s4 Number of entries: 0 Brick rhshdp03.lab.eng.blr.redhat.com:/glance3/s5 Number of entries: 0 Brick rhshdp04.lab.eng.blr.redhat.com:/glance3/s6 Number of entries: 0 Brick rhshdp01.lab.eng.blr.redhat.com:/glance3/s7 Number of entries: 0 Brick rhshdp02.lab.eng.blr.redhat.com:/glance3/s8 Number of entries: 0 [root@rhshdp01 ~]# gluster volume rebalance cinder-vol status; gluster volume rebalance glance-vol status; Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 3 80.0GB 13 0 0 completed 1220.00 rhshdp04.lab.eng.blr.redhat.com 0 0Bytes 10 0 0 completed 0.00 rhshdp02.lab.eng.blr.redhat.com 0 0Bytes 10 0 0 completed 0.00 rhshdp03.lab.eng.blr.redhat.com 0 0Bytes 10 0 0 completed 0.00 volume rebalance: cinder-vol: success: Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 1 3.5GB 4 0 0 completed 94.00 rhshdp04.lab.eng.blr.redhat.com 0 0Bytes 3 0 0 completed 0.00 rhshdp02.lab.eng.blr.redhat.com 0 0Bytes 3 0 0 completed 0.00 rhshdp03.lab.eng.blr.redhat.com 1 3.5GB 4 0 0 completed 91.00 volume rebalance: glance-vol: success: