| Summary: | [RHS-RHOS] mkfs.ext4 hangs at "Creating journal" on cinder volume attached to an instance during rebalance with self-heal | |||
|---|---|---|---|---|
| Product: | Red Hat Gluster Storage | Reporter: | shilpa <smanjara> | |
| Component: | glusterfs | Assignee: | Anand Avati <aavati> | |
| Status: | CLOSED ERRATA | QA Contact: | shilpa <smanjara> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 2.1 | CC: | amarts, ashetty, chrisw, grajaiya, pkarampu, rhs-bugs, sgowda, shaines, vagarwal, vbellur | |
| Target Milestone: | --- | |||
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | glusterfs-3.4.0.27rhs-1 | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1002399 (view as bug list) | Environment: |
virt rhos cinder rhs integration
|
|
| Last Closed: | 2013-09-23 22:36:04 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Bug Depends On: | ||||
| Bug Blocks: | 1002399 | |||
sosreports in: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/999528 This issue is reproducible on glusterfs-3.4.0.21rhs-1.el6rhs.x86_64 1) Created 10 cinder-volumes hosted on 1x2 pure replica volume 2) Added 2 more bricks to the volume 3) Started rebalance 4) Brought one of the bricks in each of the replica pair and brought it up again. 5) While the rebalance was on, attached the volumes to instances and started formatting the cinder volume with ext4 6) On one of the cinder volumes was stuck with the issue mentioned above 7) Tried detaching the volume from the instance and it was stuck at it. Requesting blocker since this is reproducible. shanks, is the logs similar to the one in description too? As discussed in Big Bend blocker daily readout, assigning it to Avati for further insight into this. Client: rhs-client28.lab.eng.blr.redhat.com Server: Brick1: rhshdp01.lab.eng.blr.redhat.com:/cinder1/s1 Brick2: rhshdp02.lab.eng.blr.redhat.com:/cinder1/s2 Brick3: rhshdp03.lab.eng.blr.redhat.com:/cinder1/s3 Brick4: rhshdp04.lab.eng.blr.redhat.com:/cinder1/s4 Cinder volume which has hanged: # . keystonerc_admin # cinder list: | fdbec220-002a-4f8a-a719-461a0d67f08e | detaching | vol_1 | 10 | None | false | 3346b176-5a6e-4a2a-9008-fd812831aff4 | Please let me know if you need more info. Verified on 3.4.0.30rhs. Hang not found anymore. Volume Name: cinder-vol Type: Distributed-Replicate Volume ID: 7d981582-e190-44e2-b51f-f2750380dcb2 Status: Started Number of Bricks: 5 x 2 = 10 Transport-type: tcp Bricks: Brick1: 10.70.37.168:/rhs/brick2/c1 Brick2: 10.70.37.74:/rhs/brick2/c2 Brick3: 10.70.37.220:/rhs/brick2/c7 Brick4: 10.70.37.203:/rhs/brick2/c8 Brick5: 10.70.37.168:/rhs/brick2/c9 Brick6: 10.70.37.74:/rhs/brick2/c10 Brick7: 10.70.37.168:/rhs/brick2/c5 Brick8: 10.70.37.74:/rhs/brick2/c6 Brick9: 10.70.37.220:/rhs/brick2/c3 Brick10: 10.70.37.203:/rhs/brick2/c4 Options Reconfigured: performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: enable storage.owner-gid: 165 storage.owner-uid: 165 Tested mkfs.ext4 on volume volume-343b84a1-44ad-467b-a661-edbb94f29824 attached to an instance rhel-6.4 located on brick c1 [root@rhs-vm1 brick2]# ls -lR c1 c5 c9 c1: total 0 ---------T 2 root root 0 Sep 3 15:21 volume-18152182-7c18-4b2d-a26e-9b82e27ab86c -rw-rw-rw- 2 qemu qemu 10737418240 Sep 3 15:29 volume-343b84a1-44ad-467b-a661-edbb94f29824 -rw-rwSrwT 2 qemu qemu 10737418240 Sep 3 15:27 volume-f879e2d9-7be5-4644-afae-12edeeab2942 Did an add-brick first, brought down the brick process on the replica pair 10.70.37.74. Brought back the brick processes up. Did a remove-brick on cinder-vol [root@rhs-vm1 brick2]# gluster v remove-brick cinder-vol 10.70.37.168:/rhs/brick2/c1 10.70.37.74:/rhs/brick2/c2 start During the migration of volume-343b84a1-44ad-467b-a661-edbb94f29824, brought down brick processes on 10.70.37.74, brought it back after a few seconds. Ran mkfs.ext4 on the attached volume while the migration was still running. Found NO hang and the formatting completed successfully. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1262.html |
Description of problem: mkfs.ext4 hangs forever at "Creating journal" on cinder volume attached to an instance when rebalance and self-heal are running concurrently. It continues to remain hanged even after rebalance is completed successfully. I could not find any way to recover from this situation and ended up loosing the instance. Version-Release number of selected component (if applicable): RHS: glusterfs-3.4.0.20rhs RHOS: Grizzy-2013.1.2 How reproducible: Tested once Steps to Reproduce: 1. Create two 6X2 distribute-replicate volumes with 4 RHS nodes and 3 bricks in each (glance-vol and cinder-vol). #gluster volume create cinder-vol replica 2 <brick1> .... <brick12> #gluster volume create glance-vol replica 2 <brick1> .... <brick12> 2. Tag these volumes with group virt (i.e) gluster volume set cinder-vol group virt (i.e) gluster volume set glance-vol group virt 3. Set storage.owner-uid & storage.owner-gid as below (i.e) gluster volume set cinder.vol storage.owner-uid 165 gluster volume set cinder.vol storage.owner-gid 165 gluster volume set glance.vol storage.owner-uid 161 gluster volume set glance.vol storage.owner-gid 161 4. Configure glance and cinder to use the above created gluster volumes For glance: Change the path that Glance uses for its file system store line in /etc/glance/glance-api.conf (i.e) filesystem_store_datadir = /mnt/gluster/glance/images 5. Fuse mount the volumes with openstack glance: #mount -t glusterfs 10.70.37.168:glance-vol /mnt/gluster #mkdir -p /mnt/gluster/glance/images For cinder: 6. edit /etc/cinder/cinder.conf file (i.e) openstack-config --set /etc/cinder/cinder.conf volume_driver cinder.volume.drivers.glusterfs.GlusterfsDriver openstack-config --set /etc/cinder/cinder.conf glusterfs_shares_config /etc/cinder/shares.conf openstack-config --set /etc/cinder/cinder.conf glusterfs_mount_point_base /var/lib/cinder/images 7. Add an volume entry for the volume created in step 1 to /etc/cinder/shares.conf. The content of this file should be, 10.70.37.168:cinder-vol 8. Restart Cinder and Glance services 9. See to that cinder volumes are mounted automatically using mount command 10. Create glance image and create an instance. 11. Create single/multiple cinder volumes and attach it to a running instance. 12. In the Vm instance, check for the volume being attached as disk using fdisk -l command. 13. Locate the file path of the cinder volume on the RHS and do a remove-brick start on the bricks involved. 14. While the rebalance is in progress, bring down bricks of one of the nodes and bring it back up. 15. Log in to the VM instance and Rrun mkfs.ext4 on the disk volume: # mkfs.ext4 /dev/vdc 16. Check the rebalance status. 17. Check gluster heal <volume> info 18. Check the progress of mkfs.ext4 in the VM. Actual results: mkfs.ext4 should complete without any delay Expected results: mkfs.ext4 hangs forever at "Creating journal" level. Does not complete. Additional info: Volume Name: cinder-vol Type: Distributed-Replicate Volume ID: 7d981582-e190-44e2-b51f-f2750380dcb2 Status: Started Number of Bricks: 6 x 2 = 12 Transport-type: tcp Bricks: Brick1: 10.70.37.168:/rhs/brick2/c1 Brick2: 10.70.37.74:/rhs/brick2/c2 Brick3: 10.70.37.220:/rhs/brick2/c7 Brick4: 10.70.37.203:/rhs/brick2/c8 Brick5: 10.70.37.168:/rhs/brick2/c9 Brick6: 10.70.37.74:/rhs/brick2/c10 Brick7: 10.70.37.168:/rhs/brick2/c5 Brick8: 10.70.37.74:/rhs/brick2/c6 Brick9: 10.70.37.220:/rhs/brick2/c3 Brick10: 10.70.37.203:/rhs/brick2/c4 Brick11: 10.70.37.220:/rhs/brick2/c11 Brick12: 10.70.37.203:/rhs/brick2/c12 Options Reconfigured: performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: enable storage.owner-gid: 165 storage.owner-uid: 165 The file in question is volume-06aa2d15-a030-453d-9a8b-3b75a5f164db. Below o/p shows the file being migrated: [root@rhs-vm1 brick2]# ls -lR c1 13 c5 c9 ls: cannot access 13: No such file or directory c1: total 5398352 -rw-rwSrwT 2 qemu qemu 10737418240 Aug 21 14:25 volume-06aa2d15-a030-453d-9a8b-3b75a5f164db -rw-rw-rw- 2 qemu qemu 107374182400 Aug 21 15:47 volume-53c03932-70fb-4bb1-abf0-8ce145dd8ce0 c5: total 0 ---------T 2 root root 10737418240 Aug 21 15:45 volume-06aa2d15-a030-453d-9a8b-3b75a5f164db -rw-rw-rw- 2 root root 10737418240 Aug 21 14:25 volume-7889ef4b-b4d5-4d4b-a365-3754aec6e5a6 c9: total 0 -rw-rw-rw- 2 root root 10737418240 Aug 19 14:07 volume-54e3733b-7ede-422d-a109-7113dd9f3673 [root@rhs-vm1 brick2]# gluster v remove-brick cinder-vol 10.70.37.168:/rhs/brick2/c1 10.70.37.74:/rhs/brick2/c2 status Node Rebalanced-files size scanned failures skipped status run-time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 2 20.0GB 3 0 0 in progress 2782.00 10.70.37.220 0 0Bytes 0 0 0 not started 0.00 10.70.37.203 0 0Bytes 0 0 0 not started 0.00 10.70.37.74 0 0Bytes 0 0 0 not started 0.00 [2013-08-21 10:21:57.259561] W [socket.c:522:__socket_rwv] 0-cinder-vol-client-1: readv on 10.70.37.74:49162 failed (No data available) [2013-08-21 10:21:57.259642] I [client.c:2103:client_rpc_notify] 0-cinder-vol-client-1: disconnected from 10.70.37.74:49162. Client process wi ll keep trying to connect to glusterd until brick's port is available. [2013-08-21 10:21:57.259983] W [socket.c:522:__socket_rwv] 0-cinder-vol-client-7: writev on 10.70.37.74:49174 failed (Connection reset by peer ) [2013-08-21 10:21:57.260123] W [socket.c:522:__socket_rwv] 0-cinder-vol-client-5: readv on 10.70.37.74:49164 failed (No data available) [2013-08-21 10:21:57.260221] I [client.c:2103:client_rpc_notify] 0-cinder-vol-client-5: disconnected from 10.70.37.74:49164. Client process wi ll keep trying to connect to glusterd until brick's port is available. [2013-08-21 10:21:57.260271] W [socket.c:522:__socket_rwv] 0-cinder-vol-client-7: readv on 10.70.37.74:49174 failed (No data available) [2013-08-21 10:21:57.397666] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x165) [0x37f120df55] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x37f120da93] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x37f120d9ae]))) 0-cinder-vol-client-7: forced unwinding frame type(GlusterFS 3.3) op(FSYNC(16)) called at 2013-08-21 10:21:55.652408 (xid=0x117260x) [2013-08-21 10:21:57.397709] W [client-rpc-fops.c:984:client3_3_fsync_cbk] 0-cinder-vol-client-7: remote operation failed: Transport endpoint is not connected [2013-08-21 10:21:57.397778] W [afr-transaction.c:1497:afr_changelog_fsync_cbk] 0-cinder-vol-replicate-3: fsync(00000000-0000-0000-0000-000000000000) failed on subvolume cinder-vol-client-7. Transaction was WRITE [2013-08-21 10:21:57.397854] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x165) [0x37f120df55] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x37f120da93] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x37f120d9ae]))) 0-cinder-vol-client-7: forced unwinding frame type(GlusterFS 3.3) op(FSYNC(16)) called at 2013-08-21 10:21:55.700303 (xid=0x117309x) [2013-08-21 10:21:57.397854] W [client-rpc-fops.c:984:client3_3_fsync_cbk] 0-cinder-vol-client-7: remote operation failed: Transport endpoint is not connected