Bug 988262
Summary: | [QEMU/KVM-RHS] EXT4 filesystem corruption & device IO error, in application VMs, after rebalance operation | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | SATHEESARAN <sasundar> |
Component: | glusterfs | Assignee: | shishir gowda <sgowda> |
Status: | CLOSED ERRATA | QA Contact: | SATHEESARAN <sasundar> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 2.1 | CC: | aavati, amarts, grajaiya, nsathyan, pkarampu, rhs-bugs, sasundar, sdharane, sgowda, surs, vbellur |
Target Milestone: | --- | Keywords: | TestBlocker |
Target Release: | --- | Flags: | sasundar:
needinfo-
sasundar: needinfo- |
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | glusterfs-3.4.0.19rhs-1 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: |
virt qemu integration
|
|
Last Closed: | 2013-09-23 22:29:50 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
SATHEESARAN
2013-07-25 08:23:30 UTC
Created attachment 778110 [details]
EXT4 corruption as seen in appvm6
1. Info from Backend ================= [Thu Jul 25 08:32:29 UTC 2013 root.37.200:~ ] # ls -lh /rhs/brick{1,2,3,4}/* /rhs/brick1/dir1: total 0 ---------T 2 qemu qemu 0 Jul 25 00:28 appvm5 /rhs/brick2/dir2: total 0 ---------T 2 qemu qemu 0 Jul 25 07:47 appvm1 /rhs/brick3/dir3: total 0 ---------T 2 qemu qemu 0 Jul 25 07:47 appvm6.img ---------T 2 qemu qemu 0 Jul 25 07:47 appvm7.img /rhs/brick4/spare1: total 31G ---------T 2 qemu qemu 21G Jul 24 21:57 appvm2 ---------T 2 qemu qemu 21G Jul 24 21:57 appvm3 ---------T 2 qemu qemu 21G Jul 24 21:57 appvm4 -rw-r--r-- 2 qemu qemu 20G Jul 24 22:44 appvm6.img -rw-r--r-- 2 qemu qemu 20G Jul 24 23:40 appvm7.img ---------T 2 qemu qemu 20G Jul 24 21:57 appvm8.img [Thu Jul 25 08:35:00 UTC 2013 root.37.82:~ ] # ls -lh /rhs/brick{1,2,3,4}/* /rhs/brick1/dir1: total 0 ---------T 2 qemu qemu 0 Jul 25 00:28 appvm5 /rhs/brick2/dir2: total 0 ---------T 2 qemu qemu 0 Jul 25 07:47 appvm1 /rhs/brick3/dir3: total 0 ---------T 2 qemu qemu 0 Jul 25 07:47 appvm6.img ---------T 2 qemu qemu 0 Jul 25 07:47 appvm7.img /rhs/brick4/spare1: total 31G ---------T 2 qemu qemu 21G Jul 24 21:57 appvm2 ---------T 2 qemu qemu 21G Jul 24 21:57 appvm3 ---------T 2 qemu qemu 21G Jul 24 21:57 appvm4 -rw-r--r-- 2 qemu qemu 20G Jul 24 22:44 appvm6.img -rw-r--r-- 2 qemu qemu 20G Jul 24 23:40 appvm7.img ---------T 2 qemu qemu 20G Jul 24 21:57 appvm8.img [Thu Jul 25 08:35:54 UTC 2013 root.37.204:~ ] # ls -lh /rhs/brick{1,2,3}/* /rhs/brick1/dir1: total 21G -rw-r--r-- 2 qemu qemu 21G Jul 24 22:18 appvm1 /rhs/brick2/dir2: total 76G -rw-r--r-- 2 qemu qemu 21G Jul 25 08:36 appvm2 -rw-r--r-- 2 qemu qemu 21G Jul 25 08:36 appvm3 -rw-r--r-- 2 qemu qemu 21G Jul 25 08:36 appvm4 -rw-r--r-- 2 qemu qemu 20G Jul 25 08:36 appvm8.img /rhs/brick3/dir3: total 21G -rw-r--r-- 2 qemu qemu 21G Jul 24 21:15 appvm5 [Thu Jul 25 08:36:47 UTC 2013 root.37.188:~ ] # ls -lh /rhs/brick{1,2,3}/* /rhs/brick1/dir1: total 21G -rw-r--r-- 2 qemu qemu 21G Jul 24 22:18 appvm1 /rhs/brick2/dir2: total 76G -rw-r--r-- 2 qemu qemu 21G Jul 25 08:36 appvm2 -rw-r--r-- 2 qemu qemu 21G Jul 25 08:36 appvm3 -rw-r--r-- 2 qemu qemu 21G Jul 25 08:36 appvm4 -rw-r--r-- 2 qemu qemu 20G Jul 25 08:36 appvm8.img /rhs/brick3/dir3: total 21G -rw-r--r-- 2 qemu qemu 21G Jul 24 21:15 appvm5 2.xattrs on bricks =================== [Thu Jul 25 08:39:04 UTC 2013 root.37.200:~ ] # getfattr -d -m . -e hex /rhs/brick{1,2,3,4}/* getfattr: Removing leading '/' from absolute path names # file: rhs/brick1/dir1 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000002492492449249247 trusted.glusterfs.volume-id=0x84ba2d553f404bb48ae99db8f0e98b9f # file: rhs/brick2/dir2 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000492492486db6db6b trusted.glusterfs.volume-id=0x84ba2d553f404bb48ae99db8f0e98b9f # file: rhs/brick3/dir3 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000006db6db6c9249248f trusted.glusterfs.volume-id=0x84ba2d553f404bb48ae99db8f0e98b9f # file: rhs/brick4/spare1 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000000000000024924923 trusted.glusterfs.volume-id=0x84ba2d553f404bb48ae99db8f0e98b9f [Thu Jul 25 08:39:45 UTC 2013 root.37.82:~ ] # getfattr -d -m . -e hex /rhs/brick{1,2,3,4}/* getfattr: Removing leading '/' from absolute path names # file: rhs/brick1/dir1 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000002492492449249247 trusted.glusterfs.volume-id=0x84ba2d553f404bb48ae99db8f0e98b9f # file: rhs/brick2/dir2 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000492492486db6db6b trusted.glusterfs.volume-id=0x84ba2d553f404bb48ae99db8f0e98b9f # file: rhs/brick3/dir3 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000006db6db6c9249248f trusted.glusterfs.volume-id=0x84ba2d553f404bb48ae99db8f0e98b9f # file: rhs/brick4/spare1 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000000000000024924923 trusted.glusterfs.volume-id=0x84ba2d553f404bb48ae99db8f0e98b9f [Thu Jul 25 08:40:44 UTC 2013 root.37.204:~ ] # getfattr -d -m . -e hex /rhs/brick{1,2,3}/* getfattr: Removing leading '/' from absolute path names # file: rhs/brick1/dir1 trusted.afr.drvol-client-6=0x000000000000000000000000 trusted.afr.drvol-client-7=0x000000000000000000000000 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x000000010000000092492490b6db6db3 trusted.glusterfs.volume-id=0x84ba2d553f404bb48ae99db8f0e98b9f # file: rhs/brick2/dir2 trusted.afr.drvol-client-8=0x000000000000000000000000 trusted.afr.drvol-client-9=0x000000000000000000000000 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000b6db6db4db6db6d7 trusted.glusterfs.volume-id=0x84ba2d553f404bb48ae99db8f0e98b9f # file: rhs/brick3/dir3 trusted.afr.drvol-client-10=0x000000000000000000000000 trusted.afr.drvol-client-11=0x000000000000000000000000 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000db6db6d8ffffffff trusted.glusterfs.volume-id=0x84ba2d553f404bb48ae99db8f0e98b9f [Thu Jul 25 08:41:18 UTC 2013 root.37.188:~ ] # getfattr -d -m . -e hex /rhs/brick{1,2,3}/* getfattr: Removing leading '/' from absolute path names # file: rhs/brick1/dir1 trusted.afr.drvol-client-6=0x000000000000000000000000 trusted.afr.drvol-client-7=0x000000000000000000000000 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x000000010000000092492490b6db6db3 trusted.glusterfs.volume-id=0x84ba2d553f404bb48ae99db8f0e98b9f # file: rhs/brick2/dir2 trusted.afr.drvol-client-8=0x000000000000000000000000 trusted.afr.drvol-client-9=0x000000000000000000000000 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000b6db6db4db6db6d7 trusted.glusterfs.volume-id=0x84ba2d553f404bb48ae99db8f0e98b9f # file: rhs/brick3/dir3 trusted.afr.drvol-client-10=0x000000000000000000000000 trusted.afr.drvol-client-11=0x000000000000000000000000 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000db6db6d8ffffffff trusted.glusterfs.volume-id=0x84ba2d553f404bb48ae99db8f0e98b9f 3.Rebalance status ================== [Thu Jul 25 08:39:12 UTC 2013 root.37.200:~ ] # gluster volume rebalance drvol status Node Rebalanced-files size scanned failures status run time in secs --------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 3 60.0GB 11 0 completed 11580.00 10.70.37.82 0 0Bytes 8 0 completed 1.00 10.70.37.188 0 0Bytes 8 0 completed 1.00 10.70.37.204 1 20.0GB 9 4 completed 2493.00 volume rebalance: drvol: success: Observation =========== I could see the following in rebalance error logs(/var/log/glusterfs/drvol-rebalance.log) in 10.70.37.82, as follows : <snip> [2013-07-24 21:15:36.297303] I [afr-common.c:2181:afr_discovery_cbk] 0-drvol-replicate-4: selecting local read_child drvol-client-8 [2013-07-24 21:15:36.300858] I [dht-layout.c:636:dht_layout_normalize] 0-drvol-dht: found anomalies in /. holes=1 overlaps=2 missing=0 down=0 misc=0 </snip> (In reply to SATHEESARAN from comment #4) > Observation > =========== > I could see the following in rebalance error > logs(/var/log/glusterfs/drvol-rebalance.log) in 10.70.37.82, as follows : > > <snip> > [2013-07-24 21:15:36.297303] I [afr-common.c:2181:afr_discovery_cbk] > 0-drvol-replicate-4: selecting local read_child drvol-client-8 > [2013-07-24 21:15:36.300858] I [dht-layout.c:636:dht_layout_normalize] > 0-drvol-dht: found anomalies in /. holes=1 overlaps=2 missing=0 down=0 misc=0 > </snip> I could see same errors @ client(10.70.36.32) side logs(/var/log/glusterfs/mnt-dr-vmstore.log) also : <snip> [2013-07-24 21:14:34.885009] I [dht-layout.c:636:dht_layout_normalize] 1-drvol-dht: found anomalies in /. holes=1 overlaps=0 missing=0 down=1 misc=0 </snip> Seems to be a fd migration issue: [2013-07-24 21:14:35.034612] I [fuse-bridge.c:4892:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0xbf2fb4) to newfd (0xe645fc) (inode-gfid:16b92d60-1621-4d0b-9752-c090505da239)(old-subvolume:drvol-0 new-subvolume:drvol-1) [2013-07-24 21:14:35.036723] I [fuse-bridge.c:4892:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0xbf2edc) to newfd (0xe64668) (inode-gfid:f60fc19c-627b-4c1a-95de-3d47f45fcab5)(old-subvolume:drvol-0 new-subvolume:drvol-1) [2013-07-24 21:14:35.039936] I [fuse-bridge.c:4892:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0xbf2f48) to newfd (0xe646d4) (inode-gfid:cf62eaa8-7923-4cc9-b145-29f7fd359c86)(old-subvolume:drvol-0 new-subvolume:drvol-1) [2013-07-24 21:14:35.084415] I [fuse-bridge.c:4892:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0xbf3020) to newfd (0xe64740) (inode-gfid:cb5dc5f7-9f42-4712-84e7-3e85aa03ffe0)(old-subvolume:drvol-0 new-subvolume:drvol-1) [2013-07-24 21:14:35.088213] I [fuse-bridge.c:4892:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0xbf30f8) to newfd (0xe647ac) (inode-gfid:4f080784-cf1d-4f5d-9422-1fe8f2dac8be)(old-subvolume:drvol-0 new-subvolume:drvol-1) [2013-07-24 21:14:35.268606] I [fuse-bridge.c:4892:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0xbf3164) to newfd (0xe64818) (inode-gfid:500bdfeb-d5bd-458d-8111-76da7428eef4)(old-subvolume:drvol-0 new-subvolume:drvol-1) [2013-07-24 21:14:35.270745] I [fuse-bridge.c:4892:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0xbf308c) to newfd (0xe64884) (inode-gfid:f24d7373-b952-4f63-a258-a60fbd15c017)(old-subvolume:drvol-0 new-subvolume:drvol-1) [2013-07-24 21:14:35.272783] I [fuse-bridge.c:4892:fuse_migrate_fd_open] 0-glusterfs-fuse: migrated basefd (0xbf323c) to newfd (0xe648f0) (inode-gfid:4f9990da-74b9-4526-8f1c-fd4a52726cd7)(old-subvolume:drvol-0 new-subvolume:drvol-1) [2013-07-24 21:14:38.449142] I [rpc-clnt.c:1675:rpc_clnt_reconfig] 1-drvol-client-12: changing port to 49162 (from 0) ......... [2013-07-24 21:15:15.649436] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-drvol-client-6: remote operation failed: Bad file descriptor [2013-07-24 21:15:15.656641] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-drvol-client-7: remote operation failed: Bad file descriptor [2013-07-24 21:15:15.656680] W [fuse-bridge.c:2631:fuse_writev_cbk] 0-glusterfs-fuse: 2490994: WRITE => -1 (Bad file descriptor) [2013-07-24 21:15:15.671073] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-drvol-client-6: remote operation failed: Bad file descriptor [2013-07-24 21:15:15.679869] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-drvol-client-7: remote operation failed: Bad file descriptor Server logs: [2013-07-24 18:13:07.294544] I [server-handshake.c:569:server_setvolume] 0-drvol-server: accepted client from rhs-client8.lab.eng.blr.redhat.com-26442-2013/07/24-18:12:45:38843-drvol-client-6-0 (version: 3.4.0.1 2rhs.beta6) [2013-07-24 18:24:15.488082] I [server.c:773:server_rpc_notify] 0-drvol-server: disconnecting connection from rhs-client8.lab.eng.blr.redhat.com-26442-2013/07/24-18:12:45:38843-drvol-client-6-0, Number of pendin g operations: 1 [2013-07-24 18:24:15.488232] I [server-helpers.c:752:server_connection_put] 0-drvol-server: Shutting down connection rhs-client8.lab.eng.blr.redhat.com-26442-2013/07/24-18:12:45:38843-drvol-client-6-0 [2013-07-24 18:24:15.488280] I [server-helpers.c:585:server_log_conn_destroy] 0-drvol-server: destroyed connection of rhs-client8.lab.eng.blr.redhat.com-26442-2013/07/24-18:12:45:38843-drvol-client-6-0 [2013-07-24 18:28:54.770477] I [server-handshake.c:569:server_setvolume] 0-drvol-server: accepted client from rhs-client8.lab.eng.blr.redhat.com-7995-2013/07/24-18:28:32:669057-drvol-client-6-0 (version: 3.4.0.1 2rhs.beta6) [2013-07-24 21:14:56.482542] I [server-handshake.c:569:server_setvolume] 0-drvol-server: accepted client from rhs-client8.lab.eng.blr.redhat.com-7995-2013/07/24-18:28:32:669057-drvol-client-6-1 (version: 3.4.0.1 2rhs.beta6) [2013-07-24 21:14:56.581174] I [server.c:773:server_rpc_notify] 0-drvol-server: disconnecting connection from localhost.localdomain-4658-2013/07/24-18:09:15:362646-drvol-client-6-0, Number of pending operations: 1 [2013-07-24 21:14:56.581251] I [server-helpers.c:752:server_connection_put] 0-drvol-server: Shutting down connection localhost.localdomain-4658-2013/07/24-18:09:15:362646-drvol-client-6-0 [2013-07-24 21:14:56.581286] I [server-helpers.c:585:server_log_conn_destroy] 0-drvol-server: destroyed connection of localhost.localdomain-4658-2013/07/24-18:09:15:362646-drvol-client-6-0 [2013-07-24 21:14:57.656278] I [server.c:773:server_rpc_notify] 0-drvol-server: disconnecting connection from localhost.localdomain-4613-2013/07/24-18:08:49:227528-drvol-client-6-0, Number of pending operations: 1 [2013-07-24 21:14:57.656368] I [server-helpers.c:752:server_connection_put] 0-drvol-server: Shutting down connection localhost.localdomain-4613-2013/07/24-18:08:49:227528-drvol-client-6-0 [2013-07-24 21:14:57.656408] I [server-helpers.c:585:server_log_conn_destroy] 0-drvol-server: destroyed connection of localhost.localdomain-4613-2013/07/24-18:08:49:227528-drvol-client-6-0 [2013-07-24 21:14:59.010729] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed [2013-07-24 21:14:59.062835] I [server.c:773:server_rpc_notify] 0-drvol-server: disconnecting connection from localhost.localdomain-4587-2013/07/24-18:09:16:565428-drvol-client-6-0, Number of pending operations: 1 [2013-07-24 21:14:59.062931] I [server-helpers.c:752:server_connection_put] 0-drvol-server: Shutting down connection localhost.localdomain-4587-2013/07/24-18:09:16:565428-drvol-client-6-0 [2013-07-24 21:14:59.062974] I [server-helpers.c:585:server_log_conn_destroy] 0-drvol-server: destroyed connection of localhost.localdomain-4587-2013/07/24-18:09:16:565428-drvol-client-6-0 [2013-07-24 21:14:59.240440] I [server.c:773:server_rpc_notify] 0-drvol-server: disconnecting connection from localhost.localdomain-4290-2013/07/24-18:09:16:917203-drvol-client-6-0, Number of pending operations: 1 [2013-07-24 21:14:59.240491] I [server-helpers.c:752:server_connection_put] 0-drvol-server: Shutting down connection localhost.localdomain-4290-2013/07/24-18:09:16:917203-drvol-client-6-0 [2013-07-24 21:14:59.240608] I [server-helpers.c:585:server_log_conn_destroy] 0-drvol-server: destroyed connection of localhost.localdomain-4290-2013/07/24-18:09:16:917203-drvol-client-6-0 [2013-07-24 21:14:59.351492] I [server.c:773:server_rpc_notify] 0-drvol-server: disconnecting connection from localhost.localdomain-4226-2013/07/24-18:09:16:766361-drvol-client-6-0, Number of pending operations: 1 [2013-07-24 21:14:59.351609] I [server-helpers.c:752:server_connection_put] 0-drvol-server: Shutting down connection localhost.localdomain-4226-2013/07/24-18:09:16:766361-drvol-client-6-0 [2013-07-24 21:14:59.351640] I [server-helpers.c:585:server_log_conn_destroy] 0-drvol-server: destroyed connection of localhost.localdomain-4226-2013/07/24-18:09:16:766361-drvol-client-6-0 [2013-07-24 21:15:00.082308] I [server.c:773:server_rpc_notify] 0-drvol-server: disconnecting connection from localhost.localdomain-4547-2013/07/24-18:08:49:538873-drvol-client-6-0, Number of pending operations: 1 [2013-07-24 21:15:00.082378] I [server-helpers.c:752:server_connection_put] 0-drvol-server: Shutting down connection localhost.localdomain-4547-2013/07/24-18:08:49:538873-drvol-client-6-0 [2013-07-24 21:15:00.082432] I [server-helpers.c:585:server_log_conn_destroy] 0-drvol-server: destroyed connection of localhost.localdomain-4547-2013/07/24-18:08:49:538873-drvol-client-6-0 [2013-07-24 21:15:00.266400] I [glusterfsd-mgmt.c:1545:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing [2013-07-24 21:15:00.279650] I [server-handshake.c:569:server_setvolume] 0-drvol-server: accepted client from localhost.localdomain-6493-2013/07/24-21:14:58:232896-drvol-client-6-0 (version: 3.4.0.12rhs.beta6) [2013-07-24 21:15:00.308110] I [server.c:773:server_rpc_notify] 0-drvol-server: disconnecting connection from localhost.localdomain-4248-2013/07/24-18:08:49:885957-drvol-client-6-0, Number of pending operations: 1 [2013-07-24 21:15:00.308168] I [server-helpers.c:752:server_connection_put] 0-drvol-server: Shutting down connection localhost.localdomain-4248-2013/07/24-18:08:49:885957-drvol-client-6-0 [2013-07-24 21:15:00.308199] I [server-helpers.c:585:server_log_conn_destroy] 0-drvol-server: destroyed connection of localhost.localdomain-4248-2013/07/24-18:08:49:885957-drvol-client-6-0 [2013-07-24 21:15:00.587673] I [server.c:773:server_rpc_notify] 0-drvol-server: disconnecting connection from localhost.localdomain-4186-2013/07/24-18:08:50:262229-drvol-client-6-0, Number of pending operations: 1 [2013-07-24 21:15:00.587758] I [server-helpers.c:752:server_connection_put] 0-drvol-server: Shutting down connection localhost.localdomain-4186-2013/07/24-18:08:50:262229-drvol-client-6-0 [2013-07-24 21:15:00.587998] I [server-helpers.c:585:server_log_conn_destroy] 0-drvol-server: destroyed connection of localhost.localdomain-4186-2013/07/24-18:08:50:262229-drvol-client-6-0 [2013-07-24 21:15:01.346994] I [server-handshake.c:569:server_setvolume] 0-drvol-server: accepted client from localhost.localdomain-6485-2013/07/24-21:14:57:218199-drvol-client-6-0 (version: 3.4.0.12rhs.beta6) [2013-07-24 21:15:01.423570] I [server-handshake.c:569:server_setvolume] 0-drvol-server: accepted client from localhost.localdomain-6353-2013/07/24-21:14:59:723372-drvol-client-6-0 (version: 3.4.0.12rhs.beta6) [2013-07-24 21:15:01.440864] I [server-handshake.c:569:server_setvolume] 0-drvol-server: accepted client from localhost.localdomain-6041-2013/07/24-21:15:01:305761-drvol-client-6-0 (version: 3.4.0.12rhs.beta6) [2013-07-24 21:15:01.467482] I [server-handshake.c:569:server_setvolume] 0-drvol-server: accepted client from localhost.localdomain-6360-2013/07/24-21:15:00:737721-drvol-client-6-0 (version: 3.4.0.12rhs.beta6) [2013-07-24 21:15:01.636060] I [server-handshake.c:569:server_setvolume] 0-drvol-server: accepted client from localhost.localdomain-6034-2013/07/24-21:15:00:312372-drvol-client-6-0 (version: 3.4.0.12rhs.beta6) [2013-07-24 21:15:01.731995] I [server-handshake.c:569:server_setvolume] 0-drvol-server: accepted client from localhost.localdomain-5982-2013/07/24-21:15:01:315529-drvol-client-6-0 (version: 3.4.0.12rhs.beta6) [2013-07-24 21:15:02.068744] I [server-handshake.c:569:server_setvolume] 0-drvol-server: accepted client from localhost.localdomain-5975-2013/07/24-21:15:00:345084-drvol-client-6-0 (version: 3.4.0.12rhs.beta6) [2013-07-24 21:15:36.490013] I [server-handshake.c:569:server_setvolume] 0-drvol-server: accepted client from localhost.localdomain-6530-2013/07/24-21:15:26:4843-drvol-client-6-0 (version: 3.4.0.12rhs.beta6) [2013-07-24 21:15:36.616477] I [server-handshake.c:569:server_setvolume] 0-drvol-server: accepted client from localhost.localdomain-6380-2013/07/24-21:15:31:101716-drvol-client-6-0 (version: 3.4.0.12rhs.beta6) [2013-07-24 21:15:36.929038] I [server-handshake.c:569:server_setvolume] 0-drvol-server: accepted client from localhost.localdomain-6001-2013/07/24-21:15:31:344034-drvol-client-6-0 (version: 3.4.0.12rhs.beta6) [2013-07-24 21:15:37.251672] I [server-handshake.c:569:server_setvolume] 0-drvol-server: accepted client from localhost.localdomain-6065-2013/07/24-21:15:31:451247-drvol-client-6-0 (version: 3.4.0.12rhs.beta6) [2013-07-24 21:15:37.314065] E [posix.c:2150:posix_writev] 0-drvol-posix: write failed: offset 6094622720, Bad file descriptor [2013-07-24 21:15:37.314102] I [server-rpc-fops.c:1439:server_writev_cbk] 0-drvol-server: 2322: WRITEV 1 (4f080784-cf1d-4f5d-9422-1fe8f2dac8be) ==> (Bad file descriptor) [2013-07-24 21:15:37.335589] E [posix.c:2150:posix_writev] 0-drvol-posix: write failed: offset 6094622720, Bad file descriptor [2013-07-24 21:15:37.335669] I [server-rpc-fops.c:1439:server_writev_cbk] 0-drvol-server: 2325: WRITEV 1 (4f080784-cf1d-4f5d-9422-1fe8f2dac8be) ==> (Bad file descriptor) [2013-07-24 21:15:37.358656] E [posix.c:2150:posix_writev] 0-drvol-posix: write failed: offset 6095114240, Bad file descriptor Spent some time on this issue today with Avati and Shishir, and it seems that this particular cases have come up in Big Bend now is because of 'open-behind' (and if you notice, 'gluster volume set <VOL> group virt' didn't disable open-behind). With build glusterfs-3.4.0.13rhs, open-behind is default off on the volume, and hence can we get a final round of test done on this ? Amar, Yes. I will test this particular scenario with latest glusterfs [3.4.0.14rhs-1] and will update this bug. Can you please update the bug if the test ran without open-behind enabled (in latest glusterfs post 3.4.013 open-behind is off by default)? Shishir, I will update soon after testing this. I re-tested the same with glusterfs-3.4.0.17rhs-1, after installing ISO, RHS-2.1-20130806.n.2-RHS-x86_64-DVD1.iso I am still seeing the same issue happening in 2/7 VMs Steps Performed =============== 1. Create a 6X2 distributed-replicate volume, with 4 node cluster, each node having 3 bricks (i.e) gluster volume create <vol-name> replica 2 <brick1> .. <brick12> 2. Set the volume with profile virt (i.e) gluster volume set <vol-name> group virt 3. Set owner-uid and owner-gid to 107 and 107 (i.e) gluster volume set <vol-name> storage.owner-uid 107 gluster volume set <vol-name> storage.owner-uid 107 4. start the volume (i.e) gluster volume start <vol-name> 5. Fuse mount the volume on the client [RHEL 6.4 in my case] (i.e) mount.glusterfs <RHS-Server>:<vol-name> <mount-point> 6. Created 4 qcow2 images (thick provisioned) and 3 raw images, each of size 20GB (i.e) for i in 1 2 3 4; do qemu-img create -f qcow2 -o preallocation=metadata <mount-point>/appvm${i}.qcow2 20G;done for i in 5 6 7; do qemu-img create -f raw <mount-point>/appvm${i}.img 20G;done 7. Create 7 VMs running RHEL6.4, each having images created in step 6 (i.e) for i in 1 2 3 4; do virt-install --name appvm$i --ram 2048 --vcpus 2 --pxe --disk path=/mnt/dr-vmstore/appvm${i}.qcow2,format=qcow2,bus=virtio,device=disk,io=threads --vnc 8. Execute the script, https://bugzilla.redhat.com/show_bug.cgi?id=983737#c3, which simulates IO running inside VM 9. Added a pair of bricks on RHS server (i.e) gluster volume add-brick <vol-name> <brick1> <brick2> 10. Start the rebalance (i.e) gluster volume rebalance <vol-name> start 11. check for rebalance status (i.e) gluster volume rebalance <vol-name> status Additional Information ====================== 1. RHS Nodes ============ 10.70.37.188 10.70.37.73 10.70.37.106 10.70.37.204 2. Client [ RHEL 6.4 ] ====================== 10.70.36.32 - rhs-client8.lab.eng.blr.redhat.com 3. Mount point =============== volume is mounted at, /mnt/dr-vmstore 4. All commands are executed from the RHS Node - 10.70.37.188 5. Console logs on Client ========================= [Wed Aug 7 08:14:23 UTC 2013 root.36.32:~ ] # mkdir /mnt/dr-vmstore [Wed Aug 7 08:14:29 UTC 2013 root.36.32:~ ] # ls -ld /mnt/dr-vmstore drwxr-xr-x. 2 root root 4096 Aug 7 13:44 /mnt/dr-vmstore [Wed Aug 7 08:14:34 UTC 2013 root.36.32:~ ] # mount.glusterfs 10.70.37.188:dr-vmstore /mnt/dr-vmstore [Wed Aug 7 08:15:08 UTC 2013 root.36.32:~ ] # ls -ld /mnt/dr-vmstore drwxr-xr-x. 3 qemu qemu 138 Aug 7 13:33 /mnt/dr-vmstore [Wed Aug 7 08:15:11 UTC 2013 root.36.32:~ ] # df -Th Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/vg_rhsclient8-lv_root ext4 50G 3.2G 44G 7% / tmpfs tmpfs 7.8G 0 7.8G 0% /dev/shm /dev/sda1 ext4 485M 65M 396M 14% /boot /dev/mapper/vg_rhsclient8-lv_home ext4 1.8T 7.7G 1.7T 1% /home 10.70.37.188:dr-vmstore fuse.glusterfs 510G 206M 510G 1% /mnt/dr-vmstore [Wed Aug 7 08:16:01 UTC 2013 root.36.32:~ ] # mount /dev/mapper/vg_rhsclient8-lv_root on / type ext4 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw,rootcontext="system_u:object_r:tmpfs_t:s0") /dev/sda1 on /boot type ext4 (rw) /dev/mapper/vg_rhsclient8-lv_home on /home type ext4 (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) 10.70.37.188:dr-vmstore on /mnt/dr-vmstore type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072) [Wed Aug 7 08:16:05 UTC 2013 root.36.32:~ ] # ps aux | grep dr-vmstore root 14875 0.2 0.4 336800 71748 ? Ssl 13:45 0:00 /usr/sbin/glusterfs --volfile-id=dr-vmstore --volfile-server=10.70.37.188 /mnt/dr-vmstore root 14900 0.0 0.0 103244 856 pts/0 S+ 13:46 0:00 grep dr-vmstore 6. RHS Console logs on 10.70.37.188 =================================== [Wed Aug 7 08:13:51 UTC 2013 satheesaran@unused:~/sas/simplefs ] # ssh root.37.188 Last login: Wed Aug 7 07:26:57 2013 from dhcp-0-179.blr.redhat.com [Wed Aug 7 09:22:39 UTC 2013 root.37.188:~ ] # [Wed Aug 7 09:22:40 UTC 2013 root.37.188:~ ] # [Wed Aug 7 09:22:40 UTC 2013 root.37.188:~ ] # clear [Wed Aug 7 09:22:41 UTC 2013 root.37.188:~ ] # gluster volume info dr-vmstore Volume Name: dr-vmstore Type: Distributed-Replicate Volume ID: 6b5b534a-d714-4d6e-91b2-71423a8ff7f1 Status: Started Number of Bricks: 6 x 2 = 12 Transport-type: tcp Bricks: Brick1: 10.70.37.188:/rhs/brick1/dir1 Brick2: 10.70.37.73:/rhs/brick1/dir1 Brick3: 10.70.37.188:/rhs/brick2/dir2 Brick4: 10.70.37.73:/rhs/brick2/dir2 Brick5: 10.70.37.188:/rhs/brick3/dir3 Brick6: 10.70.37.73:/rhs/brick3/dir3 Brick7: 10.70.37.106:/rhs/brick1/dir1 Brick8: 10.70.37.204:/rhs/brick1/dir1 Brick9: 10.70.37.106:/rhs/brick2/dir2 Brick10: 10.70.37.204:/rhs/brick2/dir2 Brick11: 10.70.37.106:/rhs/brick3/dir3 Brick12: 10.70.37.204:/rhs/brick3/dir3 Options Reconfigured: storage.owner-gid: 107 storage.owner-uid: 107 network.remote-dio: enable cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off [Wed Aug 7 09:22:51 UTC 2013 root.37.188:~ ] # gluster volume status dr-vmstore Status of volume: dr-vmstore Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.188:/rhs/brick1/dir1 49152 Y 4657 Brick 10.70.37.73:/rhs/brick1/dir1 49152 Y 2865 Brick 10.70.37.188:/rhs/brick2/dir2 49153 Y 4668 Brick 10.70.37.73:/rhs/brick2/dir2 49153 Y 2876 Brick 10.70.37.188:/rhs/brick3/dir3 49154 Y 4679 Brick 10.70.37.73:/rhs/brick3/dir3 49154 Y 2887 Brick 10.70.37.106:/rhs/brick1/dir1 49152 Y 4405 Brick 10.70.37.204:/rhs/brick1/dir1 49152 Y 4364 Brick 10.70.37.106:/rhs/brick2/dir2 49153 Y 4416 Brick 10.70.37.204:/rhs/brick2/dir2 49153 Y 4375 Brick 10.70.37.106:/rhs/brick3/dir3 49154 Y 4427 Brick 10.70.37.204:/rhs/brick3/dir3 49154 Y 4387 NFS Server on localhost 2049 Y 4752 Self-heal Daemon on localhost N/A Y 4700 NFS Server on 10.70.37.204 2049 Y 4454 Self-heal Daemon on 10.70.37.204 N/A Y 4407 NFS Server on 10.70.37.106 2049 Y 4497 Self-heal Daemon on 10.70.37.106 N/A Y 4448 NFS Server on 10.70.37.73 2049 Y 2954 Self-heal Daemon on 10.70.37.73 N/A Y 2906 There are no active volume tasks [Wed Aug 7 09:23:02 UTC 2013 root.37.188:~ ] # gluster volume add-brick dr-vmstore 10.70.37.188:/rhs/brick4/newbrick1 10.70.37.73:/rhs/brick4/newbrick1 volume add-brick: success [Wed Aug 7 09:24:46 UTC 2013 root.37.188:~ ] # gluster volume rebalance dr-vmstore start volume rebalance: dr-vmstore: success: Starting rebalance on volume dr-vmstore has been successful. ID: bfd9c472-2e89-4f88-b7e1-76bf427f2ef2 [Wed Aug 7 09:25:17 UTC 2013 root.37.188:~ ] # gluster volume rebalance dr-vmstore status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 1 0 0 in progress 9.00 10.70.37.73 0 0Bytes 7 0 0 completed 0.00 10.70.37.106 0 0Bytes 5 0 0 in progress 9.00 10.70.37.204 0 0Bytes 7 0 0 completed 2.00 volume rebalance: dr-vmstore: success: [Wed Aug 7 09:25:26 UTC 2013 root.37.188:~ ] # gluster volume status dr-vmstore Status of volume: dr-vmstore Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.188:/rhs/brick1/dir1 49152 Y 4657 Brick 10.70.37.73:/rhs/brick1/dir1 49152 Y 2865 Brick 10.70.37.188:/rhs/brick2/dir2 49153 Y 4668 Brick 10.70.37.73:/rhs/brick2/dir2 49153 Y 2876 Brick 10.70.37.188:/rhs/brick3/dir3 49154 Y 4679 Brick 10.70.37.73:/rhs/brick3/dir3 49154 Y 2887 Brick 10.70.37.106:/rhs/brick1/dir1 49152 Y 4405 Brick 10.70.37.204:/rhs/brick1/dir1 49152 Y 4364 Brick 10.70.37.106:/rhs/brick2/dir2 49153 Y 4416 Brick 10.70.37.204:/rhs/brick2/dir2 49153 Y 4375 Brick 10.70.37.106:/rhs/brick3/dir3 49154 Y 4427 Brick 10.70.37.204:/rhs/brick3/dir3 49154 Y 4387 Brick 10.70.37.188:/rhs/brick4/newbrick1 49156 Y 32034 Brick 10.70.37.73:/rhs/brick4/newbrick1 49155 Y 11254 NFS Server on localhost 2049 Y 32046 Self-heal Daemon on localhost N/A Y 32053 NFS Server on 10.70.37.204 2049 Y 31672 Self-heal Daemon on 10.70.37.204 N/A Y 31679 NFS Server on 10.70.37.73 2049 Y 11710 Self-heal Daemon on 10.70.37.73 N/A Y 12086 NFS Server on 10.70.37.106 2049 Y 31713 Self-heal Daemon on 10.70.37.106 N/A Y 31720 Task ID Status ---- -- ------ Rebalance bfd9c472-2e89-4f88-b7e1-76bf427f2ef2 1 [Wed Aug 7 09:25:46 UTC 2013 root.37.188:~ ] # gluster volume info dr-vmstore Volume Name: dr-vmstore Type: Distributed-Replicate Volume ID: 6b5b534a-d714-4d6e-91b2-71423a8ff7f1 Status: Started Number of Bricks: 7 x 2 = 14 Transport-type: tcp Bricks: Brick1: 10.70.37.188:/rhs/brick1/dir1 Brick2: 10.70.37.73:/rhs/brick1/dir1 Brick3: 10.70.37.188:/rhs/brick2/dir2 Brick4: 10.70.37.73:/rhs/brick2/dir2 Brick5: 10.70.37.188:/rhs/brick3/dir3 Brick6: 10.70.37.73:/rhs/brick3/dir3 Brick7: 10.70.37.106:/rhs/brick1/dir1 Brick8: 10.70.37.204:/rhs/brick1/dir1 Brick9: 10.70.37.106:/rhs/brick2/dir2 Brick10: 10.70.37.204:/rhs/brick2/dir2 Brick11: 10.70.37.106:/rhs/brick3/dir3 Brick12: 10.70.37.204:/rhs/brick3/dir3 Brick13: 10.70.37.188:/rhs/brick4/newbrick1 Brick14: 10.70.37.73:/rhs/brick4/newbrick1 Options Reconfigured: storage.owner-gid: 107 storage.owner-uid: 107 network.remote-dio: enable cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off 7.Partition information for all VMs ===================================== NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT vda 252:0 0 20G 0 disk ├─vda1 252:1 0 500M 0 part /boot └─vda2 252:2 0 19.5G 0 part ├─vg_appvm2-lv_root (dm-0) 253:0 0 15.6G 0 lvm / └─vg_appvm2-lv_swap (dm-1) 253:1 0 4G 0 lvm [SWAP] 8. Screenshot of VMs showing up EXT4 Error is attached Rebalance is still in progress, will be uploading sosreports its completed Created attachment 783801 [details]
EXT4 corruption as seen in appvm1
Created attachment 783802 [details]
EXT4 corruption as seen in appvm5
Observation =========== I see some errors in fuse mount log (/var/log/glusterfs/mnt-dr-vmstore.log) 013-08-07 11:08:26.158105] W [fuse-bridge.c:2695:fuse_writev_cbk] 0-glusterfs-fuse: 2898608: WRITE => -1 (Bad file descriptor) [2013-08-07 11:08:26.163813] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-dr-vmstore-client-6: remote operation failed: Bad file descriptor [2013-08-07 11:08:26.167022] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-dr-vmstore-client-7: remote operation failed: Bad file descriptor [2013-08-07 11:08:26.167070] W [fuse-bridge.c:2695:fuse_writev_cbk] 0-glusterfs-fuse: 2898613: WRITE => -1 (Bad file descriptor) [2013-08-07 11:08:26.172347] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-dr-vmstore-client-6: remote operation failed: Bad file descriptor [2013-08-07 11:08:26.190515] W [client-rpc-fops.c:866:client3_3_writev_cbk] 1-dr-vmstore-client-7: remote operation failed: Bad file descriptor [2013-08-07 11:08:26.190553] W [fuse-bridge.c:2695:fuse_writev_cbk] 0-glusterfs-fuse: 2898616: WRITE => -1 (Bad file descriptor) Additional Information ====================== 1. Files in all bricks ======================= NODE 1 ====== [Wed Aug 7 15:53:09 UTC 2013 root.37.188:~ ] # ls -Rl /rhs/brick* /rhs/brick1: total 0 drwxr-xr-x 3 qemu qemu 59 Aug 7 10:15 dir1 /rhs/brick1/dir1: total 24002736 -rw-r--r-- 2 qemu qemu 21478375424 Aug 7 10:29 appvm4.qcow2 -rw-r--r-- 2 qemu qemu 21474836480 Aug 7 10:29 appvm5.img /rhs/brick2: total 0 drwxr-xr-x 3 qemu qemu 23 Aug 7 13:28 dir2 /rhs/brick2/dir2: total 0 /rhs/brick3: total 0 drwxr-xr-x 3 qemu qemu 23 Aug 7 08:03 dir3 /rhs/brick3/dir3: total 0 /rhs/brick4: total 0 drwxr-xr-x 3 qemu qemu 76 Aug 7 09:25 newbrick1 /rhs/brick4/newbrick1: total 53080120 -rw-r--r-- 2 qemu qemu 21478375424 Aug 7 12:42 appvm2.qcow2 -rw-r--r-- 2 qemu qemu 21474836480 Aug 7 11:30 appvm6.img -rw-r--r-- 2 qemu qemu 21474836480 Aug 7 11:40 appvm7.img NODE 2 ======= [Wed Aug 7 15:57:59 UTC 2013 root.37.73:~ ] # ls -Rlh /rhs/brick* /rhs/brick1: total 0 drwxr-xr-x 3 qemu qemu 59 Aug 7 10:15 dir1 /rhs/brick1/dir1: total 23G -rw-r--r-- 2 qemu qemu 21G Aug 7 10:29 appvm4.qcow2 -rw-r--r-- 2 qemu qemu 20G Aug 7 10:29 appvm5.img /rhs/brick2: total 0 drwxr-xr-x 3 qemu qemu 23 Aug 7 13:28 dir2 /rhs/brick2/dir2: total 0 /rhs/brick3: total 0 drwxr-xr-x 3 qemu qemu 23 Aug 7 08:03 dir3 /rhs/brick3/dir3: total 0 /rhs/brick4: total 0 drwxr-xr-x 3 qemu qemu 76 Aug 7 09:25 newbrick1 /rhs/brick4/newbrick1: total 51G -rw-r--r-- 2 qemu qemu 21G Aug 7 12:42 appvm2.qcow2 -rw-r--r-- 2 qemu qemu 20G Aug 7 11:30 appvm6.img -rw-r--r-- 2 qemu qemu 20G Aug 7 11:40 appvm7.img NODE 3 ====== [Wed Aug 7 15:58:00 UTC 2013 root.37.106:~ ] # ls -Rlh /rhs/brick* /rhs/brick1: total 0 drwxr-xr-x 3 qemu qemu 61 Aug 7 12:26 dir1 /rhs/brick1/dir1: total 24G -rw-r--r-- 2 qemu qemu 21G Aug 7 09:25 appvm1.qcow2 -rw-r--r-- 2 qemu qemu 21G Aug 7 12:26 appvm3.qcow2 /rhs/brick2: total 0 drwxr-xr-x 3 qemu qemu 23 Aug 7 08:03 dir2 /rhs/brick2/dir2: total 0 /rhs/brick3: total 0 drwxr-xr-x 3 qemu qemu 23 Aug 7 13:06 dir3 /rhs/brick3/dir3: total 0 /rhs/brick4: total 0 NODE 4 ======= [Wed Aug 7 15:57:59 UTC 2013 root.37.204:~ ] # ls -Rlh /rhs/brick* /rhs/brick1: total 0 drwxr-xr-x 3 qemu qemu 61 Aug 7 12:26 dir1 /rhs/brick1/dir1: total 24G -rw-r--r-- 2 qemu qemu 21G Aug 7 09:25 appvm1.qcow2 -rw-r--r-- 2 qemu qemu 21G Aug 7 12:26 appvm3.qcow2 /rhs/brick2: total 0 drwxr-xr-x 3 qemu qemu 23 Aug 7 08:03 dir2 /rhs/brick2/dir2: total 0 /rhs/brick3: total 0 drwxr-xr-x 3 qemu qemu 23 Aug 7 13:06 dir3 /rhs/brick3/dir3: total 0 /rhs/brick4: total 0 2. xattrs on all bricks ======================== NODE-1 ====== [Wed Aug 7 16:05:06 UTC 2013 root.37.188:~ ] # for i in 1 2 3 ;do getfattr -d -m . -e hex /rhs/brick${i}/dir${i};done getfattr: Removing leading '/' from absolute path names # file: rhs/brick1/dir1 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000db6db6d8ffffffff trusted.glusterfs.volume-id=0x6b5b534ad7144d6e91b271423a8ff7f1 getfattr: Removing leading '/' from absolute path names # file: rhs/brick2/dir2 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x000000010000000092492490b6db6db3 trusted.glusterfs.volume-id=0x6b5b534ad7144d6e91b271423a8ff7f1 getfattr: Removing leading '/' from absolute path names # file: rhs/brick3/dir3 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000b6db6db4db6db6d7 trusted.glusterfs.volume-id=0x6b5b534ad7144d6e91b271423a8ff7f1 [Wed Aug 7 16:08:13 UTC 2013 root.37.188:~ ] # getfattr -d -m . -e hex /rhs/brick4/newbrick1/ getfattr: Removing leading '/' from absolute path names # file: rhs/brick4/newbrick1/ trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000006db6db6c9249248f trusted.glusterfs.volume-id=0x6b5b534ad7144d6e91b271423a8ff7f1 NODE-2 ====== [Wed Aug 7 16:05:05 UTC 2013 root.37.73:~ ] # for i in 1 2 3 ;do getfattr -d -m . -e hex /rhs/brick${i}/dir${i};done getfattr: Removing leading '/' from absolute path names # file: rhs/brick1/dir1 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000db6db6d8ffffffff trusted.glusterfs.volume-id=0x6b5b534ad7144d6e91b271423a8ff7f1 getfattr: Removing leading '/' from absolute path names # file: rhs/brick2/dir2 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x000000010000000092492490b6db6db3 trusted.glusterfs.volume-id=0x6b5b534ad7144d6e91b271423a8ff7f1 getfattr: Removing leading '/' from absolute path names # file: rhs/brick3/dir3 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000b6db6db4db6db6d7 trusted.glusterfs.volume-id=0x6b5b534ad7144d6e91b271423a8ff7f1 [Wed Aug 7 16:08:13 UTC 2013 root.37.73:~ ] # getfattr -d -m . -e hex /rhs/brick4/newbrick1/ getfattr: Removing leading '/' from absolute path names # file: rhs/brick4/newbrick1/ trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000006db6db6c9249248f trusted.glusterfs.volume-id=0x6b5b534ad7144d6e91b271423a8ff7f1 NODE-3 ======= [Wed Aug 7 16:05:06 UTC 2013 root.37.106:~ ] # for i in 1 2 3 ;do getfattr -d -m . -e hex /rhs/brick${i}/dir${i};done getfattr: Removing leading '/' from absolute path names # file: rhs/brick1/dir1 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000492492486db6db6b trusted.glusterfs.volume-id=0x6b5b534ad7144d6e91b271423a8ff7f1 getfattr: Removing leading '/' from absolute path names # file: rhs/brick2/dir2 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000000000000024924923 trusted.glusterfs.volume-id=0x6b5b534ad7144d6e91b271423a8ff7f1 getfattr: Removing leading '/' from absolute path names # file: rhs/brick3/dir3 trusted.afr.dr-vmstore-client-10=0x000000000000000000000000 trusted.afr.dr-vmstore-client-11=0x000000000000000000000000 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000002492492449249247 trusted.glusterfs.volume-id=0x6b5b534ad7144d6e91b271423a8ff7f1 NODE-4 ====== [Wed Aug 7 16:05:05 UTC 2013 root.37.204:~ ] # for i in 1 2 3 ;do getfattr -d -m . -e hex /rhs/brick${i}/dir${i};done getfattr: Removing leading '/' from absolute path names # file: rhs/brick1/dir1 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000492492486db6db6b trusted.glusterfs.volume-id=0x6b5b534ad7144d6e91b271423a8ff7f1 getfattr: Removing leading '/' from absolute path names # file: rhs/brick2/dir2 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000000000000024924923 trusted.glusterfs.volume-id=0x6b5b534ad7144d6e91b271423a8ff7f1 getfattr: Removing leading '/' from absolute path names # file: rhs/brick3/dir3 trusted.afr.dr-vmstore-client-10=0x000000000000000000000000 trusted.afr.dr-vmstore-client-11=0x000000000000000000000000 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000002492492449249247 trusted.glusterfs.volume-id=0x6b5b534ad7144d6e91b271423a8ff7f1 3. Corrupted VMs ================ I could see FS errors in 6 VMs out of 7. ( except appvm6 ) 4. Rebalance status post post its completion : ============================================ [Wed Aug 7 15:07:22 UTC 2013 root.37.188:~ ] # gluster volume rebalance dr-vmstore status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 4 80.0GB 9 0 0 completed 14585.00 10.70.37.73 0 0Bytes 7 0 0 completed 0.00 10.70.37.106 3 60.0GB 10 0 0 completed 13278.00 10.70.37.204 0 0Bytes 7 0 0 completed 2.00 volume rebalance: dr-vmstore: success: 5.Since sosreports are more than 20MB, I have copied sosreports to, http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/988262-1/ 6. Rebooting the VM, which was affected with EXT4 FS IO errors, boots up post checking filesystems. Attached the screenshot of the same Created attachment 783989 [details]
Filesystem checking, while rebooting application VM affected with EXT4 FS IO errors while rebalance
https://code.engineering.redhat.com/gerrit/11398 fixes the issue. Created attachment 788342 [details]
tar-ed sosreports
I tested this bug with glusterfs-3.4.0.20rhs-2 and there were no EXT4 IO Errors on Application VMs. But I see 2 issues, 1. https://bugzilla.redhat.com/show_bug.cgi?id=979901. Is this still expected in BigBend 2. Migration took a long time, rather than usual in my setup, which is about ~9.5 hours for 80GB. I am providing info for issue 2, [Tue Aug 20 06:11:52 UTC 2013 root.37.216:~ ] # gluster volume status Status of volume: dr-vmstore Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.216:/rhs/brick1/dir1 49152 Y 3007 Brick 10.70.37.98:/rhs/brick1/dir1 49152 Y 2982 Brick 10.70.37.216:/rhs/brick2/dir2 49153 Y 3018 Brick 10.70.37.98:/rhs/brick2/dir2 49153 Y 2993 Brick 10.70.37.216:/rhs/brick3/dir3 49154 Y 3029 Brick 10.70.37.98:/rhs/brick3/dir3 49154 Y 3004 Brick 10.70.37.123:/rhs/brick1/dir1 49152 Y 2924 Brick 10.70.37.164:/rhs/brick1/dir1 49152 Y 2976 Brick 10.70.37.123:/rhs/brick2/dir2 49153 Y 2935 Brick 10.70.37.164:/rhs/brick2/dir2 49153 Y 2987 Brick 10.70.37.123:/rhs/brick3/dir3 49154 Y 2946 Brick 10.70.37.164:/rhs/brick3/dir3 49154 Y 2998 Brick 10.70.37.216:/rhs/brick4/newdir1 49155 Y 3817 Brick 10.70.37.98:/rhs/brick4/newdir1 49155 Y 3771 NFS Server on localhost 2049 Y 3829 Self-heal Daemon on localhost N/A Y 3845 NFS Server on 10.70.37.164 2049 Y 3961 Self-heal Daemon on 10.70.37.164 N/A Y 3968 NFS Server on 10.70.37.123 2049 Y 3912 Self-heal Daemon on 10.70.37.123 N/A Y 3919 NFS Server on 10.70.37.98 2049 Y 3783 Self-heal Daemon on 10.70.37.98 N/A Y 3790 Task ID Status ---- -- ------ Rebalance 190512d5-1a68-49ad-9db4-13162a642bd6 3 [Tue Aug 20 06:51:11 UTC 2013 root.37.216:~ ] # gluster volume info Volume Name: dr-vmstore Type: Distributed-Replicate Volume ID: 28ce4337-c3a3-4f2f-9db7-b0e5c2c57959 Status: Started Number of Bricks: 7 x 2 = 14 Transport-type: tcp Bricks: Brick1: 10.70.37.216:/rhs/brick1/dir1 Brick2: 10.70.37.98:/rhs/brick1/dir1 Brick3: 10.70.37.216:/rhs/brick2/dir2 Brick4: 10.70.37.98:/rhs/brick2/dir2 Brick5: 10.70.37.216:/rhs/brick3/dir3 Brick6: 10.70.37.98:/rhs/brick3/dir3 Brick7: 10.70.37.123:/rhs/brick1/dir1 Brick8: 10.70.37.164:/rhs/brick1/dir1 Brick9: 10.70.37.123:/rhs/brick2/dir2 Brick10: 10.70.37.164:/rhs/brick2/dir2 Brick11: 10.70.37.123:/rhs/brick3/dir3 Brick12: 10.70.37.164:/rhs/brick3/dir3 Brick13: 10.70.37.216:/rhs/brick4/newdir1 Brick14: 10.70.37.98:/rhs/brick4/newdir1 Options Reconfigured: storage.owner-gid: 107 storage.owner-uid: 107 network.remote-dio: enable cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off [Tue Aug 20 06:51:18 UTC 2013 root.37.216:~ ] # gluster volume rebalance dr-vmstore status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 4 80.0GB 9 0 0 completed 34151.00 10.70.37.164 0 0Bytes 7 0 0 completed 0.00 10.70.37.123 3 60.0GB 9 0 0 completed 23658.00 10.70.37.98 0 0Bytes 7 0 0 completed 0.00 volume rebalance: dr-vmstore: success: RHS Node Details ================ 10.70.37.216 10.70.37.98 10.70.37.123 10.70.37.164 RHEL 6.4 Client : 10.70.36.32 gluster volume fuse mounted at : /mnt/dr-vmstore sosreports are attached and also available @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/988262-2/ Is this duration of ~9.5 hours for rebalance for 80GB , is expected ? Setting NEEDINFO to address above mentioned Query Marking this bug as, VERIFIED, as there no more EXT4 FS IO Errors seen in Application VMs, while rebalance of glusterfs volume is in progress. The issue as mentioned in comment, comment22, will be handled probably with seperate bug. Tested with glusterfs-3.4.0.20rhs-2 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1262.html |