Description of problem: =================== Getting IO error while VM instance is migrating from source to destination brick Version-Release number of selected component (if applicable): ================ glusterfs-api-3.7.1-11 How reproducible: Steps to Reproduce: =================== 1. Create 2x2 volume and mount it on open stack machine using nfs 2. Use gluster volume for storing VM images and VM instances 3. Create VM instance 4. Add 4 new bricks to the volume 5. login to the VM and do some IO, while IO is going on remove old bricks while files are migrating from old brick to new brick getting IO error on the VM Actual results: =============== Even after completion of file migration, getting IO error while running ls on the VM Expected results: =================== There should not be any IO error during and after migration of the files from source to destination Additional info: ================= [root@rhs-client39 edd8a842-5476-4dfc-911c-40060430d41d]# gluster vol status glance1 Status of volume: glance1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick rhs-client39.lab.eng.blr.redhat.com:/ rhs/brick6/glance1-1 49162 0 Y 31756 Brick rhs-client40.lab.eng.blr.redhat.com:/ rhs/brick6/glance1-1 49164 0 Y 22033 Brick rhs-client21.lab.eng.blr.redhat.com:/ rhs/brick6/glance1-1 49161 0 Y 22558 Brick rhs-client4.lab.eng.blr.redhat.com:/r hs/brick6/glance1-1 49162 0 Y 21218 NFS Server on localhost 2049 0 Y 5288 Self-heal Daemon on localhost N/A N/A Y 5296 NFS Server on rhs-client40.lab.eng.blr.redh at.com 2049 0 Y 26647 Self-heal Daemon on rhs-client40.lab.eng.bl r.redhat.com N/A N/A Y 26655 NFS Server on rhs-client21.lab.eng.blr.redh at.com 2049 0 Y 27001 Self-heal Daemon on rhs-client21.lab.eng.bl r.redhat.com N/A N/A Y 27009 NFS Server on rhs-client4.lab.eng.blr.redha t.com 2049 0 Y 25681 Self-heal Daemon on rhs-client4.lab.eng.blr .redhat.com N/A N/A Y 25689 Task Status of Volume glance1 ------------------------------------------------------------------------------ There are no active volume tasks [root@rhs-client39 edd8a842-5476-4dfc-911c-40060430d41d]# gluster vol info glance1 Volume Name: glance1 Type: Distributed-Replicate Volume ID: 92491e7c-0e1b-45a3-b219-432d1877f37b Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: rhs-client39.lab.eng.blr.redhat.com:/rhs/brick6/glance1-1 Brick2: rhs-client40.lab.eng.blr.redhat.com:/rhs/brick6/glance1-1 Brick3: rhs-client21.lab.eng.blr.redhat.com:/rhs/brick6/glance1-1 Brick4: rhs-client4.lab.eng.blr.redhat.com:/rhs/brick6/glance1-1 Options Reconfigured: cluster.self-heal-daemon: on performance.readdir-ahead: on Adding new bricks [root@rhs-client39 edd8a842-5476-4dfc-911c-40060430d41d]# gluster vol add-brick glance1 rhs-client39.lab.eng.blr.redhat.com:/rhs/brick7/glance1-11 rhs-client40.lab.eng.blr.redhat.com:/rhs/brick7/glance1-11 rhs-client21.lab.eng.blr.redhat.com:/rhs/brick7/glance1-11 rhs-client4.lab.eng.blr.redhat.com:/rhs/brick7/glance1-11 volume add-brick: success Removing old bricks [root@rhs-client39 edd8a842-5476-4dfc-911c-40060430d41d]# gluster vol remove-brick glance1 rhs-client39.lab.eng.blr.redhat.com:/rhs/brick6/glance1-1 rhs-client40.lab.eng.blr.redhat.com:/rhs/brick6/glance1-1 rhs-client21.lab.eng.blr.redhat.com:/rhs/brick6/glance1-1 rhs-client4.lab.eng.blr.redhat.com:/rhs/brick6/glance1-1 start volume remove-brick start: success ID: bff0a058-52a2-4738-b82a-cce9e3b4d6ff VM [centos@centos-instance-2 data]$ sudo dd if=/dev/urandom of=/data/file bs=1M count=1024 dd: error writing ‘/data/file’: Input/output error 676+0 records in 675+0 records out 708132864 bytes (708 MB) copied, 67.9697 s, 10.4 MB/s [centos@centos-instance-2 data]$ sudo dd if=/dev/urandom of=/data/file bs=1M count=1024 -bash: /usr/bin/sudo: Input/output error [centos@centos-instance-2 data]$ [centos@centos-instance-2 data]$ [centos@centos-instance-2 data]$ ls -bash: /usr/bin/ls: Input/output error [centos@centos-instance-2 data]$ ls -bash: /usr/bin/ls: Input/output error [centos@centos-instance-2 data]$
sosreports are available on rhsqe-repo.lab.eng.blr.redhat.com in the following location /home/repo/sosreports/bug.1276062
NOTE: You get in to this issue ONLY if you remove all the bricks in a replica set (which people may never do, consciously). Hence, this is really a very edge case and may be a candidate for documentation (if not there already).
The rebalance logs show some EIO messages returned by AFR: [2015-10-27 09:22:39.886186] E [MSGID: 108008] [afr-read-txn.c:76:afr_read_txn_refresh_done] 0-glance1-replicate-3: Failing GETXATTR on gfid 00000000-0000-0000-0000-000000000000: split-brain observed. [Input/output error] [2015-10-27 09:22:39.886260] W [MSGID: 109023] [dht-rebalance.c:1076:dht_migrate_file] 0-glance1-dht: Migrate file failed:/glance/images/45edba02-7b69-4161-ade7-047a1d5f2e9b: failed to get xattr from glance1-replicate-3 (Invalid argument) [2015-10-27 09:22:39.886396] W [MSGID: 109023] [dht-rebalance.c:546:__dht_rebalance_create_dst_file] 0-glance1-dht: /glance/images/d6fb9845-fdfe-4139-83c7-7e90b3072824: failed to set xattr on glance1-replicate-0 (Cannot allocate memory) [2015-10-27 09:22:39.887923] E [MSGID: 108008] [afr-transaction.c:1984:afr_transaction] 0-glance1-replicate-3: Failing SETXATTR on gfid 00000000-0000-0000-0000-000000000000: split-brain observed. [Input/output error] [2015-10-27 09:22:39.888262] E [MSGID: 109023] [dht-rebalance.c:792:__dht_rebalance_open_src_file] 0-glance1-dht: failed to set xattr on /glance/images/ad92693e-3c51-408e-ae5a-85ce73a9dc62 in glance1-replicate-3 (Input/output error) [2015-10-27 09:22:39.888288] E [MSGID: 109023] [dht-rebalance.c:1098:dht_migrate_file] 0-glance1-dht: Migrate file failed: failed to open /glance/images/ad92693e-3c51-408e-ae5a-85ce73a9dc62 on glance1-replicate-3 [2015-10-27 09:22:39.888319] E [MSGID: 101046] [afr-inode-write.c:1534:afr_fsetxattr] 0-glance1-replicate-0: setxattr dict is null [2015-10-27 09:22:39.888533] W [MSGID: 109023] [dht-rebalance.c:546:__dht_rebalance_create_dst_file] 0-glance1-dht: /glance/images/45edba02-7b69-4161-ade7-047a1d5f2e9b: failed to set xattr on glance1-replicate-0 (Cannot allocate memory) [2015-10-27 09:22:39.889482] E [MSGID: 108008] [afr-transaction.c:1984:afr_transaction] 0-glance1-replicate-3: Failing SETXATTR on gfid 00000000-0000-0000-0000-000000000000: split-brain observed. [Input/output error] [2015-10-27 09:22:39.889855] E [MSGID: 109023] [dht-rebalance.c:792:__dht_rebalance_open_src_file] 0-glance1-dht: failed to set xattr on /glance/images/d6fb9845-fdfe-4139-83c7-7e90b3072824 in glance1-replicate-3 (Input/output error) [2015-10-27 09:22:39.889873] E [MSGID: 109023] [dht-rebalance.c:1098:dht_migrate_file] 0-glance1-dht: Migrate file failed: failed to open /glance/images/d6fb9845-fdfe-4139-83c7-7e90b3072824 on glance1-replicate-3 [2015-10-27 09:22:39.891221] E [MSGID: 108008] [afr-transaction.c:1984:afr_transaction] 0-glance1-replicate-3: Failing SETXATTR on gfid 00000000-0000-0000-0000-000000000000: split-brain observed. [Input/output error] [2015-10-27 09:22:39.891487] E [MSGID: 109023] [dht-rebalance.c:792:__dht_rebalance_open_src_file] 0-glance1-dht: failed to set xattr on /glance/images/45edba02-7b69-4161-ade7-047a1d5f2e9b in glance1-replicate-3 (Input/output error) Setting a NeedInfo on Ravi to see if this is a known issue.
It is difficult to figure out the exact failure as the client logs are not available.
(In reply to Nithya Balachandran from comment #6) > The rebalance logs show some EIO messages returned by AFR: > > > [2015-10-27 09:22:39.886186] E [MSGID: 108008] > [afr-read-txn.c:76:afr_read_txn_refresh_done] 0-glance1-replicate-3: Failing > GETXATTR on gfid 00000000-0000-0000-0000-000000000000: split-brain observed. > [Input/output error] > Setting a NeedInfo on Ravi to see if this is a known issue. We had some known spurious split-brain logs that was fixed via BZ 1411625 some time back where getfattr failed with EIO spuriously. But here the gifd is all zeroes which is strange. Probably needs to be tested with latest gluster bits to see if the issue is reproducible.
Created attachment 1416622 [details] Fuse mount logs logrotated Fuse mount logs ( logrotated ones )
Created attachment 1416623 [details] Fuse mount logs continued Here is the rest of the fuse mount logs. Refer previous attachment for logrotated logs. Most of the errors could be seen in these fuse mount logs
Created attachment 1416625 [details] sosreport from hypervisor Gluster volume is fuse mounted on the hypervisor in the location /mnt/test