Bug 1276062 - Getting IO error while VM instance is migrating from source to destination brick [NEEDINFO]
Getting IO error while VM instance is migrating from source to destination brick
Status: NEW
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: distribute (Show other bugs)
3.1
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Nithya Balachandran
storage-qa-internal@redhat.com
dht-IO-rebalance, dht-fops-while-reba...
: ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-10-28 10:44 EDT by RajeshReddy
Modified: 2017-10-18 01:39 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
tdesala: needinfo? (sasundar)


Attachments (Terms of Use)

  None (edit)
Description RajeshReddy 2015-10-28 10:44:58 EDT
Description of problem:
===================
Getting IO error while VM instance is migrating from source to destination brick 

Version-Release number of selected component (if applicable):
================
glusterfs-api-3.7.1-11

How reproducible:


Steps to Reproduce:
===================
1. Create 2x2 volume and mount it on open stack machine using nfs  
2. Use gluster volume for storing VM images and VM instances 
3. Create VM instance 
4. Add 4 new bricks to the volume 
5. login to the VM and do some IO, while IO is going on remove old bricks while files are migrating from old brick to new brick getting IO error on the VM

Actual results:
===============
Even after completion of file migration, getting IO error while running ls on the VM

Expected results:
===================
There should not be any IO error during and after migration of the files from source to destination 


Additional info:
=================
[root@rhs-client39 edd8a842-5476-4dfc-911c-40060430d41d]# gluster vol status glance1
Status of volume: glance1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick rhs-client39.lab.eng.blr.redhat.com:/
rhs/brick6/glance1-1                        49162     0          Y       31756
Brick rhs-client40.lab.eng.blr.redhat.com:/
rhs/brick6/glance1-1                        49164     0          Y       22033
Brick rhs-client21.lab.eng.blr.redhat.com:/
rhs/brick6/glance1-1                        49161     0          Y       22558
Brick rhs-client4.lab.eng.blr.redhat.com:/r
hs/brick6/glance1-1                         49162     0          Y       21218
NFS Server on localhost                     2049      0          Y       5288 
Self-heal Daemon on localhost               N/A       N/A        Y       5296 
NFS Server on rhs-client40.lab.eng.blr.redh
at.com                                      2049      0          Y       26647
Self-heal Daemon on rhs-client40.lab.eng.bl
r.redhat.com                                N/A       N/A        Y       26655
NFS Server on rhs-client21.lab.eng.blr.redh
at.com                                      2049      0          Y       27001
Self-heal Daemon on rhs-client21.lab.eng.bl
r.redhat.com                                N/A       N/A        Y       27009
NFS Server on rhs-client4.lab.eng.blr.redha
t.com                                       2049      0          Y       25681
Self-heal Daemon on rhs-client4.lab.eng.blr
.redhat.com                                 N/A       N/A        Y       25689
 
Task Status of Volume glance1
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@rhs-client39 edd8a842-5476-4dfc-911c-40060430d41d]# gluster vol info glance1
 
Volume Name: glance1
Type: Distributed-Replicate
Volume ID: 92491e7c-0e1b-45a3-b219-432d1877f37b
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: rhs-client39.lab.eng.blr.redhat.com:/rhs/brick6/glance1-1
Brick2: rhs-client40.lab.eng.blr.redhat.com:/rhs/brick6/glance1-1
Brick3: rhs-client21.lab.eng.blr.redhat.com:/rhs/brick6/glance1-1
Brick4: rhs-client4.lab.eng.blr.redhat.com:/rhs/brick6/glance1-1
Options Reconfigured:
cluster.self-heal-daemon: on
performance.readdir-ahead: on


Adding new bricks


[root@rhs-client39 edd8a842-5476-4dfc-911c-40060430d41d]# gluster vol add-brick glance1 rhs-client39.lab.eng.blr.redhat.com:/rhs/brick7/glance1-11 rhs-client40.lab.eng.blr.redhat.com:/rhs/brick7/glance1-11 rhs-client21.lab.eng.blr.redhat.com:/rhs/brick7/glance1-11 rhs-client4.lab.eng.blr.redhat.com:/rhs/brick7/glance1-11 
volume add-brick: success

Removing old bricks

[root@rhs-client39 edd8a842-5476-4dfc-911c-40060430d41d]# gluster vol remove-brick glance1 rhs-client39.lab.eng.blr.redhat.com:/rhs/brick6/glance1-1 rhs-client40.lab.eng.blr.redhat.com:/rhs/brick6/glance1-1 rhs-client21.lab.eng.blr.redhat.com:/rhs/brick6/glance1-1 rhs-client4.lab.eng.blr.redhat.com:/rhs/brick6/glance1-1 start 
volume remove-brick start: success
ID: bff0a058-52a2-4738-b82a-cce9e3b4d6ff

VM 

[centos@centos-instance-2 data]$ sudo dd if=/dev/urandom of=/data/file bs=1M count=1024
dd: error writing ‘/data/file’: Input/output error
676+0 records in
675+0 records out
708132864 bytes (708 MB) copied, 67.9697 s, 10.4 MB/s
[centos@centos-instance-2 data]$ sudo dd if=/dev/urandom of=/data/file bs=1M count=1024
-bash: /usr/bin/sudo: Input/output error
[centos@centos-instance-2 data]$ 
[centos@centos-instance-2 data]$ 
[centos@centos-instance-2 data]$ ls
-bash: /usr/bin/ls: Input/output error
[centos@centos-instance-2 data]$ ls
-bash: /usr/bin/ls: Input/output error
[centos@centos-instance-2 data]$
Comment 3 RajeshReddy 2015-10-29 08:27:12 EDT
sosreports are available on  rhsqe-repo.lab.eng.blr.redhat.com in the following location /home/repo/sosreports/bug.1276062
Comment 4 Anoop 2015-11-04 01:57:52 EST
NOTE: You get in to this issue ONLY if you remove all the bricks in a replica set (which people may never do, consciously). Hence, this is really a very edge case and may be a candidate for documentation (if not there already).
Comment 6 Nithya Balachandran 2017-08-16 05:37:46 EDT
The rebalance logs show some EIO messages returned by AFR:


[2015-10-27 09:22:39.886186] E [MSGID: 108008] [afr-read-txn.c:76:afr_read_txn_refresh_done] 0-glance1-replicate-3: Failing GETXATTR on gfid 00000000-0000-0000-0000-000000000000: split-brain observed. [Input/output error]
[2015-10-27 09:22:39.886260] W [MSGID: 109023] [dht-rebalance.c:1076:dht_migrate_file] 0-glance1-dht: Migrate file failed:/glance/images/45edba02-7b69-4161-ade7-047a1d5f2e9b: failed to get xattr from glance1-replicate-3 (Invalid argument)
[2015-10-27 09:22:39.886396] W [MSGID: 109023] [dht-rebalance.c:546:__dht_rebalance_create_dst_file] 0-glance1-dht: /glance/images/d6fb9845-fdfe-4139-83c7-7e90b3072824: failed to set xattr on glance1-replicate-0 (Cannot allocate memory)
[2015-10-27 09:22:39.887923] E [MSGID: 108008] [afr-transaction.c:1984:afr_transaction] 0-glance1-replicate-3: Failing SETXATTR on gfid 00000000-0000-0000-0000-000000000000: split-brain observed. [Input/output error]
[2015-10-27 09:22:39.888262] E [MSGID: 109023] [dht-rebalance.c:792:__dht_rebalance_open_src_file] 0-glance1-dht: failed to set xattr on /glance/images/ad92693e-3c51-408e-ae5a-85ce73a9dc62 in glance1-replicate-3 (Input/output error)
[2015-10-27 09:22:39.888288] E [MSGID: 109023] [dht-rebalance.c:1098:dht_migrate_file] 0-glance1-dht: Migrate file failed: failed to open /glance/images/ad92693e-3c51-408e-ae5a-85ce73a9dc62 on glance1-replicate-3
[2015-10-27 09:22:39.888319] E [MSGID: 101046] [afr-inode-write.c:1534:afr_fsetxattr] 0-glance1-replicate-0: setxattr dict is null
[2015-10-27 09:22:39.888533] W [MSGID: 109023] [dht-rebalance.c:546:__dht_rebalance_create_dst_file] 0-glance1-dht: /glance/images/45edba02-7b69-4161-ade7-047a1d5f2e9b: failed to set xattr on glance1-replicate-0 (Cannot allocate memory)
[2015-10-27 09:22:39.889482] E [MSGID: 108008] [afr-transaction.c:1984:afr_transaction] 0-glance1-replicate-3: Failing SETXATTR on gfid 00000000-0000-0000-0000-000000000000: split-brain observed. [Input/output error]
[2015-10-27 09:22:39.889855] E [MSGID: 109023] [dht-rebalance.c:792:__dht_rebalance_open_src_file] 0-glance1-dht: failed to set xattr on /glance/images/d6fb9845-fdfe-4139-83c7-7e90b3072824 in glance1-replicate-3 (Input/output error)
[2015-10-27 09:22:39.889873] E [MSGID: 109023] [dht-rebalance.c:1098:dht_migrate_file] 0-glance1-dht: Migrate file failed: failed to open /glance/images/d6fb9845-fdfe-4139-83c7-7e90b3072824 on glance1-replicate-3
[2015-10-27 09:22:39.891221] E [MSGID: 108008] [afr-transaction.c:1984:afr_transaction] 0-glance1-replicate-3: Failing SETXATTR on gfid 00000000-0000-0000-0000-000000000000: split-brain observed. [Input/output error]
[2015-10-27 09:22:39.891487] E [MSGID: 109023] [dht-rebalance.c:792:__dht_rebalance_open_src_file] 0-glance1-dht: failed to set xattr on /glance/images/45edba02-7b69-4161-ade7-047a1d5f2e9b in glance1-replicate-3 (Input/output error)


Setting a NeedInfo on Ravi to see if this is a known issue.
Comment 7 Nithya Balachandran 2017-08-16 05:39:08 EDT
It is difficult to figure out the exact failure as the client logs are not available.
Comment 8 Ravishankar N 2017-08-16 07:38:10 EDT
(In reply to Nithya Balachandran from comment #6)
> The rebalance logs show some EIO messages returned by AFR:
> 
> 
> [2015-10-27 09:22:39.886186] E [MSGID: 108008]
> [afr-read-txn.c:76:afr_read_txn_refresh_done] 0-glance1-replicate-3: Failing
> GETXATTR on gfid 00000000-0000-0000-0000-000000000000: split-brain observed.
> [Input/output error]
> Setting a NeedInfo on Ravi to see if this is a known issue.

We had some known spurious split-brain logs that was fixed via BZ 1411625 some time back where getfattr failed with EIO spuriously. But here the gifd is all zeroes which is strange. Probably needs to be tested with latest gluster bits to see if the issue is reproducible.

Note You need to log in before you can comment on or make changes to this bug.