1276062 – Getting IO error while VM instance is migrating from source to destination brick

Bug 1276062 - Getting IO error while VM instance is migrating from source to destination brick

Summary: Getting IO error while VM instance is migrating from source to destination brick

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	distribute
Sub Component:
Version:	rhgs-3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Nithya Balachandran
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:	dht-IO-rebalance, dht-fops-while-reba...
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-10-28 14:44 UTC by RajeshReddy
Modified:	2018-04-16 17:59 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-04-16 17:59:22 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Fuse mount logs logrotated (14.33 KB, text/plain) 2018-04-03 07:13 UTC, SATHEESARAN	no flags	Details
Fuse mount logs continued (14.33 KB, text/plain) 2018-04-03 07:14 UTC, SATHEESARAN	no flags	Details
sosreport from hypervisor (9.30 MB, application/x-xz) 2018-04-03 07:21 UTC, SATHEESARAN	no flags	Details
View All

Description RajeshReddy 2015-10-28 14:44:58 UTC

Description of problem:
===================
Getting IO error while VM instance is migrating from source to destination brick 

Version-Release number of selected component (if applicable):
================
glusterfs-api-3.7.1-11

How reproducible:


Steps to Reproduce:
===================
1. Create 2x2 volume and mount it on open stack machine using nfs  
2. Use gluster volume for storing VM images and VM instances 
3. Create VM instance 
4. Add 4 new bricks to the volume 
5. login to the VM and do some IO, while IO is going on remove old bricks while files are migrating from old brick to new brick getting IO error on the VM

Actual results:
===============
Even after completion of file migration, getting IO error while running ls on the VM

Expected results:
===================
There should not be any IO error during and after migration of the files from source to destination 


Additional info:
=================
[root@rhs-client39 edd8a842-5476-4dfc-911c-40060430d41d]# gluster vol status glance1
Status of volume: glance1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick rhs-client39.lab.eng.blr.redhat.com:/
rhs/brick6/glance1-1                        49162     0          Y       31756
Brick rhs-client40.lab.eng.blr.redhat.com:/
rhs/brick6/glance1-1                        49164     0          Y       22033
Brick rhs-client21.lab.eng.blr.redhat.com:/
rhs/brick6/glance1-1                        49161     0          Y       22558
Brick rhs-client4.lab.eng.blr.redhat.com:/r
hs/brick6/glance1-1                         49162     0          Y       21218
NFS Server on localhost                     2049      0          Y       5288 
Self-heal Daemon on localhost               N/A       N/A        Y       5296 
NFS Server on rhs-client40.lab.eng.blr.redh
at.com                                      2049      0          Y       26647
Self-heal Daemon on rhs-client40.lab.eng.bl
r.redhat.com                                N/A       N/A        Y       26655
NFS Server on rhs-client21.lab.eng.blr.redh
at.com                                      2049      0          Y       27001
Self-heal Daemon on rhs-client21.lab.eng.bl
r.redhat.com                                N/A       N/A        Y       27009
NFS Server on rhs-client4.lab.eng.blr.redha
t.com                                       2049      0          Y       25681
Self-heal Daemon on rhs-client4.lab.eng.blr
.redhat.com                                 N/A       N/A        Y       25689
 
Task Status of Volume glance1
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@rhs-client39 edd8a842-5476-4dfc-911c-40060430d41d]# gluster vol info glance1
 
Volume Name: glance1
Type: Distributed-Replicate
Volume ID: 92491e7c-0e1b-45a3-b219-432d1877f37b
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: rhs-client39.lab.eng.blr.redhat.com:/rhs/brick6/glance1-1
Brick2: rhs-client40.lab.eng.blr.redhat.com:/rhs/brick6/glance1-1
Brick3: rhs-client21.lab.eng.blr.redhat.com:/rhs/brick6/glance1-1
Brick4: rhs-client4.lab.eng.blr.redhat.com:/rhs/brick6/glance1-1
Options Reconfigured:
cluster.self-heal-daemon: on
performance.readdir-ahead: on


Adding new bricks


[root@rhs-client39 edd8a842-5476-4dfc-911c-40060430d41d]# gluster vol add-brick glance1 rhs-client39.lab.eng.blr.redhat.com:/rhs/brick7/glance1-11 rhs-client40.lab.eng.blr.redhat.com:/rhs/brick7/glance1-11 rhs-client21.lab.eng.blr.redhat.com:/rhs/brick7/glance1-11 rhs-client4.lab.eng.blr.redhat.com:/rhs/brick7/glance1-11 
volume add-brick: success

Removing old bricks

[root@rhs-client39 edd8a842-5476-4dfc-911c-40060430d41d]# gluster vol remove-brick glance1 rhs-client39.lab.eng.blr.redhat.com:/rhs/brick6/glance1-1 rhs-client40.lab.eng.blr.redhat.com:/rhs/brick6/glance1-1 rhs-client21.lab.eng.blr.redhat.com:/rhs/brick6/glance1-1 rhs-client4.lab.eng.blr.redhat.com:/rhs/brick6/glance1-1 start 
volume remove-brick start: success
ID: bff0a058-52a2-4738-b82a-cce9e3b4d6ff

VM 

[centos@centos-instance-2 data]$ sudo dd if=/dev/urandom of=/data/file bs=1M count=1024
dd: error writing ‘/data/file’: Input/output error
676+0 records in
675+0 records out
708132864 bytes (708 MB) copied, 67.9697 s, 10.4 MB/s
[centos@centos-instance-2 data]$ sudo dd if=/dev/urandom of=/data/file bs=1M count=1024
-bash: /usr/bin/sudo: Input/output error
[centos@centos-instance-2 data]$ 
[centos@centos-instance-2 data]$ 
[centos@centos-instance-2 data]$ ls
-bash: /usr/bin/ls: Input/output error
[centos@centos-instance-2 data]$ ls
-bash: /usr/bin/ls: Input/output error
[centos@centos-instance-2 data]$

Comment 3 RajeshReddy 2015-10-29 12:27:12 UTC

sosreports are available on  rhsqe-repo.lab.eng.blr.redhat.com in the following location /home/repo/sosreports/bug.1276062

Comment 4 Anoop 2015-11-04 06:57:52 UTC

NOTE: You get in to this issue ONLY if you remove all the bricks in a replica set (which people may never do, consciously). Hence, this is really a very edge case and may be a candidate for documentation (if not there already).

Comment 6 Nithya Balachandran 2017-08-16 09:37:46 UTC

The rebalance logs show some EIO messages returned by AFR:


[2015-10-27 09:22:39.886186] E [MSGID: 108008] [afr-read-txn.c:76:afr_read_txn_refresh_done] 0-glance1-replicate-3: Failing GETXATTR on gfid 00000000-0000-0000-0000-000000000000: split-brain observed. [Input/output error]
[2015-10-27 09:22:39.886260] W [MSGID: 109023] [dht-rebalance.c:1076:dht_migrate_file] 0-glance1-dht: Migrate file failed:/glance/images/45edba02-7b69-4161-ade7-047a1d5f2e9b: failed to get xattr from glance1-replicate-3 (Invalid argument)
[2015-10-27 09:22:39.886396] W [MSGID: 109023] [dht-rebalance.c:546:__dht_rebalance_create_dst_file] 0-glance1-dht: /glance/images/d6fb9845-fdfe-4139-83c7-7e90b3072824: failed to set xattr on glance1-replicate-0 (Cannot allocate memory)
[2015-10-27 09:22:39.887923] E [MSGID: 108008] [afr-transaction.c:1984:afr_transaction] 0-glance1-replicate-3: Failing SETXATTR on gfid 00000000-0000-0000-0000-000000000000: split-brain observed. [Input/output error]
[2015-10-27 09:22:39.888262] E [MSGID: 109023] [dht-rebalance.c:792:__dht_rebalance_open_src_file] 0-glance1-dht: failed to set xattr on /glance/images/ad92693e-3c51-408e-ae5a-85ce73a9dc62 in glance1-replicate-3 (Input/output error)
[2015-10-27 09:22:39.888288] E [MSGID: 109023] [dht-rebalance.c:1098:dht_migrate_file] 0-glance1-dht: Migrate file failed: failed to open /glance/images/ad92693e-3c51-408e-ae5a-85ce73a9dc62 on glance1-replicate-3
[2015-10-27 09:22:39.888319] E [MSGID: 101046] [afr-inode-write.c:1534:afr_fsetxattr] 0-glance1-replicate-0: setxattr dict is null
[2015-10-27 09:22:39.888533] W [MSGID: 109023] [dht-rebalance.c:546:__dht_rebalance_create_dst_file] 0-glance1-dht: /glance/images/45edba02-7b69-4161-ade7-047a1d5f2e9b: failed to set xattr on glance1-replicate-0 (Cannot allocate memory)
[2015-10-27 09:22:39.889482] E [MSGID: 108008] [afr-transaction.c:1984:afr_transaction] 0-glance1-replicate-3: Failing SETXATTR on gfid 00000000-0000-0000-0000-000000000000: split-brain observed. [Input/output error]
[2015-10-27 09:22:39.889855] E [MSGID: 109023] [dht-rebalance.c:792:__dht_rebalance_open_src_file] 0-glance1-dht: failed to set xattr on /glance/images/d6fb9845-fdfe-4139-83c7-7e90b3072824 in glance1-replicate-3 (Input/output error)
[2015-10-27 09:22:39.889873] E [MSGID: 109023] [dht-rebalance.c:1098:dht_migrate_file] 0-glance1-dht: Migrate file failed: failed to open /glance/images/d6fb9845-fdfe-4139-83c7-7e90b3072824 on glance1-replicate-3
[2015-10-27 09:22:39.891221] E [MSGID: 108008] [afr-transaction.c:1984:afr_transaction] 0-glance1-replicate-3: Failing SETXATTR on gfid 00000000-0000-0000-0000-000000000000: split-brain observed. [Input/output error]
[2015-10-27 09:22:39.891487] E [MSGID: 109023] [dht-rebalance.c:792:__dht_rebalance_open_src_file] 0-glance1-dht: failed to set xattr on /glance/images/45edba02-7b69-4161-ade7-047a1d5f2e9b in glance1-replicate-3 (Input/output error)


Setting a NeedInfo on Ravi to see if this is a known issue.

Comment 7 Nithya Balachandran 2017-08-16 09:39:08 UTC

It is difficult to figure out the exact failure as the client logs are not available.

Comment 8 Ravishankar N 2017-08-16 11:38:10 UTC

(In reply to Nithya Balachandran from comment #6)
> The rebalance logs show some EIO messages returned by AFR:
> 
> 
> [2015-10-27 09:22:39.886186] E [MSGID: 108008]
> [afr-read-txn.c:76:afr_read_txn_refresh_done] 0-glance1-replicate-3: Failing
> GETXATTR on gfid 00000000-0000-0000-0000-000000000000: split-brain observed.
> [Input/output error]
> Setting a NeedInfo on Ravi to see if this is a known issue.

We had some known spurious split-brain logs that was fixed via BZ 1411625 some time back where getfattr failed with EIO spuriously. But here the gifd is all zeroes which is strange. Probably needs to be tested with latest gluster bits to see if the issue is reproducible.

Comment 12 SATHEESARAN 2018-04-03 07:13:16 UTC

Created attachment 1416622 [details]
Fuse mount logs logrotated

Fuse mount logs ( logrotated ones )

Comment 13 SATHEESARAN 2018-04-03 07:14:16 UTC

Created attachment 1416623 [details]
Fuse mount logs continued

Here is the rest of the fuse mount logs. Refer previous attachment for logrotated logs. Most of the errors could be seen in these fuse mount logs

Comment 14 SATHEESARAN 2018-04-03 07:21:13 UTC

Created attachment 1416625 [details]
sosreport from hypervisor

Gluster volume is fuse mounted on the hypervisor in the location /mnt/test

Note You need to log in before you can comment on or make changes to this bug.