Bug 980081 - DHT: Kernel untar fails on the mount point after add-brick
DHT: Kernel untar fails on the mount point after add-brick
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterfs (Show other bugs)
2.1
Unspecified Unspecified
medium Severity high
: ---
: ---
Assigned To: shishir gowda
shylesh
: Reopened, TestBlocker
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-07-01 07:37 EDT by shylesh
Modified: 2014-11-12 07:31 EST (History)
7 users (show)

See Also:
Fixed In Version: glusterfs-3.4.0.15rhs-1
Doc Type: Bug Fix
Doc Text:
Cause: Continuous add-brick while I/O is in progress on the mount point, might lead to applications seeing failure Consequence: Applications might see I/O error Fix: Run rebalance command after each/subsequent add-brick should help Result:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-11-12 07:31:12 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
sosreport from machine "shakthiman" (2.91 MB, application/x-xz)
2013-07-31 08:09 EDT, Vijaykumar Koppad
no flags Details
sosreport from machine "snow" (3.35 MB, application/x-xz)
2013-07-31 08:10 EDT, Vijaykumar Koppad
no flags Details
sosreport from machine "spartacus" (2.37 MB, application/x-xz)
2013-07-31 08:10 EDT, Vijaykumar Koppad
no flags Details
sosreport from machine "stark" (3.91 MB, application/x-xz)
2013-07-31 08:10 EDT, Vijaykumar Koppad
no flags Details

  None (edit)
Description shylesh 2013-07-01 07:37:19 EDT
Description of problem:
adding bricks while kernel untar is happening make untar fail

Version-Release number of selected component (if applicable):
[root@beta2 ~]# rpm -qa | grep gluster
gluster-swift-container-1.4.8-4.el6.noarch
vdsm-gluster-4.10.2-22.5.el6rhs.noarch
gluster-swift-plugin-1.0-5.noarch
gluster-swift-proxy-1.4.8-4.el6.noarch
gluster-swift-account-1.4.8-4.el6.noarch
glusterfs-geo-replication-3.4.0.12rhs.beta1-1.el6rhs.x86_64
gluster-swift-doc-1.4.8-4.el6.noarch
glusterfs-3.4.0.12rhs.beta1-1.el6rhs.x86_64
gluster-swift-1.4.8-4.el6.noarch
glusterfs-server-3.4.0.12rhs.beta1-1.el6rhs.x86_64
glusterfs-rdma-3.4.0.12rhs.beta1-1.el6rhs.x86_64
gluster-swift-object-1.4.8-4.el6.noarch
glusterfs-fuse-3.4.0.12rhs.beta1-1.el6rhs.x86_64


How reproducible:


Steps to Reproduce:
1. created a 2 brick distributed volume 
2. started kernel untar on the mount point
3. kept on adding the bricks 

Actual results:
kernel untar fails

Expected results:



Additional info:

RHS nodes
==========
10.70.35.62
10.70.35.64

mount point
===========
10.70.35.203

add-brick command executed from 10.70.35.62

mount path
==========
/testmount



[root@beta1 ~]# gluster v info
 
Volume Name: test
Type: Distribute
Volume ID: fb2b8a8c-f3a1-4787-9a36-633ee608f7b8
Status: Started
Number of Bricks: 20
Transport-type: tcp
Bricks:
Brick1: 10.70.35.62:/brick1/t1
Brick2: 10.70.35.64:/brick1/t2
Brick3: 10.70.35.62:/brick1/t3
Brick4: 10.70.35.62:/brick1/t4
Brick5: 10.70.35.62:/brick1/t5
Brick6: 10.70.35.64:/brick1/t6
Brick7: 10.70.35.62:/brick1/t7
Brick8: 10.70.35.64:/brick1/t8
Brick9: 10.70.35.62:/brick1/t9
Brick10: 10.70.35.62:/brick2/t10
Brick11: 10.70.35.64:/brick2/t10
Brick12: 10.70.35.62:/brick2/t11
Brick13: 10.70.35.64:/brick2/t12
Brick14: 10.70.35.62:/brick2/t13
Brick15: 10.70.35.64:/brick2/t14
Brick16: 10.70.35.62:/brick2/t15
Brick17: 10.70.35.64:/brick2/t16
Brick18: 10.70.35.62:/brick2/t17
Brick19: 10.70.35.64:/brick2/t18
Brick20: 10.70.35.62:/brick2/t19



tar failures
============

linux-2.6.32.61/arch/arm/mach-realview/include/mach/debug-macro.S
xz: (stdin): Read error: Bad file descriptor
linux-2.6.32.61/arch/arm/mach-realview/include/mach/entry-macro.S
linux-2.6.32.61/arch/arm/mach-realview/include/mach/gpio.h
linux-2.6.32.61/arch/arm/mach-realview/include/mach/hardware.h
linux-2.6.32.61/arch/arm/mach-realview/include/mach/io.h
linux-2.6.32.61/arch/arm/mach-realview/include/mach/irqs-eb.h
linux-2.6.32.61/arch/arm/mach-realview/include/mach/irqs-pb1176.h
linux-2.6.32.61/arch/arm/mach-realview/include/mach/irqs-pb11mp.h
linux-2.6.32.61/arch/arm/mach-realview/include/mach/irqs-pba8.h
linux-2.6.32.61/arch/arm/mach-realview/include/mach/irqs-pbx.h
linux-2.6.32.61/arch/arm/mach-realview/include/mach/irqs.h
linux-2.6.32.61/arch/arm/mach-realview/include/mach/memory.h
linux-2.6.32.61/arch/arm/mach-realview/include/mach/platform.h
linux-2.6.32.61/arch/arm/mach-realview/include/mach/smp.h
linux-2.6.32.61/arch/arm/mach-realview/include/mach/system.h
linux-2.6.32.61/arch/arm/mach-realview/include/mach/timex.h
linux-2.6.32.61/arch/arm/mach-realview/include/mach/uncompress.h
linux-2.6.32.61/arch/arm/mach-realview/include/mach/vmalloc.h
linux-2.6.32.61/arch/arm/mach-realview/localtimer.c
linux-2.6.32.61/arch/arm/mach-realview/platsmp.c
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now




mnt log
===========

2013-07-01 10:09:39.374220] W [client-rpc-fops.c:322:client3_3_mkdir_cbk] 3-test-client-16: remote operation failed: No such file or 
directory. Path: /linux-2.6.32.61/arch/arm/mach-realview/include/mach
[2013-07-01 10:09:39.374336] W [client-rpc-fops.c:322:client3_3_mkdir_cbk] 3-test-client-18: remote operation failed: No such file or 
directory. Path: /linux-2.6.32.61/arch/arm/mach-realview/include/mach
[2013-07-01 10:09:39.474168] W [fuse-resolve.c:530:fuse_resolve_fd] 0-fuse-resolve: migration of basefd (ptr:0x9e280c inode-gfid:599f6
e3c-d3e9-43e0-a21e-b22545636591) did not complete, failing fop with EBADF (old-subvolume:test-2 new-subvolume:test-3)
[2013-07-01 10:09:39.474295] W [fuse-bridge.c:2572:fuse_readv_resume] 0-glusterfs-fuse: 51840: READ() inode migration of (null) failed
 (Bad file descriptor)
[2013-07-01 10:09:39.474351] W [fuse-resolve.c:530:fuse_resolve_fd] 0-fuse-resolve: migration of basefd (ptr:0x9e280c inode-gfid:599f6
e3c-d3e9-43e0-a21e-b22545636591) did not complete, failing fop with EBADF (old-subvolume:test-2 new-subvolume:test-3)
[2013-07-01 10:09:39.474368] W [fuse-bridge.c:2572:fuse_readv_resume] 0-glusterfs-fuse: 51841: READ() inode migration of (null) failed
 (Bad file descriptor)
[2013-07-01 10:09:39.478344] W [fuse-resolve.c:530:fuse_resolve_fd] 0-fuse-resolve: migration of basefd (ptr:0x9e280c inode-gfid:599f6
e3c-d3e9-43e0-a21e-b22545636591) did not complete, failing fop with EBADF (old-subvolume:test-2 new-subvolume:test-3)
[2013-07-01 10:09:39.478376] W [fuse-bridge.c:2723:fuse_flush_resume] 0-glusterfs-fuse: 51844: FLUSH() inode migration of (null) faile
d (Bad file descriptor)
[2013-07-01 10:09:39.479667] W [defaults.c:1291:default_release] (-->/usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/mount/fuse.so(+0x21
c88) [0x7fd7f21f7c88] (-->/usr/lib64/glusterfs/3.4.0.12rhs.beta1/xlator/mount/fuse.so(+0xb1ca) [0x7fd7f21e11ca] (-->/usr/lib64/libglus
terfs.so.0(fd_unref+0x13b) [0x7fd7f3e1725b]))) 0-fuse: xlator does not implement release_cbk
[2013-07-01 10:09:39.721121] W [client-rpc-fops.c:1983:client3_3_setattr_cbk] 3-test-client-16: remote operation failed: No such file 
or directory



attaching the sosreport
Comment 2 shylesh 2013-07-01 07:41:39 EDT
dht layout
==========

10.70.35.62
===============
# file: brick1/t1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000555555547ffffffd
trusted.glusterfs.volume-id=0xfb2b8a8cf3a147879a36633ee608f7b8

# file: brick1/t3
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000aaaaaaa8d5555551
trusted.glusterfs.volume-id=0xfb2b8a8cf3a147879a36633ee608f7b8

# file: brick1/t4
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000d5555552ffffffff
trusted.glusterfs.volume-id=0xfb2b8a8cf3a147879a36633ee608f7b8

# file: brick1/t5
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000000000002aaaaaa9
trusted.glusterfs.volume-id=0xfb2b8a8cf3a147879a36633ee608f7b8

# file: brick1/t7
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.volume-id=0xfb2b8a8cf3a147879a36633ee608f7b8

# file: brick1/t9
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.volume-id=0xfb2b8a8cf3a147879a36633ee608f7b8

# file: brick2/t10
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.volume-id=0xfb2b8a8cf3a147879a36633ee608f7b8

# file: brick2/t11
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.volume-id=0xfb2b8a8cf3a147879a36633ee608f7b8

# file: brick2/t13
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.volume-id=0xfb2b8a8cf3a147879a36633ee608f7b8

# file: brick2/t15
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.volume-id=0xfb2b8a8cf3a147879a36633ee608f7b8

# file: brick2/t17
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.volume-id=0xfb2b8a8cf3a147879a36633ee608f7b8

# file: brick2/t19
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.volume-id=0xfb2b8a8cf3a147879a36633ee608f7b8




10.70.35.64
============
[root@beta2 ~]# less out
# file: brick1/t2
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000007ffffffeaaaaaaa7
trusted.glusterfs.volume-id=0xfb2b8a8cf3a147879a36633ee608f7b8

# file: brick1/t6
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000002aaaaaaa55555553
trusted.glusterfs.volume-id=0xfb2b8a8cf3a147879a36633ee608f7b8

# file: brick1/t8
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.volume-id=0xfb2b8a8cf3a147879a36633ee608f7b8

# file: brick2/t10
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.volume-id=0xfb2b8a8cf3a147879a36633ee608f7b8

# file: brick2/t12
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.volume-id=0xfb2b8a8cf3a147879a36633ee608f7b8

# file: brick2/t14
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.volume-id=0xfb2b8a8cf3a147879a36633ee608f7b8

# file: brick2/t16
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.volume-id=0xfb2b8a8cf3a147879a36633ee608f7b8

# file: brick2/t18
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.volume-id=0xfb2b8a8cf3a147879a36633ee608f7b8
Comment 4 shishir gowda 2013-07-05 02:39:01 EDT
Looks like a graph switch fd migration issue.

Additionally a 2 brick dht was expanded to a 20 brick dht with continuous add-bricks when kernel untar was in-progress.

No rebalance was done inbetween any one of them.

Think we should lower the priority.

[2013-07-01 10:09:39.474168] W [fuse-resolve.c:530:fuse_resolve_fd] 0-fuse-resolve: migration of basefd (ptr:0x9e280c inode-gfid:599f6
e3c-d3e9-43e0-a21e-b22545636591) did not complete, failing fop with EBADF (old-subvolume:test-2 new-subvolume:test-3)
[2013-07-01 10:09:39.474295] W [fuse-bridge.c:2572:fuse_readv_resume] 0-glusterfs-fuse: 51840: READ() inode migration of (null) failed
 (Bad file descriptor)
[2013-07-01 10:09:39.474351] W [fuse-resolve.c:530:fuse_resolve_fd] 0-fuse-resolve: migration of basefd (ptr:0x9e280c inode-gfid:599f6
e3c-d3e9-43e0-a21e-b22545636591) did not complete, failing fop with EBADF (old-subvolume:test-2 new-subvolume:test-3)
Comment 5 Vijaykumar Koppad 2013-07-24 05:20:46 EDT
I was able to hit this, while doing add-brick and parallely doing file creation. 

These were the  logs from client log file.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

[2013-07-24 08:59:03.952391] W [fuse-resolve.c:145:fuse_resolve_gfid_cbk] 0-fuse: eccc06e4-f2e1-493a-8ced-de2aa8d8c9c4: failed to resolve (Resource temporarily unavailable)
[2013-07-24 08:59:03.952399] E [fuse-bridge.c:1098:fuse_getattr_resume] 0-glusterfs-fuse: 26637142: GETATTR 140051937293260 (eccc06e4-f2e1-493a-8ced-de2aa8d8c9c4) resolution failed
[2013-07-24 08:59:03.953410] I [dht-layout.c:636:dht_layout_normalize] 2-master-dht: found anomalies in <gfid:eccc06e4-f2e1-493a-8ced-de2aa8d8c9c4>. holes=1 overlaps=0 missing=1 down=0 misc=0
[2013-07-24 08:59:03.953438] W [dht-common.c:213:dht_discover_complete] 2-master-dht: normalizing failed on <gfid:eccc06e4-f2e1-493a-8ced-de2aa8d8c9c4> (overlaps/holes present)
[2013-07-24 08:59:03.953462] W [fuse-resolve.c:145:fuse_resolve_gfid_cbk] 0-fuse: eccc06e4-f2e1-493a-8ced-de2aa8d8c9c4: failed to resolve (Resource temporarily unavailable)
[2013-07-24 08:59:03.953475] E [fuse-bridge.c:1098:fuse_getattr_resume] 0-glusterfs-fuse: 26637143: GETATTR 140051937293260 (eccc06e4-f2e1-493a-8ced-de2aa8d8c9c4) resolution failed
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Comment 7 Amar Tumballi 2013-07-24 08:29:47 EDT
Taking 'blocker' flag out of this bug, as we decided, we don't approve of doing a 'add-brick' in loop. Also marking as 'medium' priority.
Comment 9 Vijaykumar Koppad 2013-07-24 10:44:01 EDT
Amar, 

    I was able to hit it  while doing add-brick and creation files in parallel. I wasn't doing any in loop add-brick.
Comment 10 shishir gowda 2013-07-25 08:24:54 EDT
Please provide the sos-reports
Comment 11 shylesh 2013-07-25 08:58:02 EDT
(In reply to shishir gowda from comment #10)
> Please provide the sos-reports

Please check  the URL provided in comment 1
Comment 12 shishir gowda 2013-07-29 08:13:18 EDT
I had already analyzed the sos-reports provided in comment 1, and updated the bug accordingly. The request for sos-reports was for Vijaykumar, as he said he was able to hit the bug in earlier scenario, hence raising the blocker flag.
Please provide the sos-reports
Comment 13 Amar Tumballi 2013-07-30 07:56:09 EDT
This has been argued that we don't support 'add-brick' in loop without doing 'rebalance'. Hence removing the 'blocker' flag, please re-open if we hit any issue with add-brick (technically, it makes sense to close this bug as NOTABUG, and open another issue if comment #9 is still valid).
Comment 15 Vijaykumar Koppad 2013-07-31 08:09:06 EDT
Created attachment 781071 [details]
sosreport from machine "shakthiman"
Comment 16 Vijaykumar Koppad 2013-07-31 08:10:12 EDT
Created attachment 781072 [details]
sosreport from machine "snow"
Comment 17 Vijaykumar Koppad 2013-07-31 08:10:21 EDT
Created attachment 781073 [details]
sosreport from machine "spartacus"
Comment 18 Vijaykumar Koppad 2013-07-31 08:10:49 EDT
Created attachment 781075 [details]
sosreport from machine "stark"
Comment 19 Amar Tumballi 2013-08-05 14:58:20 EDT
Koppad, I recommend closing this original bug and opening a new issue (as the steps described in 'description' of bug is no more valid.

Anyways, as we fixed an issue with add-brick of distribute, marking this as ON_QA.
Comment 20 shylesh 2013-08-07 05:02:03 EDT
Verified on 3.4.0.17rhs-1.el6rhs.x86_64
Comment 21 Scott Haines 2013-09-23 18:29:50 EDT
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Note You need to log in before you can comment on or make changes to this bug.