Bug 1463907
Summary: | Application VMs, wth the disk images on replica 3 volume, paused post rebalance | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | SATHEESARAN <sasundar> | ||||||||||
Component: | distribute | Assignee: | Nithya Balachandran <nbalacha> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | SATHEESARAN <sasundar> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | unspecified | ||||||||||||
Version: | rhgs-3.2 | CC: | amukherj, knarra, nbalacha, pkarampu, rcyriac, rhinduja, rhs-bugs, sasundar, spalai, storage-qa-internal | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | RHGS 3.3.0 | ||||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | 3.3.0-devel-freeze-exception | ||||||||||||
Fixed In Version: | glusterfs-3.8.4-39 | Doc Type: | If docs needed, set a value | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2017-09-21 04:59:42 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | 1465075 | ||||||||||||
Bug Blocks: | 1035040, 1367266, 1367285, 1411323, 1417151, 1434653, 1485863 | ||||||||||||
Attachments: |
|
Description
SATHEESARAN
2017-06-22 05:47:03 UTC
Created attachment 1290500 [details]
fuse mount logs from the client/hypervisor
Created attachment 1290502 [details]
posix/brick logs from node1
RC: Clue-1: write failed with EBADF, which can happen if the file is closed or the file doesn't have write permissions as per manpage. So check if the file on the brick is opened without write permission [root@dhcp37-189 ~]# ps aux | grep glusterfsd root 12428 0.3 1.2 1353864 48576 ? Ssl Jun21 4:25 /usr/sbin/glusterfsd -s dhcp37-189.lab.eng.blr.redhat.com --volfile-id rep3vol.dhcp37-189.lab.eng.blr.redhat.com.gluster-brick1-b1 -p /var/lib/glusterd/vols/rep3vol/run/dhcp37-189.lab.eng.blr.redhat.com-gluster-brick1-b1.pid -S /var/run/gluster/a128baef734cbf634c6d7182dc6d44e2.socket --brick-name /gluster/brick1/b1 -l /var/log/glusterfs/bricks/gluster-brick1-b1.log --xlator-option *-posix.glusterd-uuid=d68ce036-52a9-458a-991a-7994b8c2a80b --brick-port 49152 --xlator-option rep3vol-server.listen-port=49152 root 12909 0.3 0.9 1353996 38620 ? Ssl Jun21 2:25 /usr/sbin/glusterfsd -s dhcp37-189.lab.eng.blr.redhat.com --volfile-id rep3vol.dhcp37-189.lab.eng.blr.redhat.com.gluster-brick2-b2 -p /var/lib/glusterd/vols/rep3vol/run/dhcp37-189.lab.eng.blr.redhat.com-gluster-brick2-b2.pid -S /var/run/gluster/16c36bb8d1f7cc85afef32f4257b8ad7.socket --brick-name /gluster/brick2/b2 -l /var/log/glusterfs/bricks/gluster-brick2-b2.log --xlator-option *-posix.glusterd-uuid=d68ce036-52a9-458a-991a-7994b8c2a80b --brick-port 49153 --xlator-option rep3vol-server.listen-port=49153 State of fd on the server: [root@dhcp37-189 ~]# ls -l /proc/12909/fd -l ... lr-x------. 1 root root 64 Jun 22 10:48 21 -> /gluster/brick2/b2/.glusterfs/f6/2d/f62d019f-903a-4f0e-bcb9-6e931994df4a ... [root@dhcp37-189 ~]# ls /proc/12909/fdinfo/21 /proc/12909/fdinfo/21 [root@dhcp37-189 ~]# cat /proc/12909/fdinfo/21 pos: 0 flags: 0100000 <<--------- file is opened without Writing permissions mnt_id: 121 Clue-2/guess: This write failed on all the bricks with EBADF so the client must have sent open without any write-flags. Because it happened after rebalance, my guess was either dht didn't send right flags or sent a fop on fd without any open on the file. I saw the following code in dht: In dht_rebalance_inprogress_task() /* flags for open are stripped down to allow following the * new location of the file, otherwise we can get EEXIST or * truncate the file again as rebalance is moving the data */ ret = syncop_open (dst_node, &tmp_loc, (iter_fd->flags & ~(O_CREAT | O_EXCL | O_TRUNC)), iter_fd, NULL, NULL); if (ret < 0) { So dht is sending flags when migration happens. Clue-3: AFR when it has to do open-fd-heal will send open with fdctx->flags which will be zero in case open never comes on the fd for that afr subvolume. So looked at the fd-ctx for that fd in gdb: Breakpoint 1, client_priv_dump (this=0x7fe0b420d370) at client.c:2830 ... #5 0x00007fe0b2e8219e in meta_readv (frame=0x7fe0a4594fe0, this=0x7fe0b4283ed0, fd=0x7fe0a401e990, size=131072, offset=0, flags=32768, xdata=0x0) at meta.c:74 74 META_FOP (fd->inode, readv, frame, this, fd, size, offset, flags, xdata); (gdb) p this->itable $1 = (inode_table_t *) 0x7fe0ac03d220 ... 2855 list_for_each_entry(tmp, &conf->saved_fds, sfd_pos) { (gdb) 2834 int i = 0; (gdb) p tmp->gfid $2 = "\366-\001\237\220:O\016\274\271n\223\031\224\337J" ... (gdb) p inode_find($1, $2) $4 = (inode_t *) 0x7fe0a0006550 (gdb) p fd_lookup ($4, 0) $6 = (fd_t *) 0x7fe09c01e610 (gdb) p *$6 $7 = {pid = 15340, flags = 32770, refcount = 2, inode_list = {next = 0x7fe0a00065a8, prev = 0x7fe0a00065a8}, inode = 0x7fe0a0006550, lock = {spinlock = 0, mutex = {__data = { __lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, _ctx = 0x7fe09c01deb0, xl_count = 16, lk_ctx = 0x7fe0a40439b0, anonymous = _gf_false} (gdb) p $6->_ctx[0] $8 = {{key = 94836635403776, xl_key = 0x5640dff70600}, {value1 = 140602666716848, ptr1 = 0x7fe09c010eb0}} (gdb) p(xlator_t*) $6->_ctx[0].xl_key $9 = (xlator_t *) 0x5640dff70600 (gdb) p $9->name $10 = 0x5640dff711a0 "fuse" (gdb) p ((xlator_t*) $6->_ctx[0].xl_key)->name $11 = 0x5640dff711a0 "fuse" (gdb) p ((xlator_t*) $6->_ctx[1].xl_key)->name $12 = 0x7fe0b41112a0 "rep3vol-replicate-1" (gdb) p ((afr_fd_ctx_t*) $6->_ctx[1].value1) $13 = (afr_fd_ctx_t *) 0x7fe0ac27ccd0 (gdb) p *$13 $14 = {pre_op_done = {0x7fe0ac1287a0, 0x7fe0ac190a70, 0x7fe0ac020ef0}, inherited = {0, 0, 0}, on_disk = {0, 0, 0}, opened_on = 0x7fe0ac1e9990, lock_piggyback = 0x7fe0ac08e920, lock_acquired = 0x7fe0ac07fbb0, flags = 0, delay_lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, delay_timer = 0x0, delay_frame = 0x0, witnessed_unstable_write = _gf_false, open_fd_count = 0, eager_locked = { next = 0x7fe0ac27cd60, prev = 0x7fe0ac27cd60}, readdir_subvol = -1} (gdb) p ((xlator_t*) $6->_ctx[2].xl_key)->name $15 = 0x7fe0b427bcb0 "rep3vol" (gdb) p ((xlator_t*) $6->_ctx[3].xl_key)->name $16 = 0x7fe0b40524c0 "rep3vol-replicate-0" (gdb) p ((xlator_t*) $6->_ctx[4].xl_key)->name $17 = 0x7fe0b4049f40 "rep3vol-client-1" (gdb) p ((xlator_t*) $6->_ctx[5].xl_key)->name $18 = 0x7fe0b411f4a0 "rep3vol-client-2" (gdb) p ((xlator_t*) $6->_ctx[6].xl_key)->name $19 = 0x7fe0b429a130 "rep3vol-client-0" (gdb) p ((xlator_t*) $6->_ctx[7].xl_key)->name $20 = 0x7fe0b40a7d40 "rep3vol-shard" (gdb) p ((xlator_t*) $6->_ctx[8].xl_key)->name $21 = 0x7fe0b42d6cf0 "rep3vol-client-4" (gdb) p ((xlator_t*) $6->_ctx[9].xl_key)->name $22 = 0x7fe0b41b2700 "rep3vol-client-3" (gdb) p ((xlator_t*) $6->_ctx[10].xl_key)->name $23 = 0x7fe0b4047020 "rep3vol-client-5" (gdb) p ((xlator_t*) $6->_ctx[11].xl_key)->name Cannot access memory at address 0x0 (gdb) p ((xlator_t*) $6->_ctx[12].xl_key)->name Cannot access memory at address 0x0 (gdb) p ((xlator_t*) $6->_ctx[13].xl_key)->name Cannot access memory at address 0x0 (gdb) p ((xlator_t*) $6->_ctx[14].xl_key)->name Cannot access memory at address 0x0 (gdb) p ((xlator_t*) $6->_ctx[15].xl_key)->name Cannot access memory at address 0x0 (gdb) p ((xlator_t*) $6->_ctx[2].xl_key)->type $24 = 0x7fe0b40dc620 "debug/io-stats" (gdb) p /x $2 $25 = {0xf6, 0x2d, 0x1, 0x9f, 0x90, 0x3a, 0x4f, 0xe, 0xbc, 0xb9, 0x6e, 0x93, 0x19, 0x94, 0xdf, 0x4a} (gdb) (gdb) p ((afr_fd_ctx_t*) $6->_ctx[3].value1) $26 = (afr_fd_ctx_t *) 0x7fe0a426ca20 (gdb) p $26->flags $27 = 32770 <<<----- replicate-0 (gdb) p $13->flags $28 = 0 <<<---- replicate-1 Based on the fdctx above, on replica-0, the file is opened with correct flags. Where as on replicate-1 it never sent open as dht doesn't have any fd-ctx which says that dht never got a chance to set "fd_ctx->opened_on_dst", because dht sent the fop on an fd which was never opened on afr, afr does an openfd heal with ctx->flags as all zero, so fd gets opened as RDONLY. upstream patch : https://review.gluster.org/#/c/17630/ Created attachment 1293168 [details]
fuse mount logs for debug-build
Created attachment 1293169 [details]
posix logs for debug-build from one of the servers
From the client logs: [2017-06-29 11:02:37.071384] W [MSGID: 114031] [client-rpc-fops.c:855:client3_3_writev_cbk] 2-vmstore-client-3: remote operation failed [Bad file descriptor] [2017-06-29 11:02:37.071745] W [MSGID: 114031] [client-rpc-fops.c:855:client3_3_writev_cbk] 2-vmstore-client-4: remote operation failed [Bad file descriptor] The last writev on replicate-1 call before this error was: [2017-06-29 11:02:37.058364] I [MSGID: 0] [dht-inode-write.c:195:dht_writev] 2-vmstore-dht: fop on fd 0x7effcc0274c0 on vmstore-replicate-1 On GDB-ing into the client process to check the fd_t struct: (gdb) p (fd_t*)0x7effcc0274c0 $1 = (fd_t *) 0x7effcc0274c0 (gdb) p *(fd_t*)0x7effcc0274c0 $2 = {pid = 9190, flags = 32770, refcount = 1, inode_list = {next = 0x7effd8008368, prev = 0x7effd8008368}, inode = 0x7effd8008310, lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, _ctx = 0x7effcc0285e0, xl_count = 16, lk_ctx = 0x7effdc03a740, anonymous = _gf_false} On checking the fd context, we see the same thing as Pranith did in comment#5. The fd has been opened on replicate-1 with flags = 0 (gdb) p *(xlator_t*) $2->_ctx[3].xl_key $24 = {name = 0x7effe02b4e30 "vmstore-replicate-0", type = 0x7effe02f99a0 "cluster/replicate", instance_name = 0x0, next = 0x7effe02a1bf0, prev = 0x7effe01ca3b0, parents = 0x7effe0312b20, children = 0x7effe0037430, options = 0x7effe02b56f0, dlhandle = 0x7effe800de10, fops = 0x7effeee4b1a0 <fops>, cbks = 0x7effeee4b0e0 <cbks>, dumpops = 0x7effeee4b140 <dumpops>, volume_options = {next = 0x7effe02fa150, prev = 0x7effe02fa150}, fini = 0x7effeec38120 <fini>, init = 0x7effeec37470 <init>, reconfigure = 0x7effeec36da0 <reconfigure>, mem_acct_init = 0x7effeec36cd0 <mem_acct_init>, notify = 0x7effeec36c70 <notify>, loglevel = GF_LOG_NONE, latencies = {{min = 0, max = 0, total = 0, std = 0, mean = 0, count = 0} <repeats 55 times>}, history = 0x0, ctx = 0x557a5814c010, graph = 0x7effe0086e50, itable = 0x7effe0423510, init_succeeded = 1 '\001', private = 0x7effe0423280, mem_acct = 0x7effe0081130, winds = 0, switched = 0 '\000', local_pool = 0x7efffca5aa58 <pools+280>, is_autoloaded = _gf_false, volfile_id = 0x0} (gdb) p *(xlator_t*) $2->_ctx[1].xl_key $25 = {name = 0x7effe013f2b0 "vmstore-replicate-1", type = 0x7effe013f320 "cluster/replicate", instance_name = 0x0, next = 0x7effe0474e70, prev = 0x7effe0046da0, parents = 0x7effe0229ed0, children = 0x7effe040b2c0, options = 0x7effe00bd670, dlhandle = 0x7effe800de10, fops = 0x7effeee4b1a0 <fops>, cbks = 0x7effeee4b0e0 <cbks>, dumpops = 0x7effeee4b140 <dumpops>, volume_options = {next = 0x7effe03700d0, prev = 0x7effe03700d0}, fini = 0x7effeec38120 <fini>, init = 0x7effeec37470 <init>, reconfigure = 0x7effeec36da0 <reconfigure>, mem_acct_init = 0x7effeec36cd0 <mem_acct_init>, notify = 0x7effeec36c70 <notify>, loglevel = GF_LOG_NONE, latencies = {{min = 0, max = 0, total = 0, std = 0, mean = 0, count = 0} <repeats 55 times>}, history = 0x0, ctx = 0x557a5814c010, graph = 0x7effe0086e50, itable = 0x7effe041d4f0, init_succeeded = 1 '\001', private = 0x7effe00c2520, mem_acct = 0x7effe0185d80, winds = 0, switched = 0 '\000', local_pool = 0x7efffca5aa58 <pools+280>, is_autoloaded = _gf_false, volfile_id = 0x0} (gdb) p ((afr_fd_ctx_t*) $2->_ctx[3].value1)->flags $26 = 32770 (gdb) p ((afr_fd_ctx_t*) $2->_ctx[1].value1)->flags $27 = 0 There is no value in the fd_ctx for the DHT xlator, which implies that no rebalance check task (Phase1/Phase2 mig check) has been been performed on this inode/fd. Getting the gfid from the fd->inode: (gdb) p *(inode_t*)0x7effd8008310 $30 = {table = 0x7effe04521f0, gfid = "\035\022\370\005\341\200J\222\246I\"\324~S\n\b", lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, nlookup = 2, fd_count = 1, ref = 1, ia_type = IA_IFREG, fd_list = {next = 0x7effcc0274d0, prev = 0x7effcc0274d0}, dentry_list = { next = 0x7effe033e6a0, prev = 0x7effe033e6a0}, hash = {next = 0x7effe072f000, prev = 0x7effe072f000}, list = {next = 0x7effd8000c58, prev = 0x7effe000cab8}, _ctx = 0x7effd8011bd0} (gdb) p/x *(inode_t*)0x7effd8008310 $31 = {table = 0x7effe04521f0, gfid = {0x1d, 0x12, 0xf8, 0x5, 0xe1, 0x80, 0x4a, 0x92, 0xa6, 0x49, 0x22, 0xd4, 0x7e, 0x53, 0xa, 0x8}, lock = { spinlock = 0x0, mutex = {__data = {__lock = 0x0, __count = 0x0, __owner = 0x0, __nusers = 0x0, __kind = 0x0, __spins = 0x0, __elision = 0x0, __list = { __prev = 0x0, __next = 0x0}}, __size = {0x0 <repeats 40 times>}, __align = 0x0}}, nlookup = 0x2, fd_count = 0x1, ref = 0x1, ia_type = 0x1, fd_list = {next = 0x7effcc0274d0, prev = 0x7effcc0274d0}, dentry_list = {next = 0x7effe033e6a0, prev = 0x7effe033e6a0}, hash = {next = 0x7effe072f000, prev = 0x7effe072f000}, list = {next = 0x7effd8000c58, prev = 0x7effe000cab8}, _ctx = 0x7effd8011bd0} On checking the brick, the file with this gfid is vm4.img # file: vm4.img security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.gfid=0x1d12f805e1804a92a64922d47e530a08 trusted.glusterfs.shard.block-size=0x0000000000400000 trusted.glusterfs.shard.file-size=0x00000003c0000000000000000000000000000000000bcf700000000000000000 From the rebalance logs on dhcp37-163, vm4.img was migrated at: [2017-06-29 11:02:33.991118] I [dht-rebalance.c:1515:dht_migrate_file] 0-vmstore-dht: /vm4.img: attempting to move from vmstore-replicate-0 to vmstore-replicate-1 ... [2017-06-29 11:02:36.796394] I [MSGID: 109022] [dht-rebalance.c:2137:dht_migrate_file] 0-vmstore-dht: completed migration of /vm4.img from subvolume vmstore-replicate-0 to vmstore-replicate-1 In the mnt-log: [2017-06-29 11:02:35.959532] I [dht-layout.c:875:dht_layout_preset] (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so(+0x3828a) [0x7effee98828a] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so(+0x373e2) [0x7effee9873e2] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so(+0x72ae) [0x7effee9572ae] ) 2-vmstore-dht: updating cached subvol for 1d12f805-e180-4a92-a649-22d47e530a08 to vmstore-replicate-1 At this point the cached subvol has been updated in the dht inode ctx, however, the fd was not opened on replicate-1 by the dht xlator. The next writev fop on this fd fails on replicate-1 with EBADF. https://review.gluster.org/#/c/17630/ should fix this particular issue. However, the sharding translator uses xattrops as well and dht_xattrop/dht_fxattrop does not currently perform phase1/phase2 checks. This is another area which needs to be fixed. We'd need following two patches to fix this bug: https://review.gluster.org/#/c/17630/ https://review.gluster.org/17731 upstream patch : https://review.gluster.org/#/c/17976/ Downstream patch: https://code.engineering.redhat.com/gerrit/#/c/114677 upstream release-3.12 patch: https://review.gluster.org/17995 Tested with RHGS 3.3.0 interim build ( glusterfs-3.8.4-39.elrhgs with the followin steps 1. created replica 3 volume(1x3) and optimized it for vmstore usecase 2. Fuse mounted the volume 3. created 5 VM images on the volume 4. Initiate creation of 5 VMs 5. While VM installation is in progress, add 3 more bricks to convert 1x3 volume to 2x3 volume 6. start rebalance Observation VMs are healthy post rebalance is completed. Repeated the test for 25 iterations and there are no problems Note: Repeated the test with remove-brick scenario, and there were no issues too *** Bug 1286150 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774 |