Bug 1463907

Summary: Application VMs, wth the disk images on replica 3 volume, paused post rebalance
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: SATHEESARAN <sasundar>
Component: distributeAssignee: Nithya Balachandran <nbalacha>
Status: CLOSED ERRATA QA Contact: SATHEESARAN <sasundar>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: amukherj, knarra, nbalacha, pkarampu, rcyriac, rhinduja, rhs-bugs, sasundar, spalai, storage-qa-internal
Target Milestone: ---   
Target Release: RHGS 3.3.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: 3.3.0-devel-freeze-exception
Fixed In Version: glusterfs-3.8.4-39 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-21 04:59:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1465075    
Bug Blocks: 1035040, 1367266, 1367285, 1411323, 1417151, 1434653, 1485863    
Attachments:
Description Flags
fuse mount logs from the client/hypervisor
none
posix/brick logs from node1
none
fuse mount logs for debug-build
none
posix logs for debug-build from one of the servers none

Description SATHEESARAN 2017-06-22 05:47:03 UTC
Description of problem:
-----------------------
The replica 3 volume is created out of 3 RHGS machines, and mounted on the RHEL 7 hypervisor. 5 VM disk images are created from the fuse mount. While OS installation is in progress in those VMs, add 3 more bricks and trigger rebalance.

Once the rebalance is completed, one of the VM has gone in to 'paused' state

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHGS 3.3.0 interim ( glusterfs-3.8.4-28.el7rhgs )

How reproducible:
-----------------
1/1 ( I haven't tried to reproduce it )

Steps to Reproduce:
-------------------
1. Create replica 3 volume, optimize the volume for virt store and start it
2. Fuse mount the volume on the RHEL 7.3 client
3. Create 5 VM images on the fuse mounted gluster volume
4. Install OS ( RHEL 7.3 ) on all those VMs
5. While the installation is in progress, add 3 more bricks and trigger rebalance

Actual results:
---------------
Post rebalance, I observed that one of the VM has gone 'paused'

Expected results:
-----------------
None of the VM should get 'paused'

Comment 2 SATHEESARAN 2017-06-22 05:56:56 UTC
Created attachment 1290500 [details]
fuse mount logs from the client/hypervisor

Comment 3 SATHEESARAN 2017-06-22 05:58:00 UTC
Created attachment 1290502 [details]
posix/brick logs from node1

Comment 5 Pranith Kumar K 2017-06-22 06:53:20 UTC
RC:

Clue-1: write failed with EBADF, which can happen if the file is closed or the file doesn't have write permissions as per manpage.
So check if the file on the brick is opened without write permission 
[root@dhcp37-189 ~]# ps aux | grep glusterfsd
root     12428  0.3  1.2 1353864 48576 ?       Ssl  Jun21   4:25 /usr/sbin/glusterfsd -s dhcp37-189.lab.eng.blr.redhat.com --volfile-id rep3vol.dhcp37-189.lab.eng.blr.redhat.com.gluster-brick1-b1 -p /var/lib/glusterd/vols/rep3vol/run/dhcp37-189.lab.eng.blr.redhat.com-gluster-brick1-b1.pid -S /var/run/gluster/a128baef734cbf634c6d7182dc6d44e2.socket --brick-name /gluster/brick1/b1 -l /var/log/glusterfs/bricks/gluster-brick1-b1.log --xlator-option *-posix.glusterd-uuid=d68ce036-52a9-458a-991a-7994b8c2a80b --brick-port 49152 --xlator-option rep3vol-server.listen-port=49152
root     12909  0.3  0.9 1353996 38620 ?       Ssl  Jun21   2:25 /usr/sbin/glusterfsd -s dhcp37-189.lab.eng.blr.redhat.com --volfile-id rep3vol.dhcp37-189.lab.eng.blr.redhat.com.gluster-brick2-b2 -p /var/lib/glusterd/vols/rep3vol/run/dhcp37-189.lab.eng.blr.redhat.com-gluster-brick2-b2.pid -S /var/run/gluster/16c36bb8d1f7cc85afef32f4257b8ad7.socket --brick-name /gluster/brick2/b2 -l /var/log/glusterfs/bricks/gluster-brick2-b2.log --xlator-option *-posix.glusterd-uuid=d68ce036-52a9-458a-991a-7994b8c2a80b --brick-port 49153 --xlator-option rep3vol-server.listen-port=49153

State of fd on the server:
[root@dhcp37-189 ~]# ls -l /proc/12909/fd -l
...
lr-x------. 1 root root 64 Jun 22 10:48 21 -> /gluster/brick2/b2/.glusterfs/f6/2d/f62d019f-903a-4f0e-bcb9-6e931994df4a
...
[root@dhcp37-189 ~]# ls /proc/12909/fdinfo/21
/proc/12909/fdinfo/21
[root@dhcp37-189 ~]# cat /proc/12909/fdinfo/21
pos:	0
flags:	0100000 <<--------- file is opened without Writing permissions
mnt_id:	121

Clue-2/guess:
This write failed on all the bricks with EBADF so the client must have sent open without any write-flags. Because it happened after rebalance, my guess was either dht didn't send right flags or sent a fop on fd without any open on the file.
I saw the following code in dht:
In dht_rebalance_inprogress_task()
                /* flags for open are stripped down to allow following the
                 * new location of the file, otherwise we can get EEXIST or
                 * truncate the file again as rebalance is moving the data */
                ret = syncop_open (dst_node, &tmp_loc,
                                  (iter_fd->flags &
                                   ~(O_CREAT | O_EXCL | O_TRUNC)),
                                   iter_fd, NULL, NULL);
                if (ret < 0) {

So dht is sending flags when migration happens.

Clue-3:
AFR when it has to do open-fd-heal will send open with fdctx->flags which will be zero in case open never comes on the fd for that afr subvolume.

So looked at the fd-ctx for that fd in gdb:

Breakpoint 1, client_priv_dump (this=0x7fe0b420d370) at client.c:2830
...
#5  0x00007fe0b2e8219e in meta_readv (frame=0x7fe0a4594fe0, this=0x7fe0b4283ed0, 
    fd=0x7fe0a401e990, size=131072, offset=0, flags=32768, xdata=0x0) at meta.c:74
74		META_FOP (fd->inode, readv, frame, this, fd, size, offset, flags, xdata);
(gdb) p this->itable
$1 = (inode_table_t *) 0x7fe0ac03d220
...
2855	        list_for_each_entry(tmp, &conf->saved_fds, sfd_pos) {
(gdb) 
2834	        int                     i = 0;
(gdb) p tmp->gfid
$2 = "\366-\001\237\220:O\016\274\271n\223\031\224\337J"
...
(gdb) p inode_find($1, $2)
$4 = (inode_t *) 0x7fe0a0006550
(gdb) p fd_lookup ($4, 0)
$6 = (fd_t *) 0x7fe09c01e610
(gdb) p *$6
$7 = {pid = 15340, flags = 32770, refcount = 2, inode_list = {next = 0x7fe0a00065a8, 
    prev = 0x7fe0a00065a8}, inode = 0x7fe0a0006550, lock = {spinlock = 0, mutex = {__data = {
        __lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, 
        __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, 
      __align = 0}}, _ctx = 0x7fe09c01deb0, xl_count = 16, lk_ctx = 0x7fe0a40439b0, 
  anonymous = _gf_false}
(gdb) p $6->_ctx[0]
$8 = {{key = 94836635403776, xl_key = 0x5640dff70600}, {value1 = 140602666716848, 
    ptr1 = 0x7fe09c010eb0}}
(gdb) p(xlator_t*) $6->_ctx[0].xl_key
$9 = (xlator_t *) 0x5640dff70600
(gdb) p $9->name
$10 = 0x5640dff711a0 "fuse"
(gdb) p ((xlator_t*) $6->_ctx[0].xl_key)->name
$11 = 0x5640dff711a0 "fuse"
(gdb) p ((xlator_t*) $6->_ctx[1].xl_key)->name
$12 = 0x7fe0b41112a0 "rep3vol-replicate-1"
(gdb) p ((afr_fd_ctx_t*) $6->_ctx[1].value1)
$13 = (afr_fd_ctx_t *) 0x7fe0ac27ccd0
(gdb) p *$13
$14 = {pre_op_done = {0x7fe0ac1287a0, 0x7fe0ac190a70, 0x7fe0ac020ef0}, inherited = {0, 0, 0}, 
  on_disk = {0, 0, 0}, opened_on = 0x7fe0ac1e9990, lock_piggyback = 0x7fe0ac08e920, 
  lock_acquired = 0x7fe0ac07fbb0, flags = 0, delay_lock = {__data = {__lock = 0, __count = 0, 
      __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, 
        __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, delay_timer = 0x0, 
  delay_frame = 0x0, witnessed_unstable_write = _gf_false, open_fd_count = 0, eager_locked = {
    next = 0x7fe0ac27cd60, prev = 0x7fe0ac27cd60}, readdir_subvol = -1}
(gdb) p ((xlator_t*) $6->_ctx[2].xl_key)->name
$15 = 0x7fe0b427bcb0 "rep3vol"
(gdb) p ((xlator_t*) $6->_ctx[3].xl_key)->name
$16 = 0x7fe0b40524c0 "rep3vol-replicate-0"
(gdb) p ((xlator_t*) $6->_ctx[4].xl_key)->name
$17 = 0x7fe0b4049f40 "rep3vol-client-1"
(gdb) p ((xlator_t*) $6->_ctx[5].xl_key)->name
$18 = 0x7fe0b411f4a0 "rep3vol-client-2"
(gdb) p ((xlator_t*) $6->_ctx[6].xl_key)->name
$19 = 0x7fe0b429a130 "rep3vol-client-0"
(gdb) p ((xlator_t*) $6->_ctx[7].xl_key)->name
$20 = 0x7fe0b40a7d40 "rep3vol-shard"
(gdb) p ((xlator_t*) $6->_ctx[8].xl_key)->name
$21 = 0x7fe0b42d6cf0 "rep3vol-client-4"
(gdb) p ((xlator_t*) $6->_ctx[9].xl_key)->name
$22 = 0x7fe0b41b2700 "rep3vol-client-3"
(gdb) p ((xlator_t*) $6->_ctx[10].xl_key)->name
$23 = 0x7fe0b4047020 "rep3vol-client-5"
(gdb) p ((xlator_t*) $6->_ctx[11].xl_key)->name
Cannot access memory at address 0x0
(gdb) p ((xlator_t*) $6->_ctx[12].xl_key)->name
Cannot access memory at address 0x0
(gdb) p ((xlator_t*) $6->_ctx[13].xl_key)->name
Cannot access memory at address 0x0
(gdb) p ((xlator_t*) $6->_ctx[14].xl_key)->name
Cannot access memory at address 0x0
(gdb) p ((xlator_t*) $6->_ctx[15].xl_key)->name
Cannot access memory at address 0x0
(gdb) p ((xlator_t*) $6->_ctx[2].xl_key)->type
$24 = 0x7fe0b40dc620 "debug/io-stats"
(gdb) p /x $2
$25 = {0xf6, 0x2d, 0x1, 0x9f, 0x90, 0x3a, 0x4f, 0xe, 0xbc, 0xb9, 0x6e, 0x93, 0x19, 0x94, 
  0xdf, 0x4a}
(gdb) 
(gdb) p ((afr_fd_ctx_t*) $6->_ctx[3].value1)
$26 = (afr_fd_ctx_t *) 0x7fe0a426ca20
(gdb) p $26->flags
$27 = 32770 <<<----- replicate-0
(gdb) p $13->flags 
$28 = 0 <<<---- replicate-1

Based on the fdctx above,
on replica-0, the file is opened with correct flags. Where as on replicate-1 it never sent open as dht doesn't have any fd-ctx which says that dht never got a chance to set "fd_ctx->opened_on_dst", because dht sent the fop on an fd which was never opened on afr, afr does an openfd heal with ctx->flags as all zero, so fd gets opened as RDONLY.

Comment 9 Atin Mukherjee 2017-06-27 03:49:51 UTC
upstream patch : https://review.gluster.org/#/c/17630/

Comment 15 SATHEESARAN 2017-06-30 09:30:53 UTC
Created attachment 1293168 [details]
fuse mount logs for debug-build

Comment 16 SATHEESARAN 2017-06-30 09:31:59 UTC
Created attachment 1293169 [details]
posix logs for debug-build from one of the servers

Comment 17 Nithya Balachandran 2017-07-03 16:25:15 UTC
From the client logs:

[2017-06-29 11:02:37.071384] W [MSGID: 114031] [client-rpc-fops.c:855:client3_3_writev_cbk] 2-vmstore-client-3: remote operation failed [Bad file descriptor]
[2017-06-29 11:02:37.071745] W [MSGID: 114031] [client-rpc-fops.c:855:client3_3_writev_cbk] 2-vmstore-client-4: remote operation failed [Bad file descriptor]


The last writev on replicate-1 call before this error was:

[2017-06-29 11:02:37.058364] I [MSGID: 0] [dht-inode-write.c:195:dht_writev] 2-vmstore-dht: fop on fd 0x7effcc0274c0 on vmstore-replicate-1



On GDB-ing into the client process to check the fd_t struct:


(gdb) p (fd_t*)0x7effcc0274c0
$1 = (fd_t *) 0x7effcc0274c0
(gdb) p *(fd_t*)0x7effcc0274c0
$2 = {pid = 9190, flags = 32770, refcount = 1, inode_list = {next = 0x7effd8008368, prev = 0x7effd8008368}, inode = 0x7effd8008310, lock = {spinlock = 0, 
    mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = '\000' <repeats 39 times>, __align = 0}}, _ctx = 0x7effcc0285e0, xl_count = 16, lk_ctx = 0x7effdc03a740, anonymous = _gf_false}


On checking the fd context, we see the same thing as Pranith did in comment#5. The fd has been opened on replicate-1 with flags = 0


(gdb) p *(xlator_t*) $2->_ctx[3].xl_key
$24 = {name = 0x7effe02b4e30 "vmstore-replicate-0", type = 0x7effe02f99a0 "cluster/replicate", instance_name = 0x0, next = 0x7effe02a1bf0, 
  prev = 0x7effe01ca3b0, parents = 0x7effe0312b20, children = 0x7effe0037430, options = 0x7effe02b56f0, dlhandle = 0x7effe800de10, 
  fops = 0x7effeee4b1a0 <fops>, cbks = 0x7effeee4b0e0 <cbks>, dumpops = 0x7effeee4b140 <dumpops>, volume_options = {next = 0x7effe02fa150, 
    prev = 0x7effe02fa150}, fini = 0x7effeec38120 <fini>, init = 0x7effeec37470 <init>, reconfigure = 0x7effeec36da0 <reconfigure>, 
  mem_acct_init = 0x7effeec36cd0 <mem_acct_init>, notify = 0x7effeec36c70 <notify>, loglevel = GF_LOG_NONE, latencies = {{min = 0, max = 0, total = 0, 
      std = 0, mean = 0, count = 0} <repeats 55 times>}, history = 0x0, ctx = 0x557a5814c010, graph = 0x7effe0086e50, itable = 0x7effe0423510, 
  init_succeeded = 1 '\001', private = 0x7effe0423280, mem_acct = 0x7effe0081130, winds = 0, switched = 0 '\000', local_pool = 0x7efffca5aa58 <pools+280>, 
  is_autoloaded = _gf_false, volfile_id = 0x0}



(gdb) p *(xlator_t*) $2->_ctx[1].xl_key
$25 = {name = 0x7effe013f2b0 "vmstore-replicate-1", type = 0x7effe013f320 "cluster/replicate", instance_name = 0x0, next = 0x7effe0474e70, 
  prev = 0x7effe0046da0, parents = 0x7effe0229ed0, children = 0x7effe040b2c0, options = 0x7effe00bd670, dlhandle = 0x7effe800de10, 
  fops = 0x7effeee4b1a0 <fops>, cbks = 0x7effeee4b0e0 <cbks>, dumpops = 0x7effeee4b140 <dumpops>, volume_options = {next = 0x7effe03700d0, 
    prev = 0x7effe03700d0}, fini = 0x7effeec38120 <fini>, init = 0x7effeec37470 <init>, reconfigure = 0x7effeec36da0 <reconfigure>, 
  mem_acct_init = 0x7effeec36cd0 <mem_acct_init>, notify = 0x7effeec36c70 <notify>, loglevel = GF_LOG_NONE, latencies = {{min = 0, max = 0, total = 0, 
      std = 0, mean = 0, count = 0} <repeats 55 times>}, history = 0x0, ctx = 0x557a5814c010, graph = 0x7effe0086e50, itable = 0x7effe041d4f0, 
  init_succeeded = 1 '\001', private = 0x7effe00c2520, mem_acct = 0x7effe0185d80, winds = 0, switched = 0 '\000', local_pool = 0x7efffca5aa58 <pools+280>, 
  is_autoloaded = _gf_false, volfile_id = 0x0}



(gdb) p ((afr_fd_ctx_t*) $2->_ctx[3].value1)->flags
$26 = 32770

(gdb) p ((afr_fd_ctx_t*) $2->_ctx[1].value1)->flags
$27 = 0


There is no value in the fd_ctx for the DHT xlator, which implies that no rebalance check task (Phase1/Phase2 mig check) has been been performed on this inode/fd.

Getting the gfid from the fd->inode:

(gdb) p *(inode_t*)0x7effd8008310
$30 = {table = 0x7effe04521f0, gfid = "\035\022\370\005\341\200J\222\246I\"\324~S\n\b", lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, 
        __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, 
      __align = 0}}, nlookup = 2, fd_count = 1, ref = 1, ia_type = IA_IFREG, fd_list = {next = 0x7effcc0274d0, prev = 0x7effcc0274d0}, dentry_list = {
    next = 0x7effe033e6a0, prev = 0x7effe033e6a0}, hash = {next = 0x7effe072f000, prev = 0x7effe072f000}, list = {next = 0x7effd8000c58, 
    prev = 0x7effe000cab8}, _ctx = 0x7effd8011bd0}
(gdb) p/x *(inode_t*)0x7effd8008310
$31 = {table = 0x7effe04521f0, gfid = {0x1d, 0x12, 0xf8, 0x5, 0xe1, 0x80, 0x4a, 0x92, 0xa6, 0x49, 0x22, 0xd4, 0x7e, 0x53, 0xa, 0x8}, lock = {
    spinlock = 0x0, mutex = {__data = {__lock = 0x0, __count = 0x0, __owner = 0x0, __nusers = 0x0, __kind = 0x0, __spins = 0x0, __elision = 0x0, __list = {
          __prev = 0x0, __next = 0x0}}, __size = {0x0 <repeats 40 times>}, __align = 0x0}}, nlookup = 0x2, fd_count = 0x1, ref = 0x1, ia_type = 0x1, 
  fd_list = {next = 0x7effcc0274d0, prev = 0x7effcc0274d0}, dentry_list = {next = 0x7effe033e6a0, prev = 0x7effe033e6a0}, hash = {next = 0x7effe072f000, 
    prev = 0x7effe072f000}, list = {next = 0x7effd8000c58, prev = 0x7effe000cab8}, _ctx = 0x7effd8011bd0}



On checking the brick, the file with this gfid is vm4.img

# file: vm4.img
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0x1d12f805e1804a92a64922d47e530a08
trusted.glusterfs.shard.block-size=0x0000000000400000
trusted.glusterfs.shard.file-size=0x00000003c0000000000000000000000000000000000bcf700000000000000000



From the rebalance logs on dhcp37-163, vm4.img was migrated at:

[2017-06-29 11:02:33.991118] I [dht-rebalance.c:1515:dht_migrate_file] 0-vmstore-dht: /vm4.img: attempting to move from vmstore-replicate-0 to vmstore-replicate-1
 ...
[2017-06-29 11:02:36.796394] I [MSGID: 109022] [dht-rebalance.c:2137:dht_migrate_file] 0-vmstore-dht: completed migration of /vm4.img from subvolume vmstore-replicate-0 to vmstore-replicate-1


In the mnt-log:

[2017-06-29 11:02:35.959532] I [dht-layout.c:875:dht_layout_preset] (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so(+0x3828a) [0x7effee98828a] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so(+0x373e2) [0x7effee9873e2] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so(+0x72ae) [0x7effee9572ae] ) 2-vmstore-dht: updating cached subvol for 1d12f805-e180-4a92-a649-22d47e530a08 to vmstore-replicate-1


At this point the cached subvol has been updated in the dht inode ctx, however, the fd was not opened on replicate-1 by the dht xlator.


The next writev fop on this fd fails on replicate-1 with EBADF.

Comment 18 Nithya Balachandran 2017-07-03 16:30:22 UTC
https://review.gluster.org/#/c/17630/ should fix this particular issue. 

However, the sharding translator uses xattrops as well and dht_xattrop/dht_fxattrop does not currently perform phase1/phase2 checks. This is another area which needs to be fixed.

Comment 28 Atin Mukherjee 2017-07-10 04:44:06 UTC
We'd need following two patches to fix this bug:

https://review.gluster.org/#/c/17630/
https://review.gluster.org/17731

Comment 34 Atin Mukherjee 2017-08-07 06:40:21 UTC
upstream patch : https://review.gluster.org/#/c/17976/

Comment 35 Atin Mukherjee 2017-08-08 12:05:34 UTC
Downstream patch:
https://code.engineering.redhat.com/gerrit/#/c/114677


upstream release-3.12 patch:
https://review.gluster.org/17995

Comment 39 SATHEESARAN 2017-08-13 01:30:10 UTC
Tested with RHGS 3.3.0 interim build ( glusterfs-3.8.4-39.elrhgs with the followin steps
1. created replica 3 volume(1x3) and optimized it for vmstore usecase
2. Fuse mounted the volume
3. created 5 VM images on the volume
4. Initiate creation of 5 VMs
5. While VM installation is in progress, add 3 more bricks to convert 1x3 volume to 2x3 volume
6. start rebalance

Observation

VMs are healthy post rebalance is completed. Repeated the test for 25 iterations and there are no problems

Note: Repeated the test with remove-brick scenario, and there were no issues too

Comment 40 Nithya Balachandran 2017-08-16 10:28:36 UTC
*** Bug 1286150 has been marked as a duplicate of this bug. ***

Comment 42 errata-xmlrpc 2017-09-21 04:59:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774