Bug 1463907

Summary:

Application VMs, wth the disk images on replica 3 volume, paused post rebalance

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

SATHEESARAN <sasundar>

Component:

distribute

Assignee:

Nithya Balachandran <nbalacha>

Status:

CLOSED ERRATA

QA Contact:

SATHEESARAN <sasundar>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

rhgs-3.2

CC:

amukherj, knarra, nbalacha, pkarampu, rcyriac, rhinduja, rhs-bugs, sasundar, spalai, storage-qa-internal

Target Milestone:

---

Target Release:

RHGS 3.3.0

Hardware:

x86_64

OS:

Linux

Whiteboard:

3.3.0-devel-freeze-exception

Fixed In Version:

glusterfs-3.8.4-39

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-09-21 04:59:42 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1465075

Bug Blocks:

1035040, 1367266, 1367285, 1411323, 1417151, 1434653, 1485863

Attachments:

Description	Flags
fuse mount logs from the client/hypervisor	none
posix/brick logs from node1	none
fuse mount logs for debug-build	none
posix logs for debug-build from one of the servers	none

Description SATHEESARAN 2017-06-22 05:47:03 UTC

Description of problem:
-----------------------
The replica 3 volume is created out of 3 RHGS machines, and mounted on the RHEL 7 hypervisor. 5 VM disk images are created from the fuse mount. While OS installation is in progress in those VMs, add 3 more bricks and trigger rebalance.

Once the rebalance is completed, one of the VM has gone in to 'paused' state

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHGS 3.3.0 interim ( glusterfs-3.8.4-28.el7rhgs )

How reproducible:
-----------------
1/1 ( I haven't tried to reproduce it )

Steps to Reproduce:
-------------------
1. Create replica 3 volume, optimize the volume for virt store and start it
2. Fuse mount the volume on the RHEL 7.3 client
3. Create 5 VM images on the fuse mounted gluster volume
4. Install OS ( RHEL 7.3 ) on all those VMs
5. While the installation is in progress, add 3 more bricks and trigger rebalance

Actual results:
---------------
Post rebalance, I observed that one of the VM has gone 'paused'

Expected results:
-----------------
None of the VM should get 'paused'

Comment 2 SATHEESARAN 2017-06-22 05:56:56 UTC

Created attachment 1290500 [details]
fuse mount logs from the client/hypervisor

Comment 3 SATHEESARAN 2017-06-22 05:58:00 UTC

Created attachment 1290502 [details]
posix/brick logs from node1

Comment 5 Pranith Kumar K 2017-06-22 06:53:20 UTC

RC:

Clue-1: write failed with EBADF, which can happen if the file is closed or the file doesn't have write permissions as per manpage.
So check if the file on the brick is opened without write permission 
[root@dhcp37-189 ~]# ps aux | grep glusterfsd
root     12428  0.3  1.2 1353864 48576 ?       Ssl  Jun21   4:25 /usr/sbin/glusterfsd -s dhcp37-189.lab.eng.blr.redhat.com --volfile-id rep3vol.dhcp37-189.lab.eng.blr.redhat.com.gluster-brick1-b1 -p /var/lib/glusterd/vols/rep3vol/run/dhcp37-189.lab.eng.blr.redhat.com-gluster-brick1-b1.pid -S /var/run/gluster/a128baef734cbf634c6d7182dc6d44e2.socket --brick-name /gluster/brick1/b1 -l /var/log/glusterfs/bricks/gluster-brick1-b1.log --xlator-option *-posix.glusterd-uuid=d68ce036-52a9-458a-991a-7994b8c2a80b --brick-port 49152 --xlator-option rep3vol-server.listen-port=49152
root     12909  0.3  0.9 1353996 38620 ?       Ssl  Jun21   2:25 /usr/sbin/glusterfsd -s dhcp37-189.lab.eng.blr.redhat.com --volfile-id rep3vol.dhcp37-189.lab.eng.blr.redhat.com.gluster-brick2-b2 -p /var/lib/glusterd/vols/rep3vol/run/dhcp37-189.lab.eng.blr.redhat.com-gluster-brick2-b2.pid -S /var/run/gluster/16c36bb8d1f7cc85afef32f4257b8ad7.socket --brick-name /gluster/brick2/b2 -l /var/log/glusterfs/bricks/gluster-brick2-b2.log --xlator-option *-posix.glusterd-uuid=d68ce036-52a9-458a-991a-7994b8c2a80b --brick-port 49153 --xlator-option rep3vol-server.listen-port=49153

State of fd on the server:
[root@dhcp37-189 ~]# ls -l /proc/12909/fd -l
...
lr-x------. 1 root root 64 Jun 22 10:48 21 -> /gluster/brick2/b2/.glusterfs/f6/2d/f62d019f-903a-4f0e-bcb9-6e931994df4a
...
[root@dhcp37-189 ~]# ls /proc/12909/fdinfo/21
/proc/12909/fdinfo/21
[root@dhcp37-189 ~]# cat /proc/12909/fdinfo/21
pos:	0
flags:	0100000 <<--------- file is opened without Writing permissions
mnt_id:	121

Clue-2/guess:
This write failed on all the bricks with EBADF so the client must have sent open without any write-flags. Because it happened after rebalance, my guess was either dht didn't send right flags or sent a fop on fd without any open on the file.
I saw the following code in dht:
In dht_rebalance_inprogress_task()
                /* flags for open are stripped down to allow following the
                 * new location of the file, otherwise we can get EEXIST or
                 * truncate the file again as rebalance is moving the data */
                ret = syncop_open (dst_node, &tmp_loc,
                                  (iter_fd->flags &
                                   ~(O_CREAT | O_EXCL | O_TRUNC)),
                                   iter_fd, NULL, NULL);
                if (ret < 0) {

So dht is sending flags when migration happens.

Clue-3:
AFR when it has to do open-fd-heal will send open with fdctx->flags which will be zero in case open never comes on the fd for that afr subvolume.

So looked at the fd-ctx for that fd in gdb:

Breakpoint 1, client_priv_dump (this=0x7fe0b420d370) at client.c:2830
...
#5  0x00007fe0b2e8219e in meta_readv (frame=0x7fe0a4594fe0, this=0x7fe0b4283ed0, 
    fd=0x7fe0a401e990, size=131072, offset=0, flags=32768, xdata=0x0) at meta.c:74
74		META_FOP (fd->inode, readv, frame, this, fd, size, offset, flags, xdata);
(gdb) p this->itable
$1 = (inode_table_t *) 0x7fe0ac03d220
...
2855	        list_for_each_entry(tmp, &conf->saved_fds, sfd_pos) {
(gdb) 
2834	        int                     i = 0;
(gdb) p tmp->gfid
$2 = "\366-\001\237\220:O\016\274\271n\223\031\224\337J"
...
(gdb) p inode_find($1, $2)
$4 = (inode_t *) 0x7fe0a0006550
(gdb) p fd_lookup ($4, 0)
$6 = (fd_t *) 0x7fe09c01e610
(gdb) p *$6
$7 = {pid = 15340, flags = 32770, refcount = 2, inode_list = {next = 0x7fe0a00065a8, 
    prev = 0x7fe0a00065a8}, inode = 0x7fe0a0006550, lock = {spinlock = 0, mutex = {__data = {
        __lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, 
        __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, 
      __align = 0}}, _ctx = 0x7fe09c01deb0, xl_count = 16, lk_ctx = 0x7fe0a40439b0, 
  anonymous = _gf_false}
(gdb) p $6->_ctx[0]
$8 = {{key = 94836635403776, xl_key = 0x5640dff70600}, {value1 = 140602666716848, 
    ptr1 = 0x7fe09c010eb0}}
(gdb) p(xlator_t*) $6->_ctx[0].xl_key
$9 = (xlator_t *) 0x5640dff70600
(gdb) p $9->name
$10 = 0x5640dff711a0 "fuse"
(gdb) p ((xlator_t*) $6->_ctx[0].xl_key)->name
$11 = 0x5640dff711a0 "fuse"
(gdb) p ((xlator_t*) $6->_ctx[1].xl_key)->name
$12 = 0x7fe0b41112a0 "rep3vol-replicate-1"
(gdb) p ((afr_fd_ctx_t*) $6->_ctx[1].value1)
$13 = (afr_fd_ctx_t *) 0x7fe0ac27ccd0
(gdb) p *$13
$14 = {pre_op_done = {0x7fe0ac1287a0, 0x7fe0ac190a70, 0x7fe0ac020ef0}, inherited = {0, 0, 0}, 
  on_disk = {0, 0, 0}, opened_on = 0x7fe0ac1e9990, lock_piggyback = 0x7fe0ac08e920, 
  lock_acquired = 0x7fe0ac07fbb0, flags = 0, delay_lock = {__data = {__lock = 0, __count = 0, 
      __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, 
        __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, delay_timer = 0x0, 
  delay_frame = 0x0, witnessed_unstable_write = _gf_false, open_fd_count = 0, eager_locked = {
    next = 0x7fe0ac27cd60, prev = 0x7fe0ac27cd60}, readdir_subvol = -1}
(gdb) p ((xlator_t*) $6->_ctx[2].xl_key)->name
$15 = 0x7fe0b427bcb0 "rep3vol"
(gdb) p ((xlator_t*) $6->_ctx[3].xl_key)->name
$16 = 0x7fe0b40524c0 "rep3vol-replicate-0"
(gdb) p ((xlator_t*) $6->_ctx[4].xl_key)->name
$17 = 0x7fe0b4049f40 "rep3vol-client-1"
(gdb) p ((xlator_t*) $6->_ctx[5].xl_key)->name
$18 = 0x7fe0b411f4a0 "rep3vol-client-2"
(gdb) p ((xlator_t*) $6->_ctx[6].xl_key)->name
$19 = 0x7fe0b429a130 "rep3vol-client-0"
(gdb) p ((xlator_t*) $6->_ctx[7].xl_key)->name
$20 = 0x7fe0b40a7d40 "rep3vol-shard"
(gdb) p ((xlator_t*) $6->_ctx[8].xl_key)->name
$21 = 0x7fe0b42d6cf0 "rep3vol-client-4"
(gdb) p ((xlator_t*) $6->_ctx[9].xl_key)->name
$22 = 0x7fe0b41b2700 "rep3vol-client-3"
(gdb) p ((xlator_t*) $6->_ctx[10].xl_key)->name
$23 = 0x7fe0b4047020 "rep3vol-client-5"
(gdb) p ((xlator_t*) $6->_ctx[11].xl_key)->name
Cannot access memory at address 0x0
(gdb) p ((xlator_t*) $6->_ctx[12].xl_key)->name
Cannot access memory at address 0x0
(gdb) p ((xlator_t*) $6->_ctx[13].xl_key)->name
Cannot access memory at address 0x0
(gdb) p ((xlator_t*) $6->_ctx[14].xl_key)->name
Cannot access memory at address 0x0
(gdb) p ((xlator_t*) $6->_ctx[15].xl_key)->name
Cannot access memory at address 0x0
(gdb) p ((xlator_t*) $6->_ctx[2].xl_key)->type
$24 = 0x7fe0b40dc620 "debug/io-stats"
(gdb) p /x $2
$25 = {0xf6, 0x2d, 0x1, 0x9f, 0x90, 0x3a, 0x4f, 0xe, 0xbc, 0xb9, 0x6e, 0x93, 0x19, 0x94, 
  0xdf, 0x4a}
(gdb) 
(gdb) p ((afr_fd_ctx_t*) $6->_ctx[3].value1)
$26 = (afr_fd_ctx_t *) 0x7fe0a426ca20
(gdb) p $26->flags
$27 = 32770 <<<----- replicate-0
(gdb) p $13->flags 
$28 = 0 <<<---- replicate-1

Based on the fdctx above,
on replica-0, the file is opened with correct flags. Where as on replicate-1 it never sent open as dht doesn't have any fd-ctx which says that dht never got a chance to set "fd_ctx->opened_on_dst", because dht sent the fop on an fd which was never opened on afr, afr does an openfd heal with ctx->flags as all zero, so fd gets opened as RDONLY.

Comment 9 Atin Mukherjee 2017-06-27 03:49:51 UTC

upstream patch : https://review.gluster.org/#/c/17630/

Comment 15 SATHEESARAN 2017-06-30 09:30:53 UTC

Created attachment 1293168 [details]
fuse mount logs for debug-build

Comment 16 SATHEESARAN 2017-06-30 09:31:59 UTC

Created attachment 1293169 [details]
posix logs for debug-build from one of the servers

Comment 17 Nithya Balachandran 2017-07-03 16:25:15 UTC

From the client logs:

[2017-06-29 11:02:37.071384] W [MSGID: 114031] [client-rpc-fops.c:855:client3_3_writev_cbk] 2-vmstore-client-3: remote operation failed [Bad file descriptor]
[2017-06-29 11:02:37.071745] W [MSGID: 114031] [client-rpc-fops.c:855:client3_3_writev_cbk] 2-vmstore-client-4: remote operation failed [Bad file descriptor]


The last writev on replicate-1 call before this error was:

[2017-06-29 11:02:37.058364] I [MSGID: 0] [dht-inode-write.c:195:dht_writev] 2-vmstore-dht: fop on fd 0x7effcc0274c0 on vmstore-replicate-1



On GDB-ing into the client process to check the fd_t struct:


(gdb) p (fd_t*)0x7effcc0274c0
$1 = (fd_t *) 0x7effcc0274c0
(gdb) p *(fd_t*)0x7effcc0274c0
$2 = {pid = 9190, flags = 32770, refcount = 1, inode_list = {next = 0x7effd8008368, prev = 0x7effd8008368}, inode = 0x7effd8008310, lock = {spinlock = 0, 
    mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = '\000' <repeats 39 times>, __align = 0}}, _ctx = 0x7effcc0285e0, xl_count = 16, lk_ctx = 0x7effdc03a740, anonymous = _gf_false}


On checking the fd context, we see the same thing as Pranith did in comment#5. The fd has been opened on replicate-1 with flags = 0


(gdb) p *(xlator_t*) $2->_ctx[3].xl_key
$24 = {name = 0x7effe02b4e30 "vmstore-replicate-0", type = 0x7effe02f99a0 "cluster/replicate", instance_name = 0x0, next = 0x7effe02a1bf0, 
  prev = 0x7effe01ca3b0, parents = 0x7effe0312b20, children = 0x7effe0037430, options = 0x7effe02b56f0, dlhandle = 0x7effe800de10, 
  fops = 0x7effeee4b1a0 <fops>, cbks = 0x7effeee4b0e0 <cbks>, dumpops = 0x7effeee4b140 <dumpops>, volume_options = {next = 0x7effe02fa150, 
    prev = 0x7effe02fa150}, fini = 0x7effeec38120 <fini>, init = 0x7effeec37470 <init>, reconfigure = 0x7effeec36da0 <reconfigure>, 
  mem_acct_init = 0x7effeec36cd0 <mem_acct_init>, notify = 0x7effeec36c70 <notify>, loglevel = GF_LOG_NONE, latencies = {{min = 0, max = 0, total = 0, 
      std = 0, mean = 0, count = 0} <repeats 55 times>}, history = 0x0, ctx = 0x557a5814c010, graph = 0x7effe0086e50, itable = 0x7effe0423510, 
  init_succeeded = 1 '\001', private = 0x7effe0423280, mem_acct = 0x7effe0081130, winds = 0, switched = 0 '\000', local_pool = 0x7efffca5aa58 <pools+280>, 
  is_autoloaded = _gf_false, volfile_id = 0x0}



(gdb) p *(xlator_t*) $2->_ctx[1].xl_key
$25 = {name = 0x7effe013f2b0 "vmstore-replicate-1", type = 0x7effe013f320 "cluster/replicate", instance_name = 0x0, next = 0x7effe0474e70, 
  prev = 0x7effe0046da0, parents = 0x7effe0229ed0, children = 0x7effe040b2c0, options = 0x7effe00bd670, dlhandle = 0x7effe800de10, 
  fops = 0x7effeee4b1a0 <fops>, cbks = 0x7effeee4b0e0 <cbks>, dumpops = 0x7effeee4b140 <dumpops>, volume_options = {next = 0x7effe03700d0, 
    prev = 0x7effe03700d0}, fini = 0x7effeec38120 <fini>, init = 0x7effeec37470 <init>, reconfigure = 0x7effeec36da0 <reconfigure>, 
  mem_acct_init = 0x7effeec36cd0 <mem_acct_init>, notify = 0x7effeec36c70 <notify>, loglevel = GF_LOG_NONE, latencies = {{min = 0, max = 0, total = 0, 
      std = 0, mean = 0, count = 0} <repeats 55 times>}, history = 0x0, ctx = 0x557a5814c010, graph = 0x7effe0086e50, itable = 0x7effe041d4f0, 
  init_succeeded = 1 '\001', private = 0x7effe00c2520, mem_acct = 0x7effe0185d80, winds = 0, switched = 0 '\000', local_pool = 0x7efffca5aa58 <pools+280>, 
  is_autoloaded = _gf_false, volfile_id = 0x0}



(gdb) p ((afr_fd_ctx_t*) $2->_ctx[3].value1)->flags
$26 = 32770

(gdb) p ((afr_fd_ctx_t*) $2->_ctx[1].value1)->flags
$27 = 0


There is no value in the fd_ctx for the DHT xlator, which implies that no rebalance check task (Phase1/Phase2 mig check) has been been performed on this inode/fd.

Getting the gfid from the fd->inode:

(gdb) p *(inode_t*)0x7effd8008310
$30 = {table = 0x7effe04521f0, gfid = "\035\022\370\005\341\200J\222\246I\"\324~S\n\b", lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, 
        __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, 
      __align = 0}}, nlookup = 2, fd_count = 1, ref = 1, ia_type = IA_IFREG, fd_list = {next = 0x7effcc0274d0, prev = 0x7effcc0274d0}, dentry_list = {
    next = 0x7effe033e6a0, prev = 0x7effe033e6a0}, hash = {next = 0x7effe072f000, prev = 0x7effe072f000}, list = {next = 0x7effd8000c58, 
    prev = 0x7effe000cab8}, _ctx = 0x7effd8011bd0}
(gdb) p/x *(inode_t*)0x7effd8008310
$31 = {table = 0x7effe04521f0, gfid = {0x1d, 0x12, 0xf8, 0x5, 0xe1, 0x80, 0x4a, 0x92, 0xa6, 0x49, 0x22, 0xd4, 0x7e, 0x53, 0xa, 0x8}, lock = {
    spinlock = 0x0, mutex = {__data = {__lock = 0x0, __count = 0x0, __owner = 0x0, __nusers = 0x0, __kind = 0x0, __spins = 0x0, __elision = 0x0, __list = {
          __prev = 0x0, __next = 0x0}}, __size = {0x0 <repeats 40 times>}, __align = 0x0}}, nlookup = 0x2, fd_count = 0x1, ref = 0x1, ia_type = 0x1, 
  fd_list = {next = 0x7effcc0274d0, prev = 0x7effcc0274d0}, dentry_list = {next = 0x7effe033e6a0, prev = 0x7effe033e6a0}, hash = {next = 0x7effe072f000, 
    prev = 0x7effe072f000}, list = {next = 0x7effd8000c58, prev = 0x7effe000cab8}, _ctx = 0x7effd8011bd0}



On checking the brick, the file with this gfid is vm4.img

# file: vm4.img
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0x1d12f805e1804a92a64922d47e530a08
trusted.glusterfs.shard.block-size=0x0000000000400000
trusted.glusterfs.shard.file-size=0x00000003c0000000000000000000000000000000000bcf700000000000000000



From the rebalance logs on dhcp37-163, vm4.img was migrated at:

[2017-06-29 11:02:33.991118] I [dht-rebalance.c:1515:dht_migrate_file] 0-vmstore-dht: /vm4.img: attempting to move from vmstore-replicate-0 to vmstore-replicate-1
 ...
[2017-06-29 11:02:36.796394] I [MSGID: 109022] [dht-rebalance.c:2137:dht_migrate_file] 0-vmstore-dht: completed migration of /vm4.img from subvolume vmstore-replicate-0 to vmstore-replicate-1


In the mnt-log:

[2017-06-29 11:02:35.959532] I [dht-layout.c:875:dht_layout_preset] (-->/usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so(+0x3828a) [0x7effee98828a] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so(+0x373e2) [0x7effee9873e2] -->/usr/lib64/glusterfs/3.8.4/xlator/cluster/distribute.so(+0x72ae) [0x7effee9572ae] ) 2-vmstore-dht: updating cached subvol for 1d12f805-e180-4a92-a649-22d47e530a08 to vmstore-replicate-1


At this point the cached subvol has been updated in the dht inode ctx, however, the fd was not opened on replicate-1 by the dht xlator.


The next writev fop on this fd fails on replicate-1 with EBADF.

Comment 18 Nithya Balachandran 2017-07-03 16:30:22 UTC

https://review.gluster.org/#/c/17630/ should fix this particular issue. 

However, the sharding translator uses xattrops as well and dht_xattrop/dht_fxattrop does not currently perform phase1/phase2 checks. This is another area which needs to be fixed.

Comment 28 Atin Mukherjee 2017-07-10 04:44:06 UTC

We'd need following two patches to fix this bug:

https://review.gluster.org/#/c/17630/
https://review.gluster.org/17731

Comment 34 Atin Mukherjee 2017-08-07 06:40:21 UTC

upstream patch : https://review.gluster.org/#/c/17976/

Comment 35 Atin Mukherjee 2017-08-08 12:05:34 UTC

Downstream patch:
https://code.engineering.redhat.com/gerrit/#/c/114677


upstream release-3.12 patch:
https://review.gluster.org/17995

Comment 39 SATHEESARAN 2017-08-13 01:30:10 UTC

Tested with RHGS 3.3.0 interim build ( glusterfs-3.8.4-39.elrhgs with the followin steps
1. created replica 3 volume(1x3) and optimized it for vmstore usecase
2. Fuse mounted the volume
3. created 5 VM images on the volume
4. Initiate creation of 5 VMs
5. While VM installation is in progress, add 3 more bricks to convert 1x3 volume to 2x3 volume
6. start rebalance

Observation

VMs are healthy post rebalance is completed. Repeated the test for 25 iterations and there are no problems

Note: Repeated the test with remove-brick scenario, and there were no issues too

Comment 40 Nithya Balachandran 2017-08-16 10:28:36 UTC

*** Bug 1286150 has been marked as a duplicate of this bug. ***

Comment 42 errata-xmlrpc 2017-09-21 04:59:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774