Bug 1120456 - rebalance is not resulting in the hash layout changes being available to nfs client
Summary: rebalance is not resulting in the hash layout changes being available to nfs ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: gluster-nfs
Version: 2.1
Hardware: aarch64
OS: Linux
urgent
urgent
Target Milestone: ---
: RHGS 3.0.0
Assignee: Raghavendra G
QA Contact: shylesh
URL:
Whiteboard:
Depends On:
Blocks: 1125824 1138393 1139997 1140338
TreeView+ depends on / blocked
 
Reported: 2014-07-17 01:55 UTC by Paul Cuzner
Modified: 2015-05-15 17:44 UTC (History)
10 users (show)

Fixed In Version: glusterfs-3.6.0.28-1
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1125824 (view as bug list)
Environment:
Last Closed: 2014-09-22 19:44:30 UTC
Embargoed:


Attachments (Terms of Use)
nfs log from one of the rhs nodes providing the nfs connection for the client (437.10 KB, application/gzip)
2014-07-17 01:55 UTC, Paul Cuzner
no flags Details
nfs log from the other rhs node providing nfs connectivity (550.98 KB, application/gzip)
2014-07-17 02:04 UTC, Paul Cuzner
no flags Details
rebalance log from rhs5 node (1.03 MB, text/plain)
2014-07-17 23:06 UTC, Paul Cuzner
no flags Details
rebalance log from rhs7 node (1000.11 KB, text/plain)
2014-07-17 23:07 UTC, Paul Cuzner
no flags Details
Test case reproducing the problem (1.64 KB, text/plain)
2014-08-12 01:52 UTC, Shyamsundar
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2014:1278 0 normal SHIPPED_LIVE Red Hat Storage Server 3.0 bug fix and enhancement update 2014-09-22 23:26:55 UTC

Description Paul Cuzner 2014-07-17 01:55:59 UTC
Created attachment 918559 [details]
nfs log from one of the rhs nodes providing the nfs connection for the client

Description of problem:
Testing volume expansion and rebalance on a volume used by Splunk for cold data, resulted in files being unable to be copied/deleted. originally I had a 4 brick dist-repl volume, and expanded this to an 8 brick configuration by
- running add-brick
- running rebalance start

The rebalance was executed during splunk activity (writes of cold buckets to the volume and reads to the volume across buckets during upto 36 concurrent search sessions. rebalance completed successfully.

However, two problems have been identified following the rebalance;

1. a subsequent benchmark test that attempts to refresh the environment by deleting existing files - failed (nfs.log attached)
2. the migration of data from one of the indexers to the RHS volume started to fail, leaving the data on local disk instead of migrating to the nfs mounted rhs volume. Splunk continued but this is an error.

Version-Release number of selected component (if applicable):
rhs 2.1u2, glusterfs 3.4.0.59rhs

How reproducible:


Steps to Reproduce:
1. any attempt to delete the file's listed in the nfs.log, fail

e.g.
[root@focil-rhs1 rawdata]# pwd
/opt/sbk/splunk/var/lib/splunk/test_streaming/colddb/db_1412354979_1408724759_476/rawdata
[root@focil-rhs1 rawdata]# rm slicesv2.dat 
rm: remove regular file `slicesv2.dat'? y
rm: cannot remove `slicesv2.dat': Invalid argument



Actual results:
file deletion fails.

Expected results:
file access/manipulation following rebalance should work



Additional info:
Avati has had a look at the system in the cisco lab and verified that the hash layout although changed on the filesystem itself, is not being used by the nfs translators in-memory copy.

nfs.log from rhs6 node is attached. The issue with Splunk happened at 03:30 PDT, which corresponds to the hash/REMOVE failures listed in the nfs.log from 10:30 UTC.

The systems are currently available for debuging if required.

I'm marking this as Urgent - since this is critical to our work with Splunk/Cisco.

Comment 2 Paul Cuzner 2014-07-17 02:04:09 UTC
Created attachment 918560 [details]
nfs log from the other rhs node providing nfs connectivity

Comment 3 Paul Cuzner 2014-07-17 02:07:35 UTC
Also the date in the logs to check for is 2014-07-16

Comment 4 Anand Avati 2014-07-17 04:05:04 UTC
Here are some of my observations:

After rebalance completion, the backend of this "faulty" directory (i.e /opt/sbk/splunk/var/lib/splunk/test_streaming/colddb/db_1412354979_1408724759_476/rawdata) seems to be OK -

Dir: /rhs1-streaming/db_1412354979_1408724759_476/rawdata
trusted.gfid=0x9d18c4690f134b3380a4ae308348162e

focil-rhs5:/rhs/brick1/splunk
trusted.glusterfs.dht=0x00000001000000007ffffffebffffffc
focil-rhs6:/rhs/brick1/splunk
trusted.glusterfs.dht=0x00000001000000007ffffffebffffffc

focil-rhs7:/rhs/brick1/splunk
trusted.glusterfs.dht=0x00000001000000003fffffff7ffffffd
focil-rhs8:/rhs/brick1/splunk
trusted.glusterfs.dht=0x00000001000000003fffffff7ffffffd

focil-rhs5:/rhs/brick2/splunkRepl
trusted.glusterfs.dht=0x0000000100000000bffffffdffffffff
focil-rhs8:/rhs/brick2/splunkRepl
trusted.glusterfs.dht=0x0000000100000000bffffffdffffffff

focil-rhs6:/rhs/brick2/splunkRepl
trusted.glusterfs.dht=0x0000000100000000000000003ffffffe
focil-rhs7:/rhs/brick2/splunkRepl
trusted.glusterfs.dht=0x0000000100000000000000003ffffffe

This indicates rebalance to have finished properly.

However, on inspecting the in-memory layout for that dir inode in the NFS server, it appears that DHT has set a FILE's layout to the directory. This is an excerpt from the gdb session during a break point at dht_unlink

(gdb) p *loc
$10 = {path = 0x2044ad0 "<gfid:7907d41f-ade3-4315-8d71-551518922ec0>/db_1412354979_1408724759_476/rawdata/slicesv2.dat", name = 0x2044b21 "slicesv2.dat", 
  inode = 0x7f87e44e81cc, parent = 0x7f87e44e8130, gfid = "\n)*\206\267rB\231\356 \375\062rD", pargfid = "\235\030\304i\017\023K3\200\244\256\060\203H\026."}

(gdb) p/x loc->parent->gfid
$11 = {0x9d, 0x18, 0xc4, 0x69, 0xf, 0x13, 0x4b, 0x33, 0x80, 0xa4, 0xae, 0x30, 0x83, 0x48, 0x16, 0x2e}

(gdb) p *loc->parent
$12 = {table = 0x1650500, gfid = "\235\030\304i\017\023K3\200\244\256\060\203H\026.", lock = 1, nlookup = 2, fd_count = 0, ref = 5, ia_type = IA_IFDIR, 
  fd_list = {next = 0x7f87e44e8168, prev = 0x7f87e44e8168}, dentry_list = {next = 0x7f87e3b6b308, prev = 0x7f87e3b6b308}, hash = {next = 0x7f87e3a532f0, 
    prev = 0x7f87e4509d70}, list = {next = 0x7f87e44e7910, prev = 0x7f87e44e85dc}, _ctx = 0x1a96440}

(gdb) p *layout
$4 = {spread_cnt = 0, cnt = 1, preset = 1, gen = 0, type = 0, ref = 17028183, search_unhashed = 0, list = 0x16f66e0}

(gdb) p layout->list[0]
$5 = {err = 0, start = 0, stop = 0, xlator = 0x161f2c0}

(gdb) p layout->list[0].xlator->name
$8 = 0x161b780 "splunkRepl-replicate-1"

Note that the parent dir's layout has preset=1 and cnt=1, typical of a FILE inode's preset layout. This also correlates with NFS logs where pretty much every hash value is reported to not fall in range (as FILE layout range is from 0 to 0)

Further investigation is needed to find out why the dir inode ended up with a FILE preset layout. That is very likely the root cause of the overall problem.

Comment 5 Anand Avati 2014-07-17 04:13:28 UTC
Another noteworthy observation was that, from the NFS client when rm /opt/sbk/splunk/var/lib/splunk/test_streaming/colddb/db_1412354979_1408724759_476/rawdata/filename was attempted, there were NO break point hits on dht_lookup, even on the ancestor inodes in the path. There were hits to dht_access and dht_stat on both the parent dir (the one with the FILE layout) and the file.

This could be aggrevating the effect, as the layout is not getting an opportunity to get refreshed (and "get fixed" automatically)

Comment 6 Paul Cuzner 2014-07-17 23:06:31 UTC
Created attachment 918857 [details]
rebalance log from rhs5 node

Comment 7 Paul Cuzner 2014-07-17 23:07:08 UTC
Created attachment 918858 [details]
rebalance log from rhs7 node

Comment 8 Paul Cuzner 2014-07-17 23:37:43 UTC
To allow the testing on this platform to move forward I had to mount the gluster vol to a different mount point to keep the application happy. I tested the rm against the slices* file mentioned above to verify that everything was still the same - but the rm now worked! 

Perhaps the mount action has refreshed the in-memory layout? So it looks like I unintentionally broke the 'reproducer' :(

I've left the remaining files in place attached to /opt/sbk/rhs1-streaming on the rhs1 node.

Comment 9 Anand Avati 2014-07-18 00:01:30 UTC
A new mount (client) would have performed LOOKUP operations and that would have very likely refreshed the in-mem layout. The reproducer was "stably" reproducing because LOOKUPs were not coming from the client (as mentioned in my previous comment).

Comment 10 Sayan Saha 2014-07-24 19:59:10 UTC
Proposing this as a blocker for Denali as it needs to be addressed for a key workload.

Comment 15 Shyamsundar 2014-08-12 01:52:57 UTC
Created attachment 925892 [details]
Test case reproducing the problem

I was able to reproduce this problem on 2.1 code base with a similar internal state in DHT (as seen by Avati in comment #4).

The reproduction steps are in a test script (as attached), and fails for a random directory or two while listing or unlinking the same.

We now need to try this with the fix for dht_access as mentioned in the other similar bug #1121099 and see if the problem goes away.

We also need to test the same on 3.0, but as it is in a consistently reproducible state, we should be able to get to the bottom of this sooner at least from a trouble shooting perspective.

Comment 16 Shyamsundar 2014-08-12 17:54:32 UTC
Upstream patch submitted here,  http://review.gluster.org/8462

The issue with RHS 3.0 was not as severe due to potential fix in layout setting on nameless lookup in DHT, but there were still stale errors on accessing some directories etc. These are also fixed with the changes made to the code.

Once this is reviewed and accepted upstream, this will be ported to RHS 3.0 (and maybe 2.1 as well?)

Comment 18 Atin Mukherjee 2014-09-19 12:13:30 UTC
Gluster-server version
======================

[root@rhssvm-swift2 ~]# gluster --version
glusterfs 3.6.0.28 built on Sep  3 2014 10:13:12
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.


glusterfs-client version
========================

[root@rhs-client10 10]# glusterfs --version
glusterfs 3.6.0.28 built on Sep  3 2014 10:13:11
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2013 Red Hat, Inc. <http://www.redhat.com/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.


Ran the script attached by Shyam with a small correction to testcase 11 (ref: http://review.gluster.org/#/c/8462/6/tests/bugs/bug-1125824.t) and all testcases passed.

Hence, this is verified.

Comment 20 errata-xmlrpc 2014-09-22 19:44:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html


Note You need to log in before you can comment on or make changes to this bug.