Created attachment 918559 [details] nfs log from one of the rhs nodes providing the nfs connection for the client Description of problem: Testing volume expansion and rebalance on a volume used by Splunk for cold data, resulted in files being unable to be copied/deleted. originally I had a 4 brick dist-repl volume, and expanded this to an 8 brick configuration by - running add-brick - running rebalance start The rebalance was executed during splunk activity (writes of cold buckets to the volume and reads to the volume across buckets during upto 36 concurrent search sessions. rebalance completed successfully. However, two problems have been identified following the rebalance; 1. a subsequent benchmark test that attempts to refresh the environment by deleting existing files - failed (nfs.log attached) 2. the migration of data from one of the indexers to the RHS volume started to fail, leaving the data on local disk instead of migrating to the nfs mounted rhs volume. Splunk continued but this is an error. Version-Release number of selected component (if applicable): rhs 2.1u2, glusterfs 3.4.0.59rhs How reproducible: Steps to Reproduce: 1. any attempt to delete the file's listed in the nfs.log, fail e.g. [root@focil-rhs1 rawdata]# pwd /opt/sbk/splunk/var/lib/splunk/test_streaming/colddb/db_1412354979_1408724759_476/rawdata [root@focil-rhs1 rawdata]# rm slicesv2.dat rm: remove regular file `slicesv2.dat'? y rm: cannot remove `slicesv2.dat': Invalid argument Actual results: file deletion fails. Expected results: file access/manipulation following rebalance should work Additional info: Avati has had a look at the system in the cisco lab and verified that the hash layout although changed on the filesystem itself, is not being used by the nfs translators in-memory copy. nfs.log from rhs6 node is attached. The issue with Splunk happened at 03:30 PDT, which corresponds to the hash/REMOVE failures listed in the nfs.log from 10:30 UTC. The systems are currently available for debuging if required. I'm marking this as Urgent - since this is critical to our work with Splunk/Cisco.
Created attachment 918560 [details] nfs log from the other rhs node providing nfs connectivity
Also the date in the logs to check for is 2014-07-16
Here are some of my observations: After rebalance completion, the backend of this "faulty" directory (i.e /opt/sbk/splunk/var/lib/splunk/test_streaming/colddb/db_1412354979_1408724759_476/rawdata) seems to be OK - Dir: /rhs1-streaming/db_1412354979_1408724759_476/rawdata trusted.gfid=0x9d18c4690f134b3380a4ae308348162e focil-rhs5:/rhs/brick1/splunk trusted.glusterfs.dht=0x00000001000000007ffffffebffffffc focil-rhs6:/rhs/brick1/splunk trusted.glusterfs.dht=0x00000001000000007ffffffebffffffc focil-rhs7:/rhs/brick1/splunk trusted.glusterfs.dht=0x00000001000000003fffffff7ffffffd focil-rhs8:/rhs/brick1/splunk trusted.glusterfs.dht=0x00000001000000003fffffff7ffffffd focil-rhs5:/rhs/brick2/splunkRepl trusted.glusterfs.dht=0x0000000100000000bffffffdffffffff focil-rhs8:/rhs/brick2/splunkRepl trusted.glusterfs.dht=0x0000000100000000bffffffdffffffff focil-rhs6:/rhs/brick2/splunkRepl trusted.glusterfs.dht=0x0000000100000000000000003ffffffe focil-rhs7:/rhs/brick2/splunkRepl trusted.glusterfs.dht=0x0000000100000000000000003ffffffe This indicates rebalance to have finished properly. However, on inspecting the in-memory layout for that dir inode in the NFS server, it appears that DHT has set a FILE's layout to the directory. This is an excerpt from the gdb session during a break point at dht_unlink (gdb) p *loc $10 = {path = 0x2044ad0 "<gfid:7907d41f-ade3-4315-8d71-551518922ec0>/db_1412354979_1408724759_476/rawdata/slicesv2.dat", name = 0x2044b21 "slicesv2.dat", inode = 0x7f87e44e81cc, parent = 0x7f87e44e8130, gfid = "\n)*\206\267rB\231\356 \375\062rD", pargfid = "\235\030\304i\017\023K3\200\244\256\060\203H\026."} (gdb) p/x loc->parent->gfid $11 = {0x9d, 0x18, 0xc4, 0x69, 0xf, 0x13, 0x4b, 0x33, 0x80, 0xa4, 0xae, 0x30, 0x83, 0x48, 0x16, 0x2e} (gdb) p *loc->parent $12 = {table = 0x1650500, gfid = "\235\030\304i\017\023K3\200\244\256\060\203H\026.", lock = 1, nlookup = 2, fd_count = 0, ref = 5, ia_type = IA_IFDIR, fd_list = {next = 0x7f87e44e8168, prev = 0x7f87e44e8168}, dentry_list = {next = 0x7f87e3b6b308, prev = 0x7f87e3b6b308}, hash = {next = 0x7f87e3a532f0, prev = 0x7f87e4509d70}, list = {next = 0x7f87e44e7910, prev = 0x7f87e44e85dc}, _ctx = 0x1a96440} (gdb) p *layout $4 = {spread_cnt = 0, cnt = 1, preset = 1, gen = 0, type = 0, ref = 17028183, search_unhashed = 0, list = 0x16f66e0} (gdb) p layout->list[0] $5 = {err = 0, start = 0, stop = 0, xlator = 0x161f2c0} (gdb) p layout->list[0].xlator->name $8 = 0x161b780 "splunkRepl-replicate-1" Note that the parent dir's layout has preset=1 and cnt=1, typical of a FILE inode's preset layout. This also correlates with NFS logs where pretty much every hash value is reported to not fall in range (as FILE layout range is from 0 to 0) Further investigation is needed to find out why the dir inode ended up with a FILE preset layout. That is very likely the root cause of the overall problem.
Another noteworthy observation was that, from the NFS client when rm /opt/sbk/splunk/var/lib/splunk/test_streaming/colddb/db_1412354979_1408724759_476/rawdata/filename was attempted, there were NO break point hits on dht_lookup, even on the ancestor inodes in the path. There were hits to dht_access and dht_stat on both the parent dir (the one with the FILE layout) and the file. This could be aggrevating the effect, as the layout is not getting an opportunity to get refreshed (and "get fixed" automatically)
Created attachment 918857 [details] rebalance log from rhs5 node
Created attachment 918858 [details] rebalance log from rhs7 node
To allow the testing on this platform to move forward I had to mount the gluster vol to a different mount point to keep the application happy. I tested the rm against the slices* file mentioned above to verify that everything was still the same - but the rm now worked! Perhaps the mount action has refreshed the in-memory layout? So it looks like I unintentionally broke the 'reproducer' :( I've left the remaining files in place attached to /opt/sbk/rhs1-streaming on the rhs1 node.
A new mount (client) would have performed LOOKUP operations and that would have very likely refreshed the in-mem layout. The reproducer was "stably" reproducing because LOOKUPs were not coming from the client (as mentioned in my previous comment).
Proposing this as a blocker for Denali as it needs to be addressed for a key workload.
Created attachment 925892 [details] Test case reproducing the problem I was able to reproduce this problem on 2.1 code base with a similar internal state in DHT (as seen by Avati in comment #4). The reproduction steps are in a test script (as attached), and fails for a random directory or two while listing or unlinking the same. We now need to try this with the fix for dht_access as mentioned in the other similar bug #1121099 and see if the problem goes away. We also need to test the same on 3.0, but as it is in a consistently reproducible state, we should be able to get to the bottom of this sooner at least from a trouble shooting perspective.
Upstream patch submitted here, http://review.gluster.org/8462 The issue with RHS 3.0 was not as severe due to potential fix in layout setting on nameless lookup in DHT, but there were still stale errors on accessing some directories etc. These are also fixed with the changes made to the code. Once this is reviewed and accepted upstream, this will be ported to RHS 3.0 (and maybe 2.1 as well?)
Gluster-server version ====================== [root@rhssvm-swift2 ~]# gluster --version glusterfs 3.6.0.28 built on Sep 3 2014 10:13:12 Repository revision: git://git.gluster.com/glusterfs.git Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com> GlusterFS comes with ABSOLUTELY NO WARRANTY. You may redistribute copies of GlusterFS under the terms of the GNU General Public License. glusterfs-client version ======================== [root@rhs-client10 10]# glusterfs --version glusterfs 3.6.0.28 built on Sep 3 2014 10:13:11 Repository revision: git://git.gluster.com/glusterfs.git Copyright (c) 2006-2013 Red Hat, Inc. <http://www.redhat.com/> GlusterFS comes with ABSOLUTELY NO WARRANTY. It is licensed to you under your choice of the GNU Lesser General Public License, version 3 or any later version (LGPLv3 or later), or the GNU General Public License, version 2 (GPLv2), in all cases as published by the Free Software Foundation. Ran the script attached by Shyam with a small correction to testcase 11 (ref: http://review.gluster.org/#/c/8462/6/tests/bugs/bug-1125824.t) and all testcases passed. Hence, this is verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html