Description of problem:
IO hang is seen on NFS ganesha mount with file operations such as file creations, directory creations and lookups.
Version-Release number of selected component (if applicable):
yet to determine
Steps to Reproduce:
1. create a EC volume
2. mount the volume using NFS ganesha v3 protocol
3. create 100000 files and 10000 directories with multiple sub-dirs
4. run lookup on the mount point
5. Allow file and directory creation to complete
IO operation hung
No disruption to IO should be seen
sosreports and statedumps shall be attached shortly
sosreports are available here --> http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1330997/
changing the summary to reflect the - EC volume - IO hang seen on ganesha mount with file ops
*** Bug 1334860 has been marked as a duplicate of this bug. ***
*** Bug 1339465 has been marked as a duplicate of this bug. ***
While verifying this bug I hit Bug 1342426 - self heal deamon killed due to oom kills on a dist-disperse volume using nfs ganesha
Hence this bug verification is blocked till Bug 1342426 is fixed
Restarted validating this bug again after having "Bug 1342426 - self heal deamon killed due to oom kills on a dist-disperse volume using nfs ganesha " this bug fixed.
I hit an issue again with ls -lRt hanging and also stale file handle error with an oom kill of ganesha process. Raised a bug for the same 1344675 - Stale file handle seen on the mount of dist-disperse volume when doing IOs with nfs-ganesha protocol
So, discussed with stakeholders and unmounted the volume and cleaned the client side.
then remounted the same volume on only one client and issued an ls -lRt on root of the mount
the ls -lRt was hung (observered for atleast 1 hr and was still not responding till end of day)
[root@dhcp35-126 ~]# mkdir /mnt/ec-nfsganesha;mount -t nfs -o vers=3 10.70.35.220:/ec-nfsganesha /mnt/ec-nfsganesha
[root@dhcp35-126 ~]# cd /mnt/ec-nfsganesha
[root@dhcp35-126 ec-nfsganesha]# ls -lRt |& tee -a refreshed.ls.log
Plain 'ls' command on root directory seems to have worked fine. Only issue is with 'ls -lRT' deep directories lookup. As part of readdir(/plus), NFS-ganesha caches all dirents along with their attributes.
The issue seems like, since the volume contains millions of file, its taking lot of time for NFS-ganesha to cache all of them and then respond. Also we see that in case of disperse volumes, we see that for each file there are lookup requests sent to almost all the ec sub-volumes (bricks) adding up to the latency. Attached the pkt trace.
Any thoughts on the deadlock of locks mentioned above?
Disabling readdirp doesn't disable readdirp. Both md-cache and dht still do readdirp. So we need to disable them too. I will take a look at the setup about the locks.
Did you get a chance to look at the setup, Pranith?
Yes, there is a stale lock. I am trying to find the reasons why it could get into this state. I will update the bz as soon as I find something. This is multi-threaded code and there could be races. I need to find which race could lead to this state. May take a while to find the problem
TL:DR: I recreated the hang on my laptop.
The locks were getting acquired at the time when bricks were going down because of ping timeouts. 4 of the 6 bricks went down at that time. 2 of the 6 bricks have locks which are not being unlocked for some reason and were left stale. To come to this conclusion it took almost 7 hours :-). Had to look at statedumps and took statedump of nfs ganesha process as well using gdb.
Steps to recreate the issue:
1) create a plain disperse volume
2) Put a breakpoint at ec_wind_inodelk
3) From the fuse mount issue ls -laR <mount>
4) as soon as the break point is hit in gdb, from other terminal kill 4 of the 6 bricks
5) quit gdb
6) Wait for a second or two to confirm that there are stale locks on the remaining bricks
7) In my case there were, so I issued ls -laR on the mount and it hung.
Posted the patch upstream: http://review.gluster.org/14703
This is day-1 bug in ec. Was able to recreate it in 3.1.1 as well. I don't remember seeing disconnects first time we looked at this bz. So I think it is different RC.
It seems like ec stale lock issue is *only* hit if brick disconnection (ping timer expiry) happens. Question for Nagpavan is are you hitting brick disconnection every time while verifying this bug, if so does that mean you are blocked on verifying it?
If Nagpavan does confirm the same, can you shed some light on this why are we seeing the ping timer expiry? Does it have anything to do with the access protocol?
Me and Pranith looked into client and brick logs.
1. clients were able to reconnect without any issues after disconnect
2. There is not much of information in logs which help to figure out what caused the ping-timer-expiry.
To summarize, ping-timer-expiry is not a problem in rpc/transport. It can be caused because of many reasons, one of which can be poller thread being not able to respond back to ping requests. We need more investigation and the bug need not be in rpc/transport layer. What can help here is:
1. brick logs in DEBUG log-level so that we can try to analyse what the brick process is doing during ping-expiry.
2. exact steps/workload during which ping timer expired.
I didn't observe any brick disconnections.
Also, the dd IO was successfully going on(when I ran it as part of validation before comment 20). Only ls -lrt was hanging(with tar untar failing after certain iterations, due to oom kill of nfs ganesha)
unable to mark this as verified due to following reason:
the writes were not having issues but doing an ls -lRt was either getting hung or started to display contents only after a very long delay of more than 15 hrs
hence the use case is incomplete.
Marking this bug as blocked to verify due to below BZs:
1344675 - Stale file handle seen on the mount of dist-disperse volume when doing IOs with nfs-ganesha protocol
1345911 - locks on file in dist-disperse not released leading to IO hangs
1345909 - ls -lRt taking very long time to display contents on an dist-disperse volume with nfs ganesha protocol
As I see the fix is already available in the latest release of RHGS, closing this bug.