Description of problem: Occasionally running pidof for process X hangs in access_process_vm for eternity. Version-Release number of selected component (if applicable): 2.6.9-67 How reproducible: Day or two - but pidof access rate is very low, once per 5 minutes or so. Steps to Reproduce: Nokia flexiserver specific: test case is to fail over cluster NFS heads. Failover script uses pidof to figure out already running NFS tasks. We get a hang once every 100 runs or so.
Created attachment 299335 [details] crash backtrace of pidof while hanging
Slow console & hw watchdog caused us not to get any sysrq-t's from this hardware but that is now solved. We should get complete task list from next occurrence.
OK, thanks. BTW, it this reproducable on my system? The system is obviously stuck here: ------------------------------------ int access_process_vm(...) { struct mm_struct *mm; struct vm_area_struct *vma; struct page *page; void *old_buf = buf; mm = get_task_mm(tsk); if (!mm) return 0; >>> down_read(&mm->mmap_sem); ------------------------------------- But there are hundreds of other down_write(...>mmap_sem); calls on that architecture that could cause this problem... Can you get an AltSysrq-T when this happens so I can see what the processes that has the semaphore is doing??? Larry
Sysrq-t is in the works. With any luck we get it tomorrow morning. I'll try to make a reproducer testcase on one of the rhts systems. I'm feeling lucky..
Most basic imaginable testcase (killing/starting hordes of processes and making pidof against them) does not seem to reproduce it. I'm betting this may have something to do with the mount being done just prior to checking the existence of leftover nfs tasks. Just a guess though.
Created attachment 300059 [details] partial sysrq
This is tricky. WD is not the cause of the reset, it has to be something else. We can only get partial sysrq's.
Created attachment 300244 [details] Probably complete backtrace from Crash
Created attachment 300249 [details] Another crash-bt
NOTE::: attachment id 300249 is verified to be from NON-RECOVERABLE occurrence.
Umm, one of the tasks on top of NFS holds the semaphore that is required for NFS to start up :) ?
I suspect this problem was introduced in linux-2.6.9-futex.patch: * Thu May 10 2007 Jason Baron <jbaron> [2.6.9-55.2] -fix for futex()/FUTEX_WAIT race condition (Ernie Petrides) [217067] Can you try kernel-2.6.9-55.1 and see it the problem goes away??? Larry
Yeah, no (obvious) luck with the nfs guess.
Hmm, imho my initial guess may still be valid. It may be that 11748 holds the write mmap_sem in sys_mmap holding just about everyone. That task may not be proceeding as NFS is not up and given that NFS startup is hanging in having 'pidof' waiting for that same semaphore, we have a deadlock. So we'll try both cases. We'll try removing the patch Larry suggested on one system and on another we'll move the pidof call to a point where basic NFS is already up. I'm willing to bet Larry a cup of machine coffee on this one :)
It took a day to find second cluster for testing with 55.1 kernel, but we found one and will set up it on monday + start the test. System 1 has been testing the fix that moves pidof call to a point where nfsd/mountd are already up and the fault has not shown up yet. Given that this is the case implications of this are yet to be properly understood: it may mean that whole NFS failover concept is flaky, at least when it comes to having NFS client and server in the same node.
We have not seen this bug again since pidof call was moved to a point when nfsd/mountd are already up. We'll keep the test running for another day to be 'sure'. Build with 55.1 kernel is also ready but has not been installed yet.
Verified: does not show up after having moved the pidof call.
Verified: using 55.1 kernel does not resolve this issue.
OK, so what does this all mean? Is the whole failover logic flawed whe NFS is in the picture??? Larry
To me it means that with bad luck tasks that are running on top of NFS mount may cause the system to deadlock when NFS server itself migrates to the same node. I take it not too many people are doing this..
To summarize: provided that we have NFS server migration from external host to local occurring at the same time when local NFS client task is holding mmap_sem we can have a deadlock. Pidof calls in NFS server startup iterate all tasks (proc/pid/cmdline) and they will stop once hitting this task: and this task is never proceeding as server is not coming back up. Larry, any major holes in this theory?