Bug 239372
Summary: | automount consumes excessive CPU time | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | IBM Bug Proxy <bugproxy> | ||||||||
Component: | realtime-kernel | Assignee: | Guy Streeter <streeter> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | |||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 1.0 | CC: | nobody | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | 2.6.21-9el5rt | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2007-05-25 15:54:55 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
IBM Bug Proxy
2007-05-07 22:15:19 UTC
I forgot to mention this: The problem doesn't happen with the standard RHEL5 kernel. ----- Additional Comments From sripathi.com (prefers email at sripathik.com) 2007-05-08 08:24 EDT ------- I forgot to mention this: The problem doesn't happen with the standard RHEL5 kernel. ----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-05-10 18:14 EDT ------- Reproduced w/ 2.6.21-2.el5rt kernel (from Clark's alpha repo) changed: What |Removed |Added ---------------------------------------------------------------------------- Status|OPEN |ASSIGNED ------- Additional Comments From sripathi.com (prefers email at sripathik.com) 2007-05-15 04:28 EDT ------- I have a slight clue on what could be going wrong here. I have noticed that automount daemon spawns a few threads and it is always one of these threads that consumes a lot of CPU. I then noticed that the cuplrit thread always seems to be doing a pthread_cond_timedwait(). Comparing with the sources of automount, I noted that it is usually one of the pthread_cond_wait()s in autofs-5.0.1/daemon/state.c, line numbers 931 or 967. The call to clock_gettime we see in strace output is probably the one in pthread_cond_timedwait() code in glibc. I suspect there is some problem with the kernel/glibc communication for pthread_cond_timedwait() that could be causing this problem. ----- Additional Comments From sripathi.com (prefers email at sripathik.com) 2007-05-15 07:40 EDT ------- This is getting interesting now. It looks like pthread_cond_timedwait() is going bonkers! The code in automount calls pthread_cond_timedwait() in a loop with a timeout of 1 second. On -rt kernel, this seems to be returning immediately and hence it makes a large number of pthread_cond_timedwait() calls. Hence it consumes a lot of CPU time. On mainline kernel, since pthread_cond_timedwait() times out after 1 second, it makes just 1 call per second. I need to figure out what is going wrong with pthread_cond_timedwait() on -rt. The problem could be in the kernel or at kernel/glibc interface. ----- Additional Comments From sripathi.com (prefers email at sripathik.com) 2007-05-16 07:57 EDT ------- I will attach a simple testcase that demonstrates the problem. It mimics what the code in automount is doing. The problem could be because of the way time() and clock_gettime() are used together and/or behavior of these calls on -rt. The testcase calls pthread_cond_timedwait 10 times, each time with a timeout of 1 second. This test runs for around 10 seconds on vanilla kernel, but finishes within a second on -rt. I have some ideas about what is going wrong here, but I also need to figure out how it is working fine on vanilla kernel before concluding anything. Created attachment 154809 [details]
timedwait2.c
----- Additional Comments From sripathi.com (prefers email at sripathik.com) 2007-05-16 07:57 EDT ------- Testcase to demonstrate the problem ----- Additional Comments From sripathi.com (prefers email at sripathik.com) 2007-05-17 04:13 EDT ------- Some more info about the problem: The code in automount calls time() to get the current time, adds 1 to it and passes the result as timeout value to pthread_cond_timedwait. Since time() has a grannularity of 1 second, the timeout value could be anything between 1 nanosecond and 1 second. The code in pthread_cond_timedwait calls clock_gettime() (CLOCK_REALTIME) to get the current value of clock. It then compares the timeout with the current value and if the timeout has already elapsed, it just returns ETIMEDOUT without going into the kernel. This should happen rarely. However, on -rt, the values returned by time() and clock_gettime() don't seem to match. Hence pthread_cond_wait often finds that the timeout has elapsed and returns ETIMEDOUT immediately. This is the cause of the problem. I will attach another testcase that can demonstrate the problem much better. It does clock_gettime() to read CLOCK_REALTIME first. It then calls time() to get the time. When we run this test a few times, we notice that the time returned by time() is sometimes older than what is returned by clock_gettime(), even though it is called later. We need to findout which of the calls is reporting wrong time and why. Created attachment 154898 [details]
timediff.c
----- Additional Comments From sripathi.com (prefers email at sripathik.com) 2007-05-17 04:15 EDT ------- New testcase. Better demonstrates the problem Run this testcase repeatedly and observe if it prints "ERROR!!!" ----- Additional Comments From srinivds.com (prefers email at srinivasa.com) 2007-05-17 08:25 EDT ------- What I could observe is, "time()" (in testcase) is implemented in glibc as well as in systemcall. We are hitting this error only when "time()" maps to glibc call and not when it maps to a systemcall. So problem is in glibc implementaion of "time()" Thanks Srinivasa Ds ----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-05-17 13:32 EDT ------- Srinivasa: Hmm. Was this using strace? Sounds like its a vsyscall issue. I'll take a look at it. ----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-05-17 14:19 EDT ------- Yup. I'm pretty sure its a vsyscall issue. I'll have a patch for it shortly. Created attachment 154970 [details]
vtime.patch
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-05-17 21:40 EDT ------- Fix for vsyscall time() This patch forces vtime() to call vgettimeofday() instead of just reading xtime, just as the time() syscall does. ----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-05-17 22:02 EDT ------- Patch tested and sent to Ingo. ----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-05-17 22:28 EDT ------- Patch also rediffed against 2.6.22-rc1 and sent to Andrew and Andi. ----- Additional Comments From srinivds.com (prefers email at srinivasa.com) 2007-05-18 06:06 EDT ------- Thanks Jhon Patch is working fine. ----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-05-18 19:03 EDT ------- Patch included in 2.6.21-rt4. Additionally Andrew has picked it up. Can you verify that the problem goes away in a kernel built with the -rt4 or later patch? ----- Additional Comments From sripathi.com (prefers email at sripathik.com) 2007-05-23 02:33 EDT ------- I tested with 2.6.21-4.el5rt kernel and the problem is still there. automount still consumes a lot of cpu time. Which -rt patch is this kernel built on? The -rt4 patch was added in 2.6.21-9el5rt. I don't know how you're supposed to get that one. If you tested your own build with the -rt4 patch, that's good enough for me. ----- Additional Comments From sripathi.com (prefers email at sripathik.com) 2007-05-24 12:32 EDT ------- I have tested this with 2.6.21-14.el5rt kernel (which I believe contains Ingo's patch-2.6.21-rt7) and the problem is no more seen. automount is behaving sanely. Thanks, Sripathi. |