Bug 239372 - automount consumes excessive CPU time
automount consumes excessive CPU time
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: realtime-kernel (Show other bugs)
1.0
x86_64 Linux
medium Severity medium
: ---
: ---
Assigned To: Guy Streeter
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-05-07 18:15 EDT by IBM Bug Proxy
Modified: 2016-02-09 20:32 EST (History)
1 user (show)

See Also:
Fixed In Version: 2.6.21-9el5rt
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-05-25 11:54:55 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
timedwait2.c (792 bytes, text/plain)
2007-05-16 08:01 EDT, IBM Bug Proxy
no flags Details
timediff.c (495 bytes, text/plain)
2007-05-17 04:20 EDT, IBM Bug Proxy
no flags Details
vtime.patch (711 bytes, text/plain)
2007-05-17 21:45 EDT, IBM Bug Proxy
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
IBM Linux Technology Center 34435 None None None Never

  None (edit)
Description IBM Bug Proxy 2007-05-07 18:15:19 EDT
LTC Owner is: jstultz@us.ibm.com
LTC Originator is: sripathi@in.ibm.com


automount daemon seems to consume excessive CPU time on RHEL5-RT. Just log into
a RHEL5-RT box and do "top". automount is seen near the top of the list,
sometimes consuming upto 30% of cpu time. I have seen this on rt-cypress,
rt-aspen, llm50, llm38, etc.

Machine type: LS20, LS21.
Kernel: 2.6.20-0119.rt8
glibc: glibc-2.5-12
automount version: Linux automount version 5.0.1-0.rc2.42

I ran strace on automount and noticed that it keeps making the following syscalls:

2604  futex(0x555555783a20, FUTEX_WAKE, 1) = 0
2604  clock_gettime(CLOCK_REALTIME, {1178333881, 321418}) = 0
2604  futex(0x555555783a20, FUTEX_WAKE, 1) = 0
2604  clock_gettime(CLOCK_REALTIME, {1178333881, 422253}) = 0
2604  futex(0x555555783a20, FUTEX_WAKE, 1) = 0
2604  clock_gettime(CLOCK_REALTIME, {1178333881, 503257}) = 0
2604  futex(0x555555783a20, FUTEX_WAKE, 1) = 0
2604  clock_gettime(CLOCK_REALTIME, {1178333881, 582864}) = 0
2604  futex(0x555555783a20, FUTEX_WAKE, 1) = 0
2604  clock_gettime(CLOCK_REALTIME, {1178333881, 663029}) = 0
2604  futex(0x555555783a20, FUTEX_WAKE, 1) = 0
2604  clock_gettime(CLOCK_REALTIME, {1178333881, 743195}) = 0
2604  futex(0x555555783a20, FUTEX_WAKE, 1) = 0
2604  clock_gettime(CLOCK_REALTIME, {1178333881, 823081}) = 0
2604  futex(0x555555783a20, FUTEX_WAKE, 1) = 0
2604  clock_gettime(CLOCK_REALTIME, {1178333881, 902967}) = 0
2604  futex(0x555555783a20, FUTEX_WAKE, 1) = 0
2604  clock_gettime(CLOCK_REALTIME, {1178333881, 983412}) = 0
Comment 1 Sripathi Kodi 2007-05-08 08:25:35 EDT
I forgot to mention this: The problem doesn't happen with the standard RHEL5 kernel.
Comment 2 IBM Bug Proxy 2007-05-08 08:25:57 EDT
----- Additional Comments From sripathi@in.ibm.com (prefers email at sripathik@in.ibm.com)  2007-05-08 08:24 EDT -------
I forgot to mention this: The problem doesn't happen with the standard RHEL5 kernel. 
Comment 3 IBM Bug Proxy 2007-05-10 18:16:45 EDT
----- Additional Comments From jstultz@us.ibm.com (prefers email at johnstul@us.ibm.com)  2007-05-10 18:14 EDT -------
Reproduced w/ 2.6.21-2.el5rt kernel (from Clark's alpha repo) 
Comment 4 IBM Bug Proxy 2007-05-15 04:30:46 EDT
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|OPEN                        |ASSIGNED




------- Additional Comments From sripathi@in.ibm.com (prefers email at sripathik@in.ibm.com)  2007-05-15 04:28 EDT -------
I have a slight clue on what could be going wrong here.

I have noticed that automount daemon spawns a few threads and it is always one
of these threads that consumes a lot of CPU. I then noticed that the cuplrit
thread always seems to be doing a pthread_cond_timedwait(). Comparing with the
sources of automount, I noted that it is usually one of the pthread_cond_wait()s
in autofs-5.0.1/daemon/state.c, line numbers 931 or 967. The call to
clock_gettime we see in strace output is probably the one in
pthread_cond_timedwait() code in glibc.

I suspect there is some problem with the kernel/glibc communication for
pthread_cond_timedwait() that could be causing this problem. 
Comment 5 IBM Bug Proxy 2007-05-15 07:45:41 EDT
----- Additional Comments From sripathi@in.ibm.com (prefers email at sripathik@in.ibm.com)  2007-05-15 07:40 EDT -------
This is getting interesting now. It looks like pthread_cond_timedwait() is going
bonkers!

The code in automount calls pthread_cond_timedwait() in a loop with a timeout of
1 second. On -rt kernel, this seems to be returning immediately and hence it
makes a large number of pthread_cond_timedwait() calls. Hence it consumes a lot
of CPU time. On mainline kernel, since pthread_cond_timedwait() times out after
1 second, it makes just 1 call per second.

I need to figure out what is going wrong with pthread_cond_timedwait() on -rt.
The problem could be in the kernel or at kernel/glibc interface. 
Comment 6 IBM Bug Proxy 2007-05-16 08:00:46 EDT
----- Additional Comments From sripathi@in.ibm.com (prefers email at sripathik@in.ibm.com)  2007-05-16 07:57 EDT -------
I will attach a simple testcase that demonstrates the problem. It mimics what
the code in automount is doing. The problem could be because of the way time()
and clock_gettime() are used together and/or behavior of these calls on -rt.

The testcase calls pthread_cond_timedwait 10 times, each time with a timeout of
1 second. This test runs for around 10 seconds on vanilla kernel, but finishes
within a second on -rt.

I have some ideas about what is going wrong here, but I also need to figure out
how it is working fine on vanilla kernel before concluding anything. 
Comment 7 IBM Bug Proxy 2007-05-16 08:01:18 EDT
Created attachment 154809 [details]
timedwait2.c
Comment 8 IBM Bug Proxy 2007-05-16 08:01:23 EDT
----- Additional Comments From sripathi@in.ibm.com (prefers email at sripathik@in.ibm.com)  2007-05-16 07:57 EDT -------
 
Testcase to demonstrate the problem 
Comment 9 IBM Bug Proxy 2007-05-17 04:15:31 EDT
----- Additional Comments From sripathi@in.ibm.com (prefers email at sripathik@in.ibm.com)  2007-05-17 04:13 EDT -------
Some more info about the problem:

The code in automount calls time() to get the current time, adds 1 to it and
passes the result as timeout value to pthread_cond_timedwait. Since time() has a
grannularity of 1 second, the timeout value could be anything between 1
nanosecond and 1 second.

The code in pthread_cond_timedwait calls clock_gettime() (CLOCK_REALTIME) to get
the current value of clock. It then compares the timeout with the current value
and if the timeout has already elapsed, it just returns ETIMEDOUT without going
into the kernel. This should happen rarely.

However, on -rt, the values returned by time() and clock_gettime() don't seem to
match. Hence pthread_cond_wait often finds that the timeout has elapsed and
returns ETIMEDOUT immediately. This is the cause of the problem.

I will attach another testcase that can demonstrate the problem much better. It
does clock_gettime() to read CLOCK_REALTIME first. It then calls time() to get
the time. When we run this test a few times, we notice that the time returned by
time() is sometimes older than what is returned by clock_gettime(), even though
it is called later. We need to findout which of the calls is reporting wrong
time and why. 
Comment 10 IBM Bug Proxy 2007-05-17 04:20:50 EDT
Created attachment 154898 [details]
timediff.c
Comment 11 IBM Bug Proxy 2007-05-17 04:20:55 EDT
----- Additional Comments From sripathi@in.ibm.com (prefers email at sripathik@in.ibm.com)  2007-05-17 04:15 EDT -------
 
New testcase. Better demonstrates the problem

Run this testcase repeatedly and observe if it prints "ERROR!!!" 
Comment 12 IBM Bug Proxy 2007-05-17 08:30:25 EDT
----- Additional Comments From srinivds@in.ibm.com (prefers email at srinivasa@in.ibm.com)  2007-05-17 08:25 EDT -------
What I could observe is, "time()" (in testcase) is implemented in glibc as well
as in systemcall. We are hitting this error only when "time()" maps to glibc
call and not when it maps to a systemcall.
 So problem is in glibc implementaion of "time()"

Thanks
 Srinivasa Ds 
Comment 13 IBM Bug Proxy 2007-05-17 13:35:46 EDT
----- Additional Comments From jstultz@us.ibm.com (prefers email at johnstul@us.ibm.com)  2007-05-17 13:32 EDT -------
Srinivasa: Hmm. Was this using strace? Sounds like its a vsyscall issue. I'll
take a look at it. 
Comment 14 IBM Bug Proxy 2007-05-17 14:25:25 EDT
----- Additional Comments From jstultz@us.ibm.com (prefers email at johnstul@us.ibm.com)  2007-05-17 14:19 EDT -------
Yup. I'm pretty sure its a vsyscall issue. I'll have a patch for it shortly. 
Comment 15 IBM Bug Proxy 2007-05-17 21:45:20 EDT
Created attachment 154970 [details]
vtime.patch
Comment 16 IBM Bug Proxy 2007-05-17 21:45:25 EDT
----- Additional Comments From jstultz@us.ibm.com (prefers email at johnstul@us.ibm.com)  2007-05-17 21:40 EDT -------
 
Fix for vsyscall time()

This patch forces vtime() to call vgettimeofday() instead of just reading
xtime, just as the time() syscall does. 
Comment 17 IBM Bug Proxy 2007-05-17 22:05:39 EDT
----- Additional Comments From jstultz@us.ibm.com (prefers email at johnstul@us.ibm.com)  2007-05-17 22:02 EDT -------
Patch tested and sent to Ingo. 
Comment 18 IBM Bug Proxy 2007-05-17 22:30:43 EDT
----- Additional Comments From jstultz@us.ibm.com (prefers email at johnstul@us.ibm.com)  2007-05-17 22:28 EDT -------
Patch also rediffed against 2.6.22-rc1 and sent to Andrew and Andi. 
Comment 19 IBM Bug Proxy 2007-05-18 06:10:36 EDT
----- Additional Comments From srinivds@in.ibm.com (prefers email at srinivasa@in.ibm.com)  2007-05-18 06:06 EDT -------
Thanks Jhon
 Patch is working fine. 
Comment 20 IBM Bug Proxy 2007-05-18 19:05:58 EDT
----- Additional Comments From jstultz@us.ibm.com (prefers email at johnstul@us.ibm.com)  2007-05-18 19:03 EDT -------
Patch included in 2.6.21-rt4. Additionally Andrew has picked it up. 
Comment 21 Guy Streeter 2007-05-22 11:36:42 EDT
Can you verify that the problem goes away in a kernel built with the -rt4 or
later patch?
Comment 22 IBM Bug Proxy 2007-05-23 13:56:50 EDT
----- Additional Comments From sripathi@in.ibm.com (prefers email at sripathik@in.ibm.com)  2007-05-23 02:33 EDT -------
I tested with 2.6.21-4.el5rt kernel and the problem is still there. automount
still consumes a lot of cpu time. Which -rt patch is this kernel built on? 
Comment 23 Guy Streeter 2007-05-23 14:59:25 EDT
The -rt4 patch was added in 2.6.21-9el5rt. I don't know how you're supposed to
get that one.
If you tested your own build with the -rt4 patch, that's good enough for me.
Comment 24 IBM Bug Proxy 2007-05-24 12:35:34 EDT
----- Additional Comments From sripathi@in.ibm.com (prefers email at sripathik@in.ibm.com)  2007-05-24 12:32 EDT -------
I have tested this with 2.6.21-14.el5rt kernel (which I believe contains Ingo's
patch-2.6.21-rt7) and the problem is no more seen. automount is behaving sanely.
Thanks,
Sripathi. 

Note You need to log in before you can comment on or make changes to this bug.