Description of problem: I have been chasing an issue with the upstream and Fedora kernels for a week or so. The problem is with the CONFIG_DEBUG_SPINLOCK option -- when turned on the resultant kernel requires 20 minutes to build a kernel on a 64p/64G ia64 system. For comparison, the same build only requires 2 minutes on similarily configured 2.6.9 series kernel. Version-Release number of selected component (if applicable): Latest fedora kernels. How reproducible: 100 % Steps to Reproduce: 1. boot Fedora 2. download and install source 3. compile Actual results: Below is a list of buildtimes (using time (make -j oldconfig; make -j compressed; make -j modules; make -j modules_install; make -j install)) for the 2.6.9-33.EL RHEL4 U3 RC3 kernel, 2.6.16-rc4 without CONFIG_DEBUG_SPINLOCK, and 2.6.16-rc4 with CONFIG_DEBUG_SPINLOCK. You can see that there is a huge increase in system time with the 64p running 2.6.16-rc4 with CONFIG_DEBUG_SPINLOCK compared to 2.6.16-rc4 without CONFIG_DEBUG_SPINLOCK, 958 minutes vs. 13 minutes. 2.6.9-33.EL with CONFIG_DEBUG_SPINLOCK 64p real 2m01.013s user 58m9.504s sys 6m36.283s 32p real 4m13.665s user 58m8.381s sys 5m24.072s 16p real 6m41.907s user 64m45.513s sys 5m47.168s 8p real 10m30.895s user 62m36.766s sys 4m49.389s 4p real 18m22.062s user 62m6.599s sys 4m5.615s 2p real 33m31.121s user 54m32.272s sys 4m46.661s 2.6.16-rc4 without CONFIG_DEBUG_SPINLOCK 64p real 4m29.869s user 56m37.336s sys 13m5.512s 32p real 4m46.434s user 57m10.280s sys 6m52.700s 16p real 6m11.018s user 57m7.368s sys 5m36.380s 8p real 10m20.800s user 61m56.780s sys 5m26.372s 4p real 18m58.029s user 62m8.304s sys 4m58.764s 2p real 37m37.040s user 63m19.824s sys 4m54.532s 2.6.16-rc4 with CONFIG_DEBUG_SPINLOCK 64p real 21m25.129s user 54m59.256s sys 958m37.320s 32p real 4m39.060s user 57m10.264s sys 22m14.432s 16p real 6m58.702s user 63m28.012s sys 6m51.596s 8p real 9m33.040s user 56m45.364s sys 4m56.840s 4p real 19m51.676s user 62m45.048s sys 5m0.296s 2p real 33m29.849s user 56m39.272s sys 4m26.068s Expected results: This is tough to answer -- it depends on the expected behaviour of CONFIG_DEBUG_SPINLOCK. If the behaviour is correct (in that it doesn't scale well ... after all, it is a _DEBUG_ option) then this the reported behaviour is okay. However, CONFIG_DEBUG_SPINLOCK should not be turned on in production kernels. Additional info:
some data from my HP 64p (ia64) Integrity 1TB system command in both cases was: time make -j 70 compiling kernel under RHEL4U3: real 1m3.786s user 26m38.878s sys 2m46.416s compiling running under upstream 2.6.16-rc5 w/CONFIG_DEBUG_SPINLOCK real 1m12.617s user 26m3.748s sys 9m18.908s It is slower but not as bad as you were seeing... - Doug
After a discussion with Kimball, he pointed out that the issue is probably not any sort of cacheline ping-pong issue, but rather with the __delay(1) used in lib/spinlock_debug.c: __spin_lock_debug() for (i = 0; i < loops_per_jiffy * HZ; i++) { if (__raw_spin_trylock(&lock->raw_lock)) return; __delay(1); } Removing this __delay cuts down the time drastically. Ingo, I'm probably missing something basic here -- but is it really necessary to do a __delay(1)? Why not make the "loops_per_jiffy * HZ" value tunable at boot time and allow users to set a value -- wouldn't that be the same as attempting to get close to one second? Just some thoughts, P.
With the __delay(1) removed, and building a kernel: real 2m44.371s user 38m0.008s sys 51m19.492s We're still seeing an impact, however, that maybe due to other CONFIG options, etc.. P.
The more I think about this, the more I believe the correct solution is to remove the __delay(1). The loop for (i = 0; i < loops_per_jiffy * HZ; i++) does loop for approximately a second. I do not see any advantage to delaying for a single clock tick on ia64. Manoj -- what is the ramification of removing the __delay on ppc & ppc64? P. P.
Prarit have you seen is LKML discussion ? http://lkml.org/lkml/2006/2/6/381
>http://lkml.org/lkml/2006/2/6/381 Hmmm ... interesting. Ingo is saying that loops_per_jiffy is the wrong metric to use -- and I agree. I think the correct metric to use here is the TSC? IIRC most arch's have a value that is TSC ticks/second that can be determined by querying the CPU ... P.
changed the __delay(1) to a cpu_relax() in latest builds. Let me know how it works out.
most recent rawhide kernel with cpu_relax instead of __delay(1) on 64p ... is actually worse, not better :/. real 21m32.134s user 56m47.696s sys 1016m2.616s