Bug 866988 - kswapd using 100%cpu for extended period on i686 kvm vhost - rawhide
kswapd using 100%cpu for extended period on i686 kvm vhost - rawhide
Status: CLOSED CURRENTRELEASE
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
19
i686 Linux
unspecified Severity unspecified
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
: 875103 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-10-16 09:40 EDT by John Ellson
Modified: 2013-04-05 15:14 EDT (History)
14 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-04-05 15:14:03 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
/proc/vmstat while kswad0 at 100%cpu (1.95 KB, text/plain)
2012-11-30 07:07 EST, John Ellson
no flags Details
/proc/zoneinfo with kswapd0 at 100% cpu (3.09 KB, text/plain)
2012-11-30 07:10 EST, John Ellson
no flags Details
kswapd.txt output from perf command with kswapd0 at 100% cpu (55.62 KB, text/plain)
2012-11-30 07:24 EST, John Ellson
no flags Details

  None (edit)
Description John Ellson 2012-10-16 09:40:02 EDT
Description of problem:
kswapd using 100%cpu for extended period on i686 kvm vhost - rawhide

top - 13:32:16 up 2 days, 15:55,  2 users,  load average: 1.05, 1.03, 1.13
Tasks:  99 total,   3 running,  96 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us, 99.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.3 hi,  0.3 si,  0.3 st
KiB Mem:   1019468 total,   813236 used,   206232 free,    42276 buffers
KiB Swap:  2064380 total,     3216 used,  2061164 free,   388560 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND           
   25 root      20   0     0    0    0 R  98.4  0.0   2681:19 kswapd0         

Version-Release number of selected component (if applicable):
kernel-3.7.0-0.rc0.git6.2.fc19.i686 

How reproducible:
Seen more than one time, but now I can't make it happen with just a reboot.

Steps to Reproduce:
1.reboot
2..... not sure....
3.
  
Actual results:
100%cpu - vhost consumes more than fair share from parent host

Expected results:
<5% cpu when idling

Additional info:
Comment 1 John Ellson 2012-10-16 09:45:53 EDT
This bug has also been seen by others:

    https://lkml.org/lkml/2012/10/12/206
Comment 2 kevin martin 2012-10-16 10:30:32 EDT
My usage today (over the last hour or so I've been watching it) on an x86_64 laptop with over 1GB of free mem (please look at the lkml posting to see what they patched to fix this problem):

Linux toshiba 3.7.0-0.rc0.git5.2.fc19.x86_64 #1 SMP Wed Oct 10 21:40:09 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux


top - 09:27:24 up 3 days, 22:44, 10 users,  load average: 2.05, 1.83, 1.78
Tasks: 224 total,   2 running, 222 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.8 us, 26.2 sy,  0.0 ni, 71.8 id,  0.3 wa,  0.6 hi,  0.2 si,  0.0 st
KiB Mem:   3952540 total,  2810540 used,  1142000 free,     3680 buffers
KiB Swap:  4095996 total,   691856 used,  3404140 free,   133020 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND                                                                               
   41 root      20   0     0    0    0 R  97.6  0.0   3951:28 kswapd0                                                                               
31192 root      20   0     0    0    0 S   3.3  0.0   1:01.91 kworker/0:2                                                                           
 4201 root      20   0  294m 137m 4284 S   1.3  3.6  81:15.03 X                                                                                     
 4356 kevinm    20   0  583m 8812 4664 S   1.3  0.2   2:33.43 Terminal                                                                              
 8023 kevinm    20   0 1184m 360m  21m S   1.3  9.3   1228:39 thunderbird                                                                           
31541 kevinm    20   0 15700 1584 1056 R   1.0  0.0   0:10.60 top                                                                                   
22750 kevinm    20   0  807m  76m  18m S   0.7  2.0   9:46.60 chrome                                                                                
31201 root      20   0     0    0    0 S   0.7  0.0   0:11.68 kworker/1:2                                                                           
22846 kevinm    20   0  954m  43m 7968 S   0.3  1.1   1:03.76 chrome                                                                                
23013 kevinm    20   0 1088m 152m  16m S   0.3  4.0   4:09.33 chrome                                                                                
23021 kevinm    20   0 1093m 153m  18m S   0.3  4.0   4:26.71 chrome                                                                                
23130 kevinm    20   0  983m  52m  10m S   0.3  1.4   2:02.83 chrome                                                                                
23258 kevinm    20   0 1067m  75m  10m S   0.3  1.9   2:11.97 chrome                                                                                
    1 root      20   0 48572 1388 1144 S   0.0  0.0   0:07.42 systemd                                                                               
    2 root      20   0     0    0    0 S   0.0  0.0   0:00.43 kthreadd
Comment 3 Mikhael Goikhman 2012-10-28 12:11:17 EDT
I still see the same problem with the latest rawhide kernel.

It takes a few hours until kswapd0 starts to eat CPU, then after a short period my machine is frozen completely (no network, keyboard, mouse). The same repeats after a forced reboot. "swapoff -a" does not help. Half of the memory is free.

Linux sparta 3.7.0-0.rc2.git3.1.fc19.i686 #1 SMP Thu Oct 25 20:11:21 UTC 2012 i686 i686 i386 GNU/Linux

Waiting to see a kernel in rawhide without this critical problem.
Comment 4 Thorsten Leemhuis 2012-10-30 15:31:13 EDT
If anyone here wants to help getting this fixed see:
http://article.gmane.org/gmane.linux.kernel.mm/88631

I'm building a vanilla kernel images with these two patches now. I anyone wants to try them let me know.
Comment 5 Thorsten Leemhuis 2012-10-30 18:30:44 EDT
(In reply to comment #4)
>
> I'm building a vanilla kernel images with these two patches now. I anyone
> wants to try them let me know.

Find them at http://thl.fedorapeople.org/kswap-issue/ (x86-64 only)
Comment 6 Josh Boyer 2012-11-12 08:59:39 EST
*** Bug 875103 has been marked as a duplicate of this bug. ***
Comment 7 Josh Boyer 2012-11-12 09:03:16 EST
This is the whole thread:

http://thread.gmane.org/gmane.linux.kernel.mm/87193/focus=89462

I believe there are going to be two reverts as a result of this.  Once they go in, I'll ask people to report on the issue with a kernel that contains them.
Comment 8 Bruno Wolff III 2012-11-12 11:32:58 EST
I seem to be able to get this to happen pretty consistently run a yum update and rsync at the same time.
Comment 9 John Ellson 2012-11-14 08:29:27 EST
Happened again last night for me with:
   kernel-3.7.0-0.rc5.git0.1.fc19.x86_64
on a kvm vhost with 1 cpu, 1G ram.

When it happens the machine is inaccessible by ssh and console, and has to be forced off and restarted.

Seems to be load related, "yum updates" often trigger it.
Comment 10 Josh Boyer 2012-11-17 07:21:22 EST
3.7-rc6 has one of the reverts included.

The other is still being discussed upstream with a possible different patch being included instead of a revert.
Comment 11 Bruno Wolff III 2012-11-17 14:14:40 EST
So far things are looking better. kswapd0 has only used about 6 seconds of cpu over 3 hours. I was doing some stuff that typically triggered the issue, though not the worst case.
Comment 12 kmike 2012-11-18 00:55:55 EST
Do you think the issue with kswapd0 eating 100% CPU after several suspend/resume cycles is the same as this bug?

I only see it after the 3rd or 4th resume over the night, and it's happening very reliably. I'm pretty sure there is no I/O activity going on.

I'm on FC16, and I see it with these kernels:
kernel-3.6.5-2.fc16.i686
kernel-3.6.6-1.fc16.i686
Comment 13 Bruno Wolff III 2012-11-18 10:25:30 EST
There still might be some issue. My kswapd0 process has accumalated 97 minutes of cpu time in less than a day. However I don't know was happening when the cpu resource was being used at high rates. The case where I was seeing it before doesn't seem to be triggering it.
Comment 14 John Ellson 2012-11-18 10:43:12 EST
I've had kernel-3.7.0-0.rc6.git0.1.i686, and x86_64, running for the last day without problems, but when I did a "yum update" this morning the problem re-occurred on both vhosts.

The i686 vhost is still accessible from the console, and top shows kswapd0 at 100% cpu.

The x86_64 vhost locks up completely and is inaccessible even from the console.
Comment 15 Thorsten Leemhuis 2012-11-18 11:56:48 EST
I build a kernel with the patch from 
https://lkml.org/lkml/2012/11/12/113
That's the combination of 1 + 2, which Mel mentions in 
http://thread.gmane.org/gmane.linux.kernel.mm/87193/focus=89504
(the patch "1" is the one that was reverted for rc6, hence if didn't have to apply that anymore)

Find the kernel rpms for testing at http://thl.fedorapeople.org/kswap-issue/
Comment 16 John Ellson 2012-11-20 08:04:42 EST
Still getting lockups with
kernel-3.7.0-0.rc6.git0.1.vanilla.mainline.knurd.tmp.1.fc18.x86_64.rpm
(from link in Comment #15)
Comment 17 Thorsten Leemhuis 2012-11-20 08:25:54 EST
(In reply to comment #16)
> Still getting lockups with
> kernel-3.7.0-0.rc6.git0.1.vanilla.mainline.knurd.tmp.1.fc18.x86_64.rpm
> (from link in Comment #15)

I saw problems (from some point on everything went slower until the system was barely usable) here today, too, after running kernel-3.7.0-0.rc6.git0.1.vanilla.mainline.knurd.tmp.1.fc18.x86_64.rpm fine for a few hours. A kernel without patch "2" from Mel had started to show problems after only 2 hours, but maybe I've just been just really unlucky there and the problem was the same.

Anyway, due to above problems I built and uploaded kernel-3.7.0-0.rc6.git0.1.vanilla.mainline.knurd.tmp.3.fc18.x86_64.rpm now, which contains the patch "3" that Mel mentions in 
http://thread.gmane.org/gmane.linux.kernel.mm/87193/focus=89504 

Running it here now on one of the machines where I see the problem.
Comment 18 Thorsten Leemhuis 2012-11-20 08:26:35 EST
(In reply to comment #17)
> Anyway, due to above problems I built and uploaded

To be precise: I uploaded it to http://thl.fedorapeople.org/kswap-issue/ again
Comment 19 Josh Boyer 2012-11-20 08:44:19 EST
It would be very good if you reported your results to the upstream thread.  I know Mel is waiting for feedback on which approach to take.  "1" is a given as it already is in Linus' tree, but 2 vs. 3 is still out-standing afaik.
Comment 20 Bruno Wolff III 2012-11-20 10:08:01 EST
For my rawhide system kswapd took another big jump in cpu time (over 400 minutes now, when it was a bit over 100 yesterday). However, I haven't noticed this happening while using the system, so I am not sure what is triggering the start (and end) of the high cpu use. yum updates with rsync running don't appear to set it off any more.
Comment 21 John Ellson 2012-11-20 11:14:29 EST
Today I was able to trigger kswapd0 into high cpu usage on the i686 vhost
with kernel-3.7.0-0.rc6.git1.1.fc19.i686,  by running a build and a yum update
are the same time.   On the i686 the cpu doesn't lockup and the machine is still accessible.

A similar load did *not* trigger the problem on the x86_64 vhost with kernel-3.7.0-0.rc6.git0.1.vanilla.mainline.knurd.tmp.3.fc18.x86_64 (from Comment #18)

Not conclusive, but a good sign.
Comment 22 Bruno Wolff III 2012-11-20 11:28:05 EST
I just saw kswapd0 running using 93% of a cpu. It didn't last all that long though. I had two rsyncs and an md sync running and a couple of large memory processes mostly idle.
Comment 23 Josh Boyer 2012-11-20 11:43:05 EST
(In reply to comment #22)
> I just saw kswapd0 running using 93% of a cpu. It didn't last all that long
> though. I had two rsyncs and an md sync running and a couple of large memory
> processes mostly idle.

Noting which kernel explicitly would be helpful.  There are several out there.
Comment 24 Bruno Wolff III 2012-11-20 11:49:28 EST
I'm still using 3.7.0-0.rc6.git0.1.fc19.i686.PAE. But I'll be switching to 3.7.0-0.rc6.git1.1.fc19.i686.PAE (from the rawhide nodebug repo) this afternoon.
Comment 25 Josh Boyer 2012-11-20 11:51:04 EST
(In reply to comment #24)
> I'm still using 3.7.0-0.rc6.git0.1.fc19.i686.PAE. But I'll be switching to
> 3.7.0-0.rc6.git1.1.fc19.i686.PAE (from the rawhide nodebug repo) this
> afternoon.

OK, thank you.  Note that those kernels only have the initial revert that went into -rc6 and not the other patches in question that Thorsten has built kernels for.  At this point, I wouldn't expect kswapd issues to be totally gone on rawhide kernels.
Comment 26 Bruno Wolff III 2012-11-20 16:37:43 EST
3.7.0-0.rc6.git1.1.fc19.i686.PAE (from rawhide nodebug repo) does in fact exhibit this problem. I saw kswapd0 ruuning up a lot of time while using yum without much else going on.
Comment 27 John Ellson 2012-11-21 06:08:52 EST
100% cpu kswapd0 happened within 6 hours on the i686 vhost using kernel-3.7.0-0.rc6.git1.5.fc19.i686.   I don't know what the trigger was, but this time I don't think it was yum.

The x86_64 version of the same kernel, running for the same time, is not yet showing any problems.
Comment 28 John Ellson 2012-11-21 10:05:38 EST
.... kernel-3.7.0-0.rc6.git1.5.fc19.x86_64 just crashed after a yum update.
Comment 29 Thorsten Leemhuis 2012-11-21 12:57:28 EST
Just FYI & to make sure everybody is up2date:

Mel in http://article.gmane.org/gmane.linux.kernel.mm/90502 pointed out: "There is also a potential accounting bug that could be affecting this." 

From the description in https://lkml.org/lkml/2012/11/20/613 it looks a lot like that's the problem on one of the two machines where I saw VM problems. Building a patched kernel right now and will give it a try tomorrow.

On the second machine (the one where I definitely saw the problem that started this bug) I'm running a kernel with the "riskier" patch right now (the one in 3.7.0-0.rc6.git0.1.vanilla.mainline.knurd.tmp.3.fc18.x86_64.rpm) and everything seems fine so far.
Comment 30 Mikhael Goikhman 2012-11-24 07:33:49 EST
It is a month after my last report (comment #3). kswapd still pretty quickly starts to use 100% of CPU and I am forced to reboot daily. Tried with all rawhide i686 kernels available, including the latest 3.7.0-0.rc6.git2.1.fc19.i686.

Starting with rc6 kernels, sometimes there is a situation when kswapd uses 100% CPU, but still allows other processes to work relatively ok for a short period (hours). But usually this just makes X drawing and events to work impossibly slow and eventually freezes everything, just like a month ago. I don't use desktops. Just fvwm, xterms and firefox with several tabs. I do "yum update" daily that may or may not trigger.the problem.

If this kernel bug is not solved in a week, can please somebody provide an alternative rawhide i686 kernel (possibly with the kernel snapshot from 2 months ago)? Thank you.
Comment 31 Bruno Wolff III 2012-11-24 10:01:37 EST
You should be able to use f18 kernels in rawhide without a problem.
Comment 32 Thorsten Leemhuis 2012-11-24 15:15:06 EST
(In reply to comment #30)
> If this kernel bug is not solved in a week, can please somebody provide an
> alternative rawhide i686 kernel (possibly with the kernel snapshot from 2
> months ago)? Thank you.

If you want to help getting this solved it would be great if you could try these:

http://kojipkgs.fedoraproject.org//work/tasks/3876/4723876/kernel-3.7.0-0.rc6.git1.4.knurd.mel.riskier.1.fc19.i686.rpm
http://kojipkgs.fedoraproject.org//work/tasks/3870/4723870/kernel-3.7.0-0.rc6.git1.4.knurd.mel.safe.1.fc19.i686.rpm

Mel afaics is still waiting for feedback how to properly solve the issue and you input could be what's needed to make a decision, as it seems you still can reproduce it quite easily. These kernels contain the patches "2" and "3"  he mentions on http://thread.gmane.org/gmane.linux.kernel.mm/87193/focus=89504
(Patch 1 was merged a few days ago; and these kernels contain a patch for a different, but related issue, too)

P.S.: For other kernel variants follow the links on these pages
http://koji.fedoraproject.org/koji/taskinfo?taskID=4723872
http://koji.fedoraproject.org/koji/taskinfo?taskID=4723867
Comment 33 John Ellson 2012-11-26 06:18:11 EST
No luck, The problem reoccurred within 24hours on i686 vhost with:
    kernel-3.7.0-0.rc6.git1.4.knurd.mel.riskier.1.fc19.i686

I was not able to trigger the problem on x86_64 vhost with a similar load (build + yum) with:
    kernel-3.7.0-0.rc6.git1.4.knurd.mel.riskier.1.fc19.x86_64


Switching to the "safe" kernels now ...
Comment 34 Thorsten Leemhuis 2012-11-27 04:06:07 EST
FYI, Mel's "safer" patch made it to mainline a few hours ago:
http://git.kernel.org/linus/82b212f40059bffd6808c07266a942d444d5558a

Will likely be part of the next rawhide build
Comment 35 John Ellson 2012-11-27 10:07:22 EST
Good...I'll just belatedly add that I not seen any problems after running
    kernel-3.7.0-0.rc6.git1.4.knurd.mel.safe.1.fc19.i686
and
    kernel-3.7.0-0.rc6.git1.4.knurd.mel.safe.1.fc19.x86_64
for over 24 hours with my usual loads.
Comment 36 Mikhael Goikhman 2012-11-27 19:20:14 EST
Running 3.7.0-0.rc6.git1.4.knurd.mel.safe.1.fc19.i686 for 3 days without seeing this problem. There were some stack traces shown on boot however. Will try the next rawhide kernel.
Comment 37 Thorsten Leemhuis 2012-11-28 08:40:36 EST
In reply to comment #34)
> FYI, Mel's "safer" patch made it to mainline a few hours ago:
> http://git.kernel.org/linus/82b212f40059bffd6808c07266a942d444d5558a
> 
> Will likely be part of the next rawhide build

And it looks like it will be removed again soon. For details see
http://thread.gmane.org/gmane.linux.kernel.mm/90911/ and
http://thread.gmane.org/gmane.linux.kernel.mm/90911/focus=91010

If anybody here wants to help testing the solution that Mel proposed in the latter mail, then simply grab kernels from this koji scratch build and give them a try: http://koji.fedoraproject.org/koji/taskinfo?taskID=4737301

Note, these kernels do no contain all of those patches that Fedora normally adds; but they should work fine on f18 and f19
Comment 38 Thorsten Leemhuis 2012-11-28 12:50:08 EST
(In reply to comment #37)
> In reply to comment #34)
> > FYI, Mel's "safer" patch made it to mainline a few hours ago:
> > http://git.kernel.org/linus/82b212f40059bffd6808c07266a942d444d5558a
> > 
> > Will likely be part of the next rawhide build
> 
> And it looks like it will be removed again soon. For details see
> http://thread.gmane.org/gmane.linux.kernel.mm/90911/ and
> http://thread.gmane.org/gmane.linux.kernel.mm/90911/focus=91010

There is a kernel currently building that additionally contains the patch Mel mentioned in http://thread.gmane.org/gmane.linux.kernel.mm/90911/focus=91038

Should be ready in an hour or so (depending on the load of the builders):
http://koji.fedoraproject.org/koji/taskinfo?taskID=4738252
Comment 39 John Ellson 2012-11-29 09:38:45 EST
Re: Comment #38.  No luck.

The problem of 100%cpu usage by kswapd0 still exists in kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.2.fc18.x86_64

The problem reoccurred within 12 hours.
Comment 40 Bruno Wolff III 2012-11-29 10:05:16 EST
After about a day and a half my kswapd has only accumulated 39 seconds of CPU time. This is with the 3.7.0-0.rc7.git1.2.fc19.i686.PAE kernel from the rahide nodebug repo.
Comment 41 John Ellson 2012-11-29 10:15:35 EST
Correction to Comment #39

Sorry, that should have said: kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.2.fc18.i686

The x86_64 host hasn't shown any problem yet.
Comment 42 Thorsten Leemhuis 2012-11-30 02:23:58 EST
John, could you read http://thread.gmane.org/gmane.linux.kernel.mm/90911/focus=91153 please?

Quoting one part:
> As the system is still responsive when this happens, any chance he
> could capture /proc/zoneinfo and /proc/vmstat when kswapd goes
> haywire?
>
> Or even run perf record -a -g sleep 5; perf report > kswapd.txt?
> 
> Preferrably with this patch applied, to rule out faulty lowmem
> protection:

I'm building a kernel with that patch if you want to give it a try
Comment 43 Thorsten Leemhuis 2012-11-30 03:24:14 EST
(In reply to comment #42)
> I'm building a kernel with that patch if you want to give it a try

Find it at http://koji.fedoraproject.org/koji/taskinfo?taskID=4743064
Comment 44 John Ellson 2012-11-30 06:30:13 EST
Since Comment #41, I did not see the problem reoccur using kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.2.fc18.i686 (~20hours).

Re Comment #43, I've now installed and booted:
  kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.3.fc18.i686
and 
  kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.3.fc18.x86_64
and will monitor today.     I apologise that I am leaving town for a week this afternoon, so I only have about 10 hours for this to happen,

Is the syntax of:
   perf record -a -g sleep 5; perf report > kswapd.txt
correct?   It gives me:   
  callchain: Unknown -g option value: sleep

Johannes Weiner asked:
> This requires somebody to wake up kswapd regularly, though and from
> his report it's not quite clear to me if kswapd gets stuck or just has
> really high CPU usage while the system is still under load. 

I suspect that the problem is triggered by high load (yum update, and/or a build job), but once triggered the kswapd0 @ 100% cpu continues indefinitely even after all the loads gave gone.
Comment 45 John Ellson 2012-11-30 07:07:26 EST
Created attachment 654960 [details]
/proc/vmstat while kswad0 at 100%cpu
Comment 46 John Ellson 2012-11-30 07:10:44 EST
Created attachment 654961 [details]
/proc/zoneinfo with kswapd0 at 100% cpu
Comment 47 John Ellson 2012-11-30 07:12:55 EST
Well that didn't take very long...

Using hernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.3.fc18.i686 on a kvm i686 vhost,  I did a  "make -j" and that triggered the problem (my normal scheduled builds don't use -j, so perhaps they don't trigger the problem as easily.)

I captured /proc/zoneinfo and /proc/vmstat  after the build had completed and the system was nominally idle.   I've attached them to this report.

I'll leave the system in its current state and run the perf later .. if someone can help me with the correct syntax?
Comment 48 John Ellson 2012-11-30 07:24:31 EST
Created attachment 654977 [details]
kswapd.txt output from perf command with kswapd0 at 100% cpu
Comment 49 Thorsten Leemhuis 2012-11-30 07:26:32 EST
(In reply to comment #44)
>
> Is the syntax of:
>    perf record -a -g sleep 5; perf report > kswapd.txt
> correct?   It gives me:   
>   callchain: Unknown -g option value: sleep

Seems the behaviour of "-g" changed. From a quick look at the options I'm not sure how to exactly emulate the old behaviour with the new syntax.

Old:
    -g, --call-graph      do call-graph (stack chain/backtrace) recording
New:
    -g, --call-graph <mode[,dump_size]>
                          do call-graph (stack chain/backtrace) recording: [fp]

Not sure, maybe as a workaround try to use the older perf (the one from the F18 repos, not the one build together with the kernel)
Comment 50 John Ellson 2012-11-30 07:27:47 EST
Apparently "-g" can take a string paramater, so it was confused about "sleep"

Reordering -a -g seemed to work:

[root@rawhide ~]# perf record -g -a sleep 5; perf report > kswapd.txt
[ perf record: Woken up 9 times to write data ]
[ perf record: Captured and wrote 2.179 MB perf.data (~95183 samples) ]
no symbols found in /usr/bin/sleep, maybe install a debug package?
/usr/lib/libc-2.16.90.so was updated (is prelink enabled?). Restart the long running apps that use it!
[root@rawhide ~]# 


Result attached.
Comment 51 John Ellson 2012-11-30 10:25:30 EST
I tried a couple of times, but have not been able to trigger the problem on the x86_64 vhost using kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.3.fc18.x86_64

In fact, I haven't seen the problem on x86_64 for some time.   It used to have different character anyway - it used to SEGV and lock up the machine completely.

So, maybe the remaining bug is i686-only ?

Would it help to see if I can trigger the bug in the PAE kernel?
Comment 52 John Ellson 2012-11-30 11:10:12 EST
Not conclusive, but I tried multiple times with my "make -j" builds and concurrent "yum updates", and still can't trigger the problem on the PAE kernel.

I've no idea if this could possibly be relevant.  Its just that Bruno was reporting different symptoms using PAE (e.g. in Comment #40).
Comment 53 John Ellson 2012-11-30 11:46:26 EST
I went back to the non-PAE, i686 kernel and rechecked that the same load will re-trigger the problem relatively easily, and it does.

So, AFAICT. the problem only exists on i686 non-PAE kernels.

Not on i686 with-PAE, or x86_64
Comment 54 Bruno Wolff III 2012-11-30 12:18:34 EST
I had seen it on a PAE machine, but I have been running 3.7.0-0.rc7.git1.2.fc19.i686.PAE continuously for 2 and 1/2 days no and kswapd has only accumulated 1 minute and 5 seconds of CPU time. So the patches in that kernel seem to have fixed the problem for me.
Comment 55 Thorsten Leemhuis 2012-12-01 03:24:47 EST
(In reply to comment #54)
> I had seen it on a PAE machine, but I have been running
> 3.7.0-0.rc7.git1.2.fc19.i686.PAE continuously for 2 and 1/2 days no and
> kswapd has only accumulated 1 minute and 5 seconds of CPU time. So the
> patches in that kernel seem to have fixed the problem for me.

Bruno, the patch that likely helped your case was reverted last night in mainline (http://git.kernel.org/linus/a50915394f1fc02c2861d3b7ce7014788aa5066e ), as it had been foreseeable already; see Comment 37 

Anyway, seems Hannes thx to Johns data was able to find a likely reason; see
http://thread.gmane.org/gmane.linux.kernel.mm/90911/focus=91300 

I'm building a kernel right now that contains the new patch from Hannes.
Comment 56 Thorsten Leemhuis 2012-12-01 04:46:46 EST
(In reply to comment #55)
> I'm building a kernel right now that contains the new patch from Hannes.

Find it at http://koji.fedoraproject.org/koji/taskinfo?taskID=4746265

Note, that kernel doesn't contain the patch mentioned in Comment 42 -- I assume that patch was just meant to make debbuging easier
Comment 57 John Ellson 2012-12-02 17:35:30 EST
I just installed kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686
and ran my usual load that triggers the problem.  OK so far.  I'll check again in 24hours, but looking good so far.
Comment 58 Bruno Wolff III 2012-12-03 09:24:08 EST
I am seeing bursts of kswapd activity again with 3.7.0-0.rc7.git3.2.fc19.i686.PAE from the rawhide nodebug repo. I think this kernel likely has the set of patches that were planned for the 3.7 release.
I am seeing kswapd running at 90+% of a cpu while doing a yum update and an rsync.

top - 08:23:14 up  9:27,  8 users,  load average: 2.62, 2.62, 2.06
Tasks: 197 total,   2 running, 194 sleeping,   0 stopped,   1 zombie
%Cpu0  :  1.9 us, 28.8 sy,  0.0 ni,  0.0 id, 69.2 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  3.6 us, 89.1 sy,  0.0 ni,  5.5 id,  0.0 wa,  0.0 hi,  1.8 si,  0.0 st
KiB Mem:   2065708 total,  1894136 used,   171572 free,   382916 buffers
KiB Swap: 10482612 total,    25584 used, 10457028 free,   872336 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND           
   30 root      20   0     0    0    0 R  93.8  0.0  20:36.94 kswapd0           
 5773 root      20   0     0    0    0 S   5.3  0.0   0:05.50 kworker/0:2       
31279 root      20   0  256m 236m 8296 D   5.3 11.7   4:54.71 yum
Comment 59 Thorsten Leemhuis 2012-12-03 09:26:52 EST
(In reply to comment #58)
> I am seeing bursts of kswapd activity again with
> 3.7.0-0.rc7.git3.2.fc19.i686.PAE from the rawhide nodebug repo. I think this
> kernel likely has the set of patches that were planned for the 3.7 release.

Correct.

> I am seeing kswapd running at 90+% of a cpu while doing a yum update and an
> rsync.

Could you give the one from http://koji.fedoraproject.org/koji/taskinfo?taskID=4746265 a try and see if it helps?
Comment 60 Bruno Wolff III 2012-12-03 10:57:16 EST
I can't try it until tonight as the machine won't reboot (successfully) without someone being at the console.
Debug kernels slow down my disk I/O a lot. I don't know if that will have any impact on being able to trigger kswapd to go cpu bound.
Comment 61 Thorsten Leemhuis 2012-12-03 11:21:32 EST
(In reply to comment #60)
> Debug kernels slow down my disk I/O a lot. I don't know if that will have
> any impact on being able to trigger kswapd to go cpu bound.

My kernels are similar to release/nodebug kernels, so only basic debugging options enabled
Comment 62 John Ellson 2012-12-03 20:13:32 EST
Good news.

I've now been running both
  kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686
and
  kernel-3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.x86_64
for over 24hours with no evidence of problems with kswapd
Comment 63 Bruno Wolff III 2012-12-04 12:44:56 EST
So far with 3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686.PAE kswapd is only getting small amounts of time when the system has lots of memory and cpu use. With the previous kernel I was trying I wasn't reliably triggering the kswapd high cpu usage, so I am not claiming victory yet.
Comment 64 Thorsten Leemhuis 2012-12-05 04:41:19 EST
TWIMC: I've rebased my test kernels to rc8, which contains some of the patches that are in 3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18 

Find 3.7.0-0.rc8.git0.1.van.main.knurd.kswap.4.fc18 here(¹):
http://koji.fedoraproject.org/koji/taskinfo?taskID=4758387

It now contains only these two patches:
http://thread.gmane.org/gmane.linux.kernel.mm/90911/focus=91153
http://thread.gmane.org/gmane.linux.kernel.mm/90911/focus=91300

John, would be good if you could give it a try, as the the former of those two patches was not in 3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18; it shouldn't matter much (see http://thread.gmane.org/gmane.linux.kernel.mm/90911/focus=91425 ), but might be good to know

(¹) sorry for the confusion, forgot to do a s/kswap.4/kswap.5/ :-/
Comment 65 John Ellson 2012-12-05 18:33:32 EST
(Sorry for the delay. I'm on vacation in the UK this week.)

First, the 3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18 were still looking good (i.e. no kswapd0 lockups) after ~48hours.

I've just rebooted into:
  kernel-3.7.0-0.rc8.git0.1.van.main.knurd.kswap.4.fc18.i686
and
  kernel-3.7.0-0.rc8.git0.1.van.main.knurd.kswap.4.fc18.x86_64
and run my load test that used to trigger the problem.

No problems so far.  I'll check back again in ~24hours.

One other observation.   When watching the previous kernel with "top", I didn't see any lockups, but I did see kswapd0 at 30% cpu at times.  With this latest kernel I didn't see kswapd0 above 2%.   Both i686 and x86_64 were similar in this respect.    I haven't attempted to repeat this test to be 100% certain about this observation.
Comment 66 Bruno Wolff III 2012-12-06 12:37:47 EST
I am now at a bit over 2 and 1/2 days and kswapd has accumulated 1 minute 53 seconds of CPU time with 3.7.0-0.rc7.git1.2.van.main.knurd.kswap.4.fc18.i686.PAE. This seems reasonable.
Comment 67 John Ellson 2012-12-06 16:15:32 EST
Both kernels:
    kernel-3.7.0-0.rc8.git0.1.van.main.knurd.kswap.4.fc18.i686
    kernel-3.7.0-0.rc8.git0.1.van.main.knurd.kswap.4.fc18.x86_64
still ok after ~24hours

I just reran my load test, and this time I did notice kswapd0 briefly at 24%cpu,
so I suspect my observation in Comment #65 should just be ignored.
Comment 68 Mikhael Goikhman 2012-12-10 18:18:46 EST
Running 3.7.0-0.rc8.git0.1.van.main.knurd.kswap.4.fc18.i686 for 4 days, no problem so far, the run time of kswapd0 is 0:00.28.

Prior to this, ran the latest to that moment rawhide kernel 3.7.0-0.rc7.git3.1.fc19.i686 and easily got the kswapd0 problem.
Comment 69 Bruno Wolff III 2012-12-13 15:16:44 EST
3.7.0-1.fc19.i686.PAE is looking good. I have accumulated 31 seconds of CPU time for kswapd in 42 hours of uptime.
Comment 70 Fedora End Of Life 2013-04-03 11:41:44 EDT
This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle.
Changing version to '19'.

(As we did not run this process for some time, it could affect also pre-Fedora 19 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19

Note You need to log in before you can comment on or make changes to this bug.