Bug 1060865 - Dozens of uninterruptable processes and 75-100% CPU waiting for IO [NEEDINFO]
Summary: Dozens of uninterruptable processes and 75-100% CPU waiting for IO
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 19
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-02-03 19:13 UTC by Håkon Løvdal
Modified: 2014-06-23 14:41 UTC (History)
5 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2014-06-23 14:41:14 UTC
Type: Bug
Embargoed:
jforbes: needinfo?


Attachments (Terms of Use)
Capture from vmstat covering one light freeze (26.84 KB, text/plain)
2014-02-03 19:13 UTC, Håkon Løvdal
no flags Details
Capture from vmstat covering another light freeze (23.04 KB, text/plain)
2014-02-04 16:56 UTC, Håkon Løvdal
no flags Details
Capture from vmstat covering a serious problem occurrence (1.26 MB, application/gzip)
2014-03-01 05:06 UTC, Håkon Løvdal
no flags Details
Plot of a subset of the vmstat values (14.47 KB, image/png)
2014-03-01 05:11 UTC, Håkon Løvdal
no flags Details
Capture from vmstat with kernel 3.13.5-103.fc19.x86_64 (185.69 KB, application/octet-stream)
2014-03-14 16:30 UTC, Håkon Løvdal
no flags Details

Description Håkon Løvdal 2014-02-03 19:13:23 UTC
Created attachment 858802 [details]
Capture from vmstat covering one light freeze

Short summary:

From time to time (the likelihood and severity increases over time) something will trigger the kernel to spend almost all time waiting for IO while a lot of processes are blocked. The machine will eventually recover (after minutes or possibly hours), but during the slowdown the machine is for all interactive purposes frozen.

Details:

Right after boot everything is really snappy and there is no signs of any slowdown. Over time the the likelihood of "grind to halt stops" increases. These stops are in the beginning "just" almost grind to halt stops, in that the UI becomes very unresponsive (e.g. the mouse pointer is barely updated, and any action like Alt-Tab etc takes possibly minutes to complete). But the machine will eventually recover, and then then I can continue using it normally (until the next freeze...).

In the beginning these stops are somewhat light and not lasting too long, but eventually the length and severity increases and continuing using the machine becomes some sort of Russian roulette where suddenly something is triggered that takes hours to recover from. The freezes seems to be triggered by disk activity or memory allocation (?). Opening a new tab in Firefox, Opera or Chrome might for instance be the trigger.

I have started running vmstat with its output captured continuously to try to analyse what is going on. The symptoms observed are that during these freezes, the number of processes in uninterruptible sleep is very high (typically in the range 10 to 30) and the percentage of CPU spent in waiting for IO state is typically in the range of 75% to 100%. The machine have 4 CPUs. Very typically the buffers are drained to almost nothing once this starts, but not always, I have also seen cases where there is little effect on buffers. The worst case of load average I have observed is 72 (but this is with sampling only once per minute so it might actually have been higher than that).


I am attaching a short log showing an example of such a light freeze. It was triggered when I did a git rebase, removing one commit 14 commits behind HEAD. This was in some web GUI code where I also had "grunt server" running which will trigger a recompile every time the source code is changed. So during the rebase this of course happens over and over again very rapidly so there some intense disk activity going on which triggered this behaviour.

You can clearly see the behaviour from the end of the second minute in, which continues for around two and a half minute. The tiny period with some waiting for IO percentage after that is me doing some normal disk activity (checking out a new branch or something, I do not remember exactly), but notice at most three blocked processes here, so this was not part of the freeze.

This is not a new problem. While it lately has turned into a very serious issue that occurs quite frequently and I have invested some time trying to figure out, I think I have experienced this problem (less frequently and with less impact) for as far back as one to two years. While this now hits my laptop at work consistently, I have experienced this on my private machines at home as well (at least I believe this to be the same. I was for a long time suspecting the problem to be horrible swap performance but from my investigations with vmstat I can see that swap is not involved at all).

Comment 1 Håkon Løvdal 2014-02-04 16:56:57 UTC
Created attachment 859252 [details]
Capture from vmstat covering another light freeze

Another log. Similar scenario as the previous log.

The two lines

 0 10 441668 414040   3752 359076    0    0  1560    40  444 1334  1  0 74 25
 1  1 441668 411140   3752 358516   32    0  5420     0 1067 4342  8  4 22 67

at the end was triggered by me switching to another virtual workspace (to make a copy of the capture file), e.g. just normal short term please-be-patient-while-I'll-read-something-from-disk unresponsiveness.

Comment 2 Håkon Løvdal 2014-03-01 05:06:31 UTC
Created attachment 869276 [details]
Capture from vmstat covering a serious problem occurrence

I have before attached logs with super mild occurrences of the problem. Here comes an example of the worst occurrences.

The log is a capture of running

	while uptime; do env LC_ALL=C date; free -m; cat /proc/sys/fs/file-nr; vmstat -n 1 60; done

Compare the responsiveness from the beginning of the log with the end. Under normal conditions, ten rounds of executing the while loop takes ten minutes. At the end it takes two hours!

$ grep Feb typescript.vmstat.026 | head
Tue Feb 11 15:00:07 CET 2014
Tue Feb 11 15:01:06 CET 2014
Tue Feb 11 15:02:05 CET 2014
Tue Feb 11 15:03:04 CET 2014
Tue Feb 11 15:04:04 CET 2014
Tue Feb 11 15:05:03 CET 2014
Tue Feb 11 15:06:02 CET 2014
Tue Feb 11 15:07:01 CET 2014
Tue Feb 11 15:08:00 CET 2014
Tue Feb 11 15:08:59 CET 2014
$ grep Feb typescript.vmstat.026 | tail
Thu Feb 13 07:19:19 CET 2014
Thu Feb 13 07:21:39 CET 2014
Thu Feb 13 07:29:41 CET 2014
Thu Feb 13 07:52:10 CET 2014
Thu Feb 13 07:54:40 CET 2014
Thu Feb 13 08:51:53 CET 2014
Thu Feb 13 08:53:05 CET 2014
Thu Feb 13 08:54:12 CET 2014
Thu Feb 13 08:55:43 CET 2014
Thu Feb 13 09:12:36 CET 2014
$

This occurrence started Wednesday at Wed 11:32 and I let it sit though out the day in hope that it would recover (borrowing a colleague's pc meanwhile). It did not, and when it had not even recovered the next morning I rebooted by cutting the power.

This is with a stock 3.13.0 kernel compiled manually, e.g. not the kernel from the official fedora kernel package, just to eliminate any potential problem with the patches applied there.

Comment 3 Håkon Løvdal 2014-03-01 05:11:42 UTC
Created attachment 869277 [details]
Plot of a subset of the vmstat values

The graphs for wa, id and b super clearly show the problem starting at around sample number 7500. There are also some insanely high spikes for in and cs, although this occurs some time after the problem starts, so it probably is more an effect than cause.

I have been running linux since 1994 and have even tried swapping to /dev/fd0 back in the days. This is like nothing I have ever experienced before. I assume I am just unlucky to trigger this problem through my usage pattern, because this cannot be a common problem, otherwise this would have been a well known news story with some serious effort invested by the kernel developers to try to solve. So even though it probably only hits me, can you please put a little investigation effort in? My work pc is next to unusable unless I reboot daily or even more often.

Are there any debug configs or patches I should apply for testing?

Comment 4 Håkon Løvdal 2014-03-01 05:14:39 UTC
Another clear sign of something definitely not right: look at the two last minutes of the log typescript.vmstat.026. First the machine is constantly locked up at "exactly" 75% CPU waiting + 25% user (also the case for several minutes preceding). Then the last minute is nothing but 99 or 100% waiting (before power was cut off).

It is impossible that this is a normal usage pattern! There is clearly a bug/problem somewhere.


 08:55:27 up 1 day, 17:59, 21 users,  load average: 23,69, 30,80, 36,46
Thu Feb 13 08:55:43 CET 2014
             total       used       free     shared    buffers     cached
Mem:          3628       3516        112          0          0        125
-/+ buffers/cache:       3390        237
Swap:         8199       2899       5300
21568	0	367857
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2 30 2969164 112304    196 128104   14   16   465    21   14   26  8  1 51 41
 1 31 2969136 112432    188 127608   52    0  3948     4 2722 1197 24  1  0 74
 1 30 2969136 112084    184 128332   28    0  3116     0 1297  602 25  1  0 75
 1 30 2969112 112704    188 127420   20    0  5944    28 1894  866 24  1  0 74
 2 26 2969112 110728    184 129540   76    0  7876     0 2860 1220 24  1  0 74
 1 27 2969100 112036    192 127876  100    0  4072     4 1472  800 25  2  0 74
 1 28 2969100 111440    196 128504   76    0  4760    56 1264  532 25  1  0 75
 1 30 2969096 113056    196 126616   64    0 11340    28 2153  877 24  2  0 74
 1 28 2969068 111844    204 128124    0    0  4428     0 2912 1319 24  1  0 74
 1 29 2969068 111844    208 128020   28    0  3884     8 1295  667 24  1  0 74
 2 29 2969064 111844    356 127844   60    0  3432     0 1429  551 25  1  0 74
 1 30 2969056 112464    460 127228   56    0  4760    16 1348  597 24  1  0 74
 1 28 2969036 111584    464 128136   88    0  6736     4 1590  745 24  1  0 75
 2 27 2969008 112312    312 127716   64    0  5688     0 2019  808 24  1  0 74
 1 26 2968992 112016    248 127384   56    0  4744     0 1729  765 24  1  0 74
 4 25 2968964 111504    248 128356   56    0  5068     0 1514  611 25  1  0 74
 1 27 2968964 112024    196 127700   40    0  5188     0 1781  719 24  2  0 74
 1 26 2968960 112364    196 127000   72    0  4148     0 1479  646 25  1  0 75
 1 29 2968932 110284    196 128936   88    0  5676     4 1444  672 25  1  0 74
 2 27 2968904 112240    196 126904  112    0  4616     0 1798  703 25  1  0 74
 3 26 2968892 111052    196 127544   60    0  6256     0 1319  573 24  1  0 75
 2 27 2968840 111552    196 127188   48    0  4556     0 1657  664 24  1  0 74
 1 26 2968800 111328    196 127164   28    0  4072     0 1429  628 25  1  0 74
 1 25 2970120 114976    196 130048   72 1328  4712  1328 55590  552 24  4  0 73
 1 29 2970084 116696    196 128004   76    0  4304     0 1435  572 25  2  0 74
 1 30 2970048 116512    196 128236   56    0  5164     0 1341  551 24  1  0 74
 1 32 2970044 116216    196 128208   56    0  4236     0 1283  585 25  1  0 75
 1 30 2970032 116836    196 128492   40    0  3492     0 1310  526 24  1  0 74
 1 30 2970024 114976    196 130568   56    0  4020     0 1257  517 24  1  0 74
 1 31 2970020 117276    196 128572   44    0  3380     0 1322  547 25  1  0 74
 2 29 2970012 114356    200 131852   56    0  3492     0 1223  505 24  1  0 75
 1 28 2970012 111380    200 135476   80    0  3664     0 1256  506 25  1  0 74
 1 22 2970012 107908    200 138992   64    0  3440     0 1202  521 25  1  0 75
 2 22 2970012 115844    204 129504   84    0  2296     0 1372  547 25  1  0 74
 1 23 2969920 118340    196 126320   84    0  2492     0 1421  583 25  2  0 74
 1 21 2969908 119068    196 125848   36    0  2188     0 1415  653 25  1  0 74
 2 22 2969908 118572    196 126428   56    0  2064     0 1298  593 25  1  0 74
 1 29 2969908 118076    192 127832   84    0  2964     0 1268  589 25  1  0 74
 1 28 2969908 116340    192 129020   72    0  3212     0 1301  569 24  1  0 75
 2 26 2969908 112496    192 132596   32    0  3728     0 1210  497 25  1  0 75
 1 28 2969900 116836    192 127816   32    0  4484     0 1521  757 25  2  0 74
 1 28 2969900 116588    192 128804   12    0  3768     0 1258  522 25  1  0 75
 1 23 2969900 113860    192 131348   16    0  3996     0 1259  576 25  1  0 74
 1 24 2969888 111596    192 133948   72    0  2492     0 1209  492 25  1  0 75
 1 26 2969888 110080    192 134812   64    0  2796     0 1244  572 25  1  0 74
 1 25 2969868 115564    192 129092   72    0  2224     0 1436  583 25  1  0 74
 1 26 2969856 116804    192 127488   72    0  3364     0 1478  596 25  2  0 74
 1 27 2969856 114572    192 130444   28    0  3632     0 1207  484 25  1  0 75
 1 26 2969856 110480    192 134628   84    0  4204     0 1311  683 25  1  0 74
 1 23 2969824 116676    192 127320  120    0  3160     0 1436  593 25  1  0 74
 1 27 2969816 114740    192 129592   20    0  4044     0 1268  582 25  1  0 74
 1 26 2969800 114572    196 130180   40    0  3320     0 1347  538 24  1  0 75
 2 26 2969780 110108    196 134732   16    0  4616     0 1218  522 25  1  0 74
 1 24 2969776 114572    196 129372   88    0  4088     0 1459  600 25  1  0 74
 1 23 2969776 113704    196 130256   96    0  3840     0 1440  621 24  1  0 74
 1 22 2969776 113952    196 129788   88    0  3396     0 1315  561 25  1  0 74
 2 24 2969776 110356    196 133828   72    0  3872     0 1212  522 25  1  0 75
 2 24 2969776 114868    196 128608   72    0  3664     0 1420  600 25  1  0 74
 1 26 2969776 115924    196 127764  112    0  3808     0 1424  606 24  1  0 74
 2 24 2969776 115112    196 128676  112    0  3208     0 1350  578 24  2  0 74
 09:12:16 up 1 day, 18:16, 21 users,  load average: 49,73, 47,78, 43,55
Thu Feb 13 09:12:36 CET 2014
             total       used       free     shared    buffers     cached
Mem:          3628       3522        105          0          0        133
-/+ buffers/cache:       3388        240
Swap:         8199       2904       5295
21536	0	367857
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0 49 2974148 113764    204 132128   14   16   467    21   16   27  8  1 50 41
 1 49 2974144 111408    204 134496   40    0  2404     0  223  305  0  0  0 100
 0 45 2974012 108688    344 137128   68    0  2828     0  282  383  0  0  0 99
 0 47 2974012 105836    352 140404   64    0  3224    36  199  346  0  1  0 100
 0 46 2974012 116316    228 129224   96    0  4920    88  371  601  0  1  0 99
 0 53 2974012 117244    328 128584    0    0  5116     4  492  693  0  1  0 99
 0 53 2974012 114144    328 131940   32    0  3284     0  256  357  0  0  0 100
 0 50 2974012 110424    328 135784   64    0  4140     0  244  395  0  1  0 100
 1 50 2974012 106580    328 139476   64    0  3580     0  237  394  0  0  0 100
 0 49 2974012 115620    236 128984   64    0  3488     0  272  436  0  1  0 99
 0 51 2974012 118008    224 126208   64    0  3428     4  307  410  0  1  0 99
 0 46 2974012 115880    224 129260   32    0  3488     0  239  365  0  0  0 100
 1 47 2974012 112160    224 133152   32    0  3800     0  216  327  0  0  0 100
 0 48 2974012 116740    224 128056    0    0  8248     4  354  551  1  1  0 99
 0 46 2974012 115756    224 129756    0    0  3280     0  454  675  0  0  0 100
 0 46 2974012 112904    224 131828   48    0  2344     0  195  292  0  0  0 100
 0 45 2974000 110920    224 134352   52    0  2640     0  211  365  0  1  0 99
 0 44 2974000 107448    224 137852   52    0  3644     0  278  487  0  0  0 100
 0 41 2973996 114512    220 129116    0    0  2912     0  277  456  0  1  0 99
 0 39 2973976 117184    220 126048   72    0  4128     0  489  809  0  1  0 99
 0 38 2973956 116616    220 127752   64    0  3028     0  228  392  0  0  0 100
 0 38 2973956 113764    220 130700   32    0  3116     0  173  303  0  0  0 100
 0 36 2973956 111160    220 133224    0    0  2364     0  179  346  0  0  0 100
 0 38 2973952 115628    220 129124    0    0  3064     0  272  453  0  1  0 99
 1 40 2973952 112904    220 132320   20    0  3316     0  210  338  0  0  0 100
 0 40 2973952 108936    220 137140   24    0  4624     0  381  333  5  0  0 95
 0 39 2973944 114444    220 129396  104    0  3464     0  324  580  0  1  0 99
 0 36 2973936 117936    220 126292    0    0  4488     0  387  557  0  1  0 99
 0 39 2973932 113060    220 131420   56    0  4476     0  266  364  0  0  0 100
 0 36 2973928 110084    220 134924    0    0  3448     0  244  336  0  1  0 99
 0 34 2973924 113556    220 131460    0    0  3276     0  318  364  2  1  0 98
 0 35 2973924 110456    220 134340  124    0  3472     0  192  291  0  0  0 100
 0 35 2973924 107480    224 137620  176    0  3144     0  216  301  0  0  0 100
 0 34 2973924 106980    224 137296  104    0  1916     0  250  388  0  0  0 100
 0 34 2973924 116280    220 127784  100    0  1812     0  292  450  0  1  0 99
 0 35 2973920 116900    220 126364  104    0  1492     0  338  520  0  1  0 99
 1 33 2973920 117896    220 124924    0    0  2136     0  257  427  0  0  0 100
 0 36 2973904 115044    220 128304   80    0  3052     0  219  332  0  0  0 100
 0 36 2973904 111572    220 132004   20    0  3708     0  217  337  0  1  0 100
 1 36 2973888 114172    220 130076  136    0  3788     0  255  344  0  0  0 99
 0 36 2973888 111324    220 133096    0    0  3108     0  228  284  0  0  0 99
 0 36 2973884 108100    220 136320   32    0  3440     0  269  413  0  0  0 100
 1 44 2973776 114400    220 128748  788    0  2748     0  258  437  0  1  0 99
 0 45 2973644 115588    220 125800 1296    0  4080     0  477  456  5  1  0 94
 0 45 2973456 116248    220 124596 1852    0  6888     0  322  466  0  1  0 99
 0 50 2973012 113524    220 125616 1364    0  6372     0  616  908  0  1  0 99
 0 53 2973712 112904    220 126048 1084  840  4624   840  295  462  0  1  0 99
 0 53 2974404 113028    220 126420  304  764  4184   764  279  417  0  0  0 100
 0 52 2974328 112892    220 125708  644    0  4160     0  255  390  0  1  0 99
 1 53 2974276 112036    220 126544  496    0  4784     0  283  432  0  0  0 100
 0 52 2974252 111416    220 126060  580    8  4720     8  283  429  0  1  0 99
 0 51 2974092 111612    220 125636  332    0  4260     0  285  430  0  0  0 100
 1 42 2974080 111168    220 125324   12    0  3084     0  299  423  0  1  0 99
 0 41 2974072 111412    220 125884   36    0  4936     4  334  405  2  1  0 98
 0 42 2974072 111396    216 126304    0    0  3840    32  368  623  0  1  0 99
 0 42 2974200 111064    208 126112   32  128  4852

Comment 5 Justin M. Forbes 2014-03-10 14:51:43 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs.

Fedora 19 has now been rebased to 3.13.5-100.fc19.  Please test this kernel update and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.

Comment 6 Håkon Løvdal 2014-03-14 16:30:02 UTC
Created attachment 874513 [details]
Capture from vmstat with kernel 3.13.5-103.fc19.x86_64

Problem starts after "Fri Mar 14 13:36:55 CET 2014", going on and off a little until "Fri Mar 14 13:56:06 CET 2014" after which the situation is permanent poor.
Rebooted at the end as I had not time to let it recover by itself.

Overall this was far from one of the worst occurrences experienced, only very, very unresponsive, not like totally frozen.

Comment 7 Håkon Løvdal 2014-03-14 16:32:18 UTC
Notice: I will soon loose access to this computer where I get the issue almost daily, so if there is any patches I should try out by all means do not wait!

Comment 8 Justin M. Forbes 2014-05-21 19:31:03 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs.

Fedora 19 has now been rebased to 3.14.4-100.fc19.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 20, and are still experiencing this issue, please change the version to Fedora 20.

If you experience different issues, please open a new bug report for those.

Comment 9 Justin M. Forbes 2014-06-23 14:41:14 UTC
*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 4 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.


Note You need to log in before you can comment on or make changes to this bug.