Created attachment 858802 [details] Capture from vmstat covering one light freeze Short summary: From time to time (the likelihood and severity increases over time) something will trigger the kernel to spend almost all time waiting for IO while a lot of processes are blocked. The machine will eventually recover (after minutes or possibly hours), but during the slowdown the machine is for all interactive purposes frozen. Details: Right after boot everything is really snappy and there is no signs of any slowdown. Over time the the likelihood of "grind to halt stops" increases. These stops are in the beginning "just" almost grind to halt stops, in that the UI becomes very unresponsive (e.g. the mouse pointer is barely updated, and any action like Alt-Tab etc takes possibly minutes to complete). But the machine will eventually recover, and then then I can continue using it normally (until the next freeze...). In the beginning these stops are somewhat light and not lasting too long, but eventually the length and severity increases and continuing using the machine becomes some sort of Russian roulette where suddenly something is triggered that takes hours to recover from. The freezes seems to be triggered by disk activity or memory allocation (?). Opening a new tab in Firefox, Opera or Chrome might for instance be the trigger. I have started running vmstat with its output captured continuously to try to analyse what is going on. The symptoms observed are that during these freezes, the number of processes in uninterruptible sleep is very high (typically in the range 10 to 30) and the percentage of CPU spent in waiting for IO state is typically in the range of 75% to 100%. The machine have 4 CPUs. Very typically the buffers are drained to almost nothing once this starts, but not always, I have also seen cases where there is little effect on buffers. The worst case of load average I have observed is 72 (but this is with sampling only once per minute so it might actually have been higher than that). I am attaching a short log showing an example of such a light freeze. It was triggered when I did a git rebase, removing one commit 14 commits behind HEAD. This was in some web GUI code where I also had "grunt server" running which will trigger a recompile every time the source code is changed. So during the rebase this of course happens over and over again very rapidly so there some intense disk activity going on which triggered this behaviour. You can clearly see the behaviour from the end of the second minute in, which continues for around two and a half minute. The tiny period with some waiting for IO percentage after that is me doing some normal disk activity (checking out a new branch or something, I do not remember exactly), but notice at most three blocked processes here, so this was not part of the freeze. This is not a new problem. While it lately has turned into a very serious issue that occurs quite frequently and I have invested some time trying to figure out, I think I have experienced this problem (less frequently and with less impact) for as far back as one to two years. While this now hits my laptop at work consistently, I have experienced this on my private machines at home as well (at least I believe this to be the same. I was for a long time suspecting the problem to be horrible swap performance but from my investigations with vmstat I can see that swap is not involved at all).
Created attachment 859252 [details] Capture from vmstat covering another light freeze Another log. Similar scenario as the previous log. The two lines 0 10 441668 414040 3752 359076 0 0 1560 40 444 1334 1 0 74 25 1 1 441668 411140 3752 358516 32 0 5420 0 1067 4342 8 4 22 67 at the end was triggered by me switching to another virtual workspace (to make a copy of the capture file), e.g. just normal short term please-be-patient-while-I'll-read-something-from-disk unresponsiveness.
Created attachment 869276 [details] Capture from vmstat covering a serious problem occurrence I have before attached logs with super mild occurrences of the problem. Here comes an example of the worst occurrences. The log is a capture of running while uptime; do env LC_ALL=C date; free -m; cat /proc/sys/fs/file-nr; vmstat -n 1 60; done Compare the responsiveness from the beginning of the log with the end. Under normal conditions, ten rounds of executing the while loop takes ten minutes. At the end it takes two hours! $ grep Feb typescript.vmstat.026 | head Tue Feb 11 15:00:07 CET 2014 Tue Feb 11 15:01:06 CET 2014 Tue Feb 11 15:02:05 CET 2014 Tue Feb 11 15:03:04 CET 2014 Tue Feb 11 15:04:04 CET 2014 Tue Feb 11 15:05:03 CET 2014 Tue Feb 11 15:06:02 CET 2014 Tue Feb 11 15:07:01 CET 2014 Tue Feb 11 15:08:00 CET 2014 Tue Feb 11 15:08:59 CET 2014 $ grep Feb typescript.vmstat.026 | tail Thu Feb 13 07:19:19 CET 2014 Thu Feb 13 07:21:39 CET 2014 Thu Feb 13 07:29:41 CET 2014 Thu Feb 13 07:52:10 CET 2014 Thu Feb 13 07:54:40 CET 2014 Thu Feb 13 08:51:53 CET 2014 Thu Feb 13 08:53:05 CET 2014 Thu Feb 13 08:54:12 CET 2014 Thu Feb 13 08:55:43 CET 2014 Thu Feb 13 09:12:36 CET 2014 $ This occurrence started Wednesday at Wed 11:32 and I let it sit though out the day in hope that it would recover (borrowing a colleague's pc meanwhile). It did not, and when it had not even recovered the next morning I rebooted by cutting the power. This is with a stock 3.13.0 kernel compiled manually, e.g. not the kernel from the official fedora kernel package, just to eliminate any potential problem with the patches applied there.
Created attachment 869277 [details] Plot of a subset of the vmstat values The graphs for wa, id and b super clearly show the problem starting at around sample number 7500. There are also some insanely high spikes for in and cs, although this occurs some time after the problem starts, so it probably is more an effect than cause. I have been running linux since 1994 and have even tried swapping to /dev/fd0 back in the days. This is like nothing I have ever experienced before. I assume I am just unlucky to trigger this problem through my usage pattern, because this cannot be a common problem, otherwise this would have been a well known news story with some serious effort invested by the kernel developers to try to solve. So even though it probably only hits me, can you please put a little investigation effort in? My work pc is next to unusable unless I reboot daily or even more often. Are there any debug configs or patches I should apply for testing?
Another clear sign of something definitely not right: look at the two last minutes of the log typescript.vmstat.026. First the machine is constantly locked up at "exactly" 75% CPU waiting + 25% user (also the case for several minutes preceding). Then the last minute is nothing but 99 or 100% waiting (before power was cut off). It is impossible that this is a normal usage pattern! There is clearly a bug/problem somewhere. 08:55:27 up 1 day, 17:59, 21 users, load average: 23,69, 30,80, 36,46 Thu Feb 13 08:55:43 CET 2014 total used free shared buffers cached Mem: 3628 3516 112 0 0 125 -/+ buffers/cache: 3390 237 Swap: 8199 2899 5300 21568 0 367857 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 2 30 2969164 112304 196 128104 14 16 465 21 14 26 8 1 51 41 1 31 2969136 112432 188 127608 52 0 3948 4 2722 1197 24 1 0 74 1 30 2969136 112084 184 128332 28 0 3116 0 1297 602 25 1 0 75 1 30 2969112 112704 188 127420 20 0 5944 28 1894 866 24 1 0 74 2 26 2969112 110728 184 129540 76 0 7876 0 2860 1220 24 1 0 74 1 27 2969100 112036 192 127876 100 0 4072 4 1472 800 25 2 0 74 1 28 2969100 111440 196 128504 76 0 4760 56 1264 532 25 1 0 75 1 30 2969096 113056 196 126616 64 0 11340 28 2153 877 24 2 0 74 1 28 2969068 111844 204 128124 0 0 4428 0 2912 1319 24 1 0 74 1 29 2969068 111844 208 128020 28 0 3884 8 1295 667 24 1 0 74 2 29 2969064 111844 356 127844 60 0 3432 0 1429 551 25 1 0 74 1 30 2969056 112464 460 127228 56 0 4760 16 1348 597 24 1 0 74 1 28 2969036 111584 464 128136 88 0 6736 4 1590 745 24 1 0 75 2 27 2969008 112312 312 127716 64 0 5688 0 2019 808 24 1 0 74 1 26 2968992 112016 248 127384 56 0 4744 0 1729 765 24 1 0 74 4 25 2968964 111504 248 128356 56 0 5068 0 1514 611 25 1 0 74 1 27 2968964 112024 196 127700 40 0 5188 0 1781 719 24 2 0 74 1 26 2968960 112364 196 127000 72 0 4148 0 1479 646 25 1 0 75 1 29 2968932 110284 196 128936 88 0 5676 4 1444 672 25 1 0 74 2 27 2968904 112240 196 126904 112 0 4616 0 1798 703 25 1 0 74 3 26 2968892 111052 196 127544 60 0 6256 0 1319 573 24 1 0 75 2 27 2968840 111552 196 127188 48 0 4556 0 1657 664 24 1 0 74 1 26 2968800 111328 196 127164 28 0 4072 0 1429 628 25 1 0 74 1 25 2970120 114976 196 130048 72 1328 4712 1328 55590 552 24 4 0 73 1 29 2970084 116696 196 128004 76 0 4304 0 1435 572 25 2 0 74 1 30 2970048 116512 196 128236 56 0 5164 0 1341 551 24 1 0 74 1 32 2970044 116216 196 128208 56 0 4236 0 1283 585 25 1 0 75 1 30 2970032 116836 196 128492 40 0 3492 0 1310 526 24 1 0 74 1 30 2970024 114976 196 130568 56 0 4020 0 1257 517 24 1 0 74 1 31 2970020 117276 196 128572 44 0 3380 0 1322 547 25 1 0 74 2 29 2970012 114356 200 131852 56 0 3492 0 1223 505 24 1 0 75 1 28 2970012 111380 200 135476 80 0 3664 0 1256 506 25 1 0 74 1 22 2970012 107908 200 138992 64 0 3440 0 1202 521 25 1 0 75 2 22 2970012 115844 204 129504 84 0 2296 0 1372 547 25 1 0 74 1 23 2969920 118340 196 126320 84 0 2492 0 1421 583 25 2 0 74 1 21 2969908 119068 196 125848 36 0 2188 0 1415 653 25 1 0 74 2 22 2969908 118572 196 126428 56 0 2064 0 1298 593 25 1 0 74 1 29 2969908 118076 192 127832 84 0 2964 0 1268 589 25 1 0 74 1 28 2969908 116340 192 129020 72 0 3212 0 1301 569 24 1 0 75 2 26 2969908 112496 192 132596 32 0 3728 0 1210 497 25 1 0 75 1 28 2969900 116836 192 127816 32 0 4484 0 1521 757 25 2 0 74 1 28 2969900 116588 192 128804 12 0 3768 0 1258 522 25 1 0 75 1 23 2969900 113860 192 131348 16 0 3996 0 1259 576 25 1 0 74 1 24 2969888 111596 192 133948 72 0 2492 0 1209 492 25 1 0 75 1 26 2969888 110080 192 134812 64 0 2796 0 1244 572 25 1 0 74 1 25 2969868 115564 192 129092 72 0 2224 0 1436 583 25 1 0 74 1 26 2969856 116804 192 127488 72 0 3364 0 1478 596 25 2 0 74 1 27 2969856 114572 192 130444 28 0 3632 0 1207 484 25 1 0 75 1 26 2969856 110480 192 134628 84 0 4204 0 1311 683 25 1 0 74 1 23 2969824 116676 192 127320 120 0 3160 0 1436 593 25 1 0 74 1 27 2969816 114740 192 129592 20 0 4044 0 1268 582 25 1 0 74 1 26 2969800 114572 196 130180 40 0 3320 0 1347 538 24 1 0 75 2 26 2969780 110108 196 134732 16 0 4616 0 1218 522 25 1 0 74 1 24 2969776 114572 196 129372 88 0 4088 0 1459 600 25 1 0 74 1 23 2969776 113704 196 130256 96 0 3840 0 1440 621 24 1 0 74 1 22 2969776 113952 196 129788 88 0 3396 0 1315 561 25 1 0 74 2 24 2969776 110356 196 133828 72 0 3872 0 1212 522 25 1 0 75 2 24 2969776 114868 196 128608 72 0 3664 0 1420 600 25 1 0 74 1 26 2969776 115924 196 127764 112 0 3808 0 1424 606 24 1 0 74 2 24 2969776 115112 196 128676 112 0 3208 0 1350 578 24 2 0 74 09:12:16 up 1 day, 18:16, 21 users, load average: 49,73, 47,78, 43,55 Thu Feb 13 09:12:36 CET 2014 total used free shared buffers cached Mem: 3628 3522 105 0 0 133 -/+ buffers/cache: 3388 240 Swap: 8199 2904 5295 21536 0 367857 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 49 2974148 113764 204 132128 14 16 467 21 16 27 8 1 50 41 1 49 2974144 111408 204 134496 40 0 2404 0 223 305 0 0 0 100 0 45 2974012 108688 344 137128 68 0 2828 0 282 383 0 0 0 99 0 47 2974012 105836 352 140404 64 0 3224 36 199 346 0 1 0 100 0 46 2974012 116316 228 129224 96 0 4920 88 371 601 0 1 0 99 0 53 2974012 117244 328 128584 0 0 5116 4 492 693 0 1 0 99 0 53 2974012 114144 328 131940 32 0 3284 0 256 357 0 0 0 100 0 50 2974012 110424 328 135784 64 0 4140 0 244 395 0 1 0 100 1 50 2974012 106580 328 139476 64 0 3580 0 237 394 0 0 0 100 0 49 2974012 115620 236 128984 64 0 3488 0 272 436 0 1 0 99 0 51 2974012 118008 224 126208 64 0 3428 4 307 410 0 1 0 99 0 46 2974012 115880 224 129260 32 0 3488 0 239 365 0 0 0 100 1 47 2974012 112160 224 133152 32 0 3800 0 216 327 0 0 0 100 0 48 2974012 116740 224 128056 0 0 8248 4 354 551 1 1 0 99 0 46 2974012 115756 224 129756 0 0 3280 0 454 675 0 0 0 100 0 46 2974012 112904 224 131828 48 0 2344 0 195 292 0 0 0 100 0 45 2974000 110920 224 134352 52 0 2640 0 211 365 0 1 0 99 0 44 2974000 107448 224 137852 52 0 3644 0 278 487 0 0 0 100 0 41 2973996 114512 220 129116 0 0 2912 0 277 456 0 1 0 99 0 39 2973976 117184 220 126048 72 0 4128 0 489 809 0 1 0 99 0 38 2973956 116616 220 127752 64 0 3028 0 228 392 0 0 0 100 0 38 2973956 113764 220 130700 32 0 3116 0 173 303 0 0 0 100 0 36 2973956 111160 220 133224 0 0 2364 0 179 346 0 0 0 100 0 38 2973952 115628 220 129124 0 0 3064 0 272 453 0 1 0 99 1 40 2973952 112904 220 132320 20 0 3316 0 210 338 0 0 0 100 0 40 2973952 108936 220 137140 24 0 4624 0 381 333 5 0 0 95 0 39 2973944 114444 220 129396 104 0 3464 0 324 580 0 1 0 99 0 36 2973936 117936 220 126292 0 0 4488 0 387 557 0 1 0 99 0 39 2973932 113060 220 131420 56 0 4476 0 266 364 0 0 0 100 0 36 2973928 110084 220 134924 0 0 3448 0 244 336 0 1 0 99 0 34 2973924 113556 220 131460 0 0 3276 0 318 364 2 1 0 98 0 35 2973924 110456 220 134340 124 0 3472 0 192 291 0 0 0 100 0 35 2973924 107480 224 137620 176 0 3144 0 216 301 0 0 0 100 0 34 2973924 106980 224 137296 104 0 1916 0 250 388 0 0 0 100 0 34 2973924 116280 220 127784 100 0 1812 0 292 450 0 1 0 99 0 35 2973920 116900 220 126364 104 0 1492 0 338 520 0 1 0 99 1 33 2973920 117896 220 124924 0 0 2136 0 257 427 0 0 0 100 0 36 2973904 115044 220 128304 80 0 3052 0 219 332 0 0 0 100 0 36 2973904 111572 220 132004 20 0 3708 0 217 337 0 1 0 100 1 36 2973888 114172 220 130076 136 0 3788 0 255 344 0 0 0 99 0 36 2973888 111324 220 133096 0 0 3108 0 228 284 0 0 0 99 0 36 2973884 108100 220 136320 32 0 3440 0 269 413 0 0 0 100 1 44 2973776 114400 220 128748 788 0 2748 0 258 437 0 1 0 99 0 45 2973644 115588 220 125800 1296 0 4080 0 477 456 5 1 0 94 0 45 2973456 116248 220 124596 1852 0 6888 0 322 466 0 1 0 99 0 50 2973012 113524 220 125616 1364 0 6372 0 616 908 0 1 0 99 0 53 2973712 112904 220 126048 1084 840 4624 840 295 462 0 1 0 99 0 53 2974404 113028 220 126420 304 764 4184 764 279 417 0 0 0 100 0 52 2974328 112892 220 125708 644 0 4160 0 255 390 0 1 0 99 1 53 2974276 112036 220 126544 496 0 4784 0 283 432 0 0 0 100 0 52 2974252 111416 220 126060 580 8 4720 8 283 429 0 1 0 99 0 51 2974092 111612 220 125636 332 0 4260 0 285 430 0 0 0 100 1 42 2974080 111168 220 125324 12 0 3084 0 299 423 0 1 0 99 0 41 2974072 111412 220 125884 36 0 4936 4 334 405 2 1 0 98 0 42 2974072 111396 216 126304 0 0 3840 32 368 623 0 1 0 99 0 42 2974200 111064 208 126112 32 128 4852
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs. Fedora 19 has now been rebased to 3.13.5-100.fc19. Please test this kernel update and let us know if you issue has been resolved or if it is still present with the newer kernel. If you experience different issues, please open a new bug report for those.
Created attachment 874513 [details] Capture from vmstat with kernel 3.13.5-103.fc19.x86_64 Problem starts after "Fri Mar 14 13:36:55 CET 2014", going on and off a little until "Fri Mar 14 13:56:06 CET 2014" after which the situation is permanent poor. Rebooted at the end as I had not time to let it recover by itself. Overall this was far from one of the worst occurrences experienced, only very, very unresponsive, not like totally frozen.
Notice: I will soon loose access to this computer where I get the issue almost daily, so if there is any patches I should try out by all means do not wait!
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs. Fedora 19 has now been rebased to 3.14.4-100.fc19. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 20, and are still experiencing this issue, please change the version to Fedora 20. If you experience different issues, please open a new bug report for those.
*********** MASS BUG UPDATE ************** This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 4 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.