Bug 453257
Summary: | Seemingly non-working process is messing with process scheduler. | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Ivo Sarak <ivo> | ||||||
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||||
Status: | CLOSED NEXTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | low | ||||||||
Version: | 9 | CC: | clodoaldo.pinto.neto | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2008-10-01 06:37:15 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Ivo Sarak
2008-06-28 12:17:14 UTC
possibly a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=447975 See comment #1, and try those commands. If that solves it, it's a dupe. I have no CPU speed scaling enabled or installed: [root@haskaa cpu]# ls -la total 0 drwxr-xr-x 7 root root 0 2008-06-28 19:44 . drwxr-xr-x 12 root root 0 2008-06-28 19:44 .. drwxr-xr-x 4 root root 0 2008-06-28 19:44 cpu0 drwxr-xr-x 4 root root 0 2008-06-28 19:44 cpu1 drwxr-xr-x 4 root root 0 2008-06-28 19:44 cpu2 drwxr-xr-x 4 root root 0 2008-06-28 19:44 cpu3 drwxr-xr-x 2 root root 0 2008-06-28 19:44 cpuidle -rw-r--r-- 1 root root 4096 2008-06-28 19:44 sched_mc_power_savings [root@haskaa cpu]# cd cpu0 [root@haskaa cpu0]# ls -la total 0 drwxr-xr-x 4 root root 0 2008-06-28 19:44 . drwxr-xr-x 7 root root 0 2008-06-28 19:44 .. drwxr-xr-x 6 root root 0 2008-06-28 19:44 cache -r-------- 1 root root 4096 2008-06-28 19:44 crash_notes drwxr-xr-x 2 root root 0 2008-06-28 19:44 topology [root@haskaa cpu0]# Latest F9 kernel still has that issue: [root@ragana ~]# uname -a Linux ragana 2.6.25.9-76.fc9.x86_64 #1 SMP Fri Jun 27 15:58:30 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux [root@ragana ~]# If my description of the issue is too vague then to summarize: I have couple of issues with Fedora 9 bundled 2.6.25 series kernels (when running FAH client): 1. Task scheduler will forget the presence of FahCore_*.exe. Task scheduler is stating that there is no CPU activity (system is 99% idle) and when ever there is a task asking for some CPU time it will likely get nothing because the FahCore_*.exes are taking all of it already. Usual "features" I observe - I am unable to log in to the system nor reboot it. Starting of X11 or even some console application will likely fail as well. Even plain simple BELL is likely to get suck in endless beep. Killing of FahCore_*.exes will fix these issues and you can start the FAH client up again and wait for couple (tens of) hours to tackle that very same issue again. 2. FahCore_*.exe system wide CPU utilization over 4 cores is less than 50% or as low as 20..30% (with a1 or a2 cores, it does not make much difference). I have these issues both of my Phenom based Fedora 9 systems. Being it SMP client or uni-proc one does not play much difference. Maybe it is Fedora 9 only "feature" or maybe not. I'll add a session of that issue: 1. tty1 is stuck at log in; 2. tty2 is running "top" and stating the system being 99% idle; 3. tty3 is stuck at starting X11; Everything is back to normal after the FahCore_a1.exe is being killed. Created attachment 311360 [details]
session log
System is reported to be 99% idle, but actually it is still loaded by the
FahCore_a1.exe. I am unable to log in or start X11 until I kill all running
FahCore_a1.exes.
I've managed to see following messages during one of those "99% idle sessions": Clocksource tsc unstable (delta = 145296364583 ns) ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: cmd c8/00:08:1b:48:b8/00:00:00:00:00/e0 tag 0 dma 4096 in res 40/00:02:00:08:00/00:00:00:00:00/b0 Emask 0x4 (timeout) ata1.00: status: { DRDY } ata1: soft resetting link ata1.00: configured for UDMA/100 ata1.01: configured for UDMA/33 ata1: EH complete sd 0:0:0:0: [sda] 234441648 512-byte hardware sectors (120034 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sd 0:0:0:0: [sda] 234441648 512-byte hardware sectors (120034 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA NETDEV WATCHDOG: eth0: transmit timed out r8169: eth0: link up Created attachment 311597 [details]
session log with 3 FAH clients running
That session from my second AMD Phenom based Fedora 9 system has 3 FAH clients
running: 1 SMP and 2 uni-proc ones. One uni-proc ("CPU2") has been dropped off
task scheduler list and will get restarted during the course of the session.
There are couple of places where various commands have been terminated by
pressing ctrl+c, these were needed because their execution stopped.
Is the FAH client actually getting work done when this starts happening? Is there some way to check how much progress it is making with its assigned task? Yes, the FAH client is actually still processing and getting the work done as usual. Last time I checked the processing speed was unaffected by that issue. By all accounts I see the FAH client (FAH cores actually) are getting 100% of CPU time. I currently have one set of FahCore_82.exe missing in action, but messages file as a lot of errors to report http://ra.vendomar.ee/~ivo/messages.txt 14842 ? S 0:00 /bin/bash ./FaH 14846 ? Sl 0:24 /home/ivo/folding2/foldingathome/CPU2/fah6 -betateam -forceasm -verbosity 9 14856 ? S 0:00 /bin/bash ./FaH 14858 ? Sl 0:25 /home/ivo/folding2/foldingathome/CPU3/fah6 -betateam -forceasm -verbosity 9 16537 ? Ss 0:00 SCREEN 16538 pts/2 Ss 0:00 /bin/bash 18284 ? SN 0:00 ./FahCore_82.exe -dir work/ -suffix 02 -checkpoint 15 -forceasm -verbose -lifeline 14858 -version 602 18285 ? SN 0:00 ./FahCore_82.exe -dir work/ -suffix 02 -checkpoint 15 -forceasm -verbose -lifeline 14858 -version 602 18286 ? RN 292:26 ./FahCore_82.exe -dir work/ -suffix 02 -checkpoint 15 -forceasm -verbose -lifeline 14858 -version 602 18287 ? SN 0:00 ./FahCore_82.exe -dir work/ -suffix 02 -checkpoint 15 -forceasm -verbose -lifeline 14858 -version 602 18873 pts/2 S+ 0:00 /bin/bash ./FaH 18874 pts/2 Sl+ 0:00 /home/ivo/foldingathome/CPU1/fah6 -smp -verbosity 9 -betateam 18878 pts/2 S+ 0:00 ./mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 06 -checkpoint 15 -verbose -lifeline 18874 -version 602 18879 pts/2 SNl+ 61:11 ./FahCore_a1.exe -dir work/ -suffix 06 -checkpoint 15 -verbose -lifeline 18874 -version 602 18880 pts/2 SNl+ 54:51 ./FahCore_a1.exe -dir work/ -suffix 06 -checkpoint 15 -verbose -lifeline 18874 -version 602 18881 pts/2 SNl+ 53:44 ./FahCore_a1.exe -dir work/ -suffix 06 -checkpoint 15 -verbose -lifeline 18874 -version 602 18882 pts/2 SNl+ 53:04 ./FahCore_a1.exe -dir work/ -suffix 06 -checkpoint 15 -verbose -lifeline 18874 -version 602 18898 ? SN 0:00 ./FahCore_82.exe -dir work/ -suffix 00 -checkpoint 15 -forceasm -verbose -lifeline 14846 -version 602 18899 ? SN 0:00 ./FahCore_82.exe -dir work/ -suffix 00 -checkpoint 15 -forceasm -verbose -lifeline 14846 -version 602 18900 ? RN 33:21 ./FahCore_82.exe -dir work/ -suffix 00 -checkpoint 15 -forceasm -verbose -lifeline 14846 -version 602 18901 ? SN 0:00 ./FahCore_82.exe -dir work/ -suffix 00 -checkpoint 15 -forceasm -verbose -lifeline 14846 -version 602 top - 14:37:44 up 19 days, 2 min, 6 users, load average: 6.07, 5.46, 5.02 Tasks: 127 total, 5 running, 122 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 2.1%sy, 78.4%ni, 19.0%id, 0.0%wa, 0.0%hi, 0.6%si, 0.0%st Mem: 4062920k total, 2748876k used, 1314044k free, 334148k buffers Swap: 4096564k total, 0k used, 4096564k free, 2019788k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 18286 ivo 39 19 15944 2068 1088 R 92.2 0.1 294:33.01 FahCore_82.exe 18879 ivo 39 19 77832 30m 2048 S 41.3 0.8 62:05.33 FahCore_a1.exe 18880 ivo 39 19 74428 28m 1836 S 36.6 0.7 55:39.70 FahCore_a1.exe 18881 ivo 39 19 74372 28m 1836 S 36.0 0.7 54:32.04 FahCore_a1.exe 18882 ivo 39 19 74624 28m 1844 S 35.6 0.7 53:51.29 FahCore_a1.exe 1 root 20 0 4048 892 648 S 0.0 0.0 0:01.91 init 2 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root RT -5 0 0 0 S 0.0 0.0 0:00.15 migration/0 4 root 15 -5 0 0 0 S 0.0 0.0 0:00.24 ksoftirqd/0 5 root RT -5 0 0 0 S 0.0 0.0 0:11.34 watchdog/0 6 root RT -5 0 0 0 S 0.0 0.0 0:00.15 migration/1 7 root 15 -5 0 0 0 S 0.0 0.0 0:00.10 ksoftirqd/1 8 root RT -5 0 0 0 S 0.0 0.0 0:00.76 watchdog/1 9 root RT -5 0 0 0 S 0.0 0.0 0:00.10 migration/2 It seems the FAH client is not running at full swing after all, but 1/3 of full power when it has gone hiding. New FAH core in action ~75% of CPU utilization: [08:27:00] Working on Unit 01 [July 24 08:27:00] [08:27:00] + Working ... [08:27:00] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 01 -checkpoint 15 -forceasm -verbose -lifeline 7536 -version 602' [08:27:00] [08:27:00] *------------------------------* [08:27:00] Folding@Home Gromacs SMP Core [08:27:00] Version 2.00 (Fri Jul 4 13:52:30 PDT 2008) [08:27:00] [08:27:00] Preparing to commence simulation [08:27:00] - Ensuring status. Please wait. [08:27:09] - Assembly optimizations manually forced on. [08:27:09] - Not checking prior termination. [08:27:10] - Expanded 4291742 -> 24742709 (decompressed 576.5 percent) [08:27:10] Called DecompressByteArray: compressed_data_size=4291742 data_size=24742709, decompressed_data_size=24742709 diff=0 [08:27:10] - Digital signature verified [08:27:10] [08:27:10] Project: 2662 (Run 1, Clone 259, Gen 0) [08:27:10] [08:27:11] Assembly optimizations on if available. [08:27:11] Entering M.D. [08:37:20] Completed 2500 out of 250000 steps (1%) [08:47:12] Completed 5000 out of 250000 steps (2%) [08:57:05] Completed 7500 out of 250000 steps (3%) [09:06:41] Completed 10000 out of 250000 steps (4%) [09:16:19] Completed 12500 out of 250000 steps (5%) [09:26:13] Completed 15000 out of 250000 steps (6%) [09:35:54] Completed 17500 out of 250000 steps (7%) [09:45:38] Completed 20000 out of 250000 steps (8%) [09:55:28] Completed 22500 out of 250000 steps (9%) [10:05:21] Completed 25000 out of 250000 steps (10%) [10:15:25] Completed 27500 out of 250000 steps (11%) [10:25:21] Completed 30000 out of 250000 steps (12%) [10:35:02] Completed 32500 out of 250000 steps (13%) [10:44:46] Completed 35000 out of 250000 steps (14%) [10:54:27] Completed 37500 out of 250000 steps (15%) [11:04:08] Completed 40000 out of 250000 steps (16%) [11:13:46] Completed 42500 out of 250000 steps (17%) [11:23:28] Completed 45000 out of 250000 steps (18%) [11:33:10] Completed 47500 out of 250000 steps (19%) [11:42:50] Completed 50000 out of 250000 steps (20%) [11:52:32] Completed 52500 out of 250000 steps (21%) [12:02:12] Completed 55000 out of 250000 steps (22%) [12:11:54] Completed 57500 out of 250000 steps (23%) [12:21:34] Completed 60000 out of 250000 steps (24%) [12:31:17] Completed 62500 out of 250000 steps (25%) [12:40:59] Completed 65000 out of 250000 steps (26%) [12:50:38] Completed 67500 out of 250000 steps (27%) [13:00:23] Completed 70000 out of 250000 steps (28%) [13:10:07] Completed 72500 out of 250000 steps (29%) [13:19:48] Completed 75000 out of 250000 steps (30%) [13:29:34] Completed 77500 out of 250000 steps (31%) [13:39:14] Completed 80000 out of 250000 steps (32%) [13:48:56] Completed 82500 out of 250000 steps (33%) [13:58:38] Completed 85000 out of 250000 steps (34%) [14:08:24] Completed 87500 out of 250000 steps (35%) [14:18:05] Completed 90000 out of 250000 steps (36%) [14:27:00] - Autosending finished units... [14:27:00] Trying to send all finished work units [14:27:00] + No unsent completed units remaining. [14:27:00] - Autosend completed [14:27:49] Completed 92500 out of 250000 steps (37%) [14:37:32] Completed 95000 out of 250000 steps (38%) [14:47:20] Completed 97500 out of 250000 steps (39%) [14:57:00] Completed 100000 out of 250000 steps (40%) [15:06:39] Completed 102500 out of 250000 steps (41%) [15:16:21] Completed 105000 out of 250000 steps (42%) [15:26:01] Completed 107500 out of 250000 steps (43%) [15:35:41] Completed 110000 out of 250000 steps (44%) [15:47:44] Completed 112500 out of 250000 steps (45%) [16:17:41] Completed 115000 out of 250000 steps (46%) [16:47:38] Completed 117500 out of 250000 steps (47%) [17:17:48] Completed 120000 out of 250000 steps (48%) [17:47:48] Completed 122500 out of 250000 steps (49%) [18:17:46] Completed 125000 out of 250000 steps (50%) [18:47:45] Completed 127500 out of 250000 steps (51%) [19:17:43] Completed 130000 out of 250000 steps (52%) [19:47:41] Completed 132500 out of 250000 steps (53%) [20:17:39] Completed 135000 out of 250000 steps (54%) [20:27:00] - Autosending finished units... [20:27:00] Trying to send all finished work units [20:27:00] + No unsent completed units remaining. [20:27:00] - Autosend completed [20:47:38] Completed 137500 out of 250000 steps (55%) [21:17:36] Completed 140000 out of 250000 steps (56%) [21:47:36] Completed 142500 out of 250000 steps (57%) [22:17:35] Completed 145000 out of 250000 steps (58%) [22:47:34] Completed 147500 out of 250000 steps (59%) [23:17:33] Completed 150000 out of 250000 steps (60%) [23:27:20] Completed 152500 out of 250000 steps (61%) [23:37:02] Completed 155000 out of 250000 steps (62%) [23:46:48] Completed 157500 out of 250000 steps (63%) [23:56:32] Completed 160000 out of 250000 steps (64%) [00:06:13] Completed 162500 out of 250000 steps (65%) [00:15:57] Completed 165000 out of 250000 steps (66%) [00:25:41] Completed 167500 out of 250000 steps (67%) [00:35:24] Completed 170000 out of 250000 steps (68%) [00:45:07] Completed 172500 out of 250000 steps (69%) [00:54:50] Completed 175000 out of 250000 steps (70%) [01:04:30] Completed 177500 out of 250000 steps (71%) [01:14:10] Completed 180000 out of 250000 steps (72%) [01:23:54] Completed 182500 out of 250000 steps (73%) [01:33:36] Completed 185000 out of 250000 steps (74%) [01:43:16] Completed 187500 out of 250000 steps (75%) [01:52:58] Completed 190000 out of 250000 steps (76%) [02:02:44] Completed 192500 out of 250000 steps (77%) [02:12:22] Completed 195000 out of 250000 steps (78%) [02:22:05] Completed 197500 out of 250000 steps (79%) [02:27:00] - Autosending finished units... [02:27:00] Trying to send all finished work units [02:27:00] + No unsent completed units remaining. [02:27:00] - Autosend completed [02:31:46] Completed 200000 out of 250000 steps (80%) [02:41:29] Completed 202500 out of 250000 steps (81%) [02:51:11] Completed 205000 out of 250000 steps (82%) [03:00:57] Completed 207500 out of 250000 steps (83%) [03:10:40] Completed 210000 out of 250000 steps (84%) [03:20:24] Completed 212500 out of 250000 steps (85%) [03:30:05] Completed 215000 out of 250000 steps (86%) [03:39:45] Completed 217500 out of 250000 steps (87%) [03:49:30] Completed 220000 out of 250000 steps (88%) Here the system went 99% idle and FAH client production dropped 1/3: [04:18:31] Completed 222500 out of 250000 steps (89%) [04:48:27] Completed 225000 out of 250000 steps (90%) [05:18:25] Completed 227500 out of 250000 steps (91%) [05:48:26] Completed 230000 out of 250000 steps (92%) [06:18:28] Completed 232500 out of 250000 steps (93%) [06:48:31] Completed 235000 out of 250000 steps (94%) Changed the bug Summary to reflect actual state. The issue is still unchanged under kernel 2.6.25.11-97.fc9.x86_64. There is hope, kernel-2.6.27-0.238.rc2.fc10.x86_64 from Rawhide is not having that issue (at least yet). Latest kernel-2.6.27-0.244.rc2.git1.fc10.x86_64 will die by throwing exceptions a lot. These are likely network related "03:02.0 Ethernet controller: ADMtek NC100 Network Everywhere Fast Ethernet 10/100 (rev 11)" (is the tulip driver buggy?). I think this is finally fixed in 2.6.26.5-37.fc9 kernel-2.6.26.5-39.fc9 has been submitted as an update for Fedora 9. http://admin.fedoraproject.org/updates/kernel-2.6.26.5-39.fc9 kernel-2.6.26.5-39.fc9 has been pushed to the Fedora 9 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update kernel'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F9/FEDORA-2008-8089 kernel-2.6.26.5-44.fc9 has been submitted as an update for Fedora 9. http://admin.fedoraproject.org/updates/kernel-2.6.26.5-44.fc9 kernel-2.6.26.5-45.fc9 has been pushed to the Fedora 9 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update kernel'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F9/FEDORA-2008-8283 kernel-2.6.26.5-45.fc9 has been pushed to the Fedora 9 stable repository. If problems still persist, please make note of it in this bug report. |