Description of problem: After some time (aprox. 1-2 days) my server has many of processes in state "D". Mostly they are postfix, dovecot, crond and mysql. They blocks my server which after some time hangs. Problem started after upgrade to Fedora 10, on Fedora 8 there was no problems with same configuration. Version-Release number of selected component (if applicable): [root@ns ~]# rpm -q kernel kernel-PAE kernel-2.6.27.9-159.fc10.x86_64 kernel-PAE-2.6.27.7-134.fc10.i686 kernel-PAE-2.6.27.9-159.fc10.i686 Same problem for all these kernels. How reproducible: Every 1-2 days on my machine. Steps to Reproduce: Unable to reproduce on another machine, I can only wait. Actual results: Before debugging there was nothing special in dmesg or messages. Thanks to nirik there are some logs: http://www.salstar.sk/fedora-error/ Expected results: An working server. Additional info: Curious, that after reboot all my logged data is gone. This one is a copy created before reboot of my machine. Command "sync" also fails after this problem started. May be something is with xen blk driver. My dom0 is an fully updated Fedora 8. Guest is an paravirtualized guest on lvm disk storages.
After downgrade to fc8 kernel my system is up more than 3 days. I think there is something wrong with fc10 kernels.
This problem looks very similar to my problem: http://lkml.indiana.edu/hypermail/linux/kernel/0812.3/00438.html I am using online nigthly backups with LVM snapshots.
I'm having the same issue. I have a RH/Centos 5.3 dom0 with a Fedora 10 domU. With no load the domU can run for hours without any problem, but once there is a heavy load the uninterruptible sleep (D) processes start popping up. Then just about any command that requires disk access results in a frozen shell, and eventually the whole domU just locks up. I'll see if I can come up with an easy way to reproduce this. I suspect that initiating a large file copy may do it.
(In reply to comment #3) > a heavy load the uninterruptible sleep (D) processes start popping up. Then > just about any command that requires disk access results in a frozen shell, and > eventually the whole domU just locks up. Are you sure, all you write processes freezes? In my situation only processes, which are trying to sync data to disk freezes. For example command: dmesg > /tmp/dmesg ends without problem, I can see /tmp/dmesg file with all content OK, just after reboot this file is missing. Does not matter, if it's /tmp or /root or other directory, all writes are saved to cache and does not go to disk. > I'll see if I can come up with an easy way to reproduce this. I suspect that > initiating a large file copy may do it. Good luck, I can't. My problem is only on 2 machines with same hardware. May be there is something wrong in this PC. My plans are update all bioses (raid and motherboard) on these servers, but I have to synchronize this with hardware supplier. Other servers with different hardware works well. Attaching my lshw configuration.
Created attachment 338883 [details] Hardware configuration
Jan/Chris - I wonder could either of you try an F11Beta guest and see if it has the same problem?
Created attachment 339442 [details] oops from console
After aprox. 3 days my server was down again. Here are currently used packages: Apr 11 06:48:00 Updated: kernel-firmware-2.6.29.1-54.fc11.noarch Apr 11 06:48:22 Installed: kernel-PAE-2.6.29.1-54.fc11.i686 Apr 11 06:48:25 Installed: kernel-PAE-2.6.29.1-54.fc11.i686 3 days there was no problems, everything worked well. I have also "sync" run time, but it was between 0-3 sec from start to hang. After hang, my "xm con" displayed many of oops (all which can be grabbed are attached): BUG: soft lockup - CPU#1 stuck for 61s! [crond:3288] Modules linked in: ipv6 xen_netfront pcspkr xen_blkfront Pid: 3288, comm: crond Tainted: G D (2.6.29.1-54.fc11.i686.PAE #1) EIP: 0061:[<c04023a7>] EFLAGS: 00000206 CPU: 1 EIP is at _stext+0x3a7/0x1000 EAX: 00000000 EBX: 00000003 ECX: cb8a9bdc EDX: cb8a9bec ESI: e04a8790 EDI: 0d1dc961 EBP: cb8a9bfc ESP: cb8a9bd8 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069 CR0: 8005003b CR2: b6106000 CR3: 1dfd7000 CR4: 00002620 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000000 Call Trace: [<c05c37fa>] ? xen_poll_irq+0x45/0x55 [<c0407613>] xen_spin_lock_slow+0x131/0x1f1 [<c0407780>] __xen_spin_lock+0xad/0xdf [<c04077cc>] xen_spin_lock+0xa/0xc [<c0715e51>] _spin_lock+0xd/0x10 [<c049b910>] page_referenced+0x58/0x11b [<c048b352>] shrink_active_list+0x13f/0x313 [<c0488686>] ? get_dirty_limits+0x21/0x2c4 [<c043cc6f>] ? do_softirq+0x68/0x7e [<c040b00e>] ? do_IRQ+0x97/0xad [<c048c1c6>] shrink_zone+0x292/0x2a4 [<c048cd76>] do_try_to_free_pages+0x1ee/0x31b [<c048cf87>] try_to_free_pages+0x62/0x6a [<c048afb3>] ? isolate_pages_global+0x0/0x199 [<c04876ef>] __alloc_pages_internal+0x22f/0x386 [<c04264d9>] pte_alloc_one+0x1c/0x3f [<c0493b9b>] __pte_alloc+0x16/0xaf [<c0494740>] copy_page_range+0x1b5/0x52f [<c0422d2d>] ? pvclock_clocksource_read+0x4e/0xd8 [<c0422d2d>] ? pvclock_clocksource_read+0x4e/0xd8 [<c0436313>] dup_mm+0x21b/0x2e4 [<c0436d5b>] copy_process+0x952/0x102e [<c043754f>] do_fork+0x118/0x2a7 [<c0407e32>] sys_clone+0x24/0x26 [<c040955e>] syscall_call+0x7/0xb Chris, can you try to update firmware on your machine. Can you attach hardware configuration (use lshw or at least dmidecode on dom0).
(In reply to comment #7) > Created an attachment (id=339442) [details] > oops from console This looks like a completely separate issue? Please file another bug report So, you didn't see any uninterruptible sleep process with the 2.6.29 guest?
> This looks like a completely separate issue? Please file another bug report But I can't test it more with FC11 development kernel. It's a production machine. > So, you didn't see any uninterruptible sleep process with the 2.6.29 guest? It was normal with older kernel, that it worked 2-20 days on this server. I can't say, if it works or not before it will run at least 7 days on this server and may be another 7 days on another, where I can't test this before success on server 1.
2 days after BIOS update my virtual machine is dead again. This is on virtual console: BUG: soft lockup - CPU#2 stuck for 61s! [smtp:12481] Modules linked in: ipv6 pcspkr xen_netfront xen_blkfront [last unloaded: scsi_wa it_scan] Pid: 12481, comm: smtp Tainted: G D (2.6.27.21-170.2.56.fc10.i686.PAE #1) EIP: 0061:[<c04023a7>] EFLAGS: 00200202 CPU: 2 EIP is at _stext+0x3a7/0x1000 EAX: 00000000 EBX: 00000003 ECX: ea87ce60 EDX: 00000010 ESI: ecd4e16c EDI: 00000000 EBP: ea87ce80 ESP: ea87ce5c DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069 CR0: 8005003b CR2: 00195fb0 CR3: 00816000 CR4: 00002620 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000000 [<c057978e>] ? xen_poll_irq+0x40/0x50 [<c040696d>] xen_spin_lock_slow+0x57/0x91 [<c04069d8>] xen_spin_lock+0x31/0x659 [<c06b71a9>] _spin_lock+0x8/0xb [<c0485a8e>] unlink_file_vma+0x1d/0x6d [<c048446a>] free_pgtables+0x4e/0x94 [<c0485722>] exit_mmap+0x89/0xe5 [<c0430e17>] mmput+0x37/0x86 [<c043420b>] exit_mm+0xeb/0xf3 [<c0435ad4>] do_exit+0x1cc/0x744 [<c0466d1e>] ? audit_syscall_entry+0xf9/0x123 [<c04360bc>] do_group_exit+0x70/0x97 [<c04360f6>] sys_exit_group+0x13/0x17 [<c0408c8a>] syscall_call+0x7/0xb ======================= I have lot's of similar messages, I can attach them all if requested. Is this a different problem too? Is it similar as problem with attachment https://bugzilla.redhat.com/attachment.cgi?id=339442 ? If yes, I can test this and report a new bug. What to fill in subject for this bug? I have no idea with similar bugs.
Is this still an issue with the 2.6.30 F11 kernels?
I can't upgrade to F11 now. My plans are to upgrade these servers to F11 host and F10 guest using KVM. All my servers on those machines (2 hosts) are in production and I can't test unstable things.
This message is a reminder that Fedora 10 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 10. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '10'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 10's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 10 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Fedora 10 changed to end-of-life (EOL) status on 2009-12-17. Fedora 10 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.