Bug 478414

Summary: Server down after 1-2 days, many of processes in state "D". (xen domU)
Product: [Fedora] Fedora Reporter: Jan ONDREJ <ondrejj>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: high    
Version: 10CC: chris_rhbugzilla, jforbes, kernel-maint, markmc, mathieu-acct, mjw, virt-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-12-18 07:25:51 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 480594    
Attachments:
Description Flags
Hardware configuration
none
oops from console none

Description Jan ONDREJ 2008-12-29 19:16:36 UTC
Description of problem:
After some time (aprox. 1-2 days) my server has many of processes in state "D". Mostly they are postfix, dovecot, crond and mysql.
They blocks my server which after some time hangs.
Problem started after upgrade to Fedora 10, on Fedora 8 there was no problems with same configuration.

Version-Release number of selected component (if applicable):
[root@ns ~]# rpm -q kernel kernel-PAE
kernel-2.6.27.9-159.fc10.x86_64
kernel-PAE-2.6.27.7-134.fc10.i686
kernel-PAE-2.6.27.9-159.fc10.i686
Same problem for all these kernels.

How reproducible:
Every 1-2 days on my machine.

Steps to Reproduce:
Unable to reproduce on another machine, I can only wait.
  
Actual results:
Before debugging there was nothing special in dmesg or messages. 

Thanks to nirik there are some logs:
  http://www.salstar.sk/fedora-error/

Expected results:
An working server.

Additional info:
Curious, that after reboot all my logged data is gone. This one is a copy created before reboot of my machine.
Command "sync" also fails after this problem started.

May be something is with xen blk driver. My dom0 is an fully updated Fedora 8.
Guest is an paravirtualized guest on lvm disk storages.

Comment 1 Jan ONDREJ 2009-01-05 12:14:23 UTC
After downgrade to fc8 kernel my system is up more than 3 days.
I think there is something wrong with fc10 kernels.

Comment 2 Jan ONDREJ 2009-01-06 20:16:12 UTC
This problem looks very similar to my problem:
  http://lkml.indiana.edu/hypermail/linux/kernel/0812.3/00438.html

I am using online nigthly backups with LVM snapshots.

Comment 3 Chris 2009-04-09 08:27:13 UTC
I'm having the same issue.  I have a RH/Centos 5.3 dom0 with a Fedora 10 domU.  With no load the domU can run for hours without any problem, but once there is a heavy load the uninterruptible sleep (D) processes start popping up.  Then just about any command that requires disk access results in a frozen shell, and eventually the whole domU just locks up.

I'll see if I can come up with an easy way to reproduce this.  I suspect that initiating a large file copy may do it.

Comment 4 Jan ONDREJ 2009-04-09 09:04:25 UTC
(In reply to comment #3)
> a heavy load the uninterruptible sleep (D) processes start popping up.  Then
> just about any command that requires disk access results in a frozen shell, and
> eventually the whole domU just locks up.

Are you sure, all you write processes freezes? In my situation only processes, which are trying to sync data to disk freezes. For example command:
  dmesg > /tmp/dmesg
ends without problem, I can see /tmp/dmesg file with all content OK, just after reboot this file is missing. Does not matter, if it's /tmp or /root or other directory, all writes are saved to cache and does not go to disk.

> I'll see if I can come up with an easy way to reproduce this.  I suspect that
> initiating a large file copy may do it.  

Good luck, I can't.

My problem is only on 2 machines with same hardware. May be there is something wrong in this PC. My plans are update all bioses (raid and motherboard) on these servers, but I have to synchronize this with hardware supplier.

Other servers with different hardware works well.

Attaching my lshw configuration.

Comment 5 Jan ONDREJ 2009-04-09 09:05:00 UTC
Created attachment 338883 [details]
Hardware configuration

Comment 6 Mark McLoughlin 2009-04-09 15:44:18 UTC
Jan/Chris - I wonder could either of you try an F11Beta guest and see if it has the same problem?

Comment 7 Jan ONDREJ 2009-04-14 07:50:42 UTC
Created attachment 339442 [details]
oops from console

Comment 8 Jan ONDREJ 2009-04-14 07:57:34 UTC
After aprox. 3 days my server was down again. Here are currently used packages:

Apr 11 06:48:00 Updated: kernel-firmware-2.6.29.1-54.fc11.noarch
Apr 11 06:48:22 Installed: kernel-PAE-2.6.29.1-54.fc11.i686
Apr 11 06:48:25 Installed: kernel-PAE-2.6.29.1-54.fc11.i686

3 days there was no problems, everything worked well. I have also "sync" run time, but it was between 0-3 sec from start to hang.

After hang, my "xm con" displayed many of oops (all which can be grabbed are attached):

BUG: soft lockup - CPU#1 stuck for 61s! [crond:3288]
Modules linked in: ipv6 xen_netfront pcspkr xen_blkfront

Pid: 3288, comm: crond Tainted: G      D    (2.6.29.1-54.fc11.i686.PAE #1)
EIP: 0061:[<c04023a7>] EFLAGS: 00000206 CPU: 1
EIP is at _stext+0x3a7/0x1000
EAX: 00000000 EBX: 00000003 ECX: cb8a9bdc EDX: cb8a9bec
ESI: e04a8790 EDI: 0d1dc961 EBP: cb8a9bfc ESP: cb8a9bd8
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
CR0: 8005003b CR2: b6106000 CR3: 1dfd7000 CR4: 00002620
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000000
Call Trace:
 [<c05c37fa>] ? xen_poll_irq+0x45/0x55
 [<c0407613>] xen_spin_lock_slow+0x131/0x1f1
 [<c0407780>] __xen_spin_lock+0xad/0xdf
 [<c04077cc>] xen_spin_lock+0xa/0xc
 [<c0715e51>] _spin_lock+0xd/0x10
 [<c049b910>] page_referenced+0x58/0x11b
 [<c048b352>] shrink_active_list+0x13f/0x313
 [<c0488686>] ? get_dirty_limits+0x21/0x2c4
 [<c043cc6f>] ? do_softirq+0x68/0x7e
 [<c040b00e>] ? do_IRQ+0x97/0xad
 [<c048c1c6>] shrink_zone+0x292/0x2a4
 [<c048cd76>] do_try_to_free_pages+0x1ee/0x31b
 [<c048cf87>] try_to_free_pages+0x62/0x6a
 [<c048afb3>] ? isolate_pages_global+0x0/0x199
 [<c04876ef>] __alloc_pages_internal+0x22f/0x386
 [<c04264d9>] pte_alloc_one+0x1c/0x3f
 [<c0493b9b>] __pte_alloc+0x16/0xaf
 [<c0494740>] copy_page_range+0x1b5/0x52f
 [<c0422d2d>] ? pvclock_clocksource_read+0x4e/0xd8
 [<c0422d2d>] ? pvclock_clocksource_read+0x4e/0xd8
 [<c0436313>] dup_mm+0x21b/0x2e4
 [<c0436d5b>] copy_process+0x952/0x102e
 [<c043754f>] do_fork+0x118/0x2a7
 [<c0407e32>] sys_clone+0x24/0x26
 [<c040955e>] syscall_call+0x7/0xb

Chris, can you try to update firmware on your machine.
Can you attach hardware configuration (use lshw or at least dmidecode on dom0).

Comment 9 Mark McLoughlin 2009-04-19 15:11:01 UTC
(In reply to comment #7)
> Created an attachment (id=339442) [details]
> oops from console  

This looks like a completely separate issue? Please file another bug report

So, you didn't see any uninterruptible sleep process with the 2.6.29 guest?

Comment 10 Jan ONDREJ 2009-04-19 15:25:11 UTC
> This looks like a completely separate issue? Please file another bug report

But I can't test it more with FC11 development kernel. It's a production machine.

> So, you didn't see any uninterruptible sleep process with the 2.6.29 guest?  

It was normal with older kernel, that it worked 2-20 days on this server.
I can't say, if it works or not before it will run at least 7 days on this server and may be another 7 days on another, where I can't test this before success on server 1.

Comment 11 Jan ONDREJ 2009-05-15 12:26:26 UTC
2 days after BIOS update my virtual machine is dead again. This is on virtual console:

BUG: soft lockup - CPU#2 stuck for 61s! [smtp:12481]
Modules linked in: ipv6 pcspkr xen_netfront xen_blkfront [last unloaded: scsi_wa
it_scan]

Pid: 12481, comm: smtp Tainted: G      D   (2.6.27.21-170.2.56.fc10.i686.PAE #1)
EIP: 0061:[<c04023a7>] EFLAGS: 00200202 CPU: 2
EIP is at _stext+0x3a7/0x1000
EAX: 00000000 EBX: 00000003 ECX: ea87ce60 EDX: 00000010
ESI: ecd4e16c EDI: 00000000 EBP: ea87ce80 ESP: ea87ce5c
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069
CR0: 8005003b CR2: 00195fb0 CR3: 00816000 CR4: 00002620
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000000
 [<c057978e>] ? xen_poll_irq+0x40/0x50
 [<c040696d>] xen_spin_lock_slow+0x57/0x91
 [<c04069d8>] xen_spin_lock+0x31/0x659
 [<c06b71a9>] _spin_lock+0x8/0xb
 [<c0485a8e>] unlink_file_vma+0x1d/0x6d
 [<c048446a>] free_pgtables+0x4e/0x94
 [<c0485722>] exit_mmap+0x89/0xe5
 [<c0430e17>] mmput+0x37/0x86
 [<c043420b>] exit_mm+0xeb/0xf3
 [<c0435ad4>] do_exit+0x1cc/0x744
 [<c0466d1e>] ? audit_syscall_entry+0xf9/0x123
 [<c04360bc>] do_group_exit+0x70/0x97
 [<c04360f6>] sys_exit_group+0x13/0x17
 [<c0408c8a>] syscall_call+0x7/0xb
 =======================

I have lot's of similar messages, I can attach them all if requested.
Is this a different problem too? Is it similar as problem with attachment https://bugzilla.redhat.com/attachment.cgi?id=339442 ?
If yes, I can test this and report a new bug.
What to fill in subject for this bug? I have no idea with similar bugs.

Comment 12 Justin M. Forbes 2009-09-22 15:30:00 UTC
Is this still an issue with the 2.6.30 F11 kernels?

Comment 13 Jan ONDREJ 2009-09-22 16:02:56 UTC
I can't upgrade to F11 now.

My plans are to upgrade these servers to F11 host and F10 guest using KVM.

All my servers on those machines (2 hosts) are in production and I can't test unstable things.

Comment 14 Bug Zapper 2009-11-18 07:49:53 UTC
This message is a reminder that Fedora 10 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 10.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '10'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 10's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 10 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 15 Bug Zapper 2009-12-18 07:25:51 UTC
Fedora 10 changed to end-of-life (EOL) status on 2009-12-17. Fedora 10 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.