699684 – System freeze with 2.6.35.12-*.fc14.i686.PAE

Bug 699684 - System freeze with 2.6.35.12-*.fc14.i686.PAE

Summary: System freeze with 2.6.35.12-*.fc14.i686.PAE

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	14
Hardware:	i686
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-04-26 11:12 UTC by Jan ONDREJ
Modified:	2011-09-06 23:58 UTC (History)
CC List:	5 users (show)
Fixed In Version:	kernel-2.6.35.14-96.fc14
Clone Of:
Environment:
Last Closed:	2011-09-06 23:58:57 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Backport of commit a79e53d85683c6dd9f99c90511028adc2043031f (5.10 KB, text/plain) 2011-06-23 11:23 UTC, Chuck Ebbert	no flags	Details
View All

Description Jan ONDREJ 2011-04-26 11:12:59 UTC

Description of problem:
After update to 2.6.35.12 kernel my kvm virtual machine (guest) hangs at random times.

Version-Release number of selected component (if applicable):
craching guest:
kernel-PAE-2.6.35.12-88.fc14.i686
kernel-PAE-2.6.35.12-90.fc14.i686
host:
kernel-2.6.34.8-68.fc13.x86_64
qemu-kvm-0.13.0-1.fc13.x86_64

How reproducible:
aprox. 5 crashes until now, 2 virtual machines

Steps to Reproduce:
Not sure how to reproduce this. For me it's enough to boot it and wait some hours.
  
Actual results:
Bacause system is dead, no ping to this machine, virtual VGA console is blank, serial console dead too, but after one hang I see this on serial console output:

[96758.097381] BUG: soft lockup - CPU#3 stuck for 61s! [httpd:32030]
[96758.097381] Process httpd (pid: 32030, ti=dbbf8000 task=f50fe500 task.ti=dbbf8000)
[96758.097381] Stack:
[96758.097381] Call Trace:
[96758.097381] Code: 83 fa 03 74 05 83 fa 13 75 0f 8d 54 01 18 b9 02 00 00 00 31 c0 89 d7 f3 ab b8 01 00 00 00 5f 5d c3 55 89 e5 0f 1f 44 00 00 eb 02 <f3> 90 f6 00 01 75 f9 5d c3 55 89 e5 53 0f 1f 44 00 00 89 c3 8d
[96823.595381] BUG: soft lockup - CPU#3 stuck for 61s! [httpd:32030]
[96823.595381] Process httpd (pid: 32030, ti=dbbf8000 task=f50fe500 task.ti=dbbf8000)
[96823.595381] Stack:
[96823.595381] Call Trace:
[96823.595381] Code: 83 fa 03 74 05 83 fa 13 75 0f 8d 54 01 18 b9 02 00 00 00 31
 c0 89 d7 f3 ab b8 01 00 00 00 5f 5d c3 55 89 e5 0f 1f 44 00 00 eb 02 <f3> 90 f6 00 01 75 f9 5d c3 55 89 e5 53 0f 1f 44 00 00 89 c3 8d


Expected results:
no crash

Additional info:
What else I can do to debug this? Now one virtual machine is running 2 days, but second died 2 times yesterday.
After boot to 2.6.35.11 everything works well again, both 2 stable 2.6.35.12 kernels dies.

Comment 1 Chuck Ebbert 2011-04-27 01:22:33 UTC

(In reply to comment #1)
> [96758.097381] BUG: soft lockup - CPU#3 stuck for 61s! [httpd:32030]
> [96758.097381] Process httpd (pid: 32030, ti=dbbf8000 task=f50fe500
> task.ti=dbbf8000)
> [96758.097381] Stack:
> [96758.097381] Call Trace:
> [96758.097381] Code: 83 fa 03 74 05 83 fa 13 75 0f 8d 54 01 18 b9 02 00 00 00
> 31 c0 89 d7 f3 ab b8 01 00 00 00 5f 5d c3 55 89 e5 0f 1f 44 00 00 eb 02 <f3> 90
> f6 00 01 75 f9 5d c3 55 89 e5 53 0f 1f 44 00 00 89 c3 8d
> [96823.595381] BUG: soft lockup - CPU#3 stuck for 61s! [httpd:32030]
> [96823.595381] Process httpd (pid: 32030, ti=dbbf8000 task=f50fe500
> task.ti=dbbf8000)
> [96823.595381] Stack:
> [96823.595381] Call Trace:
> [96823.595381] Code: 83 fa 03 74 05 83 fa 13 75 0f 8d 54 01 18 b9 02 00 00 00
> 31
>  c0 89 d7 f3 ab b8 01 00 00 00 5f 5d c3 55 89 e5 0f 1f 44 00 00 eb 02 <f3> 90
> f6 00 01 75 f9 5d c3 55 89 e5 53 0f 1f 44 00 00 89 c3 8d
> 

Boot the kernel with the "ignore_loglevel" kernel option to get the entire trace, and try to catch it on the serial console again.

Comment 2 Jan ONDREJ 2011-04-29 06:16:52 UTC

> Boot the kernel with the "ignore_loglevel" kernel option to get the entire
> trace, and try to catch it on the serial console again.

Nothing logged this time, only server was dead.
May be previous message was nothing with this bug.

How I can collect more information?

Comment 3 Jan ONDREJ 2011-05-03 10:42:38 UTC

Next crash. Whey trying to strace qemu-kvm process, only this is displayed repeatedly:

timer_gettime(0x2, {it_interval={0, 0}, it_value={0, 0}}) = 0
timer_settime(0x2, 0, {it_interval={0, 0}, it_value={0, 250000}}, NULL) = 0
timer_gettime(0x2, {it_interval={0, 0}, it_value={0, 210462}}) = 0
select(19, [4 7 10 14 15 17 18], [], [], {1, 0}) = 1 (in [15], left {0, 999997})
read(15, "\1\0\0\0\0\0\0\0", 512)       = 8
select(19, [4 7 10 14 15 17 18], [], [], {1, 0}) = 1 (in [17], left {0, 999917})
read(17, "\16\0\0\0\0\0\0\0\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0"..., 128) = 128
rt_sigaction(SIGALRM, NULL, {0x4a55a0, ~[KILL STOP RTMIN RT_1], SA_RESTORER, 0x7f78adbfe4a0}, 8) = 0
write(16, "\1\0\0\0\0\0\0\0", 8)        = 8
read(17, 0x7fffc0ba65f0, 128)           = -1 EAGAIN (Resource temporarily unavailable)

ls -l /proc/PID/fd:

lr-x------. 1 qemu qemu 64 máj  3 12:41 0 -> /dev/null
l-wx------. 1 qemu qemu 64 máj  3 12:41 1 -> /var/log/libvirt/qemu/stats.log
lrwx------. 1 qemu qemu 64 máj  3 12:41 10 -> anon_inode:[signalfd]
lrwx------. 1 qemu qemu 64 máj  3 12:41 11 -> /dev/dm-70
lrwx------. 1 qemu qemu 64 máj  3 12:41 12 -> anon_inode:kvm-vcpu
lrwx------. 1 qemu qemu 64 máj  3 12:41 13 -> anon_inode:kvm-vcpu
lrwx------. 1 qemu qemu 64 máj  3 12:41 14 -> socket:[7424220]
lrwx------. 1 qemu qemu 64 máj  3 12:41 15 -> anon_inode:[eventfd]
lrwx------. 1 qemu qemu 64 máj  3 12:41 16 -> anon_inode:[eventfd]
lrwx------. 1 qemu qemu 64 máj  3 12:41 17 -> anon_inode:[signalfd]
lrwx------. 1 qemu qemu 64 máj  3 12:41 18 -> socket:[8167931]
l-wx------. 1 qemu qemu 64 máj  3 12:41 2 -> /var/log/libvirt/qemu/stats.log
lrwx------. 1 qemu qemu 64 máj  3 12:41 3 -> socket:[7424152]
lrwx------. 1 qemu qemu 64 máj  3 12:41 4 -> /dev/ptmx
lrwx------. 1 qemu qemu 64 máj  3 12:41 5 -> /dev/kvm
lrwx------. 1 qemu qemu 64 máj  3 12:41 6 -> anon_inode:kvm-vm
lrwx------. 1 qemu qemu 64 máj  3 12:41 63 -> /dev/net/tun
lrwx------. 1 qemu qemu 64 máj  3 12:41 64 -> /dev/net/tun
lrwx------. 1 qemu qemu 64 máj  3 12:41 65 -> /dev/net/tun
lrwx------. 1 qemu qemu 64 máj  3 12:41 66 -> /dev/net/tun
lrwx------. 1 qemu qemu 64 máj  3 12:41 7 -> anon_inode:[eventfd]
lrwx------. 1 qemu qemu 64 máj  3 12:41 8 -> anon_inode:[eventfd]
lrwx------. 1 qemu qemu 64 máj  3 12:41 9 -> /dev/dm-0

How to collect more data?

Comment 4 Chuck Ebbert 2011-05-03 20:47:28 UTC

Can you see if this still happens in 2.6.35.13-91 ?

Comment 5 Jan ONDREJ 2011-05-10 10:03:25 UTC

4 days uptime with this new kernel, so looks this was already fixed.
Closing bug until it happens again.

[root@www ~]# uptime
 12:02:11 up 4 days,  3:28,  1 user,  load average: 0.17, 0.23, 0.28
[root@www ~]# uname -a
Linux www.upjs.sk 2.6.35.13-91.fc14.i686 #1 SMP Tue May 3 13:36:36 UTC 2011 i686 i686 i386 GNU/Linux

Comment 6 Jan ONDREJ 2011-05-25 19:35:53 UTC

Looks like I tested it with non-PAE kernel and it works with standard 32bit non PAE kernel, but hangs again after switch to PAE.

This kernel died 2 times today:

Linux www.upjs.sk 2.6.35.13-91.fc14.i686.PAE #1 SMP Tue May 3 13:29:55 UTC 2011 i686 i686 i386 GNU/Linux

Reopening bug. How to collect more information?

Comment 7 Jan ONDREJ 2011-05-26 18:50:09 UTC

Looks like this is definitelly an PAE problem. All non-PAE 32bit and 64bit kernels works well.

My host system: 2.6.34.8-68.fc13.x86_64  (qemu virtualization)

Last working guest kernel: 2.6.35.11-83.fc14.i686.PAE

Problematic PAE kernels:
  kernel-PAE-2.6.35.12-90.fc14.i686
  kernel-PAE-2.6.35.13-91.fc14.i686

What changed in kernel-PAE-2.6.35.12 ?

Comment 8 Chuck Ebbert 2011-06-04 13:43:43 UTC

Apparently this is caused by:

http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=4981d01eada5354d81c8929d5b2836829ba3df7b

The above went in 2.6.35.12, but it looks like this commit is also needed:

http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=a79e53d85683c6dd9f99c90511028adc2043031f

That one got silently dropped because it needs backporting to 2.6.35.

Comment 9 Jan ONDREJ 2011-06-22 19:57:27 UTC

Any progress with this bug in current fc14 kernels?

Comment 10 Chuck Ebbert 2011-06-23 11:23:52 UTC

Created attachment 506192 [details]
Backport of commit a79e53d85683c6dd9f99c90511028adc2043031f

Comment 11 Jan ONDREJ 2011-07-26 07:11:20 UTC

Ping? Still nothing? Can I help?

Comment 12 Jan ONDREJ 2011-08-23 05:59:57 UTC

kernel-2.6.35.14-95.fc14 is out. Is this problem solved?

Comment 13 Chuck Ebbert 2011-08-24 14:45:52 UTC

This patch was not included because I forgot to change the bug status to show that a patch was available. Sorry about that. The fix will be in 2.6.35.14-96.

Comment 14 Fedora Update System 2011-09-01 15:25:08 UTC

kernel-2.6.35.14-96.fc14 has been submitted as an update for Fedora 14.
https://admin.fedoraproject.org/updates/kernel-2.6.35.14-96.fc14

Comment 15 Fedora Update System 2011-09-02 05:30:01 UTC

Package kernel-2.6.35.14-96.fc14:
* should fix your issue,
* was pushed to the Fedora 14 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-2.6.35.14-96.fc14'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/kernel-2.6.35.14-96.fc14
then log in and leave karma (feedback).

Comment 16 Fedora Update System 2011-09-06 23:58:07 UTC

kernel-2.6.35.14-96.fc14 has been pushed to the Fedora 14 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.