Bug 699684

Summary: System freeze with 2.6.35.12-*.fc14.i686.PAE
Product: [Fedora] Fedora Reporter: Jan ONDREJ <ondrejj>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 14CC: gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.35.14-96.fc14 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-09-06 23:58:57 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Backport of commit a79e53d85683c6dd9f99c90511028adc2043031f none

Description Jan ONDREJ 2011-04-26 11:12:59 UTC
Description of problem:
After update to 2.6.35.12 kernel my kvm virtual machine (guest) hangs at random times.

Version-Release number of selected component (if applicable):
craching guest:
kernel-PAE-2.6.35.12-88.fc14.i686
kernel-PAE-2.6.35.12-90.fc14.i686
host:
kernel-2.6.34.8-68.fc13.x86_64
qemu-kvm-0.13.0-1.fc13.x86_64

How reproducible:
aprox. 5 crashes until now, 2 virtual machines

Steps to Reproduce:
Not sure how to reproduce this. For me it's enough to boot it and wait some hours.
  
Actual results:
Bacause system is dead, no ping to this machine, virtual VGA console is blank, serial console dead too, but after one hang I see this on serial console output:

[96758.097381] BUG: soft lockup - CPU#3 stuck for 61s! [httpd:32030]
[96758.097381] Process httpd (pid: 32030, ti=dbbf8000 task=f50fe500 task.ti=dbbf8000)
[96758.097381] Stack:
[96758.097381] Call Trace:
[96758.097381] Code: 83 fa 03 74 05 83 fa 13 75 0f 8d 54 01 18 b9 02 00 00 00 31 c0 89 d7 f3 ab b8 01 00 00 00 5f 5d c3 55 89 e5 0f 1f 44 00 00 eb 02 <f3> 90 f6 00 01 75 f9 5d c3 55 89 e5 53 0f 1f 44 00 00 89 c3 8d
[96823.595381] BUG: soft lockup - CPU#3 stuck for 61s! [httpd:32030]
[96823.595381] Process httpd (pid: 32030, ti=dbbf8000 task=f50fe500 task.ti=dbbf8000)
[96823.595381] Stack:
[96823.595381] Call Trace:
[96823.595381] Code: 83 fa 03 74 05 83 fa 13 75 0f 8d 54 01 18 b9 02 00 00 00 31
 c0 89 d7 f3 ab b8 01 00 00 00 5f 5d c3 55 89 e5 0f 1f 44 00 00 eb 02 <f3> 90 f6 00 01 75 f9 5d c3 55 89 e5 53 0f 1f 44 00 00 89 c3 8d


Expected results:
no crash

Additional info:
What else I can do to debug this? Now one virtual machine is running 2 days, but second died 2 times yesterday.
After boot to 2.6.35.11 everything works well again, both 2 stable 2.6.35.12 kernels dies.

Comment 1 Chuck Ebbert 2011-04-27 01:22:33 UTC
(In reply to comment #1)
> [96758.097381] BUG: soft lockup - CPU#3 stuck for 61s! [httpd:32030]
> [96758.097381] Process httpd (pid: 32030, ti=dbbf8000 task=f50fe500
> task.ti=dbbf8000)
> [96758.097381] Stack:
> [96758.097381] Call Trace:
> [96758.097381] Code: 83 fa 03 74 05 83 fa 13 75 0f 8d 54 01 18 b9 02 00 00 00
> 31 c0 89 d7 f3 ab b8 01 00 00 00 5f 5d c3 55 89 e5 0f 1f 44 00 00 eb 02 <f3> 90
> f6 00 01 75 f9 5d c3 55 89 e5 53 0f 1f 44 00 00 89 c3 8d
> [96823.595381] BUG: soft lockup - CPU#3 stuck for 61s! [httpd:32030]
> [96823.595381] Process httpd (pid: 32030, ti=dbbf8000 task=f50fe500
> task.ti=dbbf8000)
> [96823.595381] Stack:
> [96823.595381] Call Trace:
> [96823.595381] Code: 83 fa 03 74 05 83 fa 13 75 0f 8d 54 01 18 b9 02 00 00 00
> 31
>  c0 89 d7 f3 ab b8 01 00 00 00 5f 5d c3 55 89 e5 0f 1f 44 00 00 eb 02 <f3> 90
> f6 00 01 75 f9 5d c3 55 89 e5 53 0f 1f 44 00 00 89 c3 8d
> 

Boot the kernel with the "ignore_loglevel" kernel option to get the entire trace, and try to catch it on the serial console again.

Comment 2 Jan ONDREJ 2011-04-29 06:16:52 UTC
> Boot the kernel with the "ignore_loglevel" kernel option to get the entire
> trace, and try to catch it on the serial console again.

Nothing logged this time, only server was dead.
May be previous message was nothing with this bug.

How I can collect more information?

Comment 3 Jan ONDREJ 2011-05-03 10:42:38 UTC
Next crash. Whey trying to strace qemu-kvm process, only this is displayed repeatedly:

timer_gettime(0x2, {it_interval={0, 0}, it_value={0, 0}}) = 0
timer_settime(0x2, 0, {it_interval={0, 0}, it_value={0, 250000}}, NULL) = 0
timer_gettime(0x2, {it_interval={0, 0}, it_value={0, 210462}}) = 0
select(19, [4 7 10 14 15 17 18], [], [], {1, 0}) = 1 (in [15], left {0, 999997})
read(15, "\1\0\0\0\0\0\0\0", 512)       = 8
select(19, [4 7 10 14 15 17 18], [], [], {1, 0}) = 1 (in [17], left {0, 999917})
read(17, "\16\0\0\0\0\0\0\0\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0"..., 128) = 128
rt_sigaction(SIGALRM, NULL, {0x4a55a0, ~[KILL STOP RTMIN RT_1], SA_RESTORER, 0x7f78adbfe4a0}, 8) = 0
write(16, "\1\0\0\0\0\0\0\0", 8)        = 8
read(17, 0x7fffc0ba65f0, 128)           = -1 EAGAIN (Resource temporarily unavailable)

ls -l /proc/PID/fd:

lr-x------. 1 qemu qemu 64 máj  3 12:41 0 -> /dev/null
l-wx------. 1 qemu qemu 64 máj  3 12:41 1 -> /var/log/libvirt/qemu/stats.log
lrwx------. 1 qemu qemu 64 máj  3 12:41 10 -> anon_inode:[signalfd]
lrwx------. 1 qemu qemu 64 máj  3 12:41 11 -> /dev/dm-70
lrwx------. 1 qemu qemu 64 máj  3 12:41 12 -> anon_inode:kvm-vcpu
lrwx------. 1 qemu qemu 64 máj  3 12:41 13 -> anon_inode:kvm-vcpu
lrwx------. 1 qemu qemu 64 máj  3 12:41 14 -> socket:[7424220]
lrwx------. 1 qemu qemu 64 máj  3 12:41 15 -> anon_inode:[eventfd]
lrwx------. 1 qemu qemu 64 máj  3 12:41 16 -> anon_inode:[eventfd]
lrwx------. 1 qemu qemu 64 máj  3 12:41 17 -> anon_inode:[signalfd]
lrwx------. 1 qemu qemu 64 máj  3 12:41 18 -> socket:[8167931]
l-wx------. 1 qemu qemu 64 máj  3 12:41 2 -> /var/log/libvirt/qemu/stats.log
lrwx------. 1 qemu qemu 64 máj  3 12:41 3 -> socket:[7424152]
lrwx------. 1 qemu qemu 64 máj  3 12:41 4 -> /dev/ptmx
lrwx------. 1 qemu qemu 64 máj  3 12:41 5 -> /dev/kvm
lrwx------. 1 qemu qemu 64 máj  3 12:41 6 -> anon_inode:kvm-vm
lrwx------. 1 qemu qemu 64 máj  3 12:41 63 -> /dev/net/tun
lrwx------. 1 qemu qemu 64 máj  3 12:41 64 -> /dev/net/tun
lrwx------. 1 qemu qemu 64 máj  3 12:41 65 -> /dev/net/tun
lrwx------. 1 qemu qemu 64 máj  3 12:41 66 -> /dev/net/tun
lrwx------. 1 qemu qemu 64 máj  3 12:41 7 -> anon_inode:[eventfd]
lrwx------. 1 qemu qemu 64 máj  3 12:41 8 -> anon_inode:[eventfd]
lrwx------. 1 qemu qemu 64 máj  3 12:41 9 -> /dev/dm-0

How to collect more data?

Comment 4 Chuck Ebbert 2011-05-03 20:47:28 UTC
Can you see if this still happens in 2.6.35.13-91 ?

Comment 5 Jan ONDREJ 2011-05-10 10:03:25 UTC
4 days uptime with this new kernel, so looks this was already fixed.
Closing bug until it happens again.

[root@www ~]# uptime
 12:02:11 up 4 days,  3:28,  1 user,  load average: 0.17, 0.23, 0.28
[root@www ~]# uname -a
Linux www.upjs.sk 2.6.35.13-91.fc14.i686 #1 SMP Tue May 3 13:36:36 UTC 2011 i686 i686 i386 GNU/Linux

Comment 6 Jan ONDREJ 2011-05-25 19:35:53 UTC
Looks like I tested it with non-PAE kernel and it works with standard 32bit non PAE kernel, but hangs again after switch to PAE.

This kernel died 2 times today:

Linux www.upjs.sk 2.6.35.13-91.fc14.i686.PAE #1 SMP Tue May 3 13:29:55 UTC 2011 i686 i686 i386 GNU/Linux

Reopening bug. How to collect more information?

Comment 7 Jan ONDREJ 2011-05-26 18:50:09 UTC
Looks like this is definitelly an PAE problem. All non-PAE 32bit and 64bit kernels works well.

My host system: 2.6.34.8-68.fc13.x86_64  (qemu virtualization)

Last working guest kernel: 2.6.35.11-83.fc14.i686.PAE

Problematic PAE kernels:
  kernel-PAE-2.6.35.12-90.fc14.i686
  kernel-PAE-2.6.35.13-91.fc14.i686

What changed in kernel-PAE-2.6.35.12 ?

Comment 8 Chuck Ebbert 2011-06-04 13:43:43 UTC
Apparently this is caused by:

http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=4981d01eada5354d81c8929d5b2836829ba3df7b

The above went in 2.6.35.12, but it looks like this commit is also needed:

http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=a79e53d85683c6dd9f99c90511028adc2043031f

That one got silently dropped because it needs backporting to 2.6.35.

Comment 9 Jan ONDREJ 2011-06-22 19:57:27 UTC
Any progress with this bug in current fc14 kernels?

Comment 10 Chuck Ebbert 2011-06-23 11:23:52 UTC
Created attachment 506192 [details]
Backport of commit a79e53d85683c6dd9f99c90511028adc2043031f

Comment 11 Jan ONDREJ 2011-07-26 07:11:20 UTC
Ping? Still nothing? Can I help?

Comment 12 Jan ONDREJ 2011-08-23 05:59:57 UTC
kernel-2.6.35.14-95.fc14 is out. Is this problem solved?

Comment 13 Chuck Ebbert 2011-08-24 14:45:52 UTC
This patch was not included because I forgot to change the bug status to show that a patch was available. Sorry about that. The fix will be in 2.6.35.14-96.

Comment 14 Fedora Update System 2011-09-01 15:25:08 UTC
kernel-2.6.35.14-96.fc14 has been submitted as an update for Fedora 14.
https://admin.fedoraproject.org/updates/kernel-2.6.35.14-96.fc14

Comment 15 Fedora Update System 2011-09-02 05:30:01 UTC
Package kernel-2.6.35.14-96.fc14:
* should fix your issue,
* was pushed to the Fedora 14 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-2.6.35.14-96.fc14'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/kernel-2.6.35.14-96.fc14
then log in and leave karma (feedback).

Comment 16 Fedora Update System 2011-09-06 23:58:07 UTC
kernel-2.6.35.14-96.fc14 has been pushed to the Fedora 14 stable repository.  If problems still persist, please make note of it in this bug report.