464449 – semi crash ? related to 2.6.27-0.352.rc7.git1.fc10.x86_64 ?

Bug 464449 - semi crash ? related to 2.6.27-0.352.rc7.git1.fc10.x86_64 ?

Summary: semi crash ? related to 2.6.27-0.352.rc7.git1.fc10.x86_64 ?

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	10
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Dave Airlie
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-09-29 07:31 UTC by Donald Cohen
Modified:	2009-01-14 04:18 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-01-14 04:18:47 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Donald Cohen 2008-09-29 07:31:56 UTC

Description of problem:
I've recently downloaded .352 and seen the following twice today.
As you'll see, it's hard to tell exactly what's going on; if you have any ideas about how to find out more, let me know.
I've now rebooted to the previous .323 to see whether it happens there (and if not to get a little more stability).
My current mode of operation is to boot to runlevel 3, login as root, startx.
The first sign of trouble is that my system monitor gnome applet (I'm using gnome) dies.
I get a popup window saying it died, but that if I restart (something - don't remember exactly what) it will come back. I click on the restart option.
I don't recall seeing it return, but shortly after that I lost contact with the machine - the mouse didn't move, and I think the keyboard was also not responding - I tried things like c-a-f1 with no effect.
I was able to ping the machine from another machine, but could not ssh into it.
I couldn't think of anything better than rebooting. On doing so, I saw no trace of the problems in the logs.

Probably unrelated to above, I now notice (in .323) that I can no longer do a startx from terminal 1, then go back to terminal 1 (c-a-f1) then go back to X by doing c-a-f7 - that kills X. I see back in terminal 1: window manager
warning: Fatal IO error 11 (resource temporarily unavailable)
on display ':0.0' .
xinit: connection to X server lost
BTW I do now have a core file (50M) for this, if that will help.

How reproducible:
Don't know - I hope to never see it again but it has happened twice today.

Comment 1 Donald Cohen 2008-09-29 15:02:06 UTC

Some more info about c-a-f7 - killing X
- it also happens in .352
- I did startx 2>> startx.log and captured this:
Backtrace:
0: /usr/bin/X(xf86SigHandler+0x65) [0x47a105]
1: /lib64/libc.so.6 [0x360d033130]
2: /usr/lib64/xorg/modules/drivers//radeon_drv.so [0x1dcb66c]
3: /usr/lib64/xorg/modules/drivers//radeon_drv.so(radeon_update_dri_buffers+0xf2) [0x1dcb822]
4: /usr/lib64/xorg/modules/drivers//radeon_drv.so(RADEONEnterVT+0x79) [0x1da1cf9]
5: /usr/bin/X [0x481bc2]
6: /usr/bin/X(xf86Wakeup+0x444) [0x47ae54]
7: /usr/bin/X(WakeupHandler+0x4b) [0x44a47b]
8: /usr/bin/X(WaitForSomething+0x1ef) [0x4e480f]
9: /usr/bin/X(Dispatch+0x7f) [0x4465ff]
10: /usr/bin/X(main+0x45d) [0x42ccbd]
11: /lib64/libc.so.6(__libc_start_main+0xe6) [0x360d01e566]
12: /usr/bin/X [0x42c099]

Fatal server error:
Caught signal 11.  Server aborting

====
After booting to .323 I left the machine over night and found it unresponsive,
but with the caps lock light flashing; again nothing relevant in the logs, so I imagine it could be the same problem.  Although in this case I couldn't even get a response from ping.
I wonder whether it's related to suspend/hibernate.
I think neither of these is supposed to happen automatically but they don't seem to work.  That is, suspend seems to turn the machine off but the power button doesn't return it to the previous state.
For that matter there are lots of other problems, such as sound, odd mouse artifacts, but I wouldn't expect those to affect sshd or ping.

Comment 2 Donald Cohen 2008-09-30 05:01:08 UTC

just got the freeze again in 352, but this time I can ssh in.
See what you make of this:
$ tail /var/log/messages
tail /var/log/messages

Sep 29 21:38:32 number11 kernel: ALSA sound/pci/hda/hda_intel.c:1404: azx_pcm_prepare: bufsize=0x10000, format=0x4011

Sep 29 21:38:32 number11 kernel: ALSA sound/pci/hda/hda_codec.c:716: hda_codec_setup_stream: NID=0x21, stream=0x5, channel=0, format=0x4011

Sep 29 21:38:32 number11 kernel: ALSA sound/pci/hda/hda_codec.c:716: hda_codec_setup_stream: NID=0x10, stream=0x5, channel=0, format=0x4011

Sep 29 21:38:32 number11 kernel: ALSA sound/pci/hda/hda_codec.c:716: hda_codec_setup_stream: NID=0x11, stream=0x5, channel=0, format=0x4011

Sep 29 21:44:52 number11 init: tty2 main process (2737) killed by TERM signal

Sep 29 21:44:52 number11 init: tty5 main process (2736) killed by TERM signal

Sep 29 21:44:52 number11 init: tty3 main process (2738) killed by TERM signal

Sep 29 21:44:52 number11 init: tty6 main process (2740) killed by TERM signal

Sep 29 21:44:52 number11 init: tty4 main process (2735) killed by TERM signal

Sep 29 21:44:52 number11 init: rc6 main process (15608) killed by TERM signal


The clock on the screen reads 21:44:23.
In this case I again got the popup window, which I've now recorded:
 "System Monitor" has quit unexpectedly 
 If you reload a panel object, it will automatically be added back to 
 the panel 
 [Don't Reload]  [Reload]
But I studiously ignored it.  Everything (other than monitor) continued to work
for several more hours.
Can't think of anything to do now but reboot...

Comment 3 Christopher D. Stover 2008-11-15 22:22:01 UTC

Are you still experiencing this issue in newer kernels Donald?

Comment 4 Donald Cohen 2008-11-16 00:35:51 UTC

At the moment I don't even have that machine.  I sent it back for repair cause I
suspect there's a hardware component to this problem - but I can't tell what it
is.  So far I don't think the manufacturer (hp) has been able to find anything
wrong.
Do you have any idea what causes things like the caps lock flashing?
Is this even under control of the OS ?
I hope to get it back soon and then I will try at least c-a-f7 - I have not tried that since I sent the last report.
I've not seen the monitor quitting recently.
There are a lot of problems here.  If you have any idea what single hardware problem could be causing them, and even better, how to test it, please let me know.  If there's anything in particular you want me to try, also let me know.
When I get the machine back I'll do a yum update (if there's a newer fedora release I'll start from that) and then wait to see what happens.  Should I report any problems (or lack thereof) back to here?

Comment 5 Christopher D. Stover 2008-11-16 04:03:09 UTC

You certainly do have a variety of problems...  There seems to be a lot of problems related to the Radeon cards and X Server.  Have you tried booting the kernel with the nomodeset option?  I think this has been used as a work-around for some people.  There is also a known issue with people being unable to resume from standby.  I haven't read this entire but report but you may want to look at it, https://bugzilla.redhat.com/show_bug.cgi?id=464896.  The video artifacts you mention usually lead me to believe there is a problem with the video card but it's obviously difficult to narrow it down to that with so many problems.  Finally, I believe the OS could make your caps lock light blink on and off.  I don't want to mark this as a duplicate yet or assign it until we can find out more when you get your computer back.  I wish I could help you more but let me know how things go when you try again.

Comment 6 Bug Zapper 2008-11-26 03:16:02 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 10 development cycle.
Changing version to '10'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 7 Donald Cohen 2008-12-10 22:39:08 UTC

I just got the machine back.  I've upgraded to fedora 10 and done yum update.
I still have the same problems.  I still suspect hardware but hp can't find
anything wrong.  I'm inclined to ask for another machine to see whether it works there.  In the mean while, I see various interesting things in the logs.
Here's one:
Dec 10 13:29:13 number11 kernel: ------------[ cut here ]------------
Dec 10 13:29:13 number11 kernel: WARNING: at lib/list_debug.c:51 list_del+0x64/0x85()
Dec 10 13:29:13 number11 kernel: list_del corruption. next->prev should be ffffe20000578028, but was 0600000011000000
Dec 10 13:29:13 number11 kernel: Modules linked in: fuse bridge stp bnep sco l2cap bluetooth sunrpc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 cpufreq_ondemand powernow_k8 freq_table dm_multipath uinput snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq arc4 snd_seq_device ecb snd_pcm_oss crypto_blkcipher snd_mixer_oss snd_pcm ath5k uvcvideo ata_generic snd_timer pata_acpi snd_page_alloc r8169 snd_hwdep compat_ioctl32 mac80211 shpchp i2c_piix4 videodev snd i2c_core pata_atiixp mii v4l1_compat usb_storage pcspkr soundcore video joydev wmi cfg80211 output battery ac [last unloaded: scsi_wait_scan]
Dec 10 13:29:13 number11 kernel: Pid: 28749, comm: yum Not tainted 2.6.27.5-117.fc10.x86_64 #1
Dec 10 13:29:13 number11 kernel:
Dec 10 13:29:13 number11 kernel: Call Trace:
Dec 10 13:29:13 number11 kernel: [<ffffffff81041623>] warn_slowpath+0x8c/0xb5
Dec 10 13:29:13 number11 kernel: [<ffffffff81112664>] ? ext3_mark_iloc_dirty+0x2b4/0x31f
Dec 10 13:29:13 number11 kernel: [<ffffffff81097a8a>] ? mark_page_accessed+0x5a/0x66
Dec 10 13:29:13 number11 kernel: [<ffffffff810550bb>] ? bit_waitqueue+0x12/0x9f
Dec 10 13:29:13 number11 kernel: [<ffffffff81055194>] ? wake_up_bit+0x1e/0x23
Dec 10 13:29:13 number11 kernel: [<ffffffff81122e4c>] ? do_get_write_access+0x3c4/0x404
Dec 10 13:29:13 number11 kernel: [<ffffffff81121d38>] ? journal_dirty_metadata+0x3d/0xef
Dec 10 13:29:13 number11 kernel: [<ffffffff811708a8>] list_del+0x64/0x85
Dec 10 13:29:13 number11 kernel: [<ffffffff8109281a>] __rmqueue_smallest+0x81/0x14d
Dec 10 13:29:13 number11 kernel: [<ffffffff81092905>] __rmqueue+0x1f/0x1de
Dec 10 13:29:13 number11 kernel: [<ffffffff81112333>] ? brelse+0xe/0x10
Dec 10 13:29:13 number11 kernel: [<ffffffff81092b10>] rmqueue_bulk+0x4c/0x99
Dec 10 13:29:13 number11 kernel: [<ffffffff81094628>] get_page_from_freelist+0x373/0x6a8
Dec 10 13:29:13 number11 kernel: [<ffffffff81331c7e>] ? trace_hardirqs_on_thunk+0x3a/0x3c
Dec 10 13:29:13 number11 kernel: [<ffffffff81046a8f>] ? _local_bh_enable+0x96/0xab
Dec 10 13:29:13 number11 kernel: [<ffffffff81094c93>] __alloc_pages_internal+0xfe/0x457
Dec 10 13:29:13 number11 kernel: [<ffffffff810b23ae>] alloc_pages_current+0xb9/0xc2
Dec 10 13:29:13 number11 kernel: [<ffffffff8108ecf3>] __page_cache_alloc+0x67/0x6c
Dec 10 13:29:13 number11 kernel: [<ffffffff8108ee0f>] __grab_cache_page+0x39/0x7b
Dec 10 13:29:13 number11 kernel: [<ffffffff81113f57>] ext3_write_begin+0x65/0x1a4
Dec 10 13:29:13 number11 kernel: [<ffffffff8116acc3>] ? __up_read+0x7a/0x83
Dec 10 13:29:13 number11 kernel: [<ffffffff811209d5>] ? ext3_xattr_get+0x1e1/0x262
Dec 10 13:29:13 number11 kernel: [<ffffffff8108f8a2>] generic_file_buffered_write+0x14b/0x638
Dec 10 13:29:13 number11 kernel: [<ffffffff810d721a>] ? mnt_drop_write+0x82/0x143
Dec 10 13:29:13 number11 kernel: [<ffffffff810d5545>] ? mnt_want_write+0x77/0x8d
Dec 10 13:29:13 number11 kernel: [<ffffffff810901a3>] __generic_file_aio_write_nolock+0x25e/0x292
Dec 10 13:29:13 number11 kernel: [<ffffffff8129979a>] ? sock_recvmsg+0xca/0xe3
Dec 10 13:29:13 number11 kernel: [<ffffffff81090955>] generic_file_aio_write+0x67/0xc3
Dec 10 13:29:13 number11 kernel: [<ffffffff81110f43>] ext3_file_write+0x1e/0x9f
Dec 10 13:29:13 number11 kernel: [<ffffffff810bfc9e>] do_sync_write+0xe7/0x12d
Dec 10 13:29:13 number11 kernel: [<ffffffff81055199>] ? autoremove_wake_function+0x0/0x38
Dec 10 13:29:13 number11 kernel: [<ffffffff8103e2ad>] ? finish_task_switch+0x31/0xc9
Dec 10 13:29:13 number11 kernel: [<ffffffff81142a3b>] ? selinux_file_permission+0xaf/0xb8
Dec 10 13:29:13 number11 kernel: [<ffffffff8113b85c>] ? security_file_permission+0x11/0x13
Dec 10 13:29:13 number11 kernel: [<ffffffff810c055a>] vfs_write+0xab/0x105
Dec 10 13:29:13 number11 kernel: [<ffffffff810c0678>] sys_write+0x47/0x6f
Dec 10 13:29:13 number11 kernel: [<ffffffff8101024a>] system_call_fastpath+0x16/0x1b
Dec 10 13:29:13 number11 kernel:
Dec 10 13:29:13 number11 kernel: ---[ end trace 4c26e42aff35e489 ]---

Here's another 
Dec 10 12:39:50 number11 kernel: SELinux: WARNING: inside open_file_mask_to_av with unknown mode:c1b6
I see a lot of those.
Some with other modes.

And these
Dec 10 10:33:05 number11 kernel: usb 1-4: reset high speed USB device using ehci_hcd and address 3
Dec 10 10:33:20 number11 kernel: usb 1-4: device descriptor read/64, error -110
Dec 10 10:33:35 number11 kernel: usb 1-4: device descriptor read/64, error -110
Dec 10 10:33:35 number11 kernel: usb 1-4: reset high speed USB device using ehci_hcd and address 3
Dec 10 10:33:46 number11 kernel: SELinux: WARNING: inside open_file_mask_to_av with unknown mode:c1b6
Dec 10 10:33:50 number11 kernel: usb 1-4: device descriptor read/64, error -110
Dec 10 10:34:06 number11 kernel: usb 1-4: device descriptor read/64, error -110
Dec 10 10:34:06 number11 kernel: usb 1-4: reset high speed USB device using ehci_hcd and address 3
Dec 10 10:34:16 number11 kernel: usb 1-4: device not accepting address 3, error -110
Dec 10 10:34:16 number11 kernel: usb 1-4: reset high speed USB device using ehci_hcd and address 3
Dec 10 10:34:27 number11 kernel: usb 1-4: device not accepting address 3, error -110
Dec 10 10:34:27 number11 kernel: usb 1-4: USB disconnect, address 3
Dec 10 10:34:27 number11 kernel: scsi 6:0:0:0: Device offlined - not ready after error recovery
Dec 10 10:34:27 number11 kernel: scsi 6:0:0:0: rejecting I/O to dead device
Dec 10 10:34:27 number11 kernel: scsi 6:0:0:0: rejecting I/O to dead device
Dec 10 10:34:27 number11 kernel: scsi 6:0:0:0: rejecting I/O to dead device
Dec 10 10:34:27 number11 kernel: scsi 6:0:0:0: [sdb] READ CAPACITY failed
Dec 10 10:34:27 number11 kernel: scsi 6:0:0:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Dec 10 10:34:27 number11 kernel: scsi 6:0:0:0: [sdb] Sense not available.
Dec 10 10:34:27 number11 kernel: scsi 6:0:0:0: rejecting I/O to dead device
Dec 10 10:34:27 number11 kernel: scsi 6:0:0:0: [sdb] Write Protect is off
Dec 10 10:34:27 number11 kernel: scsi 6:0:0:0: [sdb] Assuming drive cache: write through
Dec 10 10:34:27 number11 kernel: scsi 6:0:0:0: rejecting I/O to dead device
Dec 10 10:34:27 number11 kernel: scsi 6:0:0:0: rejecting I/O to dead device
Dec 10 10:34:27 number11 kernel: scsi 6:0:0:0: rejecting I/O to dead device
Dec 10 10:34:27 number11 kernel: scsi 6:0:0:0: rejecting I/O to dead device
Dec 10 10:34:27 number11 kernel: scsi 6:0:0:0: [sdb] READ CAPACITY failed
Dec 10 10:34:27 number11 kernel: scsi 6:0:0:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Dec 10 10:34:27 number11 kernel: scsi 6:0:0:0: [sdb] Sense not available.
Dec 10 10:34:27 number11 kernel: scsi 6:0:0:0: rejecting I/O to dead device
Dec 10 10:34:27 number11 kernel: scsi 6:0:0:0: [sdb] Write Protect is off
Dec 10 10:34:27 number11 kernel: scsi 6:0:0:0: [sdb] Assuming drive cache: write through
Dec 10 10:34:27 number11 kernel: usb 1-4: new high speed USB device using ehci_hcd and address 4
Dec 10 10:34:42 number11 kernel: usb 1-4: device descriptor read/64, error -110
Dec 10 10:34:57 number11 kernel: usb 1-4: device descriptor read/64, error -110
Dec 10 10:34:57 number11 kernel: usb 1-4: new high speed USB device using ehci_hcd and address 5
Dec 10 10:35:13 number11 kernel: usb 1-4: device descriptor read/64, error -110
Dec 10 10:35:28 number11 kernel: usb 1-4: device descriptor read/64, error -110
Dec 10 10:35:28 number11 kernel: usb 1-4: new high speed USB device using ehci_hcd and address 6
Dec 10 10:35:38 number11 kernel: usb 1-4: device not accepting address 6, error -110
Dec 10 10:35:38 number11 kernel: usb 1-4: new high speed USB device using ehci_hcd and address 7
Dec 10 10:35:49 number11 kernel: usb 1-4: device not accepting address 7, error -110
Dec 10 10:35:49 number11 kernel: hub 1-0:1.0: unable to enumerate USB device on port 4
Dec 10 10:35:49 number11 kernel: usb 4-1: new full speed USB device using ohci_hcd and address 2
Dec 10 10:35:49 number11 kernel: usb 4-1: not running at top speed; connect to a high speed hub
Dec 10 10:35:49 number11 kernel: usb 4-1: configuration #1 chosen from 1 choice
Dec 10 10:35:49 number11 kernel: scsi9 : SCSI emulation for USB Mass Storage devices
Dec 10 10:35:49 number11 kernel: usb 4-1: New USB device found, idVendor=0bda, idProduct=0158
Dec 10 10:35:49 number11 kernel: usb 4-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
Dec 10 10:35:49 number11 kernel: usb 4-1: Product: USB2.0-CRW
Dec 10 10:35:49 number11 kernel: usb 4-1: Manufacturer: Generic
Dec 10 10:35:49 number11 kernel: usb 4-1: SerialNumber: 20071114173400000
Dec 10 10:35:54 number11 kernel: scsi 9:0:0:0: Direct-Access     Generic- Multi-Card       1.00 PQ: 0 ANSI: 0 CCS
Dec 10 10:35:54 number11 kernel: sd 9:0:0:0: [sdb] Attached SCSI removable disk
Dec 10 10:35:54 number11 kernel: sd 9:0:0:0: Attached scsi generic sg2 type 0

The most reliable way to get a crash seems to be yum update.
I'll stop here and commit before I try that again.

Comment 8 Donald Cohen 2008-12-11 18:35:56 UTC

I've now tried adding nomodeset to the end of the line starting with kernel ...
Is this the right place?  I can't find any doc that describes it.
Anyhow, things seemed to work for a while (which is often true anyway), but I now get a popup window saying 
 The panel encountered a problem while loading "OAFIID:GNOME_CPUFreqApplet".
 Do you want to delete the applet from your configuration?
I clicked on don't delete and now themouse moves, but mouse click or drag does nothing.  The clock has disappeared so I don't know whether it would have been running.
The machine does respond to ping.
Not to ssh.
c-a-f1 clears the screen, but the normal terminal does not appear - just black,
and at that point no other c-a-fx has any visible effect
On reboot I see nothing interesting in the log.
The panel problem, btw, is similar to what I've seen before (see earlier comments) and seems a good early warning that a crash is on the way -- especially if I respond to the popup.
So my current impression is that nomodeset isn't making much difference.

Comment 9 Donald Cohen 2008-12-11 18:48:10 UTC

Even better yet, the problem above seems to be somewhat reproducible!
I tried running the same thing as before (eclipse,then start a java app and
do the same thing in it) - no popup from the panel, but I did see the panel disappear and then reappear, and now I'm in the same state.
So what can I do to get some more info about what's going on just before the crash?

Comment 10 Donald Cohen 2008-12-16 23:41:05 UTC

I have found a reliable way to crash my machine.  
I run clisp, built for 64 bit machines, and simply start allocating memory. 
It may be important that it's in 64 bit mode and that it writes (initializes) the memory it allocates.  It may also be important that I have ATI Radeon HD 3200 or even that the display is 1680 X 1050 pixels.
The program seems to work, but in combination with X windows it causes problems.  If I run it inside X Windows, I see problems when it has allocated about 1.4 GB.  If I run outside X windows I can allocate much more, but then X Windows cannot start, even after I exit the program. 
I hope this gives someone a better idea of what the underlying problem is.

Comment 11 Donald Cohen 2009-01-14 02:38:11 UTC

I've now returned that machine (replaced by a different model) so there seems no prospect of determining whether this bug, if it is indeed a software bug, is fixed.  And I suspect a hardware problem, even though hp doesn't know how to test for it.  So I suggest closing this bug.

Comment 12 Christopher D. Stover 2009-01-14 04:18:47 UTC

I hope the new machine works out for you Donald.

Note You need to log in before you can comment on or make changes to this bug.