178339 – kernel crash reliably triggered by pm-suspend

Bug 178339 - kernel crash reliably triggered by pm-suspend

Summary: kernel crash reliably triggered by pm-suspend

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Dave Jones
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-01-19 16:11 UTC by Andy Burns
Modified:	2015-01-04 22:24 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2006-01-28 00:07:49 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Andy Burns 2006-01-19 16:11:50 UTC

Description of problem:

running pm-suspend causes segmentation fault

Version-Release number of selected component (if applicable):

kernel 2.6.15-1.1860_FC5
This is new behaviour with this version

How reproducible:

100% for me

Steps to Reproduce:
1. service syslog stop (probably un-necessary)
2. pm-suspend
  
Actual results:

# pm-suspend
Freezing cpus ...
int3: 0000 [1] SMP
last sysfs file: /power/state
CPU 1
Modules linked in: ipv6 ppdev autofs4 rfcomm l2cap bluetooth sunrpc
ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink xt_tcpudp
iptable_filter ip_tables x_tables raid0 video button battery ac lp parport_pc
parport nvram ohci1394 ieee1394 uhci_hcd ehci_hcd saa7134 video_buf
compat_ioctl32 v4l2_common v4l1_compat ir_kbd_i2c ir_common videodev e100 mii
snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq
snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm i2c_i801 i2c_core snd_timer snd
hw_random soundcore snd_page_alloc dm_snapshot dm_zero dm_mirror dm_mod ext3 jbd
raid1 ahci libata sd_mod scsi_mod
Pid: 2925, comm: pm-suspend Not tainted 2.6.15-1.1860_FC5 #1
RIP: 0010:[<ffffffff8055644b>] <ffffffff8055644b>{pageset_cpuup_callback+1}
RSP: 0018:ffff81003e117db0  EFLAGS: 00000282
RAX: 0000000000000001 RBX: ffffffff803c74e0 RCX: 0000000000000001
RDX: 0000000000000001 RSI: 0000000000000005 RDI: ffffffff803c74e0
RBP: 0000000000000001 R08: ffffffff8053aa68 R09: 0000000000000004
R10: 0000000000000002 R11: 0000000000000004 R12: 0000000000000005
R13: 0000000000000003 R14: 0000000000000003 R15: ffff81003e117f50
FS:  00002b6beb43dd30(0000) GS:ffff81003fe16268(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002b6bee7da000 CR3: 0000000031d96000 CR4: 00000000000006e0
Process pm-suspend (pid: 2925, threadinfo ffff81003e116000, task ffff8100339f6080)
Stack: ffffffff803403ae 0000000000000001 0000000000000001 0000000000000003
       ffffffff8014b7bf ffff81003e117e38 ffffffff801465fd 0000000000000296
       0000000000000296 0000000000000000
Call Trace: <ffffffff803403ae>{notifier_call_chain+28}
       <ffffffff8014b7bf>{cpu_down+96} <ffffffff801465fd>{remove_wait_queue+17}
       <ffffffff80254435>{vt_waitactive+150}
<ffffffff8015359f>{disable_nonboot_cpus+82}
       <ffffffff801505e1>{enter_state+161} <ffffffff801507e1>{state_store+113}
       <ffffffff801be79b>{sysfs_write_file+201} <ffffffff80180af0>{vfs_write+206}
       <ffffffff801810a2>{sys_write+69} <ffffffff8010ab34>{tracesys+209}

Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
RIP <ffffffff8055644b>{pageset_cpuup_callback+1} RSP <ffff81003e117db0>
 Segmentation fault



Additional info:

Machine is not dead at this point, but monitor on console is in DPMS sleep and
won't wake up, trying a 2nd pm-suspend ties up the console (waiting for lock?)
but still doesn't kill machine entirely ...

Comment 1 Andy Burns 2006-01-19 16:34:40 UTC

Just to be clear, this didn't happen in 2.6.15-1.1859_FC5 but does in 
2.6.15-1.1860_FC5

Comment 2 Andy Burns 2006-01-19 19:18:51 UTC

also present in 2.6.15-1.1861_FC5

Comment 3 Andy Burns 2006-01-27 12:52:21 UTC

Been trying most rawhide kernels, still exists in 2.6.15-1.1872_FC5

# init 3
INIT: Switching to runlevel: 3
INIT: Sending processes the TERM signal
Starting readahead_early:  Starting background readahead: [  OK  ]
Starting irqbalance:  [  OK  ]
Starting lm_sensors:  [  OK  ]

# modprobe -r button

# service syslog stop
Shutting down kernel logger: [  OK  ]
Shutting down system logger: [  OK  ]

# pm-suspend
Freezing cpus ...
int3: 0000 [1] SMP
last sysfs file: /power/state
CPU 0
Modules linked in: radeon drm ipv6 ppdev autofs4 rfcomm l2cap sunrpc
ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink xt_tcpudp
iptable_filter ip_tables x_tables video battery ac lp parport_pc parport nvram
hci_usb bluetooth ehci_hcd ohci1394 ieee1394 uhci_hcd snd_hda_intel saa7134
snd_hda_codec video_buf snd_seq_dummy compat_ioctl32 v4l2_common v4l1_compat
snd_seq_oss snd_seq_midi_event ir_kbd_i2c snd_seq e100 snd_seq_device ir_common
snd_pcm_oss snd_mixer_oss mii videodev snd_pcm snd_timer snd i2c_i801 hw_random
soundcore i2c_core snd_page_alloc dm_snapshot dm_zero dm_mirror dm_mod ext3 jbd
ahci libata sd_mod scsi_mod
Pid: 3373, comm: pm-suspend Not tainted 2.6.15-1.1872_FC5 #1
RIP: 0010:[<ffffffff80558435>] <ffffffff80558435>{pageset_cpuup_callback+1}
RSP: 0018:ffff81002802fdb0  EFLAGS: 00000286
RAX: 0000000000000001 RBX: ffffffff803c8560 RCX: 0000000000000001
RDX: 0000000000000001 RSI: 0000000000000005 RDI: ffffffff803c8560
RBP: 0000000000000001 R08: ffffffff8053cae8 R09: 0000000000000004
R10: 0000000000000002 R11: 0000000000000004 R12: 0000000000000005
R13: 0000000000000003 R14: 0000000000000003 R15: ffff81002802ff50
FS:  00002aee15c8cd30(0000) GS:ffffffff8051a000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aee19023000 CR3: 00000000292a1000 CR4: 00000000000006e0
Process pm-suspend (pid: 3373, threadinfo ffff81002802e000, task ffff810026b38040)
Stack: ffffffff80341296 0000000000000001 0000000000000001 0000000000000003
       ffffffff8014b803 ffff81002802fe38 ffffffff80146641 0000000000000296
       0000000000000296 0000000000000000
Call Trace: <ffffffff80341296>{notifier_call_chain+28}
       <ffffffff8014b803>{cpu_down+96} <ffffffff80146641>{remove_wait_queue+17}
       <ffffffff80255149>{vt_waitactive+150}
<ffffffff801535e3>{disable_nonboot_cpus+82}
       <ffffffff80150625>{enter_state+161} <ffffffff80150825>{state_store+113}
       <ffffffff801bf5c3>{sysfs_write_file+201} <ffffffff80180d38>{vfs_write+206}
       <ffffffff801812ea>{sys_write+69} <ffffffff8010a906>{system_call+126}

Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
RIP <ffffffff80558435>{pageset_cpuup_callback+1} RSP <ffff81002802fdb0>
 Segmentation fault

Comment 4 Dave Jones 2006-01-27 20:40:28 UTC

just checked in a fix for this, it'll show up at
http://people.redhat.com/davej/kernels/Fedora/devel in an hour or so.

Comment 5 Andy Burns 2006-01-27 23:51:02 UTC

Ah, so "sexing up" the title got your attention, I was begining to think you'd
gone on the "jolly" to NZ after all ;-)

I have a davej.repo file that is normally disabled, but which I use with 

yum update --enablerepo davej 

in situations like this, it has worked in the past, tonight it refused, looks
like your repodata was updated at same time as the .RPMs 
truied cleaning emtadate but no joy, in the end I installed with 

rpm -i
http://people.redhat.com/davej/kernels/Fedora/devel/RPMS.kernel/kernel-2.6.15-1.1881_FC5.x86_64.rpm

and all went ok.

I can now do a pm-suspend (without switching to runlevel3, or stopping syslog or
doing rmmod button like I've needed in the past) from a serial console I see

Freezing cpus ...
Breaking affinity for irq 4
Breaking affinity for irq 14
Breaking affinity for irq 66
CPU 1 is now offline
migration_cost=9
CPU1 is down
Stopping tasks: ===========================================================|

the monitor goes into DPMS sleep, machine power off :-)

Machine wakes up in response to pressing PS/2 keyboard, monitor stays in DPMS
sleep, ethernet doesn't reply to pings, keyboard LEDS are *NOT* flashing (which
they used to do) and I get a big splurge of rubbish on the COM1: port, which is
not intelligible at either 9600 or 115200 baud.

Should COM1: be reset to same speed it was when suspended? is it likely to be a
kernel panic/oops that is being sent there?

I can't seem to find the "tricks" about vbetool and acpi=bios_mode3 or whatever
it is that I was going to try to get video back ...

p.s. I am running a newer BIOS flash than when testing a couple of weeks ago,
which claims to have improved S3 resuming.

Getting there ...

Comment 6 Dave Jones 2006-01-30 01:45:40 UTC

The serial splurge is an odd one. Hmm.
From memory, I think the 8250/serial layer lacks suspend/resume hooks to reinit
the device, which could explain this.  Although.. during the kernel boot
/before/ we do the resume, the 8250 should have been setup. So unless you
changed some params after initial boot, but before suspend, I'm puzzled.

The fact that we're dumping anything at all is also a little disturbing.
I'm concerned that it's another oops.

acpi_sleep=s3_bios is probably the boot command line option you were trying to
remember ?  There's some handy hints in the Documentation/power/video.txt of the
kernel source (or kernel-doc rpm)

Comment 7 Andy Burns 2006-01-30 09:37:55 UTC

given that the initial problem is fixed, I suppose I ought to open a new bug for
the resume? 

I am setting console=ttyS0,115200 on command line so as you say would expect it
to be reset by kernel before resuming. Is it worth trying any/all speeds? Any
command line options to force using a differerent driver to treat the UART as a
"less ancient" flavour with better suspend/resume?

Previously the PS/2 LEDS flashed, I thought this was a panic indicator, since
I'm not getting the flashing does that mean much? I'll dig the docs and play
with the command line settings ...

Note You need to log in before you can comment on or make changes to this bug.