User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1 NOTICE: The kernel in question was tainted by NVIDIA module. End of notice. Report: I ran suspend (as usual) with this command, from tty1: # echo disk > /sys/power/state This seems to have triggered a WARNING, although I did not notice it at the time: Aug 17 23:34:35 localhost acpid: client has disconnected Aug 17 23:34:35 localhost kernel: PM: Syncing filesystems ... done. Aug 17 23:34:35 localhost acpid: client has disconnected Aug 17 23:34:35 localhost kernel: Freezing user space processes ... (elapsed 0.00 seconds) done. Aug 17 23:34:35 localhost kernel: Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done. Aug 17 23:34:35 localhost kernel: PM: Shrinking memory... done (66905 pages freed) Aug 17 23:34:35 localhost kernel: PM: Freed 267620 kbytes in 1.56 seconds (171.55 MB/s) Aug 17 23:34:35 localhost kernel: Suspending console(s) Aug 17 23:34:35 localhost kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache Aug 17 23:34:35 localhost kernel: ACPI: PCI interrupt for device 0000:07:00.1 disabled Aug 17 23:34:35 localhost kernel: ACPI: PCI interrupt for device 0000:06:00.0 disabled Aug 17 23:34:35 localhost kernel: NVRM: RmPowerManagement: 3 Aug 17 23:34:35 localhost kernel: ACPI: PCI interrupt for device 0000:00:1f.2 disabled Aug 17 23:34:35 localhost kernel: ACPI: PCI interrupt for device 0000:00:1f.1 disabled Aug 17 23:34:35 localhost kernel: ACPI: PCI interrupt for device 0000:00:1d.7 disabled Aug 17 23:34:35 localhost kernel: ACPI: PCI interrupt for device 0000:00:1d.2 disabled Aug 17 23:34:35 localhost kernel: ACPI: PCI interrupt for device 0000:00:1d.1 disabled Aug 17 23:34:35 localhost kernel: ACPI: PCI interrupt for device 0000:00:1d.0 disabled Aug 17 23:34:35 localhost kernel: ACPI: PCI interrupt for device 0000:00:1b.0 disabled Aug 17 23:34:35 localhost kernel: ACPI: PCI interrupt for device 0000:00:1a.7 disabled Aug 17 23:34:35 localhost kernel: ACPI: PCI interrupt for device 0000:00:1a.1 disabled Aug 17 23:34:35 localhost kernel: ACPI: PCI interrupt for device 0000:00:1a.0 disabled Aug 17 23:34:35 localhost kernel: ACPI: Preparing to enter system sleep state S4 Aug 17 23:34:35 localhost kernel: Disabling non-boot CPUs ... Aug 17 23:34:35 localhost kernel: NOHZ: local_softirq_pending 100 Aug 17 23:34:35 localhost kernel: CPU 1 is now offline Aug 17 23:34:35 localhost kernel: SMP alternatives: switching to UP code Aug 17 23:34:35 localhost kernel: CPU1 is down Aug 17 23:34:35 localhost kernel: ------------[ cut here ]------------ Aug 17 23:34:35 localhost kernel: WARNING: at arch/x86/kernel/smp_32.c:565 native_smp_call_function_mask+0 Aug 17 23:34:35 localhost kernel: Modules linked in: oprofile osst mtdchar usbserial ftl ircomm_tty ircomm Aug 17 23:34:35 localhost kernel: snd joydev i2c_i801 soundcore i2c_core ssb iTCO_wdt iTCO_vendor_support Aug 17 23:34:35 localhost kernel: Pid: 10027, comm: bash Tainted: P 2.6.25.11-97.fc9.i686 #1 Aug 17 23:34:35 localhost kernel: [<c0426bc5>] warn_on_slowpath+0x47/0x73 Aug 17 23:34:35 localhost kernel: [<c04f3862>] ? number+0x106/0x1c0 Aug 17 23:34:35 localhost kernel: [<c062d4f3>] ? kprobe_flush_task+0x56/0x90 Aug 17 23:34:35 localhost kernel: [<c0423121>] ? hrtick_set+0x80/0xe5 Aug 17 23:34:35 localhost kernel: [<c06298f2>] ? schedule+0x6a9/0x6db Aug 17 23:34:35 localhost kernel: [<c04f4104>] ? vsnprintf+0x3e7/0x420 Aug 17 23:34:35 localhost kernel: [<c041c1af>] ? __wake_up_locked+0x11/0x15 Aug 17 23:34:35 localhost kernel: [<c04133b5>] native_smp_call_function_mask+0x32/0xdc Aug 17 23:34:35 localhost kernel: [<f8e9186a>] ? nmi_cpu_stop+0x0/0x1b [oprofile] Aug 17 23:34:35 localhost kernel: [<f8e9186a>] ? nmi_cpu_stop+0x0/0x1b [oprofile] Aug 17 23:34:35 localhost kernel: [<f8e9186a>] ? nmi_cpu_stop+0x0/0x1b [oprofile] Aug 17 23:34:35 localhost kernel: [<c0413b78>] smp_call_function+0x18/0x1d Aug 17 23:34:35 localhost kernel: [<c042acbe>] on_each_cpu+0x12/0x36 Aug 17 23:34:35 localhost kernel: [<f8e91911>] nmi_suspend+0x1c/0x21 [oprofile] Aug 17 23:34:35 localhost kernel: [<c0560ad5>] sysdev_suspend+0x67/0x175 Aug 17 23:34:35 localhost kernel: [<c0564cdb>] device_power_down+0xa9/0xb1 Aug 17 23:34:35 localhost kernel: [<c04485b0>] hibernation_snapshot+0xc7/0x176 Aug 17 23:34:35 localhost kernel: [<c04486fc>] hibernate+0x9d/0x169 Aug 17 23:34:35 localhost kernel: [<c04473b3>] ? state_store+0x0/0xa8 Aug 17 23:34:35 localhost kernel: [<c0447409>] state_store+0x56/0xa8 Aug 17 23:34:35 localhost kernel: [<c04473b3>] ? state_store+0x0/0xa8 Aug 17 23:34:35 localhost kernel: [<c04f05b1>] kobj_attr_store+0x1a/0x22 Aug 17 23:34:35 localhost kernel: [<c04bb19a>] sysfs_write_file+0xb4/0xdf Aug 17 23:34:35 localhost kernel: [<c04bb0e6>] ? sysfs_write_file+0x0/0xdf Aug 17 23:34:35 localhost kernel: [<c048272f>] vfs_write+0x8a/0x12e Aug 17 23:34:35 localhost kernel: [<c048286c>] sys_write+0x3b/0x60 Aug 17 23:34:35 localhost kernel: [<c0405bf2>] syscall_call+0x7/0xb Aug 17 23:34:35 localhost kernel: [<c0620000>] ? agp_amd64_probe+0x134/0x3ee Aug 17 23:34:35 localhost kernel: ======================= Aug 17 23:34:35 localhost kernel: ---[ end trace 94272564199f17c5 ]--- Aug 17 23:34:35 localhost kernel: PM: Creating hibernation image: Aug 17 23:34:35 localhost kernel: PM: Need to copy 127181 pages Aug 17 23:34:35 localhost kernel: Intel machine check architecture supported. Aug 17 23:34:35 localhost kernel: Intel machine check reporting enabled on CPU#0. Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: Enabling non-boot CPUs ... Aug 17 23:34:35 localhost kernel: SMP alternatives: switching to SMP code Aug 17 23:34:35 localhost kernel: Booting processor 1/1 ip 4000 Aug 17 23:34:35 localhost kernel: CPU 1 irqstacks, hard=c07b9000 soft=c0799000 Aug 17 23:34:35 localhost kernel: Initializing CPU#1 Aug 17 23:34:35 localhost kernel: Calibrating delay using timer specific routine.. 2926.04 BogoMIPS (lpj=1 Aug 17 23:34:35 localhost kernel: CPU: L1 I cache: 32K, L1 D cache: 32K Aug 17 23:34:35 localhost kernel: CPU: L2 cache: 1024K Aug 17 23:34:35 localhost kernel: CPU: Physical Processor ID: 0 Aug 17 23:34:35 localhost kernel: CPU: Processor Core ID: 1 Aug 17 23:34:35 localhost kernel: Intel machine check architecture supported. Aug 17 23:34:35 localhost kernel: Intel machine check reporting enabled on CPU#1. Aug 17 23:34:35 localhost kernel: CPU1: Intel(R) Pentium(R) Dual CPU T2310 @ 1.46GHz stepping 0d Aug 17 23:34:35 localhost kernel: CPU1 is up Aug 17 23:34:35 localhost kernel: ACPI: EC: acpi_ec_wait timeout, status = 8, expect_event = 1 Aug 17 23:34:35 localhost kernel: ACPI: EC: read timeout, command = 130 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: ACPI: PCI Interrupt 0000:00:1a.0[A] -> GSI 16 (level, low) -> IRQ 16 Aug 17 23:34:35 localhost kernel: usb usb3: root hub lost power or was reset Aug 17 23:34:35 localhost kernel: ACPI: PCI Interrupt 0000:00:1a.1[B] -> GSI 21 (level, low) -> IRQ 21 Aug 17 23:34:35 localhost kernel: usb usb4: root hub lost power or was reset Aug 17 23:34:35 localhost kernel: ACPI: PCI Interrupt 0000:00:1a.7[C] -> GSI 18 (level, low) -> IRQ 18 Aug 17 23:34:35 localhost kernel: usb usb1: root hub lost power or was reset Aug 17 23:34:35 localhost kernel: ehci_hcd 0000:00:1a.7: debug port 1 Aug 17 23:34:35 localhost kernel: ACPI: PCI Interrupt 0000:00:1b.0[A] -> GSI 22 (level, low) -> IRQ 22 Aug 17 23:34:35 localhost kernel: ACPI: PCI Interrupt 0000:00:1d.0[A] -> GSI 23 (level, low) -> IRQ 23 Aug 17 23:34:35 localhost kernel: usb usb5: root hub lost power or was reset Aug 17 23:34:35 localhost kernel: ACPI: PCI Interrupt 0000:00:1d.1[B] -> GSI 19 (level, low) -> IRQ 19 Aug 17 23:34:35 localhost kernel: usb usb6: root hub lost power or was reset Aug 17 23:34:35 localhost kernel: ACPI: PCI Interrupt 0000:00:1d.2[C] -> GSI 18 (level, low) -> IRQ 18 Aug 17 23:34:35 localhost kernel: usb usb7: root hub lost power or was reset Aug 17 23:34:35 localhost kernel: ACPI: PCI Interrupt 0000:00:1d.7[A] -> GSI 23 (level, low) -> IRQ 23 Aug 17 23:34:35 localhost kernel: usb usb2: root hub lost power or was reset Aug 17 23:34:35 localhost kernel: ehci_hcd 0000:00:1d.7: debug port 1 Aug 17 23:34:35 localhost kernel: ACPI: PCI Interrupt 0000:00:1f.1[A] -> GSI 18 (level, low) -> IRQ 18 Aug 17 23:34:35 localhost kernel: ACPI: PCI Interrupt 0000:00:1f.2[B] -> GSI 19 (level, low) -> IRQ 19 Aug 17 23:34:35 localhost kernel: NVRM: RmPowerManagement: 5 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: ata2: SATA link down (SStatus 0 SControl 300) Aug 17 23:34:35 localhost kernel: ata4.00: ACPI cmd ef/03:0c:00:00:00:a0 filtered out Aug 17 23:34:35 localhost kernel: ata4.00: ACPI cmd ef/03:42:00:00:00:a0 filtered out Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Aug 17 23:34:35 localhost kernel: ata1.00: configured for UDMA/100 Aug 17 23:34:35 localhost kernel: sd 0:0:0:0: [sda] 312581808 512-byte hardware sectors (160042 MB) Aug 17 23:34:35 localhost kernel: sd 0:0:0:0: [sda] Write Protect is off Aug 17 23:34:35 localhost kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't sup Aug 17 23:34:35 localhost kernel: ata4.00: configured for UDMA/33 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: ACPI: PCI Interrupt 0000:06:00.0[A] -> GSI 19 (level, low) -> IRQ 19 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: ACPI: PCI Interrupt 0000:07:00.1[B] -> GSI 22 (level, low) -> IRQ 22 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: sd 0:0:0:0: [sda] Starting disk Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: b43-phy0: Loading firmware version 410.2160 (2007-05-26 15:32:10) Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: Restarting tasks ... WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: usb 2-4: USB disconnect, address 4 Aug 17 23:34:35 localhost kernel: done. Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: usb 2-4: new high speed USB device using ehci_hcd and address 5 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: usb 2-4: configuration #1 chosen from 1 choice Aug 17 23:34:35 localhost kernel: uvcvideo: Found UVC 1.00 device Acer CrystalEye webcam (064e:a101) Aug 17 23:34:35 localhost kernel: input: Acer CrystalEye webcam as /devices/pci0000:00/0000:00:1d.7/usb2/2 Aug 17 23:34:35 localhost kernel: usb 2-4: New USB device found, idVendor=064e, idProduct=a101 Aug 17 23:34:35 localhost kernel: usb 2-4: New USB device strings: Mfr=2, Product=1, SerialNumber=3 Aug 17 23:34:35 localhost kernel: usb 2-4: Product: Acer CrystalEye webcam Aug 17 23:34:35 localhost kernel: usb 2-4: Manufacturer: SuYin Aug 17 23:34:35 localhost kernel: usb 2-4: SerialNumber: CN0314-OV03-VA-R02.00.00 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:35 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:36 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:36 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:36 localhost kernel: WQ on CPU0, prefer CPU1 Aug 17 23:34:36 localhost kernel: WQ on CPU0, prefer CPU1 [...] So it resumed all right, with the exception of the "WQ on CPU0, prefer CPU1" message which hit my log a total of 373801 times (!) in about 12 hours, during which the machine was mostly fully operational (I did not notice that anything was wrong). Then I inserted a CD, but it wouldn't mount. Again, I found this in the log: Aug 18 17:31:39 localhost kernel: sr 3:0:0:0: [sr0] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK Aug 18 17:31:39 localhost kernel: sr 3:0:0:0: [sr0] Sense Key : Medium Error [current] Aug 18 17:31:39 localhost kernel: Info fld=0x46ac0 Aug 18 17:31:39 localhost kernel: sr 3:0:0:0: [sr0] Add. Sense: L-EC uncorrectable error Aug 18 17:31:39 localhost kernel: end_request: I/O error, dev sr0, sector 1157888 Aug 18 17:31:39 localhost kernel: Buffer I/O error on device sr0, logical block 289472 ... Aug 18 17:31:46 localhost kernel: WQ on CPU0, prefer CPU1 Aug 18 17:31:46 localhost kernel: sr 3:0:0:0: [sr0] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK Aug 18 17:31:46 localhost kernel: sr 3:0:0:0: [sr0] Sense Key : Medium Error [current] Aug 18 17:31:46 localhost kernel: Info fld=0x46ac0 Aug 18 17:31:46 localhost kernel: sr 3:0:0:0: [sr0] Add. Sense: L-EC uncorrectable error Aug 18 17:31:46 localhost kernel: end_request: I/O error, dev sr0, sector 1157888 Aug 18 17:31:46 localhost kernel: Buffer I/O error on device sr0, logical block 289472 Aug 18 17:31:46 localhost kernel: WQ on CPU0, prefer CPU1 When trying to mount the disk, # mount /dev/cdrom /mnt I would get some strange error about stale NFS. Running "strace" on mount showed that open("/etc/mtab") failed with -ESTALE. So I rebooted. When the machine came back, it started fsck, and then I had to do a "manual" fsck, which found a lot of problems, including orphaned blocks, blocks owned by multiple inodes, wrong refcounts, wrong filetypes (/etc/mtab was "file type 1 instead of 6", etc.) Luckily, after this, the machine booted all right again. I had been using oprofile to profile a userspace program. The fact that oprofile shows up in the WARNING backtrace seems significant to me. Maybe oprofile + suspend is the culprit? Note that I've been using suspend/resume with this kernel (2.6.25.11-97.fc9.i686 + NVIDIA module) without problems, so I am inclined to believe that this is NOT the problem. Moreover, I always run suspend from the console to avoid switching modes on suspend/resume, since this has been problematic in the past. Reproducible: Didn't try Steps to Reproduce: (This is guesswork, but I am reluctant to try to reproduce the failure because of the disk corruption): 1. Run oprofile on userspace program (no kernel profiling) 2. Suspend 3. Resume
I don't know if this is related to the corruption that I saw, but I managed to get this oprofile warning on the latest -git kernel, at this commit (+ one unrelated build fix): commit b09c3e3f1710b554348c98e78fbf4a661918779a Author: Huang Weiyi <weiyi.huang> Date: Sun Aug 17 07:51:05 2008 +0800 BUG: using smp_processor_id() in preemptible [00000000] code: oprofiled/4133 APIC error on CPU1: 00(40) APIC error on CPU0: 00(40) APIC error on CPU1: 40(40) APIC error on CPU0: 40(40) caller is get_stagger+0x9/0x30 Pid: 4133, comm: oprofiled Not tainted 2.6.27-rc3-00415-g122c9e0 #9 [<c037228e>] debug_smp_processor_id+0xce/0xd0 [<c0583d19>] get_stagger+0x9/0x30 [<c05843de>] p4_fill_in_addresses+0x1e/0x3a0 [<c058313d>] nmi_setup+0xcd/0x190 [<c05811fa>] oprofile_setup+0x3a/0xc0 [<c05820a6>] event_buffer_open+0x56/0x80 [<c01a22c4>] __dentry_open+0xf4/0x1f0 [<c01a2407>] nameidata_to_filp+0x47/0x60 [<c0582050>] ? event_buffer_open+0x0/0x80 [<c01ae1bc>] do_filp_open+0x18c/0x6d0 [<c015626d>] ? put_lock_stats+0xd/0x30 [<c01b883c>] ? alloc_fd+0xdc/0x100 [<c0371bb6>] ? _raw_spin_unlock+0x46/0x80 [<c066af57>] ? _spin_unlock+0x27/0x50 [<c01a208b>] do_sys_open+0x4b/0xe0 [<c0362da4>] ? trace_hardirqs_on_thunk+0xc/0x10 [<c01a2189>] sys_open+0x29/0x40 [<c01040db>] sysenter_do_call+0x12/0x3f ======================= I will also report this directly to the kernel people, however. Steps to reproduce: 1. oprofile --reset 2. oprofile --setup --no-vmlinux --image=/bin/bash 3. oprofile --start
The warning in smp_32.c is: WARN_ON(irqs_disabled());
drivers/oprofile/cpu_buffer.c: printk("WQ on CPU%d, prefer CPU%d\n",
Hi, I started a thread on LKML (http://lkml.org/lkml/2008/8/18/459), which is related, but not exactly the same. Andi Kleen posted a patch in response, which is now in mainline -git (it fixes one of the warnings in oprofile code that appeared during the suspend phase). I've been testing the oprofile + suspend a couple of more times, but the disk corruption was not seen. This makes me think it was the fault of the bad CD. I might try it again after doing a backup. Thanks for looking into it! Vegard
kernel-2.6.26.3-27.fc9 has been submitted as an update for Fedora 9. http://admin.fedoraproject.org/updates/kernel-2.6.26.3-27.fc9
kernel-2.6.26.3-29.fc9 has been submitted as an update for Fedora 9. http://admin.fedoraproject.org/updates/kernel-2.6.26.3-29.fc9
kernel-2.6.26.3-29.fc9 has been pushed to the Fedora 9 stable repository. If problems still persist, please make note of it in this bug report.