Bug 817227 - Random system crash
Random system crash
Status: CLOSED INSUFFICIENT_DATA
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
16
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-04-28 02:49 EDT by George R. Goffe
Modified: 2012-11-19 09:20 EST (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-11-14 15:28:15 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
tar.gz with /var/log/messages and uname -a output. (96.03 KB, application/x-gzip)
2012-05-27 14:33 EDT, George R. Goffe
no flags Details
tar.gz with /var/log/messages and uname -a output and other cmd output. (662.50 KB, application/x-gzip)
2012-05-29 22:06 EDT, George R. Goffe
no flags Details
jpeg screenshot of first system-config-kdump popup. (167.26 KB, image/jpeg)
2012-06-09 00:33 EDT, George R. Goffe
no flags Details
/var/log/messags of hdd disappearing (1.84 KB, application/x-gzip)
2012-06-14 05:13 EDT, George R. Goffe
no flags Details
output of lspci and lsusb commands (9.09 KB, application/x-gzip)
2012-06-15 13:23 EDT, George R. Goffe
no flags Details
gzip'd copy of /var/log/messages during start of kdump (2.35 KB, application/x-gzip)
2012-06-27 00:02 EDT, George R. Goffe
no flags Details

  None (edit)
Description George R. Goffe 2012-04-28 02:49:20 EDT
Description of problem:

I'm experiencing Random system crashes.

System is up to date as of 04/27/2012. System just powered down... again.

Version-Release number of selected component (if applicable):


How reproducible:

Unknown.

Steps to Reproduce:
1.Unknown
2.
3.
  
Actual results:

System just crashes without warning and powers off.

Expected results:

Certainly not this behavior.

Additional info:

I have experienced this frequently in the past few weeks but have thought/hoped that a new kernel would solve the problem. It has not. I tend to run "yum update" but not immediately reboot. Could this be the cause of the problem?

I have /var/log/messages* available; dmesg and lspci output available. 

I'm willing to set traps and/or accept any advice to help shoot this problem.

kernel version == 3.3.2-6.fc16.x86_64.debug #1 SMP Sat Apr 21 12:20:13 UTC 2012 x86_64 GNU/Linux

Here's the most current few entries from the last command.

     11 reboot   system boot  3.3.2-6.fc16.x86 Fri Apr 27 23:33 - 23:40  (00:06)    
     12 root     tty1                          Fri Apr 27 23:37 - crash  (00:-3)    
     13 reboot   system boot  3.3.2-6.fc16.x86 Fri Apr 27 23:31 - 23:36  (00:05)    
     14 root     tty1                          Fri Apr 27 23:35 - crash  (00:-3)    
     15 root     pts/9        :0               Fri Apr 27 23:09 - crash  (00:21)    
     16 root     pts/6        :0               Fri Apr 27 23:09 - crash  (00:21)    
     17 root     pts/5        :0               Fri Apr 27 23:09 - crash  (00:21)    
     18 root     pts/2        :0               Fri Apr 27 23:09 - crash  (00:21)    
     19 root     pts/8        :0               Fri Apr 27 23:09 - crash  (00:21)    
     20 root     pts/7        :0               Fri Apr 27 23:09 - crash  (00:21)
Comment 1 Dave Jones 2012-05-14 16:08:40 EDT
is there any stack trace in the messages file that might give a clue as to what happened ?
Comment 2 George R. Goffe 2012-05-15 16:05:46 EDT
Dave,

I saw no stack traces or other "strange" messages. I may have missed some messages due to not being sure what to look for. I haven't seen the crash since I made this bug report. If you'd like me to do something special, please let me know. I typically leave this system up. I am willing to run a debug kernel (runs slowly by the way) and look for debug info it that will help.

I'm continually updating my systems software so problems sometimes come and go. I'm on 3.3.5-2.fc16.x86_64 (non debug) but was using 3.3.4-1*debug. It seems like the problem was appearing after having run a "yum update" command but not yet rebooted.

By the way, I'm experimenting with ZFS (quite an elegant FS I think) built from their source code (zfsonlinux.org). Their build process has problems though. Their facility for building from a debug kernel but aiming at a non-debug kernel is broken. Among other things, they look for a debug rpm. Current Fedora gives both debug and non-debug kernels in one rpm I believe.

Maybe I'm too eager to help the "cause"?

Please let me know if you want me to do something about this crash or other things.

Regards,

George...
Comment 3 George R. Goffe 2012-05-27 14:33:19 EDT
Created attachment 587107 [details]
tar.gz with /var/log/messages and uname -a output.

This problem happened again early this morning:

reboot   system boot  3.3.6-3.fc16.x86 Sun May 27 10:41 - 10:57  (00:16)
reboot   system boot  3.3.6-3.fc16.x86 Sun May 27 10:37 - 10:45  (00:07)
root     tty1                          Sun May 27 10:42 - crash  (00:-5)
root     pts/25       :0               Sat May 26 17:07 - 02:32  (09:24)
root     pts/26       :0               Sat May 26 15:52 - crash  (18:45)
root     pts/25       :0               Sat May 26 15:47 - 16:04  (00:16)
root     pts/24       :0               Sat May 26 14:54 - crash  (19:43)
Comment 4 George R. Goffe 2012-05-27 21:19:57 EDT
Yet another crash.

Is anyone looking at these? I could sure use a little help please.

reboot   system boot  3.3.7-1.fc16.x86 Sun May 27 16:48 - 16:57  (00:08)    
root     tty1                          Sun May 27 16:53 - crash  (00:-5)    

System was unattended, nothing particular running... just userspace file editing and curl execution.
Comment 5 Dave Jones 2012-05-29 10:03:19 EDT
nothing too unusual in the kernel logs.
Can you try running the kernel-debug build for a while. That might give additional log messages which could give some clues.
Comment 6 George R. Goffe 2012-05-29 12:31:14 EDT
Dave,

I'm back to running the debug kernel now. I'll attach any logs I get when/if this problem re-appears. So far it's stable now... We'll see.

I did post more info to this bug (https://bugzilla.redhat.com/show_bug.cgi?id=808795). These problems might all have the same root cause since they seem to be related to USB issues.

Regards,

George...

# uptime
 09:19am  up 1 day 15:58,  15 users,  load average: 1.50, 1.58, 1.57
# uname -a
Linux joker.sleazegate.com 3.3.7-1.fc16.x86_64.debug #1 SMP Tue May 22 13:52:13 UTC 2012 x86_64 GNU/Linux
Comment 7 George R. Goffe 2012-05-29 22:06:14 EDT
Created attachment 587575 [details]
tar.gz with /var/log/messages and uname -a output and other cmd output.

Dave,

I was sitting at the console just a few minutes ago when it went dark (it's day time here...). The system had crashed AND powered itself off. No recursive boot retries. Is this due to the debug kernel? There were no messages about this crash at all... ANYWHERE. I powered the system back up and noticed that the KDE Display attributes for the monitor were not as they had been (1920x1200 was missing). I rebooted immediately and right at the end of all the messages from reboot there was a trace displayed. It's not in /var/log/messages or anywhere else that I can tell. 

I have a tar.gz file containing all the files you might want to look at that I can think of. Initially, this latest crash did NOT appear in the last command output. After the reboot (KDE Display settings related) to try to fix the display problem, the crash DOES appear. Through out all this booting I did NOT make any system changes.

Again, I was editing some files and had several Firefox nightly instances running.

George...
Comment 8 George R. Goffe 2012-06-02 18:38:37 EDT
Anyone,

I have had two of these crashes today, within about an hour... :(

I seem to be able to readily re-create this problem as it appears to be load related. The busier the disk the sooner the failure.

Does anyone have any suggestions? Is there a trap to get kernel dumps?

I'm still collecting doc for each failure but all the doc appears to be the same.

Thanks,

George...
Comment 9 George R. Goffe 2012-06-02 23:59:50 EDT
What can I say. It's definitely re-creatable... This is #3 for today.

George...


last
root     pts/10       :0               Sat Jun  2 20:51   still logged in   
root     pts/9        :0               Sat Jun  2 20:50   still logged in   
root     pts/6        :0               Sat Jun  2 20:50   still logged in   
root     pts/5        :0               Sat Jun  2 20:50   still logged in   
root     pts/2        :0               Sat Jun  2 20:50   still logged in   
root     pts/8        :0               Sat Jun  2 20:50   still logged in   
root     pts/7        :0               Sat Jun  2 20:50   still logged in   
root     pts/1        :0               Sat Jun  2 20:50   still logged in   
root     pts/4        :0               Sat Jun  2 20:50   still logged in   
root     pts/3        :0               Sat Jun  2 20:50   still logged in   
root     pts/0        :0               Sat Jun  2 20:50   still logged in                                                                                              
reboot   system boot  3.3.7-1.fc16.x86 Sat Jun  2 20:45 - 20:58  (00:13)                                                                                               
root     tty1                          Sat Jun  2 20:49 - crash  (00:-4)                                                                                               
reboot   system boot  3.3.7-1.fc16.x86 Sat Jun  2 20:42 - 20:49  (00:06)                                                                                               
root     tty1                          Sat Jun  2 20:43 - crash  (00:00)                                                                                               
root     pts/11       :0               Sat Jun  2 20:41 - crash  (00:01)                                                                                               
root     pts/10                        Sat Jun  2 19:24 - crash  (01:18)                                                                                               
root     pts/9        :0               Sat Jun  2 15:49 - crash  (04:53)                                                                                               
root     pts/6        :0               Sat Jun  2 15:49 - crash  (04:53)                                                                                               
root     pts/5        :0               Sat Jun  2 15:49 - crash  (04:53)                                                                                               
root     pts/2        :0               Sat Jun  2 15:49 - crash  (04:53)                                                                                               
root     pts/8        :0               Sat Jun  2 15:49 - crash  (04:53)                                                                                               
root     pts/7        :0               Sat Jun  2 15:49 - crash  (04:53)                                                                                               
root     pts/1        :0               Sat Jun  2 15:49 - crash  (04:53)                                                                                               
root     pts/4        :0               Sat Jun  2 15:49 - crash  (04:53)                                                                                               
root     pts/3        :0               Sat Jun  2 15:48 - crash  (04:53)                                                                                               
root     pts/0        :0               Sat Jun  2 15:48 - crash  (04:53)                                                                                               
root     tty1                          Sat Jun  2 15:48 - 20:43  (04:54)                                                                                               
reboot   system boot  3.3.7-1.fc16.x86 Sat Jun  2 15:43 - 20:49  (05:05)                                                                                               
reboot   system boot  3.3.7-1.fc16.x86 Sat Jun  2 15:41 - 15:47  (00:05)                                                                                               
root     tty1                          Sat Jun  2 15:42 - crash  (00:00)                                                                                               
root     pts/10       :0               Sat Jun  2 15:29 - crash  (00:12)                                                                                               
root     pts/9        :0               Sat Jun  2 15:28 - crash  (00:13)                                                                                               
root     pts/6        :0               Sat Jun  2 15:28 - crash  (00:13)                                                                                               
root     pts/5        :0               Sat Jun  2 15:28 - crash  (00:13)                                                                                               
root     pts/2        :0               Sat Jun  2 15:28 - crash  (00:13)                                                                                               
root     pts/8        :0               Sat Jun  2 15:28 - crash  (00:13)                                                                                               
root     pts/7        :0               Sat Jun  2 15:28 - crash  (00:13)                                                                                               
root     pts/1        :0               Sat Jun  2 15:28 - crash  (00:13)                                                                                               
root     pts/4        :0               Sat Jun  2 15:28 - crash  (00:13)                                                                                               
root     pts/3        :0               Sat Jun  2 15:28 - crash  (00:13)                                                                                               
root     pts/0        :0               Sat Jun  2 15:28 - crash  (00:13)                                                                                               
reboot   system boot  3.3.7-1.fc16.x86 Sat Jun  2 15:22 - 15:47  (00:24)                                                                                               
root     tty1                          Sat Jun  2 15:27 - crash  (00:-4)                                                                                               
root     pts/0                         Sat Jun  2 15:24 - down   (00:00)                                                                                               
reboot   system boot  3.3.7-1.fc16.x86 Sat Jun  2 15:18 - 15:24  (00:05)                                                                                               
root     tty1                          Sat Jun  2 15:23 - crash  (00:-5)                                                                                               
root     pts/9        :0               Sat Jun  2 14:47 - crash  (00:31)                                                                                               
root     pts/6        :0               Sat Jun  2 14:47 - crash  (00:31)                                                                                               
root     pts/5        :0               Sat Jun  2 14:47 - crash  (00:31)                                                                                               
root     pts/2        :0               Sat Jun  2 14:47 - crash  (00:31)                                                                                               
root     pts/8        :0               Sat Jun  2 14:47 - crash  (00:31)                                                                                               
root     pts/7        :0               Sat Jun  2 14:47 - crash  (00:31)                                                                                               
root     pts/1        :0               Sat Jun  2 14:47 - crash  (00:31)                                                                                               
root     pts/4        :0               Sat Jun  2 14:47 - crash  (00:31)                                                                                               
root     pts/3        :0               Sat Jun  2 14:47 - crash  (00:31)    
root     pts/0        :0               Sat Jun  2 14:47 - crash  (00:31)    
reboot   system boot  3.3.7-1.fc16.x86 Sat Jun  2 14:42 - 15:24  (00:42)    
root     tty1                          Sat Jun  2 14:46 - crash  (00:-4)    
reboot   system boot  3.3.7-1.fc16.x86 Sat Jun  2 14:37 - 14:46  (00:08)    
root     tty1                          Sat Jun  2 14:43 - crash  (00:-5)    
root     pts/18       :0               Sat Jun  2 12:29 - crash  (02:07)    
root     pts/17       :0               Sat Jun  2 12:27 - crash  (02:09)    
root     pts/16       :0               Sat Jun  2 12:15 - crash  (02:22)    
root     pts/15       :0               Sat Jun  2 01:19 - crash  (13:18)
Comment 10 Lukáš Czerner 2012-06-07 02:31:36 EDT
Hi George,

getting crash report is really important to actually see what the issue is. However the fact that the machine just simply power off makes me thing that this is actually hardware related problem, since we do not power off on crash but we can reboot if configured properly.

What the command

sysctl kernel.panic

tells you ? If it is zero it should not even attempt to reboot the machine.

Anyway, if you have the chance to test the same setup on different machine this would certainly help since right now it really looks like a hardware issue.

Also I've noticed that you're using USB 3 device, so for testing purposes could you not use this device and see if both problems still occur? (this might be related your Bug 823190)

Thanks!
-Lukas
Comment 11 George R. Goffe 2012-06-07 18:11:11 EDT
Lukas,

The kernel.panic value is currently zero, I'm running a non debug kernel at the moment.

This laptop is less than a year old.

I've set /etc/sysconfig/sysctl with this value: ENABLE_SYSRQ="yes"

I'm also experimenting with lkcd but their latest source doesn't compile on this system which is the only running system I have at the moment.

I'll switch from the usb 3 port and attempt to re-create the problem.

George...
Comment 12 George R. Goffe 2012-06-08 01:00:42 EDT
Lukas,

Can you point me to a current document on how to capture linux/Fedora core files and do analysis please?

Regards,

George...
Comment 13 Lukáš Czerner 2012-06-08 02:23:42 EDT
George,

there are plenty of pointer on the web you can use. For example

http://docs.fedoraproject.org/en-US/Fedora/14/html/Deployment_Guide/ch-kdump.html

if you have kdump configured properly all your kernel core dumps should be in /var/crash. However the problem is that your machine actually powers off so I really doubt that the core has been captured at all.

If you can not catch core dump after you experience the crash again even with kdump properly configured, then try to reproduce this on different machine with the same setup. It still very much looks like a hw issue.

-Lukas
Comment 14 George R. Goffe 2012-06-09 00:33:06 EDT
Created attachment 590545 [details]
jpeg screenshot of first system-config-kdump popup.

Lukas,

I'm attempting to configure kdump by running system-config-kdump. I get a couple of popups, one of which says "Don't know how to configure your boot loader." The other appears in this attachment. I also get the enclosed console messages. Sigh... 

It appears broken. Should I make a bug report?

Thanks,

George...

(system-config-kdump.py:23087): Gtk-WARNING **: GtkSpinButton: setting an adjustment with non-zero page size is deprecated
/usr/share/system-config-kdump/system-config-kdump.py:341: GtkWarning: IA__gtk_radio_button_set_group: assertion `!g_slist_find (group, radio_button)' failed
  self.xml = gtk.glade.XML ("/usr/share/system-config-kdump/system-config-kdump.glade", domain=DOMAIN)
Comment 15 Lukáš Czerner 2012-06-12 05:37:53 EDT
George,

I am sorry for the delay and I've missed the information in Comment 11 where you're saying that kernel.panic is zero. So the machine should not even attempt to reboot on kernel panic. And given that the machine just powers off I think that there is nothing we can gain by configuring kdump. You'll never get to the kdump because the machine just powers off.

So I definitely think this is HW issue. Have you tried to reproduce this problem on different machine with the same setup ? Is it reproducible even when not using USB3 device ?

Thanks!
-Lukas
Comment 16 George R. Goffe 2012-06-12 15:40:13 EDT
Lucas,

I have been running on a different USB port since your suggestion. NO PROBLEMS!

I have no other Linux capable systems here.

I was planing on switching back to the USB3 port with all the devices (disk and keyboard/mouse). I could just as well switch the keyboard/mouse OR the disk to USB3 depending on your preference.

Regards,

George...
Comment 17 George R. Goffe 2012-06-13 02:56:57 EDT
Lucas,

I made the switch to USB2 about 6 days ago and experienced my first instance of the disk "disappearing". This is where the disk switches from /dev/sdb1 to /dev/sdc1. This breaks anything using /dev/sdb1 including journaling. The system did NOT power itself off as when USB3 was used AND the drive was made busy.

George...
Comment 18 George R. Goffe 2012-06-13 04:13:17 EDT
Lucas,

I forgot to say when this latest disappearance happened. It was a few hours ago... NOTHING special going on.

George...
Comment 19 George R. Goffe 2012-06-14 05:13:01 EDT
Created attachment 591781 [details]
/var/log/messags of hdd disappearing

Lucas,

The hdd disappeared again, on USB2... Could this laptop be thermaling out? I would not want to have it sitting on my lap as it's a bit warm.

I looked in the BIOS for any logs but didn't see any.

Your thoughts?

George...
Comment 20 Don Zickus 2012-06-14 11:47:15 EDT
Hi George,

If this is a usb problem, can you attach the output of lspci -vvv and lsusb -v.  I need to see what usb host you are using.

Though a usb driver shouldn't be able to power down a machine unless it was configured to do so on panic somehow.  Kdump still might work as it takes a different path during panic.  Unfortunately, I do not think kdump is all that stable in F16.

For kdump, forget the gui tool.  The only thing you need to do to get kdump working is edit '/etc/grub.conf' and add 'crashkernel=128M' at the end of the kernel line.  For example:



title Red Hat Enterprise Linux Workstation (3.5.0-rc1)
        root (hd0,0)
        kernel /vmlinuz-3.5.0-rc1 ro root=/dev/mapper/vg_bluefish-lv_root rd_LVM_LV=vg_bluefish/lv_swap rd_NO_LUKS LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 rd_LVM_LV=vg_bluefish/lv_root  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM crashkernel=128M
                     ^^^^^^^^^^^^^^^^
        initrd /initramfs-3.5.0-rc1.img


and then reboot to pick up the new parameter.  If you have 'kexec-tools' installed, you should be able to 'service kdump start'.  If that fails, you may have to attach the output of 'sh -x /etc/init.d/kdump start' so I can see why.

The problem with attaching the /var/log/messages as you have been doing (and using a laptop) is that you really do not capture any panic messages that get displayed which is really the critical thing we need.

If kdump doesn't work we might be able to use pstore if you laptop has UEFI support in the BIOS (which it may not).

Cheers,
Don
Comment 21 George R. Goffe 2012-06-15 13:23:01 EDT
Created attachment 592189 [details]
output of lspci and lsusb commands

Don,

Thank you for looking at this bug and your suggestions.

lsusb/lspci output attached.

I had already set the crashkernel=128M parameter manually since system-config-kdump failed, bug fixed now. As far as I know, F16 uses grub2 exclusively? Anyway, the file is /etc/grub2.cfg. Here's the two entries:


menuentry 'Fedora (3.3.8-1.fc16.x86_64.debug)' --class fedora --class gnu-linux --class gnu --class os {
        savedefault
        load_video
        set gfxpayload=keep
        insmod gzio
        insmod part_msdos
        insmod ext2
        set root='(hd0,msdos1)'
        search --no-floppy --fs-uuid --set=root 8f706844-54f5-447c-b70a-622cdcc4b018
        echo 'Loading Fedora (3.3.8-1.fc16.x86_64.debug)'
        linux   /vmlinuz-3.3.8-1.fc16.x86_64.debug root=/dev/sda2 ro rhgb nomodeset   rd.plymouth=0 plymouth.enable=0 crashkernel=128m SYSFONT=latarcyrheb-sun16 LANG=en_US.UTF-8 KEYTABLE=us
        echo 'Loading initial ramdisk ...'
        initrd /initramfs-3.3.8-1.fc16.x86_64.debug.img
}
menuentry 'Fedora (3.3.8-1.fc16.x86_64)' --class fedora --class gnu-linux --class gnu --class os {
        savedefault
        load_video
        set gfxpayload=keep
        insmod gzio
        insmod part_msdos
        insmod ext2
        set root='(hd0,msdos1)'
        search --no-floppy --fs-uuid --set=root 8f706844-54f5-447c-b70a-622cdcc4b018
        echo 'Loading Fedora (3.3.8-1.fc16.x86_64)'
        linux   /vmlinuz-3.3.8-1.fc16.x86_64 root=/dev/sda2 ro rhgb nomodeset   rd.plymouth=0 plymouth.enable=0 crashkernel=128m SYSFONT=latarcyrheb-sun16 LANG=en_US.UTF-8 KEYTABLE=us
        echo 'Loading initial ramdisk ...'
        initrd /initramfs-3.3.8-1.fc16.x86_64.img





When I did the service start kdump the system "thought" for a while and then returned to the prompt with NO messages. "service kdump status" reports:

service kdump status
Redirecting to /bin/systemctl  status kdump.service
kdump.service - Crash recovery kernel arming
          Loaded: loaded (/lib/systemd/system/kdump.service; static)
          Active: active (exited) since Fri, 15 Jun 2012 09:55:19 -0700; 7min ago
         Process: 14014 ExecStart=/usr/bin/kdumpctl start (code=exited, status=0/SUCCESS)
          CGroup: name=systemd:/system/kdump.service


The request for "sh -x /etc/init.d/kdump start" fails because there's no file to "sh". Trying "sh -x /usr/bin/kdumpctl start" says it's already started so I'm guessing kdump started.

George...
Comment 22 Don Zickus 2012-06-15 14:15:55 EDT
Hi George,

So it looks promising that kdump is up and running.  We can test it by 

#echo c > /proc/sysrq-trigger

This will cause your machine to immediately crash, start the kdump kernel, save the vmcore into /var/crash/ on your disk and reboot.

Once the laptop has rebooted, you can verify kdump worked by checking in /var/crash for a directory that has a timestamp within a few minutes of the current time and a vmcore file inside.  If so, kdump is configured correctly and works.

After that you can try and cause the machine to do its weird behaviour again.  Hopefully this time it reboots instead of powering off.  And hopefully it drops a vmcore into the /var/crash area.  If so, we might get a clue about what is going on.

I can tell you the next steps if all the above is successful (as it involves getting kernel-debuginfo, kernel-debuginfo-common and crash).

=======
Look through you attachments, it seems you have a Fresco Logic USB3 controller.  This known flakey.  There was a fix for flakey ISO behaviour on USB headsets may cause similar problems with a harddrive too.  Regardless I was going to recommend updating to 3.3.8-1 to pick it up.

However, based on your comment above it seems like you already did that.  You might be able to migrate your stuff to USB3 again and it might work now.

That does not address your harddrive disappearing on USB2.  That is strange.  It is almost as if usb is getting a false 'remove' event while you are using the drive (causing the /dev/sdb1 to stick around) and then immediately get a 'insert' event which cause the /dev/sdc1 to show up because /dev/sdb1 is still in use.

Let's deal with one problem at a time here.  See if you can set up kdump and migrate back to USB3 and see if those problems are resolved.  If so, we can move back to the usb2 issue.

Cheers,
Don
Comment 23 George R. Goffe 2012-06-16 17:29:25 EDT
Don,

I just tried twice to make a dump but none appeared in /var/crash which is a link to another mounted (hopefully) file system.

service kdump does not start during boot if that matters.

echo c > /proc/sysrq-trigger did produce a 57 line stack trace but did not write anything to /var/crash which is a link to another mounted file system.

Your thoughts please?

Regards,

George...
Comment 24 Don Zickus 2012-06-18 09:55:27 EDT
Hi George,

Hmm maybe kdump is more broke than I thought on F16.  I am surprised that /var/crash is a link.  Where does it point to and what filesystem is that mounted to? (ls -l in /var and df -h, can probably give those answers).

service kdump should start.  Though I was told it might be kdumpctl in F16.  Looking through kexec-tools package tree, it seems like it was fully converted to systemd.  I am not sure the leftover files were cleaned up properly, which might explain why 'service kdump' does not start.

The 57-line stack trace seems right.  But it should boot a second kernel and shows signs of dumping a vmcore somewhere.  Unless the stack trace shows something else is broken.  You don't have an easy way to capture that do you?  Say a camera phone or something?

Cheers,
Don
Comment 25 George R. Goffe 2012-06-19 01:08:07 EDT
Don,

My /var has 581M left in it. I pointed it to my last partition (/export/home) due to possible space problems. All my major file systems are in separate partitions (/ /boot /var /opt /usr /export/home (Solaris remnant)).

About the stack trace: alas, no camera or phone. If I had another system I'd try to point the console through the network to the other system. Maybe I can borrow one...

How long should it take for the second kernel to boot?

Starting kdump gives:

joker bash-4.2 /var# service kdump start
Redirecting to /bin/systemctl  start kdump.service

rpm -q kexec-tools gives: kexec-tools-2.0.2-29.fc16.x86_64

George...
Comment 26 Don Zickus 2012-06-26 14:05:44 EDT
Hi George,

Sorry for the neglect.  Our kexec guys say f16 isn't supported.  They have been using rawhide (f18) packages to run kdump on f16.  Boooo..

You can grab a src rpm here:

http://kojipkgs.fedoraproject.org//packages/kexec-tools/2.0.3/50.fc18/src/kexec-tools-2.0.3-50.fc18.src.rpm

rpm -ivh <src.rpm>

and then build it (make sure you have the latest dracut)

cd ~/rpmbuild; rpmbuild -ba SPECS/kexec-tools.spec

Though you might need some dependencies to install. :-(

I don't have an easy way to throw F16 on a machine over here otherwise I would try and send you something.

I know this sucks.  All this work to (hopefully) get a useful stack dump..

Cheers,
Don
Comment 27 George R. Goffe 2012-06-26 23:41:46 EDT
Don,

I am just happy to get some attention for this bug. I'm building now...

I'll post whatever I get.

The build required: 

zlib-static elfutils-devel-static

George...
Comment 28 George R. Goffe 2012-06-26 23:53:18 EDT
Don,

Also needed dracut-network.

I build gcc the latest and this rpmbuild process failed when I used that compiler. I went back to the distributed version and it worked great.

I have rpms in the RPMS directory and am running rpm -ivh kexec*

George...
Comment 29 George R. Goffe 2012-06-27 00:02:49 EDT
Created attachment 594652 [details]
gzip'd copy of /var/log/messages during start of kdump

Don,

I've included some /var/log/messages relating to starting kdump.

Do I have the right versions of dracut parts?

George...

rpm -qa | grep dracut
dracut-013-22.fc16.noarch
dracut-network-013-22.fc16.noarch
Comment 30 Don Zickus 2012-06-27 09:43:37 EDT
Hi George,

koji is showing me that you do not have the latest dracut stuff.  Not sure if it wasn't released or stuck in updates-testing.

http://koji.fedoraproject.org/koji/buildinfo?buildID=322276

The kdump maintainers suggested using the latest dracut stuff to resolve those errors you see.  Let me know if that helps.

Thanks for you patience.

Cheers,
Don
Comment 31 George R. Goffe 2012-06-29 01:26:00 EDT
Don,

The good news is that I'm able to get kernel dumps now.

The bad news is that I'm getting more failures now but no crashes or power off's without msgs. Here's a sample of the latest from /var/log/messages. Notice the "unhandled error" message? Is this after the problem has happened? I was just copying files to this drive. It's on usb2, right?

I don't know what to do. Sigh. Any thoughts? 

Additionally, I have looked at /var/log messages and see that the date/time between last line before a boot/crash is usually greater than the actual time of the boot crash. Does this matter? It looks like my system clock is goofy?

Regards,

George...

00025044 Jun 28 21:10:45 joker pulseaudio[2050]: alsa-sink.c: We were woken up with POLLOUT set -- however a subsequent snd_pcm_avail() returned 0 or another value < min_avail.
00025045 Jun 28 21:13:30 joker kernel: [  305.198216] kworker/u:0 used greatest stack depth: 1960 bytes left
00025046 Jun 28 22:04:53 joker kernel: [ 3384.642897] EXT4-fs (sdb1): mounted filesystem with ordered data mode. Opts: (null)
00025047 Jun 28 22:06:58 joker kernel: [ 3510.327919] usb 2-1.4.4: reset high-speed USB device number 6 using ehci_hcd
00025048 Jun 28 22:07:07 joker kernel: [ 3519.123077] sd 6:0:0:0: Device offlined - not ready after error recovery
00025049 Jun 28 22:07:07 joker kernel: [ 3519.123104] sd 6:0:0:0: [sdb] Unhandled error code <<<<<<<<<<<<<<<<<<<<<<<<<<?????
00025050 Jun 28 22:07:07 joker kernel: [ 3519.123107] sd 6:0:0:0: [sdb]  Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK
00025051 Jun 28 22:07:07 joker kernel: [ 3519.123111] sd 6:0:0:0: [sdb] CDB: Read(10): 28 00 24 c0 00 3f 00 00 08 00
00025052 Jun 28 22:07:07 joker kernel: [ 3519.123124] end_request: I/O error, dev sdb, sector 616562751











# this one looks reasonable

 Jun 24 03:39:02 joker rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="979" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
 Jun 25 00:08:25 joker rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="979" x-info="http://www.rsyslog.com"] exiting on signal 15.
 Jun 25 01:17:29 joker kernel: imklog 5.8.10, log source = /proc/kmsg started.

# this one is goofy

 Jun 25 01:17:29 joker rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1053" x-info="http://www.rsyslog.com"] start
 Jun 25 11:03:06 joker rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1053" x-info="http://www.rsyslog.com"] exiting on signal 15.
 Jun 25 06:37:15 joker kernel: imklog 5.8.10, log source = /proc/kmsg started.

# this one is goofy

 Jun 25 06:37:15 joker rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1057" x-info="http://www.rsyslog.com"] start
 Jun 25 15:50:17 joker rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1057" x-info="http://www.rsyslog.com"] exiting on signal 15.
 Jun 25 11:05:24 joker kernel: imklog 5.8.10, log source = /proc/kmsg started.

# this one is goofy

 Jun 25 11:05:24 joker rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1040" x-info="http://www.rsyslog.com"] start
 Jun 25 20:11:38 joker rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1040" x-info="http://www.rsyslog.com"] exiting on signal 15.
 Jun 25 11:13:55 joker kernel: imklog 5.8.10, log source = /proc/kmsg started.

# this one is goofy

 Jun 25 11:13:55 joker rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1047" x-info="http://www.rsyslog.com"] start
 Jun 26 12:11:03 joker rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1047" x-info="http://www.rsyslog.com"] exiting on signal 15.
 Jun 26 03:40:45 joker kernel: imklog 5.8.10, log source = /proc/kmsg started.

# this one is goofy

 Jun 26 03:40:45 joker rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1072" x-info="http://www.rsyslog.com"] start
 Jun 26 12:59:57 joker rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1072" x-info="http://www.rsyslog.com"] exiting on signal 15.
 Jun 26 04:07:47 joker kernel: imklog 5.8.10, log source = /proc/kmsg started.

# this one looks reasonable

 Jun 26 04:07:47 joker rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1053" x-info="http://www.rsyslog.com"] start
 Jun 27 17:19:02 joker kernel: imklog 5.8.10, log source = /proc/kmsg started.

# this one is goofy

 Jun 27 17:19:02 joker rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="883" x-info="http://www.rsyslog.com"] start
 Jun 28 02:22:29 joker rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="883" x-info="http://www.rsyslog.com"] exiting on signal 15.
 Jun 27 17:20:35 joker kernel: imklog 5.8.10, log source = /proc/kmsg started.

# this one looks reasonable

 Jun 27 17:20:35 joker rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1173" x-info="http://www.rsyslog.com"] start
 Jun 27 17:24:28 joker kernel: imklog 5.8.10, log source = /proc/kmsg started.

# this one is goofy

 Jun 27 17:24:28 joker rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1058" x-info="http://www.rsyslog.com"] start
 Jun 28 02:31:22 joker rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1058" x-info="http://www.rsyslog.com"] exiting on signal 15.
 Jun 27 17:30:13 joker kernel: imklog 5.8.10, log source = /proc/kmsg started.

# this one is goofy

 Jun 27 17:30:13 joker rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="958" x-info="http://www.rsyslog.com"] start
 Jun 28 13:22:30 joker rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="958" x-info="http://www.rsyslog.com"] exiting on signal 15.
 Jun 28 04:20:59 joker kernel: imklog 5.8.10, log source = /proc/kmsg started.

# this one is goofy

 Jun 28 04:20:59 joker rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1063" x-info="http://www.rsyslog.com"] start
 Jun 28 13:56:32 joker rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1063" x-info="http://www.rsyslog.com"] exiting on signal 15.
 Jun 28 12:06:11 joker kernel: imklog 5.8.10, log source = /proc/kmsg started.
Comment 32 Don Zickus 2012-06-29 11:50:42 EDT
(In reply to comment #31)
> Don,
> 
> The good news is that I'm able to get kernel dumps now.

Hi George,

Great to hear kdump is working now.  But I am confused on your other statements, perhaps you can help clarify them for me...

> 
> The bad news is that I'm getting more failures now but no crashes or power
> off's without msgs. Here's a sample of the latest from /var/log/messages.

What do you mean by 'without msgs'?  Are you still getting crashes/power offs with msgs?

> Notice the "unhandled error" message? Is this after the problem has
> happened? I was just copying files to this drive. It's on usb2, right?

Whatever sdb is.  It seems like the drive disappeared unexpectedly and the kernel is trying to clean up the mess.  Not unusual for these circumstances, kernel should recover properly, so I am not worried about the messages but instead that the drive disappeared suddenly.

> 
> I don't know what to do. Sigh. Any thoughts? 
> 
> Additionally, I have looked at /var/log messages and see that the date/time
> between last line before a boot/crash is usually greater than the actual
> time of the boot crash. Does this matter? It looks like my system clock is
> goofy?

What is 'last line before a boot/crash' mean?  Are you getting crashes and if so where/when?

Can you attach the whole /var/log/message?  Are you getting vmcores in /var/crash now that correspond with the timestamps below that you consider 'goofy'?

Thanks,
Don
Comment 33 George R. Goffe 2012-06-29 16:22:24 EDT
Don,

The system clock in this machine seems to be having trouble. I updated the clock every 15 seconds and displayed the time. It appears to be wandering. I was only updating the clock at boot time. I think this is the source of my clock problem.

Could this clock problem be related to this disk problem?

George...
Comment 34 George R. Goffe 2012-07-05 20:09:51 EDT
Don,

I thought I would try a different distribution and picked Centos as my victim. I used the netinstall cd and used the "rescue" option. This mounted all my file systems in a chroot environment. I made several external drives busy as I have for this bug but NONE of them failed. NONE! Hmmmm...

I was expecting no changes in the behavior but this has me thinking that it's definitely a Fedora problem. What are your thoughts?

Regards,

George...
Comment 35 Don Zickus 2012-07-06 09:16:36 EDT
(In reply to comment #34)
> Don,
> 
> I thought I would try a different distribution and picked Centos as my
> victim. I used the netinstall cd and used the "rescue" option. This mounted
> all my file systems in a chroot environment. I made several external drives
> busy as I have for this bug but NONE of them failed. NONE! Hmmmm...
> 
> I was expecting no changes in the behavior but this has me thinking that
> it's definitely a Fedora problem. What are your thoughts?
> 
> Regards,
> 
> George...

Hi George,

I think you are incurring to many problems. :-)  Can you go back and answer my questions from comment #32.  I am trying to avoid diving into to many of your problems and focusing on the initial one.

As for the clock issue, again I would need the output of dmesg and /var/log/messages to see what is going on.  If you disable NTP does the clock problem go away?

Cheers,
Don
Comment 36 nimbus9 2012-09-09 16:44:36 EDT
I just wanted to add that I am affected by this bug as well and experience seemingly random shutdowns that occur for the most part when I am actively using the computer such as browsing the internet. 

The only symptoms that's unusual is I have a logitech webcam that works but dirties up the logs with error messages when plugged in (ALSA sound/usb/clock.c:243 current rate 2817 is different from the runtime rate 48000) for example but shows no signs of anything wrong. These crashes occur whether or not that device is plugged in and the logs are quiet when this device is physically disconnected from the computer which I tend to do when it is not in use..

I hope this isn't more information than needed but here are some outputs that might be relevant:

http://pastebin.com/mbQwzT54
Comment 37 Dave Jones 2012-10-23 11:41:17 EDT
# Mass update to all open bugs.

Kernel 3.6.2-1.fc16 has just been pushed to updates.
This update is a significant rebase from the previous version.

Please retest with this kernel, and let us know if your problem has been fixed.

In the event that you have upgraded to a newer release and the bug you reported
is still present, please change the version field to the newest release you have
encountered the issue with.  Before doing so, please ensure you are testing the
latest kernel update in that release and attach any new and relevant information
you may have gathered.

If you are not the original bug reporter and you still experience this bug,
please file a new report, as it is possible that you may be seeing a
different problem. 
(Please don't clone this bug, a fresh bug referencing this bug in the comment is sufficient).
Comment 38 Justin M. Forbes 2012-11-14 15:28:15 EST
With no response, we are closing this bug under the assumption that it is no longer an issue. If you still experience this bug, please feel free to reopen the bug report.
Comment 39 Vadym 2012-11-19 09:20:33 EST
Good time of day, Reassert, take this bug. From some time server crashed. First freeze  remote console, after this server rebooted himself.
Hardware Supermicro X8DTL 24Gb RAM 2 sata hdd Sumsung 2 CPU Xeon 5500
OS Centos-6.3 kernel 2.6.32-279.14.1.el6.x86_64
cpanel
Checked:
Make compile new kernel without install.
Make partition in memory, put source kernel in this partiton and make compile with amount threads as heads CPU.
On line compiled part security server crashed FOREVER.
Change ALL hardware and get some this error.
Without any load server crashed once from 24 hours. Randomly.

Note You need to log in before you can comment on or make changes to this bug.