Bug 285721

Summary: tg3: tg3_abort_hw timed out for eth0, TX_MODE_ENABLE will not clear MAC_TX_MODE=ffffffff
Product: [Fedora] Fedora Reporter: Matěj Cepl <mcepl>
Component: kernelAssignee: Andy Gospodarek <agospoda>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: low Docs Contact:
Priority: medium    
Version: 7CC: chris.brown, mcepl, peterm
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: f8 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 715452 (view as bug list) Environment:
Last Closed: 2008-01-14 06:27:05 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 715452    
Attachments:
Description Flags
output of dmesg command
none
output of dmesg
none
/var/log/messages
none
/var/log/messages
none
output of dmesg
none
lspci -vvvxxxx after suspend
none
output of dmesg
none
lspci -vvvxxxx after fresh reboot
none
output of dmesg after hibernation
none
output of lspci -vvvvxxxx
none
/var/log/messages after resume from hibernation none

Description Matěj Cepl 2007-09-11 10:06:26 UTC
Description of problem:


Version-Release number of selected component (if applicable):

When suspending my computer with pm-hibernate --quirk-vbe-post I get this error
message on console (and in dmesg):

tg3: tg3_abort_hw timed out for eth0, TX_MODE_ENABLE will not clear
MAC_TX_MODE=ffffffff

After resume, I have to modprobe -v -r not only my wifi driver (iwl3945) but tg3
as well in order to make wifi to work.

How reproducible:
100%

Steps to Reproduce:
1.suspend/resume cycle
2.
3.
  
Actual results:
error message on console/dmesg and no wireless after resume, have to rmmod both
wifi and Ethernet driver

Expected results:
just works

Additional info:
having
SUSPEND_MODULES="kvm_intel kvm" in /etc/pm/config.d/unload_modules 
(with having iwl3945 there as well didn't help; will try now with both iwl3945
and tg3)

Comment 1 Matěj Cepl 2007-09-11 10:06:26 UTC
Created attachment 192361 [details]
output of dmesg command

Comment 2 Matěj Cepl 2007-09-11 10:11:37 UTC
Of course, I am not sure, what component this should go in -- kernel, hal?

Comment 3 Phil Knirsch 2007-09-11 12:39:09 UTC
Hm thats sounds more like the tg3 driver has a problem when you hibernate using
this quirk.

Have you tried hibernating without the quirk? Or doesn't the machine hibernate
properly when you leave it out.

But in general, it's a kernel module spitting out an error message, so i rather
think this is a kernel problem, therefore i'm reassigning it to kernel.

Thanks,

Read ya, Phil

Comment 4 Matěj Cepl 2007-09-25 13:47:39 UTC
yes, I tried to hibernate with any possible quirks or without them at all and it
makes this message all the time.

Comment 5 Christopher Brown 2007-10-03 13:19:02 UTC
Hello,

I'm reviewing this bug as part of the kernel bug triage project, an attempt to
isolate current bugs in the fedora kernel.

http://fedoraproject.org/wiki/KernelBugTriage

From the 2.6.23-rc3 Changelog:

commit 3e0c95fd648c0d3175b9ff2232597d0b02eb7d46
Author: Michael Chan <mchan>
Date:   Fri Aug 3 20:56:54 2007 -0700

    [TG3]: Fix suspend/resume problem.
    
    Joachim Deguara <joachim.deguara> reported that tg3 devices
    would not resume properly if the device was shutdown before the system
    was suspended.  In such scenario where the netif_running state is 0,
    tg3_suspend() would not save the PCI state and so the memory enable bit
    and bus master enable bit would be lost.
    
    We fix this by always saving and restoring the PCI state in
    tg3_suspend() and tg3_resume() regardless of netif_running() state.
    
    Signed-off-by: Michael Chan <mchan>
    Signed-off-by: David S. Miller <davem>

Matej, can you test with a kernel based off this?

Also, could you clear up whether you are suspend/resuming or hibernate/waking?
You mention suspend/resume but then indicate you are running pm-hibernate. Do
you still see this issue with pm-suspend?

Cheers
Chris

Comment 6 Chuck Ebbert 2007-10-03 23:45:54 UTC
Patch queued for next kernel update.

Comment 7 Matěj Cepl 2007-10-04 08:53:57 UTC
(In reply to comment #5)
> Matej, can you test with a kernel based off this?

Is there RPM with the code somewhere -- I don't do Red Hat kernel building all
the time ... :-) Besides, I don't see where is the appropriate patch anyway.

Comment 8 Christopher Brown 2007-10-04 11:55:11 UTC
(In reply to comment #7)
> (In reply to comment #5)
> > Matej, can you test with a kernel based off this?
> 
> Is there RPM with the code somewhere -- I don't do Red Hat kernel building all
> the time ... :-)

Understandable. 

> Besides, I don't see where is the appropriate patch anyway.

You can either run:

# yum update kernel --enablerepo=development --nogpgcheck

which will pull the latest rawhide kernel (should be off 2.6.23) or let me know
what arch you are running and I'll do a scratch build for you in Koji.

Comment 9 Matěj Cepl 2007-10-04 23:47:07 UTC
Tried kernel-2.6.23-0.217.rc9.git1.fc8 and the results are no good:

a) iwl3945 driver didn't work with NetworkManager-0.6.5-7.fc7 (it did with
kernel-2.6.22.9-91.fc7), so ifup had to be used.
b) tg3 warning message went away, but after suspend no wireless card whatsoever
(I have the computer at home, so I have no chance to test wired ethernet
function actually), and there were some backtraces in dmesg (see attached).

Comment 10 Matěj Cepl 2007-10-04 23:47:32 UTC
Created attachment 216801 [details]
output of dmesg

Comment 11 Matěj Cepl 2007-10-04 23:49:14 UTC
Created attachment 216811 [details]
/var/log/messages

Comment 12 Christopher Brown 2007-10-05 10:58:04 UTC
Matej,

As the original issue with tg3 appears resolved in 2.6.23, I'm changing the
subject to reflect that. It might even be worth filing a new bug for the
wireless but to be honest there is so much work going into NetworkManager and
intel wifi at the moment it will likely get lost in the noise.

I'd suggest your best option would be to test with NetworkManager (again from
development if you can) and see if this helps. In the meantime I'll re-assign to
the wireless team. For brevity here is the backtrace which is related to your
touchpad rather than network driver issues:

=============================================
[ INFO: possible recursive locking detected ]
2.6.23-0.217.rc9.git1.fc8 #1
---------------------------------------------
kseriod/253 is trying to acquire lock:
 (&ps2dev->cmd_mutex){--..}, at: [<c0631cb3>] mutex_lock+0x21/0x24

but task is already holding lock:
 (&ps2dev->cmd_mutex){--..}, at: [<c0631cb3>] mutex_lock+0x21/0x24

other info that might help us debug this:
4 locks held by kseriod/253:
 #0:  (serio_mutex){--..}, at: [<c0631cb3>] mutex_lock+0x21/0x24
 #1:  (&serio->drv_mutex){--..}, at: [<c0631cb3>] mutex_lock+0x21/0x24
 #2:  (psmouse_mutex){--..}, at: [<c0631cb3>] mutex_lock+0x21/0x24
 #3:  (&ps2dev->cmd_mutex){--..}, at: [<c0631cb3>] mutex_lock+0x21/0x24

stack backtrace:
 [<c0406463>] show_trace_log_lvl+0x1a/0x2f
 [<c0406e4d>] show_trace+0x12/0x14
 [<c0406e65>] dump_stack+0x16/0x18
 [<c0449c56>] __lock_acquire+0x189/0xc67
 [<c044abae>] lock_acquire+0x7b/0x9e
 [<c0631ac0>] __mutex_lock_slowpath+0x10a/0x2dc
 [<c0631cb3>] mutex_lock+0x21/0x24
 [<c059bc3f>] ps2_command+0x92/0x30e
 [<c05a23c6>] psmouse_sliced_command+0x1c/0x5a
 [<c05a46eb>] synaptics_pt_write+0x21/0x46
 [<c059ba14>] ps2_sendbyte+0x39/0xcb
 [<c059bcbe>] ps2_command+0x111/0x30e
 [<c05a2001>] psmouse_probe+0x1d/0x6c
 [<c05a314d>] psmouse_connect+0xf8/0x20c
 [<c05993e0>] serio_connect_driver+0x1e/0x2e
 [<c0599406>] serio_driver_probe+0x16/0x18
 [<c05767bd>] driver_probe_device+0xf2/0x173
 [<c0576846>] __device_attach+0x8/0xa
 [<c0575b92>] bus_for_each_drv+0x3c/0x67
 [<c05768dc>] device_attach+0x75/0x8a
 [<c059941c>] serio_find_driver+0x14/0x3c
 [<c0599f59>] serio_thread+0x166/0x2b9
 [<c043e7f7>] kthread+0x3b/0x64
 [<c0405ee3>] kernel_thread_helper+0x7/0x10
 =======================

The logs then indicates the device coming back up and dhclient restarting a few
times before sleeping. NetworkManager fares no better later on it would seem.

Comment 13 John W. Linville 2007-10-05 13:02:44 UTC
FWIW, it is really bad form to simply hijack a bug for another problem, rename 
it, and expect it to be treated as an extension of the same bug.  It isn't 
clear to me that this is the same issue at all, and having synaptics 
backtraces and tg3 stuff in what now claims to be an iwl3945 bug just creates 
confusion...

Comment 14 Matěj Cepl 2007-10-05 13:24:27 UTC
Sure, being a bugmaster, I thought that kernel folks have different mores ;-).
No, seriously, should I file a new bug?

Comment 15 John W. Linville 2007-10-05 13:35:18 UTC
Matej, you weren't the one I was hoping to educate. :-)  Yes, please open a 
new bug.  Restoring the name and assignee of this one (and closing it, if 
appropriate) seems like a good idea too.

Comment 16 Matěj Cepl 2007-10-05 19:56:07 UTC
OK, so let's close this bug as CLOSED/RAWHIDE, and I will upgrade on Monday my
computer to F8test3 update that to the latest Rawhide, and I will all bugs which
will be eventually found out and you will fix them. Is it a deal? :-)

Comment 17 Christopher Brown 2007-10-05 20:40:03 UTC
(In reply to comment #13)
> FWIW, it is really bad form to simply hijack a bug for another problem, rename 
> it, and expect it to be treated as an extension of the same bug.  It isn't 
> clear to me that this is the same issue at all, and having synaptics 
> backtraces and tg3 stuff in what now claims to be an iwl3945 bug just creates 
> confusion...

I'm no hi-jacker, you must have me confused with someone else. I don't *expect*
it to be treated as an extension, just that the underlying issue may be the same
however the initial tg3 errors were resolved so I felt a change of subject was
appropriate. Its your call to ask the reporter to file a new bug or continue on
this one. Please don't accuse me of hi-jacking - as indicated I am attempting to
triage kernel bugs.

(In reply to comment #16)
> OK, so let's close this bug as CLOSED/RAWHIDE, and I will upgrade on Monday my
> computer to F8test3 update that to the latest Rawhide, and I will all bugs which
> will be eventually found out and you will fix them. Is it a deal? :-)

Deal.

Cheers
Chris

Comment 18 Matěj Cepl 2007-10-10 20:11:13 UTC
Sorry guys, if I can have my original summary back (I believe many people search
by the error they get in logs), and unfortunately I have to reopen this.

I have upgraded to full Rawhide, so I have now here kernel
2.6.23-0.224.rc9.git6.fc8 and 
NetworkManager-0.7.0-0.3.svn2914.fc8.

Unfortunately, the error message is back.

Comment 19 Matěj Cepl 2007-10-10 20:25:57 UTC
Created attachment 223371 [details]
/var/log/messages

These are the /var/log/messages contain both suspend/resume cycle and reboot.

After suspend/resume cycle, there is no network (neither wireless nor wired). I
will file a different bug about this.

Comment 20 Matěj Cepl 2007-10-10 21:49:21 UTC
Created attachment 223451 [details]
output of dmesg

after restart of the computer (network works)

Comment 21 Andy Gospodarek 2007-10-11 21:00:46 UTC
Matej, Can you attach the before and after suspend output of `lspci -xxxvvv` for
this system when you get the tg3 failure?  Thanks!

Comment 22 Matěj Cepl 2007-10-13 13:22:06 UTC
Created attachment 226381 [details]
lspci -vvvxxxx after suspend

I got again after resume from suspend to RAM very nice collection of crashes,
non-functional drivers, etc. When I run 'modprobe -v -r iwl3945 tg3' then the
situation turned very quickly to working wireless network even with
NetworkManager (not having wired Ethernet at hand I cannot tried real
functionality of tg3 driver).

Comment 23 Matěj Cepl 2007-10-13 13:23:18 UTC
Created attachment 226391 [details]
output of dmesg

I think there are some parts of this, which can be interesting. BTW, using
currently kernel-2.6.23-6.fc8 package.

Comment 24 Matěj Cepl 2007-10-13 13:30:58 UTC
Created attachment 226401 [details]
lspci -vvvxxxx after fresh reboot

Comment 25 Matěj Cepl 2007-10-13 13:31:27 UTC
I think that should be it.

Comment 26 Matěj Cepl 2007-10-14 21:48:40 UTC
Created attachment 226761 [details]
output of dmesg after hibernation

I have tried hibernate (suspend to Disk; the previous suspend data were with
suspend to RAM) and after resume the results were as bad as with suspend to
RAM. Actually, I haven't managed to make network working at all and I had to
reboot the computer in order to get net connection.

THis is output of dmesg where I see some interesting backtraces.

Comment 27 Matěj Cepl 2007-10-14 21:49:10 UTC
Created attachment 226771 [details]
output of lspci -vvvvxxxx

Comment 28 Matěj Cepl 2007-10-14 21:52:28 UTC
Created attachment 226781 [details]
/var/log/messages after resume from hibernation

Comment 29 Christopher Brown 2008-01-13 23:30:41 UTC
Hello Matej,

Any improvements with recent kernel updates? There have been plenty of wireless
driver updates that may have resolved this issue for you. You can also try adding:

SUSPEND_MODULES="iwl3945 tg3"

to /etc/pm/config.d/unload_modules

which might help things a bit.

Cheers
Chris

Comment 30 Matěj Cepl 2008-01-14 06:27:05 UTC
I cannot find in any log any error messages for now. So, lets CLOSE this for
now, and I will reopen it if every needed again.