Bug 386411

Summary:

problem with tulip driver on x86_64

Product:

[Fedora] Fedora

Reporter:

Gian Paolo Mureddu <gmureddu>

Component:

kernel

Assignee:

Andy Gospodarek <agospoda>

Status:

CLOSED WONTFIX

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

high

Docs Contact:

Priority:

low

Version:

CC:

jlayton, jonstanley, peterm

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-01-09 07:26:18 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
dump of /var/log/messages for today	none

Description Gian Paolo Mureddu 2007-11-16 07:37:50 UTC

Description of problem:
System crashes when browsing or heavily using a locally mounted smb share in
Fedora 8 x86_64. In my case this is my music collection which is placed in a
file server and exported through SMB, locally mounted from fstab with the
following format:

//server_ip/share     /media/music    cifs    credentials=/etc/cifs.secrets 0 0

Any application reading from the share will cause either of two things: sudden
disconnection (I still have to investigate into it) or system crash in the form
of slowly stop responding until it comes to a total halt. If either Amarok or
Rhythmbox are indexing the tracks (the share has over 40Gb of music and media
files) in about 3-5 minutes the system will crash with a total lock up. First X
stops responding, I'm able to move to a VT, then it can happen either two
things: at the login prompt I can write a user name, but no password prompt
shows (this happens in any VT) or I am able to login and when trying to run
commands (for example top) the system may stop responding either while typing,
or while running any of the commands. I have not been able to check the log,
I've repeated this lock up three times in a row, all with the same procedure. 

Version-Release number of selected component (if applicable):
samba-client-3.0.26a-6.fc8
samba-common-3.0.26a-6.fc8

How reproducible:
Always

Steps to Reproduce:
1. mount an SMB share with a large music collection.
2. start Amarok or Rhythmbox and point them to that location for the media library
3. As the programs index the files in the share, the system will slowly stop
responding.
  
Actual results:
System crashes.

Expected results:
System to be operational.

Additional info:
The SMB share is from another Linux box, a Fedora 5 server on the network with
samba version:

samba-common-3.0.24-7.fc5
samba-client-3.0.24-7.fc5

Fedora 8 box with all updates applied. Tried running with and without any
proprietary modules loaded (the machine has an nVidia video card, running with
the Livna provided driver).

I'm not sure if this problem could also be triggered with transferring large
files (for example, the Fedora DVD image).

Smolt UUID of the system which crashes: 01464085-8755-4546-adca-64d3ae66ec48

This is all I could find in /var/log/messages about the last time I reproduced
the problem:
Nov 16 00:56:07 Blackhawk gconfd (root-3073): El servidor GConf no estÃ¡ en uso,
cerrÃ¡ndolo.
Nov 16 00:56:07 Blackhawk gconfd (root-3073): Finalizando
Nov 16 00:56:43 Blackhawk restorecond: Will not restore a file with more than
one hard link (/etc/resolv.conf) Invalid argument
Nov 16 00:56:43 Blackhawk gdm-binary[2688]: WARNING: gdm_slave_xioerror_handler:
ha ocurrido un error fatal de X - Reiniciando :0
Nov 16 00:56:44 Blackhawk acpid: client connected from 3095[0:0]
Nov 16 00:56:44 Blackhawk acpid: 1 client rule loaded
Nov 16 00:56:54 Blackhawk gdm-binary[3091]: WARNING: No se pudo autenticar al
usuario
Nov 16 00:56:59 Blackhawk hcid[2141]: Default passkey agent (:1.35,
/org/bluez/passkey) registered
Nov 16 00:56:59 Blackhawk hcid[2141]: Default authorization agent (:1.35,
/org/bluez/auth) registered
Nov 16 00:57:00 Blackhawk setroubleshoot: #012    SELinux estÃ¡ previniendo
acerca de la intenciÃ³n de gdm-binary (xdm_t)#012    el acceso "write" a
<Desconocido>.#012     For complete SELinux messages. run sealert -l
081918b8-b2ce-4c18-ba1f-c55bd5bf5a12
Nov 16 00:57:00 Blackhawk setroubleshoot: #012    SELinux estÃ¡ previniendo
acerca de la intenciÃ³n de /usr/sbin/gdm-binary (xdm_t)#012    el acceso
"rename" a <Desconocido>.#012     For complete SELinux messages. run sealert -l
99139ab2-d7c8-45a2-aecf-2d87ecc14de9
Nov 16 00:57:01 Blackhawk setroubleshoot: #012    SELinux estÃ¡ previniendo
acerca de la intenciÃ³n de X (xdm_xserver_t)#012    el acceso "read" a
<Desconocido>.#012     For complete SELinux messages. run sealert -l
94b10831-17ae-4709-8a05-04a3db51de1e
Nov 16 00:57:46 Blackhawk acpid: client connected from 3095[0:0]
Nov 16 00:57:46 Blackhawk acpid: 1 client rule loaded
Nov 16 00:58:37 Blackhawk acpid: client connected from 3095[0:0]
Nov 16 00:58:37 Blackhawk acpid: 1 client rule loaded
Nov 16 00:58:37 Blackhawk kernel: rhythmbox-metad[3428]: segfault at
0000000001103000 rip 00002aaaafdfe697 rsp 0000000041400c58 error 4

Comment 1 Jeff Layton 2007-11-16 14:43:26 UTC

Exactly what kernel are you seeing this on?

Would it be possible to force a crash on this box and collect a coredump (via
kdump)? If so we should be able to look at that and see what the box is doing
when it's hung like this...

Comment 2 Gian Paolo Mureddu 2007-11-16 18:29:01 UTC

I knew I was forgetting something...

Kernel version is 2.6.23.1-49.fc8. How exactly would I go about doing this
coredump with kdump?

Comment 3 Jeff Layton 2007-11-16 18:55:29 UTC

This is the only doc I can quickly find. It's for FC6, but F8 should be very
similar.

http://fedoraproject.org/wiki/FC6KdumpKexecHowTo

When the box locks up, you'll need to intentionally crash it, for that, you can
use sysrq-c:

http://kbase.redhat.com/faq/FAQ_80_5559.shtm

An even more helpful thing to do would be to come up with a simple way to
reproduce this (i.e. some way to reproduce it from the console that doesn't
involve X).

BTW: what you mean by a "locally mounted" SMB share? Is the server and client
the same machine? If so, can you reproduce this with the host acting in only one
role (server or client)? That might help narrow things down for a reproducer...

Comment 4 Gian Paolo Mureddu 2007-11-16 19:34:03 UTC

By locally mounted I mean what I said about the share being mounted via fstab.
To be accessed through a mountpoint like a regular partition. Sorry if I wasn't
clear on that.

I'll try to reproduce the error when I get back home and can further test this,
mean while I'll read up on what you linked and try to get something (in all my
years of Linux usage, I have never knew how to invoke sysrq, guess it's never
too late to learn ;) )...

What I have in mind to see if this is only in X or not, I'll try to simply run a
"du -sh" on the mounted cifs share, see if can trigger the problem.

Comment 5 Jeff Layton 2007-11-16 20:07:08 UTC

Ok, so the server is a different physical machine. It might not hurt to know
what sort of server this is...

Comment 6 Gian Paolo Mureddu 2007-11-16 20:16:52 UTC

What do you mean by "what sort of server"?

It is a Fedora 5 machine built from spare parts of some of my old computers.
It's been running fine since I installed FC5 on it, and has given me no problems
with clients being Windows 2000/XP computers and Fedora 5, 6, & 7 installations
on my main rig. Problems started occurring with Werewolf for some reason.

Comment 7 Jeff Layton 2007-11-16 20:38:47 UTC

Mainly I just wanted an idea of the OS + samba version. It might be good to post
the version of samba being used there (some earlier versions had problems with
unix extensions).

In fact...it might be interesting to mount up this cifs share with unix
extensions disabled and see if the problem is still reproducible. Unmount all
cifs shares, then:

# modprobe cifs
# echo 0 > /proc/fs/cifs/LinuxExtensionsEnabled

...then mount up the cifs share and try to reproduce the problem. This will
disable posix extensions so you won't see proper file modes, owners, etc. I
assume that won't matter much for this though.

Comment 8 Gian Paolo Mureddu 2007-11-16 20:49:15 UTC

That information is on my original report. The samba version of the server is
samba-3.0.24-7.fc5 (client and common).

I'll try that when I get home, but I find it odd that it was until Werewolf that
this started to happen.

Comment 9 Gian Paolo Mureddu 2007-11-17 18:04:31 UTC

I have been unable to come up with a testing methdology which could yield the
same results as what I've been experiencing from a pure text environment,
without X. I need some suggestions on commands which can scan the whole "mount"
in the hopes of triggering the error. As soon as I'm able to reproduce this from
a runlevel 3 environment, I'll try mounting the share without the Unix
extensions as you suggested and test again.

Thanks again for your continued interest on this matter.

Comment 10 Gian Paolo Mureddu 2007-11-17 18:30:38 UTC

I finished reading the kdump and sysrq documentation you pointed at, thank you.
If I understand correctly to know exactly what the machine is doing when it is
"hung", the procedure should be:


1.- Configure Kdump and kexec to generate the appropriate kernel dump.
2.- Start the task that causes the hang to occur.
3.- When the system "feels hung", cause the system to crash, forcing a reboot
due to kexec and kdump, which will flush the memory to /var/crash/<date>/vmcore.
4.- Reboot into normal state.

Now the question: What exactly do I do with the dumped vmcore? The article you
linked me to states that you could do some forensics on the kernel by installing
the debuginfo package of the kernel and analyse that against the vmcore, using
the crash command, but, I wouldn't know what I'd be looking for. Should I attach
the vmcore file to this bugzilla report?

on the other hand... Thinking a bit more about the methodology to cause this
problem from init 3, are you familiar with or know of any tools that may
retrieve metadata information from digital audio files (mainly .mp3, .ogg, and
.flac) that would work from the command line? Because in such a case, I may be
able to write a little script to list all the contents of the mount,then filter
the media files out (flac, ogg, mp3), run a for-in loop with such a tool to
retrieve the metadata to /dev/null in hopes that heavy I/O would trigger the
issue. I was able to reproduce the problem using EasyTag, as it recursively
searches the given path for any supported media file type. Basically the same
behavior of Amarok and Rhythmbox.

Comment 11 Jeff Layton 2007-11-18 00:28:55 UTC

> I have been unable to come up with a testing methdology which could yield the
> same results as what I've been experiencing from a pure text environment,

Rather than concerning yourself with that at the moment, it might be easier to
just disable unix extensions and do a quick test with that. If it works fine
then we'll know that the problem is confined to that area of the code. Some
earlier versions of samba have unix extensions enabled by default, but have
problems, so I sort of suspect that the problem may be there.

> I need some suggestions on commands which can scan the whole "mount"
> in the hopes of triggering the error.

Maybe a "find" combined with something that does a read of the header info on
each file? I'm not sure...again I suspect that this problem is related to posix
extensions so I think disabling them would be the best initial test.

Comment 12 Jeff Layton 2007-11-18 00:30:29 UTC

(that might also give you a workaround until we can determine the cause and fix...)

Comment 13 Gian Paolo Mureddu 2007-11-18 05:35:22 UTC

Some updates, earlier in the day I was simply copying some files (~70Mb) to the
mount with Nautilus and the system came to a crawl. Input speed was *severely*
compromised (both mouse and keyboard on both, X and VT text console), however
top didn't show anything of particular interest. As I write this I'm going to
test disabling the Unix extensions to the cifs kernel module... Will report back
as soon as I have done this, and tested with the very same files first in
Nautilus and then in VT text console.

Comment 14 Gian Paolo Mureddu 2007-11-18 05:48:24 UTC

Just did that, tried again copying the files, and this time the system hang
occurred much faster. I remember a while back Samba (or rather gnome-vfs-smb)
having some serious problems with D-Bus causing the SMB transaction to eat a LOT
of CPU, but never to trash a system. I'm not able to do more testing for the
night, so I'll do it tomorrow when I get more time. This time I'll go for the
full sysrq->kexec->kdump methodology... Apparently this is not a bug in Samba,
but rather the kernel. When I have the vmcore created, I'll report back.

Comment 15 Jeff Layton 2007-11-18 11:27:34 UTC

> Apparently this is not a bug in Samba, but rather the kernel.

Yes, if the box is locking up when accessing a CIFS mount, then it's likely a
kernel bug. The question is whether there's something about this server that's
helping to trigger the problem. It sounds like there may not be.

With a core we might be able to tell more. Even better -- before you crash the
box, do a couple of sysrq-t's. That should dump the stack of each thread on the
box to the ring buffer and may help us determine whether the box is really
stuck, or just very busy.

Comment 16 Jeff Layton 2007-11-18 11:38:11 UTC

So when things slow to a crawl like this, are you able to switch out of X to a
VT and run commands? If so, is the VT similarly unresponsive?

Comment 17 Jeff Layton 2007-11-18 11:39:55 UTC

...also if you do the same 70Mb copy of the files to the mount with "cp" from a
console rather than nautilus, do you get the same slowdown?

Comment 18 Jeff Layton 2007-11-18 12:25:31 UTC

For the record, I mounted up a samba share from a F7 box with ~10GB of
ogg/mp3/flac files, and had easytag walk the share. I didn't see any hanging or
significant slowdown.

Samba server was samba-3.0.27-0.fc7 and client is running 2.6.23.1-49.fc8. There
are some differences between my test rig and yours though. My client is x86
(rather than x86_64) and the server is F7 instead of F5.

I'm starting to wonder if this may be something lower in the network stack
(network driver maybe). Once we get a core it may be easier to tell...

And as far as where to upload the core. It'll likely be to big to attach to the
BZ. We have an anonymous FTP server that you can upload it to, but I'll need to
get details about how to access it. Once I do that I'll post them here...

Comment 19 Jeff Layton 2007-11-18 12:42:33 UTC

The anonymous FTP server is dropbox.redhat.com. After logging in cd to
"incoming" and upload the bzip2'ed core file with a unique filename. I recommend:

bz386411-vmcore.1.bz2

Once it's uploaded, post a comment here and I'll have a look when I'm able...

Comment 20 Gian Paolo Mureddu 2007-11-18 23:28:47 UTC

I quick sum of the events up until now.

I have been unsuccessful at creating a kernel dump, most likely I'm doing
something wrong. Here's what I'm doing:

1 use kexec to load the kernel image and initrd like this (have into account
that both rhgb and quiet are removed from the command line at boot, only a 3 is
placed after boot 'ro root=LABEL=/' to boot straight into RL 3):

kexec -l /boot/vmlinuz-`uname -r` --initrd=/boot/initrd-`uname-r`.img
--command-line="`cat /proc/cmdline`"

According to the available documentation, they recommend to append crashkernel
argument too, so the --comand-line argument to kexec ends up like this:

--command-line="`cat /proc/cmdline` crashkernel=128M@16M"

2 For these testing purposes, I have enabled in /etc/syctl.conf kernel-sysrq=1,
so I don't have to type 'echo "$CMD" > /proc/sysrq-trigger' and be able to
simply sysrq the system.

3 Maybe I did not understand correctly the instructions given for generating a
vmcore dump. What I did try to do was to load the kernel image, then immediately
reboot into it, but the system hung while bringing the system back up again. At
first I thought it was about the 'crashkernel=128M@16M' argument, and as such I
tried again without it. Note that I've tried this procedure with two different
initrd's, initrd-`uname -r`.img and initrc-`uname -r`kdump.img, I do not recall,
nor did I write down the last message on the screen when this happens, but I'll
try again and record that... At any rate, my question is: Should I reboot
immediately into the kexec kernel and then crash the system while it's being
hogged down by the operation, or should I first generate a sysrq call and then
reboot or how exactly should I do it? I have not been able to get a vmcore in
/var/crash/<date>/ yet.

4 >>I'm starting to wonder if this may be something lower in the network stack
  >>(network driver maybe). Once we get a core it may be easier to tell...

Well these are my thoughts too, as there seems to be a problem with the device
driver (and I want to leave QUITE CLEAR this problem was not present in F7
Moonshine with basically the same kernel version and same architecture), where
the system gets sudden disconnections. The interface will have an IP, the
machine is unable to get any data through the network link and if brought down,
the interface cannot get an IP from the DHCP server (running in a LinkSys
router, assigned by MAC). Looking for further iunformation at /var/log/messages
and dmesg, I see this (note the combined output with the CIFS operation):

NETDEV WATCHDOG: eth0: transmit timed out
0000:00:09.0: tulip_stop_rxtx() failed (CSR5 0xfc07c057 CSR6 0xff970111)
 CIFS VFS: Write2 ret -112, wrote 0

However this is seemingly random and not necessarily with CIFS only, but also
while, for instance, streaming a video off the web, like a youtube video or
others. It also happens while the connection is idle (which I thought, was
another issue altogether).

5 The slowing to a crawl, has not only happened when accessing the CIFS mount,
but also while doing totally unreleated stuff like watching local videos, for
instance a DVD with video files (the one it happened with had all the videos of
FUDCON I've got), and it happened something "strange" with it too, a few minutes
before the system came to a crawl: The drive wasn't reading any data off the
disc and the video player would run out of buffer, sort of like a DMA problem,
then a few minutes later the system came to a crawl, and eventually locked up.
Not sure if this is related, or if it is a whole other issue. While this
happened the CIFS share was mounted.

6 >>Samba server was samba-3.0.27-0.fc7 and client is running 2.6.23.1-49.fc8.
  >>There are some differences between my test rig and yours though. My client
  >>is x86 (rather than x86_64) and the server is F7 instead of F5.

Just tested with an x86 laptop installed with F8 running the latest kernel, only
exception with the desktop is the use of AIGLX and Compiz (have been unable to
get working right the fglrx driver, but that's a WHOLE other issue, local to
fglrx). This DOES NOT show on it, same server, so by now I'm pretty sure it is
local to x86_64... Will test this system more thoroughly, though. I have to
start this laptop with the kernel command line arguments of 'acpi=off nolapic no
apic' for it to both boot and be able to do networking. Dunno if this has any
effect on this, I'd doubt it.

I'll get to gather more information, as soon as I have something, I'll report back.

Comment 21 Gian Paolo Mureddu 2007-11-19 01:41:29 UTC

Ok, so here is what I get with kexec when I try to reboot into the loaded to
memory kernel image when the boot process "freezes", i.e, I'm not able to boot
into that image.

--snip--
Kernel command line: ro root=LABEL=/ 3
initializing CPU#0
PID hash table entries: 4096 (order: 12, 32768 bytes)
TSC calibrated against PM_TIMER
time.c: Detected 1799.790 MHz processor
Console colour: VGA+ 80x25
Console [TTY0] enabled
Checking aperture
CPU0 aperture @ e8000000 size 128 MB
Memory: 1026300k/1048512k available ( 2442K kernel code, 21820 reserved, 1381K
data, 324K init)
SLUB: Genslabs=23, HWalign=64, Order= 0-1, MinObjects=4, CPUs=1, Nodes=1
--------

Guess I will have to build a vanilla kernel without any patches and see how does
it behave with all the problems I've encountered thus far.

Comment 22 Jeff Layton 2007-11-19 02:00:22 UTC

> Well these are my thoughts too, as there seems to be a problem with the device
> driver (and I want to leave QUITE CLEAR this problem was not present in F7
> Moonshine with basically the same kernel version and same architecture), where
> the system gets sudden disconnections.

This sounds very significant.

> This DOES NOT show on it, same server, so by now I'm pretty sure it is
> local to x86_64

Actually. I more suspect that there's a real problem with the network hardware
or driver, rather than something arch specific. Could you install the "sos"
package and run an sosreport on this machine?

Would it be possible to drop a different network card (preferably a different
type of card) in this box and test with that? That would help rule in/out the
network card driver and hardware.

Comment 23 Jeff Layton 2007-11-19 02:05:57 UTC

Given what you've stated about the network card showing issues, I think it would
be best to focus on that for now. Let's set aside the coredump for the moment
and see what we can determine about the underlying network stack...

Comment 24 Gian Paolo Mureddu 2007-11-19 02:24:04 UTC

Running the sosreport tool right now. I'll see if I can get a different type of
NIC (I've got a spare, but it's identical to this one, and the one currently on
the system works just fine with different Live! systems and F7 which is dual
booting in the system)

Comment 25 Gian Paolo Mureddu 2007-11-19 04:39:57 UTC

Created attachment 263251 [details]
dump of /var/log/messages for today

Ok, I've run a full sosreport on the machine, just so you know what's exactly
in it, and hopefully spot a problem or two. I'm sending the file as
gmureddu.18112007.1.tar.bz2, it's of about ~58Mb

Comment 26 Gian Paolo Mureddu 2007-11-19 10:57:32 UTC

Ok, it has got to be the network stack or the driver (or both). Just for kicks I
decided to give it a shot with the previous kernel (which I had to re-install)
2.6.23.1-42.fc8. This kernel doesn't exhibit the performance degradation and
utter unresponsiveness of the system, but does rather quickly show the "sudden
disconnection" problem, however in this case, there is nothing printed to dmesg
(nor to /var/log/messages) regarding this event. Problem is that I can't still
boot up the system with kexec, neither with regular initrd, nor with kdump
initrd, stops at the very same spot that -49.fc8 does.

Comment 27 Jeff Layton 2007-11-19 12:03:20 UTC

The kdump problem sounds like it may be a different bug. You might want to open
a separate BZ for that...

Between those 2 kernel revs there's only been 1 CIFS patch:

- CIFS: fix corruption when server returns EAGAIN (#357001)

...and that one is a small but obviously correct fix (and without it cifs can
cause random memory corruption). There are some other driver and networking
fixes in there too, so one of those may be at fault.

Comment 29 Jeff Layton 2007-11-19 12:07:36 UTC

Actually, since you have a "working" and "non-working" kernel, it might be best
to build a -49.fc8 kernel that does not have these patches:

linux-2.6-cifs-fix-bad-handling-of-EAGAIN.patch
linux-2.6-cifs-fix-incomplete-rcv.patch
linux-2.6-cifs-typo-in-cifs_reconnect-fix.patch

...and see if the problem is still reproducible. If so, then that should tell us
that they are not a factor. If not, then we can focus on those patches and see
why they might be causing an issue...

Comment 31 Jeff Layton 2007-11-19 15:32:16 UTC

Hmm...the sosreport seems to be corrupt:

$ bzip2 -tvv gmureddu.18112007.1.tar.bz2
...
    [108: huff+mtf rt+rld]
    [109: huff+mtf rt+rld]
    [110: huff+mtf file ends unexpectedly

I think the best course of action at this point is to build a kernel without
those 3 cifs patches and test it. Gian, would you be able to do that?

If the problem goes away, then it may be a combination of factors at play here.
Possible problems with the lower networking stack that are somehow tickling a
bug in CIFS.

Comment 32 Andy Gospodarek 2007-11-19 16:10:37 UTC

Looks like there isn't any difference between the tulip drivers on the latest f7
and latest f8 kernels, so I'm not guessing tulip is completely to blame.  Your
hunch is probably correct, Jeff.

Comment 33 Gian Paolo Mureddu 2007-11-19 20:18:10 UTC

(In reply to comment #31)
> Hmm...the sosreport seems to be corrupt:
> 
> $ bzip2 -tvv gmureddu.18112007.1.tar.bz2
> ...
>     [108: huff+mtf rt+rld]
>     [109: huff+mtf rt+rld]
>     [110: huff+mtf file ends unexpectedly
> 
> I think the best course of action at this point is to build a kernel without
> those 3 cifs patches and test it. Gian, would you be able to do that?
I think so, yeah. I'll only have to install the necessary infrastructure
packages (rpmbuild, etc) and get the kenrel.src.rpm, when I have it built, will
let you know.

Comment 34 Gian Paolo Mureddu 2007-11-19 22:39:21 UTC

I'm posting this from F8-Live-i686, but since it doesn't feature a full
installation of SAMBA, I can't "mount" the share and as such perform any of the
tests. Also it features the -42.fc8 kernel... However the "sudden disconnection"
issue has not appeared after some heavy network use (though gnome-vfs-smb and
FTP). This is of course inconclusive, as I'd have to test with a fully installed
i686 system and see if I can recreate the problem. Will see how it goes with the
custom kernel recompile and I even thought of building as reference a vanilla
2.6.23.1 kernel for completeness sake.

Comment 35 Gian Paolo Mureddu 2007-11-21 11:04:29 UTC

Posting this from a freshly booted 2.6.23.8 vanilla kernel, without any patches
applied. Not one of the symptoms show with it. The connection goes strong and
CIFS  shares doesn't cause the system to crawl. I may hold up to this kernel
until the next Official kernel release.

Comment 36 Jeff Layton 2007-11-28 17:53:33 UTC

Vanilla kernels don't tell me much since that still leaves a lot of patches that
could be candidates for the problem. My suggestion would again be to build a
stock -49.fc8 kernel without these three patches:

linux-2.6-cifs-fix-bad-handling-of-EAGAIN.patch
linux-2.6-cifs-fix-incomplete-rcv.patch
linux-2.6-cifs-typo-in-cifs_reconnect-fix.patch

If you're not sure how to do that, then let me know and I'll build one for you.

Comment 37 Gian Paolo Mureddu 2007-11-29 00:08:18 UTC

I'll build a test kernel later tonight and report back. I do know how to do
that, I only need to tweak a bit the .spec, will let you know what happens
without those patches. The vanilla kernel I built was to rule out hardware
failure. Now established that I don't have a hardware failure, then it falls
under some kind of interaction with these patches and the NIC driver (somehow)

Comment 38 Gian Paolo Mureddu 2007-12-03 23:26:27 UTC

Did as asked, and built the kernel and installed it. Did it in a rather quick
process so I only used rpmbuild -bp to patch the kernel and built by hand (make
bzImage modules modules_install install). The results of the installed kernel
are actually worse than the original 2.6.23.1-49.fc8, while CIFS share access
was better tolerated, the problem is still present, and eventually the system
crashes. Also interactivity is severely affected (mouse pointer, keyboard [even
in text environment], and X has an overall CPU use of ~30% (not seen with a
custom vanilla kernel or with stock -42 & -49.fc8 kernels unless "triggering"
the issue at hand, and the issue is not present with the custom vanilla 2.6.23.8
 I built before.

Without being able to troubleshoot this issue, because kdump is not working
either, and there's already a new kernel pushed out, I'll test with it, and if
the problem persists in the new kernel, will try my best to get kdump working
this time. I haven't the slightest clue as to what might be causing this,
scheduler, maybe?

Comment 39 Gian Paolo Mureddu 2007-12-04 06:30:23 UTC

Ok, now this is getting more of an issue... Updated to the latest 2.6.23.8-63
kernel and even though it took a while to present the problem (and no longer had
a problem with interactivity), the problem with CIFS shares still persists. At
this point I'm not so sure it is with CIFS, but rather something that this
triggers inside the kernel. Tried to again get a kernel dump and wasn't able to.
This time when asking Amarok to perform a full library rebuild, it actually got
very far (45%) before X locked up, then when I as able to finally get to a VT, I
was able to login, but no shell prompt would come up. This time, however sysrq-b
worked. So I went again, and tried to get a kernel dump this time... Just like
before, upon booting the kexec image, the process stops and doesn't finish
loading. Three kernels in a row, I'd say it is one huge problem, where Vanilla
kernels don't exhibit this. So what do I do about this, how do I generate a
kdump so it is finally clear what might be going on?

Comment 40 Jeff Layton 2007-12-04 12:37:39 UTC

> This time when asking Amarok to perform a full library rebuild, it actually 
> got very far (45%) before X locked up, then when I as able to finally get to a
> VT, I was able to login, but no shell prompt would come up.

With a problem like this, I'd suggest having a root shell already logged in on a
VT and ready to go.

> So I went again, and tried to get a kernel dump this time... Just like
> before, upon booting the kexec image, the process stops and doesn't finish
> loading.

If you trigger a crash when the kernel is "healthy" does it also hang like this?
Or does kdump only hang when you try to trigger a core with the box in this state?

If it's hanging whenever you try to do a kdump then that's almost certainly a
separate issue and I suggest opening a new BZ for it so that you can work the
guys who specialize in kdump to get it resolved. If not then that would suggest
that the two problems are related (perhaps some hardware that's misbehaving?) If
so, then focusing on getting kdump working might shed some light on resolving
the general lockups.

In the meantime, since you're having trouble getting a core, the best thing to
do might be to just get some sysrq info when the box is in this state. Do a
sysrq-t, wait a few seconds and then do another, and then do sysrq-w. Have an
already logged-in root shell ready on a VT before getting the box into this
state then and do a:

# dmesg -s 131072 > /tmp/dmesg.out

...alternately you can also set up a serial console to collect the info. If
you're lucky, it may even get logged to /var/log/messages, but that's often spotty.

Comment 41 Jeff Layton 2007-12-20 12:31:00 UTC

No word on this in the last couple of weeks, reducing severity to "high".

Comment 42 Jon Stanley 2008-01-20 16:53:05 UTC

Nothing in a month here.  In the meantime F8 has rebased to 2.6.23.9, can you
see if you are still having the problem?  If I don't see a response in one
month, I'll have to close this as INSUFFICIENT_DATA.

Comment 43 Gian Paolo Mureddu 2008-01-20 23:36:20 UTC

Well, I can say that this problem seems to stem from 64-bits, the driver and the
network stack. Since originally posting this, I have acquired a new system
(running fine thus far), and used the old system with a 32-bit installation of
F8. The 32-bit installation has no issues whatsoever, on the very same hardware
configuration as the problem originally, so maybe this problem is simply a Tulip
64-bit driver issue?

Comment 44 Jeff Layton 2008-02-08 17:24:57 UTC

Ok, thanks. I'm going to hand this off to Andy since it sounds more like a
problem with the network driver is the root of the problem. Once we have an idea
of what the problem there is, there may be a problem with CIFS too that's
tickled by the networking issue.

Andy, let me know if I can be of help. I don't think I have any tulip cards in a
64 bit machine, however (I may personally have an ancient tulip card around
somewhere though).

Comment 45 Bug Zapper 2008-11-26 08:29:21 UTC

This message is a reminder that Fedora 8 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 8.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '8'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 8's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 8 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 46 Bug Zapper 2009-01-09 07:26:18 UTC

Fedora 8 changed to end-of-life (EOL) status on 2009-01-07. Fedora 8 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.