Bug 386411
Summary: | problem with tulip driver on x86_64 | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Gian Paolo Mureddu <gmureddu> | ||||
Component: | kernel | Assignee: | Andy Gospodarek <agospoda> | ||||
Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 8 | CC: | jlayton, jonstanley, peterm | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2009-01-09 07:26:18 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Gian Paolo Mureddu
2007-11-16 07:37:50 UTC
Exactly what kernel are you seeing this on? Would it be possible to force a crash on this box and collect a coredump (via kdump)? If so we should be able to look at that and see what the box is doing when it's hung like this... I knew I was forgetting something... Kernel version is 2.6.23.1-49.fc8. How exactly would I go about doing this coredump with kdump? This is the only doc I can quickly find. It's for FC6, but F8 should be very similar. http://fedoraproject.org/wiki/FC6KdumpKexecHowTo When the box locks up, you'll need to intentionally crash it, for that, you can use sysrq-c: http://kbase.redhat.com/faq/FAQ_80_5559.shtm An even more helpful thing to do would be to come up with a simple way to reproduce this (i.e. some way to reproduce it from the console that doesn't involve X). BTW: what you mean by a "locally mounted" SMB share? Is the server and client the same machine? If so, can you reproduce this with the host acting in only one role (server or client)? That might help narrow things down for a reproducer... By locally mounted I mean what I said about the share being mounted via fstab. To be accessed through a mountpoint like a regular partition. Sorry if I wasn't clear on that. I'll try to reproduce the error when I get back home and can further test this, mean while I'll read up on what you linked and try to get something (in all my years of Linux usage, I have never knew how to invoke sysrq, guess it's never too late to learn ;) )... What I have in mind to see if this is only in X or not, I'll try to simply run a "du -sh" on the mounted cifs share, see if can trigger the problem. Ok, so the server is a different physical machine. It might not hurt to know what sort of server this is... What do you mean by "what sort of server"? It is a Fedora 5 machine built from spare parts of some of my old computers. It's been running fine since I installed FC5 on it, and has given me no problems with clients being Windows 2000/XP computers and Fedora 5, 6, & 7 installations on my main rig. Problems started occurring with Werewolf for some reason. Mainly I just wanted an idea of the OS + samba version. It might be good to post the version of samba being used there (some earlier versions had problems with unix extensions). In fact...it might be interesting to mount up this cifs share with unix extensions disabled and see if the problem is still reproducible. Unmount all cifs shares, then: # modprobe cifs # echo 0 > /proc/fs/cifs/LinuxExtensionsEnabled ...then mount up the cifs share and try to reproduce the problem. This will disable posix extensions so you won't see proper file modes, owners, etc. I assume that won't matter much for this though. That information is on my original report. The samba version of the server is samba-3.0.24-7.fc5 (client and common). I'll try that when I get home, but I find it odd that it was until Werewolf that this started to happen. I have been unable to come up with a testing methdology which could yield the same results as what I've been experiencing from a pure text environment, without X. I need some suggestions on commands which can scan the whole "mount" in the hopes of triggering the error. As soon as I'm able to reproduce this from a runlevel 3 environment, I'll try mounting the share without the Unix extensions as you suggested and test again. Thanks again for your continued interest on this matter. I finished reading the kdump and sysrq documentation you pointed at, thank you. If I understand correctly to know exactly what the machine is doing when it is "hung", the procedure should be: 1.- Configure Kdump and kexec to generate the appropriate kernel dump. 2.- Start the task that causes the hang to occur. 3.- When the system "feels hung", cause the system to crash, forcing a reboot due to kexec and kdump, which will flush the memory to /var/crash/<date>/vmcore. 4.- Reboot into normal state. Now the question: What exactly do I do with the dumped vmcore? The article you linked me to states that you could do some forensics on the kernel by installing the debuginfo package of the kernel and analyse that against the vmcore, using the crash command, but, I wouldn't know what I'd be looking for. Should I attach the vmcore file to this bugzilla report? on the other hand... Thinking a bit more about the methodology to cause this problem from init 3, are you familiar with or know of any tools that may retrieve metadata information from digital audio files (mainly .mp3, .ogg, and .flac) that would work from the command line? Because in such a case, I may be able to write a little script to list all the contents of the mount,then filter the media files out (flac, ogg, mp3), run a for-in loop with such a tool to retrieve the metadata to /dev/null in hopes that heavy I/O would trigger the issue. I was able to reproduce the problem using EasyTag, as it recursively searches the given path for any supported media file type. Basically the same behavior of Amarok and Rhythmbox. > I have been unable to come up with a testing methdology which could yield the > same results as what I've been experiencing from a pure text environment, Rather than concerning yourself with that at the moment, it might be easier to just disable unix extensions and do a quick test with that. If it works fine then we'll know that the problem is confined to that area of the code. Some earlier versions of samba have unix extensions enabled by default, but have problems, so I sort of suspect that the problem may be there. > I need some suggestions on commands which can scan the whole "mount" > in the hopes of triggering the error. Maybe a "find" combined with something that does a read of the header info on each file? I'm not sure...again I suspect that this problem is related to posix extensions so I think disabling them would be the best initial test. (that might also give you a workaround until we can determine the cause and fix...) Some updates, earlier in the day I was simply copying some files (~70Mb) to the mount with Nautilus and the system came to a crawl. Input speed was *severely* compromised (both mouse and keyboard on both, X and VT text console), however top didn't show anything of particular interest. As I write this I'm going to test disabling the Unix extensions to the cifs kernel module... Will report back as soon as I have done this, and tested with the very same files first in Nautilus and then in VT text console. Just did that, tried again copying the files, and this time the system hang occurred much faster. I remember a while back Samba (or rather gnome-vfs-smb) having some serious problems with D-Bus causing the SMB transaction to eat a LOT of CPU, but never to trash a system. I'm not able to do more testing for the night, so I'll do it tomorrow when I get more time. This time I'll go for the full sysrq->kexec->kdump methodology... Apparently this is not a bug in Samba, but rather the kernel. When I have the vmcore created, I'll report back. > Apparently this is not a bug in Samba, but rather the kernel.
Yes, if the box is locking up when accessing a CIFS mount, then it's likely a
kernel bug. The question is whether there's something about this server that's
helping to trigger the problem. It sounds like there may not be.
With a core we might be able to tell more. Even better -- before you crash the
box, do a couple of sysrq-t's. That should dump the stack of each thread on the
box to the ring buffer and may help us determine whether the box is really
stuck, or just very busy.
So when things slow to a crawl like this, are you able to switch out of X to a VT and run commands? If so, is the VT similarly unresponsive? ...also if you do the same 70Mb copy of the files to the mount with "cp" from a console rather than nautilus, do you get the same slowdown? For the record, I mounted up a samba share from a F7 box with ~10GB of ogg/mp3/flac files, and had easytag walk the share. I didn't see any hanging or significant slowdown. Samba server was samba-3.0.27-0.fc7 and client is running 2.6.23.1-49.fc8. There are some differences between my test rig and yours though. My client is x86 (rather than x86_64) and the server is F7 instead of F5. I'm starting to wonder if this may be something lower in the network stack (network driver maybe). Once we get a core it may be easier to tell... And as far as where to upload the core. It'll likely be to big to attach to the BZ. We have an anonymous FTP server that you can upload it to, but I'll need to get details about how to access it. Once I do that I'll post them here... The anonymous FTP server is dropbox.redhat.com. After logging in cd to "incoming" and upload the bzip2'ed core file with a unique filename. I recommend: bz386411-vmcore.1.bz2 Once it's uploaded, post a comment here and I'll have a look when I'm able... I quick sum of the events up until now. I have been unsuccessful at creating a kernel dump, most likely I'm doing something wrong. Here's what I'm doing: 1 use kexec to load the kernel image and initrd like this (have into account that both rhgb and quiet are removed from the command line at boot, only a 3 is placed after boot 'ro root=LABEL=/' to boot straight into RL 3): kexec -l /boot/vmlinuz-`uname -r` --initrd=/boot/initrd-`uname-r`.img --command-line="`cat /proc/cmdline`" According to the available documentation, they recommend to append crashkernel argument too, so the --comand-line argument to kexec ends up like this: --command-line="`cat /proc/cmdline` crashkernel=128M@16M" 2 For these testing purposes, I have enabled in /etc/syctl.conf kernel-sysrq=1, so I don't have to type 'echo "$CMD" > /proc/sysrq-trigger' and be able to simply sysrq the system. 3 Maybe I did not understand correctly the instructions given for generating a vmcore dump. What I did try to do was to load the kernel image, then immediately reboot into it, but the system hung while bringing the system back up again. At first I thought it was about the 'crashkernel=128M@16M' argument, and as such I tried again without it. Note that I've tried this procedure with two different initrd's, initrd-`uname -r`.img and initrc-`uname -r`kdump.img, I do not recall, nor did I write down the last message on the screen when this happens, but I'll try again and record that... At any rate, my question is: Should I reboot immediately into the kexec kernel and then crash the system while it's being hogged down by the operation, or should I first generate a sysrq call and then reboot or how exactly should I do it? I have not been able to get a vmcore in /var/crash/<date>/ yet. 4 >>I'm starting to wonder if this may be something lower in the network stack >>(network driver maybe). Once we get a core it may be easier to tell... Well these are my thoughts too, as there seems to be a problem with the device driver (and I want to leave QUITE CLEAR this problem was not present in F7 Moonshine with basically the same kernel version and same architecture), where the system gets sudden disconnections. The interface will have an IP, the machine is unable to get any data through the network link and if brought down, the interface cannot get an IP from the DHCP server (running in a LinkSys router, assigned by MAC). Looking for further iunformation at /var/log/messages and dmesg, I see this (note the combined output with the CIFS operation): NETDEV WATCHDOG: eth0: transmit timed out 0000:00:09.0: tulip_stop_rxtx() failed (CSR5 0xfc07c057 CSR6 0xff970111) CIFS VFS: Write2 ret -112, wrote 0 However this is seemingly random and not necessarily with CIFS only, but also while, for instance, streaming a video off the web, like a youtube video or others. It also happens while the connection is idle (which I thought, was another issue altogether). 5 The slowing to a crawl, has not only happened when accessing the CIFS mount, but also while doing totally unreleated stuff like watching local videos, for instance a DVD with video files (the one it happened with had all the videos of FUDCON I've got), and it happened something "strange" with it too, a few minutes before the system came to a crawl: The drive wasn't reading any data off the disc and the video player would run out of buffer, sort of like a DMA problem, then a few minutes later the system came to a crawl, and eventually locked up. Not sure if this is related, or if it is a whole other issue. While this happened the CIFS share was mounted. 6 >>Samba server was samba-3.0.27-0.fc7 and client is running 2.6.23.1-49.fc8. >>There are some differences between my test rig and yours though. My client >>is x86 (rather than x86_64) and the server is F7 instead of F5. Just tested with an x86 laptop installed with F8 running the latest kernel, only exception with the desktop is the use of AIGLX and Compiz (have been unable to get working right the fglrx driver, but that's a WHOLE other issue, local to fglrx). This DOES NOT show on it, same server, so by now I'm pretty sure it is local to x86_64... Will test this system more thoroughly, though. I have to start this laptop with the kernel command line arguments of 'acpi=off nolapic no apic' for it to both boot and be able to do networking. Dunno if this has any effect on this, I'd doubt it. I'll get to gather more information, as soon as I have something, I'll report back. Ok, so here is what I get with kexec when I try to reboot into the loaded to memory kernel image when the boot process "freezes", i.e, I'm not able to boot into that image. --snip-- Kernel command line: ro root=LABEL=/ 3 initializing CPU#0 PID hash table entries: 4096 (order: 12, 32768 bytes) TSC calibrated against PM_TIMER time.c: Detected 1799.790 MHz processor Console colour: VGA+ 80x25 Console [TTY0] enabled Checking aperture CPU0 aperture @ e8000000 size 128 MB Memory: 1026300k/1048512k available ( 2442K kernel code, 21820 reserved, 1381K data, 324K init) SLUB: Genslabs=23, HWalign=64, Order= 0-1, MinObjects=4, CPUs=1, Nodes=1 -------- Guess I will have to build a vanilla kernel without any patches and see how does it behave with all the problems I've encountered thus far. > Well these are my thoughts too, as there seems to be a problem with the device > driver (and I want to leave QUITE CLEAR this problem was not present in F7 > Moonshine with basically the same kernel version and same architecture), where > the system gets sudden disconnections. This sounds very significant. > This DOES NOT show on it, same server, so by now I'm pretty sure it is > local to x86_64 Actually. I more suspect that there's a real problem with the network hardware or driver, rather than something arch specific. Could you install the "sos" package and run an sosreport on this machine? Would it be possible to drop a different network card (preferably a different type of card) in this box and test with that? That would help rule in/out the network card driver and hardware. Given what you've stated about the network card showing issues, I think it would be best to focus on that for now. Let's set aside the coredump for the moment and see what we can determine about the underlying network stack... Running the sosreport tool right now. I'll see if I can get a different type of NIC (I've got a spare, but it's identical to this one, and the one currently on the system works just fine with different Live! systems and F7 which is dual booting in the system) Created attachment 263251 [details] dump of /var/log/messages for today Ok, I've run a full sosreport on the machine, just so you know what's exactly in it, and hopefully spot a problem or two. I'm sending the file as gmureddu.18112007.1.tar.bz2, it's of about ~58Mb Ok, it has got to be the network stack or the driver (or both). Just for kicks I decided to give it a shot with the previous kernel (which I had to re-install) 2.6.23.1-42.fc8. This kernel doesn't exhibit the performance degradation and utter unresponsiveness of the system, but does rather quickly show the "sudden disconnection" problem, however in this case, there is nothing printed to dmesg (nor to /var/log/messages) regarding this event. Problem is that I can't still boot up the system with kexec, neither with regular initrd, nor with kdump initrd, stops at the very same spot that -49.fc8 does. The kdump problem sounds like it may be a different bug. You might want to open a separate BZ for that... Between those 2 kernel revs there's only been 1 CIFS patch: - CIFS: fix corruption when server returns EAGAIN (#357001) ...and that one is a small but obviously correct fix (and without it cifs can cause random memory corruption). There are some other driver and networking fixes in there too, so one of those may be at fault. Actually, since you have a "working" and "non-working" kernel, it might be best to build a -49.fc8 kernel that does not have these patches: linux-2.6-cifs-fix-bad-handling-of-EAGAIN.patch linux-2.6-cifs-fix-incomplete-rcv.patch linux-2.6-cifs-typo-in-cifs_reconnect-fix.patch ...and see if the problem is still reproducible. If so, then that should tell us that they are not a factor. If not, then we can focus on those patches and see why they might be causing an issue... Hmm...the sosreport seems to be corrupt: $ bzip2 -tvv gmureddu.18112007.1.tar.bz2 ... [108: huff+mtf rt+rld] [109: huff+mtf rt+rld] [110: huff+mtf file ends unexpectedly I think the best course of action at this point is to build a kernel without those 3 cifs patches and test it. Gian, would you be able to do that? If the problem goes away, then it may be a combination of factors at play here. Possible problems with the lower networking stack that are somehow tickling a bug in CIFS. Looks like there isn't any difference between the tulip drivers on the latest f7 and latest f8 kernels, so I'm not guessing tulip is completely to blame. Your hunch is probably correct, Jeff. (In reply to comment #31) > Hmm...the sosreport seems to be corrupt: > > $ bzip2 -tvv gmureddu.18112007.1.tar.bz2 > ... > [108: huff+mtf rt+rld] > [109: huff+mtf rt+rld] > [110: huff+mtf file ends unexpectedly > > I think the best course of action at this point is to build a kernel without > those 3 cifs patches and test it. Gian, would you be able to do that? I think so, yeah. I'll only have to install the necessary infrastructure packages (rpmbuild, etc) and get the kenrel.src.rpm, when I have it built, will let you know. I'm posting this from F8-Live-i686, but since it doesn't feature a full installation of SAMBA, I can't "mount" the share and as such perform any of the tests. Also it features the -42.fc8 kernel... However the "sudden disconnection" issue has not appeared after some heavy network use (though gnome-vfs-smb and FTP). This is of course inconclusive, as I'd have to test with a fully installed i686 system and see if I can recreate the problem. Will see how it goes with the custom kernel recompile and I even thought of building as reference a vanilla 2.6.23.1 kernel for completeness sake. Posting this from a freshly booted 2.6.23.8 vanilla kernel, without any patches applied. Not one of the symptoms show with it. The connection goes strong and CIFS shares doesn't cause the system to crawl. I may hold up to this kernel until the next Official kernel release. Vanilla kernels don't tell me much since that still leaves a lot of patches that could be candidates for the problem. My suggestion would again be to build a stock -49.fc8 kernel without these three patches: linux-2.6-cifs-fix-bad-handling-of-EAGAIN.patch linux-2.6-cifs-fix-incomplete-rcv.patch linux-2.6-cifs-typo-in-cifs_reconnect-fix.patch If you're not sure how to do that, then let me know and I'll build one for you. I'll build a test kernel later tonight and report back. I do know how to do that, I only need to tweak a bit the .spec, will let you know what happens without those patches. The vanilla kernel I built was to rule out hardware failure. Now established that I don't have a hardware failure, then it falls under some kind of interaction with these patches and the NIC driver (somehow) Did as asked, and built the kernel and installed it. Did it in a rather quick process so I only used rpmbuild -bp to patch the kernel and built by hand (make bzImage modules modules_install install). The results of the installed kernel are actually worse than the original 2.6.23.1-49.fc8, while CIFS share access was better tolerated, the problem is still present, and eventually the system crashes. Also interactivity is severely affected (mouse pointer, keyboard [even in text environment], and X has an overall CPU use of ~30% (not seen with a custom vanilla kernel or with stock -42 & -49.fc8 kernels unless "triggering" the issue at hand, and the issue is not present with the custom vanilla 2.6.23.8 I built before. Without being able to troubleshoot this issue, because kdump is not working either, and there's already a new kernel pushed out, I'll test with it, and if the problem persists in the new kernel, will try my best to get kdump working this time. I haven't the slightest clue as to what might be causing this, scheduler, maybe? Ok, now this is getting more of an issue... Updated to the latest 2.6.23.8-63 kernel and even though it took a while to present the problem (and no longer had a problem with interactivity), the problem with CIFS shares still persists. At this point I'm not so sure it is with CIFS, but rather something that this triggers inside the kernel. Tried to again get a kernel dump and wasn't able to. This time when asking Amarok to perform a full library rebuild, it actually got very far (45%) before X locked up, then when I as able to finally get to a VT, I was able to login, but no shell prompt would come up. This time, however sysrq-b worked. So I went again, and tried to get a kernel dump this time... Just like before, upon booting the kexec image, the process stops and doesn't finish loading. Three kernels in a row, I'd say it is one huge problem, where Vanilla kernels don't exhibit this. So what do I do about this, how do I generate a kdump so it is finally clear what might be going on? > This time when asking Amarok to perform a full library rebuild, it actually > got very far (45%) before X locked up, then when I as able to finally get to a > VT, I was able to login, but no shell prompt would come up. With a problem like this, I'd suggest having a root shell already logged in on a VT and ready to go. > So I went again, and tried to get a kernel dump this time... Just like > before, upon booting the kexec image, the process stops and doesn't finish > loading. If you trigger a crash when the kernel is "healthy" does it also hang like this? Or does kdump only hang when you try to trigger a core with the box in this state? If it's hanging whenever you try to do a kdump then that's almost certainly a separate issue and I suggest opening a new BZ for it so that you can work the guys who specialize in kdump to get it resolved. If not then that would suggest that the two problems are related (perhaps some hardware that's misbehaving?) If so, then focusing on getting kdump working might shed some light on resolving the general lockups. In the meantime, since you're having trouble getting a core, the best thing to do might be to just get some sysrq info when the box is in this state. Do a sysrq-t, wait a few seconds and then do another, and then do sysrq-w. Have an already logged-in root shell ready on a VT before getting the box into this state then and do a: # dmesg -s 131072 > /tmp/dmesg.out ...alternately you can also set up a serial console to collect the info. If you're lucky, it may even get logged to /var/log/messages, but that's often spotty. No word on this in the last couple of weeks, reducing severity to "high". Nothing in a month here. In the meantime F8 has rebased to 2.6.23.9, can you see if you are still having the problem? If I don't see a response in one month, I'll have to close this as INSUFFICIENT_DATA. Well, I can say that this problem seems to stem from 64-bits, the driver and the network stack. Since originally posting this, I have acquired a new system (running fine thus far), and used the old system with a 32-bit installation of F8. The 32-bit installation has no issues whatsoever, on the very same hardware configuration as the problem originally, so maybe this problem is simply a Tulip 64-bit driver issue? Ok, thanks. I'm going to hand this off to Andy since it sounds more like a problem with the network driver is the root of the problem. Once we have an idea of what the problem there is, there may be a problem with CIFS too that's tickled by the networking issue. Andy, let me know if I can be of help. I don't think I have any tulip cards in a 64 bit machine, however (I may personally have an ancient tulip card around somewhere though). This message is a reminder that Fedora 8 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 8. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '8'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 8's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 8 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping Fedora 8 changed to end-of-life (EOL) status on 2009-01-07. Fedora 8 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed. |