During the install of the packages to the hard drive (after setting up a NFS install, selecting the default partition scheme and packages), the install process stops during gdb package install. If a simpler install is selected (without X, Gnome, Dev tools), the process stops during gtk2 package install. In any case, the install process does not complete. Hardware Environment: QS20 Cpu type: Cell Special hardware: - Spidernet driver as the ethernet module. Here are the steps to reproduce the error: Netboot the Cell machine. Config NFS install. Config partition scheme. Select default packages. I successfully installed Fedora 7 on an QS20 using FTP method in text mode. Any other combinations like NFS/vnc, NFS/text, FTP/vnc would fail. This is Spidernet issue. As I was trying to install SDK 2.1 from NFS mount point, Spidernet error messages keeps dumping on the minicom console. SDK install never complete ... buf_size=x00000980 next_descr_addr=x8f3f3f60 result_size=x00000600 valid_size=x000005ec data_status=x7000fa00 data_error=x02100000 which=250 got descriptor chain end interrupt, restarting DMAC A. Spider RX RAM full, incoming packets might be discarded! Spider RX RAM full, incoming packets might be discarded! Spider RX RAM full, incoming packets might be discarded! ........ Spider RX RAM full, incoming packets might be discarded! Spider RX RAM full, incoming packets might be discarded! eth0: bad status, cmd_status=x48800203 buf_addr=x987c7080 buf_size=x00000980 next_descr_addr=x99f86140 result_size=x00000600 valid_size=x000005ec data_status=x70008d00 data_error=x02100000 which=9 nfs: server qdcdepot OK nfs: server qdcdepot OK nfs: server qdcdepot OK nfs: server qdcdepot OK nfs: server qdcdepot OK nfs: server qdcdepot OK printk: 95 messages suppressed. Spider RX RAM full, incoming packets might be discarded! printk: 14 messages suppressed. Spider RX RAM full, incoming packets might be discarded! printk: 12 messages suppressed. Spider RX RAM full, incoming packets might be discarded! printk: 21 messages suppressed. Spider RX RAM full, incoming packets might be discarded! printk: 11 messages suppressed. Spider RX RAM full, incoming packets might be discarded! printk: 13 messages suppressed. Spider RX RAM full, incoming packets might be discarded! printk: 13 messages suppressed. Spider RX RAM full, incoming packets might be discarded! printk: 11 messages suppressed. Spider RX RAM full, incoming packets might be discarded! printk: 13 messages suppressed. Spider RX RAM full, incoming packets might be discarded! printk: 15 messages suppressed. Spider RX RAM full, incoming packets might be discarded! printk: 13 messages suppressed. Spider RX RAM full, incoming packets might be discarded! printk: 15 messages suppressed. Spider RX RAM full, incoming packets might be discarded! printk: 11 messages suppressed. Spider RX RAM full, incoming packets might be discarded! printk: 15 messages suppressed. Spider RX RAM full, incoming packets might be discarded! printk: 11 messages suppressed. Spider RX RAM full, incoming packets might be discarded! printk: 18 messages suppressed. Spider RX RAM full, incoming packets might be discarded! printk: 13 messages suppressed. Spider RX RAM full, incoming packets might be discarded! printk: 11 messages suppressed. Spider RX RAM full, incoming packets might be discarded! printk: 14 messages suppressed. Spider RX RAM full, incoming packets might be discarded! printk: 11 messages suppressed. Spider RX RAM full, incoming packets might be discarded! ........ The work-around: - force install 2.6.20-be0711.0.20070319 kernel (back level) from Cell SDK 2.1, which have the spidernet built-in the kernel. CONFIG_SPIDER_NET=y - reboot - mount to NFS server, and install SDK 2.1 without any problem Linas Vepstas and James are looking to see exactly what version of the Spidernet driver was picked up by Fedora 7. It is possible a later version exists that will solve this issue. This assumes of course that it is really a Spidernet problem. There are two spidernet patches that have been submitted and accepted for 2.6.21, that are not in 2.6.20, that seem likely to fix the problem. There are also 7 other patches that went into spidernet that would not affect this bug, but fix things for mii on celleb. I would advise testing with the spidernet taken from the current 2.6.21 release candidate tree. You'd need to grab spider_net.c, spider_net.h and sungem_phy.c (to get the bcm45xx/genmii changes) Trying to install the latest driver I made the following experience: If I install like 'rpm -ivv /nfs_mount/my_package.rpm' it will crash with the error as described in #7. If I copy the packages to the blade on a local directory and install like 'rpm -ivv /local_dir/my_package.rpm' it will be successful. The 2.6.20-be0711.0.20070319 Cell BE kernel works just fine, which build with these spidernet patches: spider-fix-eth_zlen.patch spidernet-add-support-for-celleb.diff spidernet-autoneg-medium.diff spidernet-autoneg-support-for-celleb.diff spidernet-avoid-double-free.diff spidernet-ipfrag-nfs.diff spidernet-linas-2-sync-with-mainline.diff spidernet-linas-3-split-descr.diff spidernet-linas-4-tx-race.diff spidernet-linas-5-nother-race.diff spidernet-linas-6-typo.diff spidernet-linas.diff spidernet-load-firmware-when-open.diff spidernet-queue-drain.diff spidernet-remove-txram-full-logging.diff spidernet-sungem-update-4.diff
Created attachment 152510 [details] /var/log/messages
changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |RETTIG.com ------- Additional Comments From mlui.com 2007-04-12 21:44 EDT ------- Oliver, adding you to cc list.
----- Additional Comments From mlui.com 2007-04-12 21:48 EDT ------- Note that the two latest Fedora 7 install bugs, this one and bug #33892 are linked to NFS. These two bugs can be related. This bug is an install problem of QS20 via NFS where FTP and HTTP install worked, while #33892 is a NFS boot from a remote harddrive for QS21.
----- Additional Comments From mlui.com 2007-04-12 22:50 EDT ------- Abe, was the successful install on our PPC box(es) via NFS? Should be a good comparison point.
changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rolf_schmidt.com ------- Additional Comments From RETTIG.com 2007-04-13 05:03 EDT ------- This is a QS20 issue AFAIK. I forward this to Rolf Shcmidt to handle (LTC).
------- Additional Comments From abareval.com 2007-04-13 09:53 EDT ------- (In reply to comment #15) > Abe, was the successful install on our PPC box(es) via NFS? Should be a good > comparison point. Yep, FC7 installed successfully via NFS on PPC boxes.
----- Additional Comments From mlui.com 2007-04-13 11:41 EDT ------- Did a little experiment on NFS as suggested by Jim. NFS-mounted the install source to cell8, a successful QS20 install with F7-test3 via FTP. Repeatedly copying the source to this QS20 to see if anything breaks in NFS. It's been 30 minutes and still going strong. Will continue to update the status of this experiment.
----- Additional Comments From linas.com (prefers email at linas.com) 2007-04-13 12:11 EDT ------- The primary problem, from what I can tell, is that the kernel currently in FC7, which is 2.6.21-rc1, does nt include a number of the patches for spidernet. The newer kernel 2.6.21-rc2 does include these patches. Recall that "rc" stands for "release candidate". Since there are now much newer rc's (kernel.org is up to rc6 as of yesterday), it can be understood that rc1 has a number of bugs that are fixed in later rc's. The spidernet issues can be considered to be one of these "bugs". I can only hope that FC7, before going golden, will rebase on a stable kernel, such as 2.6.21, instead of one of its release candidates, right? Can someone verify that this will occur? If so, then ther won't be anything to "fix" here, right?
----- Additional Comments From abareval.com 2007-04-13 12:19 EDT ------- ***Correction*** To access cell10 console, you can use "telnet con3.ltc.austin.ibm.com 7006", please let me know if you plan on logging on to this telnet session beforehand. Regards, Abraham Arevalo
----- Additional Comments From mlui.com 2007-04-13 12:35 EDT ------- Note that in the last experiment we were using the Cell kernel which has the spidernet patches: Linux cell8.ltc.austin.ibm.com 2.6.20-CBE #1 SMP Wed Mar 21 10:24:39 CET 2007 ppc64 ppc64 ppc64 GNU/Linux Next we are trying the test with the test3 kernel.
----- Additional Comments From jklewis.com 2007-04-13 12:49 EDT ------- Monza, please use caution when mentioning "Spidernet patches". We have not created any new patches to deal with this problem. We are still trying to determine where the problem is (Spidernet or NFS). To be clear, some NFS tests were run but on the wrong kernel. We are now switching back to the stock (install) kernel and running them again. This should help us isolate where the problem really is.
----- Additional Comments From jklewis.com 2007-04-13 16:34 EDT ------- I wanted to see if NFS works with the stock kernel. If it doesn't then I believe that means we have found where the problem is. Some tests were run, but it's now not clear which kernel was used. Installs using FTP and HTTP work according to a few people. This uses Spidernet of course. Fedora Core 6 installs worked fine using NFS. It fails with F7. The Spidernet driver in F7 is at least as good as the one in FC6, and so NFS should work as it did before, UNLESS something else changed. I could be wrong, but I think it's a stretch to believe an NFS fix is now needed in Spidernet on F7 when it wasn't needed on FC6.
----- Additional Comments From mlui.com 2007-04-13 17:54 EDT ------- Unfortunately the NFS test we ran the second time was in an updated kernel, 2.6.20-1.3059.fc7. Will have to force install the older kernel to try it again. We are planning to do it coming sunday and will post the result.
----- Additional Comments From mlui.com 2007-04-14 11:49 EDT ------- Ran a few experiments including copying the gdb and gtk rpms over to cell8 via a mount point for 100 times. They were run on the test3 original kernel (2.6.20-1.3023.fc7). Finished without problem as well. Note that the network goes through sitewide in all these and previous experiments.
----- Additional Comments From mlui.com 2007-04-15 22:56 EDT ------- Jim, We've found that the problem only came from eth0, not eth1 using lewis7 as file server. Could you please add comments on what is the difference between the setup of eth0 to eth1? Joseph will post more information. Thanks.
----- Additional Comments From mlui.com 2007-04-15 23:02 EDT ------- Ken and Thinh, The bug appears only when the install goes through eth0, a local network, which I believe is connected to a slower switch (100M). eth1 is on the other hand connected to sitewide and probably connected to a faster (1G) swich. Still need Jim's confirmation though. How were your network set up when you saw this bug?
----- Additional Comments From joseferr.com 2007-04-15 23:21 EDT ------- Following the steps of Monza and Jim, I started doing tests using NFS mount points. I used a NFS mount from a machine (lewis7) using eth1 and it's main IP, and started copying files of different sizes (RPMs packages from a Fedora 7 ISO) to the cell machine (cell8). No problems have been found, and the copies run for over an hour. I've used 2.6.20-1.3023-f7, 2.6.20-1.3059.f7 and 2.6.20-CBE kernels. Then, I tried using eth0 and the sames machines, using lewis7 private IP. This was the same setup used during the install process that generated this bug. With this setup, the problem appeared, using both f7 kernels. Just a few seconds after the file copies started, the cp stops, and the following output is shown: Spider RX RAM full, incoming packets might be discarded! Spider RX RAM full, incoming packets might be discarded! got descriptor chain end interrupt, restarting DMAC A. Spider RX RAM full, incoming packets might be discarded! Spider RX RAM full, incoming packets might be discarded! got descriptor chain end interrupt, restarting DMAC A. eth0: bad status, cmd_status=x48800203 buf_addr=x9e814080 buf_size=x00000980 next_descr_addr=x9e7502c0 result_size=x00000600 valid_size=x000005ec data_status=x70001500 data_error=x02100000 which=21 Spider RX RAM full, incoming packets might be discarded! Spider RX RAM full, incoming packets might be discarded! Spider RX RAM full, incoming packets might be discarded! Spider RX RAM full, incoming packets might be discarded! Spider RX RAM full, incoming packets might be discarded! printk: 2 messages suppressed. With that in mind, I retried the install of F7-Test3 (6.92), but this time using eth1 as the network card. The machines used where cell9 and lewis7 (using 9.* IPs). The NFS/text install completed succesfully, installing 1082 packages and over 1.5 GB of data. To rule out the bug as being specific to cell8, I tried the NFS again at cell9 and eth0 (private network). The install process fails as reported earlier. With cell8 installed and running F7 (a FTP install using eth1), we could see that both eths use spidernet module and are set up using 1000Mbit connections. I'm adding as attachments the outputs for lsmod, dmesg and ethtool for each of the kernels that are being used. Monza Lui has more info about these tests, using FTP instead of NFS mounts. One other thing that was found is that if the network is restarted (service network restart), it shows this errors: [root@cell8 ~]# service network restart Shutting down interface eth0: [ OK ] Shutting down interface eth1: [ OK ] Shutting down loopback interface: [ OK ] Bringing up loopback interface: [ OK ] Bringing up interface eth0: [ OK ] Bringing up interface eth1: eth1: bad status, cmd_status=x40800009 buf_addr=xbeec7080 buf_size=x00000980 next_descr_addr=xbe9b6020 result_size=x00000080 valid_size=x00000042 data_status=x4000a500 data_error=x00004000 which=0 eth1: bad status, cmd_status=x40800109 buf_addr=xbe98e080 buf_size=x00000980 next_descr_addr=xbe9b6040 result_size=x00000080 valid_size=x00000042 data_status=x4000a600 data_error=x00004000 which=1 eth1: bad status, cmd_status=x40800109 buf_addr=xa0a7d080 buf_size=x00000980 next_descr_addr=xbe9b6060 result_size=x00000080 valid_size=x00000042 data_status=x4000a700 data_error=x00004000 which=2 eth1: bad status, cmd_status=x40800109 buf_addr=xa0b6c080 buf_size=x00000980 next_descr_addr=xbe9b6080 result_size=x00000080 valid_size=x00000042 data_status=x4000a800 data_error=x00004000 which=3 eth1: bad status, cmd_status=x40800109 buf_addr=xbe8d4080 buf_size=x00000980 next_descr_addr=xbe9b60a0 result_size=x00000080 valid_size=x00000042 data_status=x4000a900 data_error=x00004000 which=4 eth1: bad status, cmd_status=x40800109 buf_addr=xbe990080 buf_size=x00000980 next_descr_addr=xbe9b60c0 result_size=x00000080 valid_size=x00000042 data_status=x4000aa00 data_error=x00004000 which=5 eth1: bad status, cmd_status=x40800109 buf_addr=xbed04080 buf_size=x00000980 next_descr_addr=xbe9b60e0 result_size=x00000080 valid_size=x00000042 data_status=x4000ab00 data_error=x00004000 which=6 eth1: bad status, cmd_status=x40800009 buf_addr=xbea0c080 buf_size=x00000980 next_descr_addr=xbe9b6100 result_size=x00000080 valid_size=x00000042 data_status=x4000ac00 data_error=x00004000 which=7 eth1: bad status, cmd_status=x40800009 buf_addr=xbe944080 buf_size=x00000980 next_descr_addr=xbe9b6120 result_size=x00000080 valid_size=x00000042 data_status=x4000ad00 data_error=x00004000 which=8 [ OK ]
Created attachment 152660 [details] 2.6.20-CBE.info
----- Additional Comments From joseferr.com 2007-04-15 23:24 EDT ------- Info using CBE kernel
Created attachment 152661 [details] 2.6.20-1.3023.fc7.info
----- Additional Comments From joseferr.com 2007-04-15 23:26 EDT ------- Info using 3023.fc7 kernel
Created attachment 152662 [details] 2.6.20-1.3059.fc7.info
----- Additional Comments From joseferr.com 2007-04-15 23:27 EDT ------- Info using 3059.fc7 kernel
------- Additional Comments From brenohl.com 2007-04-16 09:15 EDT ------- (In reply to comment #31) > Spider RX RAM full, incoming packets might be discarded! > Spider RX RAM full, incoming packets might be discarded! > got descriptor chain end interrupt, restarting DMAC A. Searching for this error message in the kernel source code, I found the following codes: source code file: drivers/net/spider_net.c function: spider_net_handle_error_irq(struct spider_net_card *card, u32 status_reg) This function look for all the status registers and if it found a register with the value SPIDER_NET_GRMFLLINT, it gives the error: 1367 case SPIDER_NET_GRMFLLINT: 1368 if (netif_msg_intr(card) && net_ratelimit()) 1369 pr_debug("Spider RX RAM full, incoming packets " 1370 "might be discarded! "); 1371 spider_net_rx_irq_off(card); 1372 tasklet_schedule(&card->rxram_full_tl); 1373 show_error = 0; 1374 break; 1375
----- Additional Comments From joseferr.com 2007-04-16 09:53 EDT ------- [root@cell8 VolGroup00]# ethtool -i eth0 driver: spidernet version: 2.0 A firmware-version: no information bus-info: 0001:00:03.0 [root@cell8 VolGroup00]# ethtool -i eth1 driver: spidernet version: 2.0 A firmware-version: no information bus-info: 0002:00:03.0 [root@cell8 VolGroup00]# lspci 0000:00:0a.0 IDE interface: Silicon Image, Inc. PCI0680 Ultra ATA-133 Host Controller (rev 02) 0001:00:03.0 Ethernet controller: Toshiba America Unknown device 01b3 (rev 02) 0002:00:03.0 Ethernet controller: Toshiba America Unknown device 01b3 (rev 02)
----- Additional Comments From jklewis.com 2007-04-16 11:23 EDT ------- Excellent work by the test team. I'll try to add some helpful info here: eth0 is our gigabit interface. The server, lewis7, has an e1000 that is connected directly to the top switch in our BladeCenter. Obviously this is a whole lot faster than using the site interface (eth1) which is at 100. Note that this is the same configuration that worked installing FC6 via NFS. The error messages are indeed coming from the Spidernet driver. However, this does not necessarily mean Spidernet is at fault. In many cases something else has changed in the kernel that has slowed the timing down, and these have caused Spidernet to have trouble keeping up. When FTP and/or HTTP was used to perform a successful install, which interface was used? If it was eth1 please try again using eth0. At this point I still suspect something other than Spidernet to be causing this problem. At some point, in the install tree, we need to replace Spidernet with the latest version. I do not know the procedures for replacing files in a distro, or building RPMs. Anyone?
----- Additional Comments From linas.com (prefers email at linas.com) 2007-04-16 13:21 EDT ------- ARE THOSE MESSAGES FROM THE PATCHED DRIVER, OR THE UNPATCHED DRIVER? ONE OF THE SPIDERNET PATCHES MISSING FROM FC7 IS A PATCH THAT FIXES A BUG THAT TRIGGERS RX RAM FULL MESSAGES!
------- Additional Comments From joseferr.com 2007-04-16 13:37 EDT ------- (In reply to comment #37) > Excellent work by the test team. I'll try to add some helpful info here: > > eth0 is our gigabit interface. The server, lewis7, has an e1000 that is > connected directly to the top switch in our BladeCenter. Obviously this is a > whole lot faster than using the site interface (eth1) which is at 100. Note > that this is the same configuration that worked installing FC6 via NFS. Very interesting, Jim. We were thinking exactly the opposite. That eth0 was at a 100 and eth1 at 1000. With small or average size files (< 100 MB), eth0 fails and eth1 is fine. eth1 is only failing when copying large files, like the F7.ISO. Any way, shouldn't the eth1 be set at the cell machine as using a 100Mbit and Half-Duplex network? This intrigues me. [root@cell8 ~]# ethtool eth1 Settings for eth1: Supported ports: [ FIBRE ] Supported link modes: 1000baseT/Full Supports auto-negotiation: No Advertised link modes: 1000baseT/Full Advertised auto-negotiation: No Speed: 1000Mb/s Duplex: Full Port: FIBRE PHYAD: 0 Transceiver: internal Auto-negotiation: off Supports Wake-on: d Wake-on: d Current message level: 0x00007fff (32767) > The error messages are indeed coming from the Spidernet driver. However, this > does not necessarily mean Spidernet is at fault. In many cases something else > has changed in the kernel that has slowed the timing down, and these have > caused Spidernet to have trouble keeping up. > > When FTP and/or HTTP was used to perform a successful install, which interface > was used? If it was eth1 please try again using eth0. Our FTP install used eth1. But Monza did some tests with FTP and some big files, and the failed at both eths. I can try to install using eth0, but I'm almost certain it will fail too, and these tests did. > At this point I still suspect something other than Spidernet to be causing > this problem. At some point, in the install tree, we need to replace Spidernet > with the latest version. I do not know the procedures for replacing files in a > distro, or building RPMs. Anyone? > Breno and I are trying to create a netboot image containing a newer kernel and the patches, so we can retry the install process and see what happens.
------- Additional Comments From joseferr.com 2007-04-16 13:41 EDT ------- (In reply to comment #38) > ARE THOSE MESSAGES FROM THE PATCHED DRIVER, OR THE UNPATCHED DRIVER? > > ONE OF THE SPIDERNET PATCHES MISSING FROM FC7 IS A PATCH THAT FIXES > A BUG THAT TRIGGERS RX RAM FULL MESSAGES! The f7 kernel 2.6.20-1.3023 used during the install (the one that comes with the ISO) and the last update at f7 repositories, 2.6.20-1.3059. The 2.6.20-CBE kernel (that comes with SDK 2.1) does not show this messages, and completes these NFS and FTP tests without problems for both eths.
----- Additional Comments From jklewis.com 2007-04-16 14:28 EDT ------- Joseph, at one time we did have it setup with eth0 the site and eth1 the private. Unfortunately, you can't netboot from eth1 and so netbooting on eth0 was very tricky as there were consistent collisions with the other servers in the lab. Switching the interfaces to how we have them now solved these problems, and also allowed for faster installs. Not sure why you brought up half-duplex. I would recommend not ever using it for various reasons. Under no circumstances should you ever mix duplex on the same network. All modern day interfaces and switches use full-duplex. If you run a test that copies a file larger than about 2 GB from somewhere to the Cell blade you are going to hit Bug # 29975. According to what I have seen this bug is NOT going to be fixed. On a Cell blade ethtool will always show a speed of 1000 Mb/s regardless of what speed is really there. I believe that's because it is showing you the speed of the switch in the BladeCenter. I have given Monza some pointers on how to create a new boot kernel.
----- Additional Comments From mlui.com 2007-04-16 14:38 EDT ------- To summarize what we found so far: Install F7-test3 via eth1 has no problem in both cell8 and cell9 (both QS20) via NFS, FTP, and HTTP. Install the same via eth0 however is at least not working via NFS on both cells. Note that eth0 and eth1 are built-in spidernet cards on QS20. The difference between eth0 and eth1 is that eth0 is connected to a local network with a faster interface (1GB) while eth1 is connected to sitewide (100MB). We did some testing after the cell blades are installed. We copied a 4GB file from two remote servers, lewis7 which is available through both eths and ppc64flp1 which is available only through eth0. For NFS, we mounted the remote file system to the cell blade before copying. For FTP, we ftp to the remote server and did a mget. Here are the results so far: Network_card Protocol File_server Kernel Result eth0 NFS lewis7 F7-test3 Hang eth1 NFS lewis7 F7-test3 OK eth1 NFS ppc64flp1 F7-test3 OK eth0 FTP lewis7 F7-test3 Fail very quickly eth1 FTP lewis7 F7-test3 Fail after a while eth1 FTP ppc64flp1 F7-test3 OK We are currently trying to build a new set of boot image which includes some spidernet patches to see if they fix the install problem.
----- Additional Comments From joseferr.com 2007-04-16 15:09 EDT ------- > lewis7 which is available through > both eths and ppc64flp1 which is available only through eth0. correction: ppc64flp1 is available only through *eth1* (as the table shows).
----- Additional Comments From mlui.com 2007-04-16 15:37 EDT ------- Ken said that his cell blade was also connected to a 1GB switch when the install bug happened. Not contradicting to what we've found.
----- Additional Comments From mlui.com 2007-04-16 23:09 EDT ------- Linas, could you please provide us a spidernet.ko with the patches compiled in? We took the F7-test3 (2.6.20-1.3023.fc7) kernel, applied 13 patches we found from SDK2.1 Cell BE kernel, and recompiled. Got a new spidernet.ko but it didn't work. We are currently trying to create a new install kernel with patched spidernet.ko to see if the patches fix the problem. That's why we need a new spidernet.ko. Thanks.
----- Additional Comments From mlui.com 2007-04-16 23:11 EDT ------- Joseph, did we try installing via eth0 via FTP?
----- Additional Comments From mlui.com 2007-04-16 23:25 EDT ------- Joseph, never mind, we didn't as lewis7 did not have nfs server running.
------- Additional Comments From joseferr.com 2007-04-17 06:38 EDT ------- (In reply to comment #47) > Joseph, never mind, we didn't as lewis7 did not have nfs server running. FTP using eth0 doesn't work either. It stops right after providing the IP and directory, at the message: +----------------------------+ Retrieving +----------------------------+ | | | Retrieving images/minstg2.img... | | | +----------------------------------------------------------------------+ At this point, it stops. Using eth1 works.
----- Additional Comments From brenohl.com 2007-04-17 16:46 EDT ------- I've tested the problem (NFS) with the 2.6.21-rc4 and it worked as expected. I'll mount the netboot right now in order to assure that the problem was fixed in this kernel version.
As Fedora is the distro of choice for any development on the Cell architecture, I cannot stress how important it is to have this fixed in FC7. Is there something we can do to help with this bug?
----- Additional Comments From mlui.com 2007-04-22 17:15 EDT ------- Linas, I remember you told me there was one NFS related patch that is not upstream yet. Is it still the case?
----- Additional Comments From mlui.com 2007-04-22 17:20 EDT ------- The Fedora team, will these spidernet patches be in test4?
there are no patches attached to this bugzilla, so it's unlikely. (test4 froze last week). We can take them on for GA however, but we'll need them diffed against a current (2.6.21-rc7 at time of writing) tree.
----- Additional Comments From brenohl.com 2007-04-23 15:39 EDT ------- Hi, The kernel 2.6.21-rc7 has all the spidernet patches except one patch, which is already in the vanilla kernel, and will be available on the rc8 kernel release. Thanks for the response.
linus has indicated that he's reluctant to do an -rc8, so it's likely that we'll see a .21 any day now. If that remaining patch isn't in -rc7-git5, please attach it, and I'll get it into the Fedora builds.
----- Additional Comments From linas.com (prefers email at linas.com) 2007-04-23 16:15 EDT ------- The nfs patch is not in linux-2.6.20-rc7 but it is in linux-2.6.20-rc7-git6 commit 33bdeec80649f2eab36039f63d69c65378493cbe
------- Additional Comments From mlui.com 2007-04-23 16:40 EDT ------- (In reply to comment #69) > It's been very difficult to work on this bug due to the lack of hardware > available for us to debug. I understand that the team does not have many > machines to perform the tests, Please note that we are very limited in hardware and we are on a very tight schedule in testing. Continual freeze is this thursday. And the team worked last weekend. However still we did provide the hardware more than a couple times to debug this problem. however if this is a ship issue, or even a > blocker as it has been originally opened,we'd need a machine available for more > time. Since there is a workaround for this, I believe we can lower this bug's > severity. This bug should stay as ship issue as defined by bugzilla's severity definition. As per our converstaion online, I believe the best way to go about this is for your team to try out the boot image on a ppc64 box. If it boots up then we try it out on cell. Just want to minimize the disruption to our testing. Hope you understand.
----- Additional Comments From mlui.com 2007-04-23 16:44 EDT ------- For the record, the boot image should be created for ppc64, not cell. As there is no install tree for cell but ppc in Fedora 7. Sorry for the last post, should have been internal only.
I'll be rebasing to linux-2.6.20-rc7-git6 soon, so we should pick up that NFS change for F7.
Created attachment 153314 [details] x
----- Additional Comments From linas.com (prefers email at linas.com) 2007-04-23 17:03 EDT ------- patch that fixes NFS hang; this patch is now in linux-2.6.20-rc7-git6
----- Additional Comments From brenohl.com 2007-04-24 16:40 EDT ------- Hi. I am getting two errors, if I boot the image on a cell machine, the anaconda loader couldn't bring the spidernet up, as described on comment#60. If I try to boot a ppc64 default kernel compilation on a ppc64 box, I got the a 0x700 error, which is a illegal instruction. I'll attach the boot log here. Do anyone have any tip? Thanks
Created attachment 153385 [details] error.txt
----- Additional Comments From brenohl.com 2007-04-24 16:41 EDT ------- the boot log
----- Additional Comments From brenohl.com 2007-04-25 16:42 EDT ------- Hi people, I booted the monza image and debugged it. The dmesg didn't show tentative to load any spidernet module. Then I uncompress the modules.cgz in a directory and loaded the spidernet.ko manually and I got the following error: <4>ksign: module signed with unknown public key <4>- signature keyid: 6b3cae08dc6d60bc ver=3 <3>Module signed with unknown public key I think we have two problem, the first one is the image directories which seems wrong to load the module and the module error which prevents it to be loaded. Linas, Do you know what is going wrong with the public key? I'll post the dmesg output above.
Created attachment 153456 [details] log.txt
----- Additional Comments From brenohl.com 2007-04-25 16:43 EDT ------- the dmesg output. Look at the modprobe error too.
----- Additional Comments From brenohl.com 2007-04-25 16:45 EDT ------- The error message I got when loading it manually: sh-3.2# insmod spidernet.ko insmod: cannot insert `spidernet.ko': Required key not available (-1): Required key not available
----- Additional Comments From brenohl.com 2007-04-25 16:50 EDT ------- Looking deeper in the dmesg outpu, I could see that it tried to load the spidernet module in the loader and also failed with the same unkown key error. So I think the image layout seems good.
----- Additional Comments From mlui.com 2007-04-26 22:35 EDT ------- Test4 arrived today, did a NFS install on cell14 via eth0. It got stuck at ││ 68% │ ││ │ ││ 997 of 1186 packages completed │ ││ │ ││ Installing libgnomeui - 2.18.1-2.fc7.ppc (3 MB) │ ││ GNOME base GUI library tcpdump -i in the server side (lewis7) shows the followings. Look like cell14 (192.168.2.15) is still trying to contact lewis7 (192.168.2.4). 21:15:12.810217 arp who-has 192.168.2.4 tell 192.168.2.15 21:15:12.810229 arp reply 192.168.2.4 is-at 00:0e:0c:a0:90:1c (oui Unknown) 21:15:13.810144 arp who-has 192.168.2.4 tell 192.168.2.15 21:15:13.810159 arp reply 192.168.2.4 is-at 00:0e:0c:a0:90:1c (oui Unknown) 21:15:14.147403 802.1d config 8000.00:16:ca:9a:e7:00.8013 root 8000.00:16:ca:9a:e7:00 pathcost 0 age 0 max 20 hello 2 fdelay 15 However cell14 is not pingable from lewis7. Will try the same but via eth1 installing from lewis7 by NFS mount and see what happens.
----- Additional Comments From mlui.com 2007-04-26 22:59 EDT ------- Breno, could you please check if all the patches are in test4? Thanks.
(In reply to comment #60) > ----- Additional Comments From mlui.com 2007-04-26 22:59 EDT ------- > Breno, could you please check if all the patches are in test4? Thanks. That would be unlikely. The test4 kernel froze long ago. Try tomorrow's rawhide, _if_ it has a kernel-2.6.20-1.3105.fc7 or later. That seems to be when linux-2.6.20-rc7-git6 was merged, according to http://cvs.fedora.redhat.com/viewcvs/rpms/kernel/devel/kernel-2.6.spec?view=log
----- Additional Comments From mlui.com 2007-04-27 21:18 EDT ------- We tested with today's rawhide which is rebased to 2.6.21 (kernel-2.6.21- 1.3116.fc7), unlike test4 which is still based on 2.6.20. Breno verified that the spidernet patches are already in this rawhide code but the install still hangs. We are trying to collect logs, etc.
changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|RH236298- Fedora 7 Test 3 |RH236298- Fedora 7 Test 3 |install fails using 1Gbit |NFS and FTP install fails |network on QS20 machine |using 1Gbit network on QS20 | |machine ------- Additional Comments From mlui.com 2007-04-28 11:08 EDT ------- Like what Ken and Thinh found in F7 test2/3, HTTP install via 1Gb network works in the rebased F7 (rawhide from 0427 with kernel- 2.6.21-1.3116.fc7) but not FTP nor NFS install. Verified in cell16.
Obviously there's not a lot we can do to help with this until our QS2[01] hardware arrives. Closing UPSTREAM for now.
------- Additional Comments From mlui.com 2007-04-28 11:30 EDT ------- (In reply to comment #109) > ------- Additional Comments From dwmw2 2007-04-28 11:22 EST ------ > Obviously there's not a lot we can do to help with this until our QS2[01] > hardware arrives. Closing UPSTREAM for now. Robbie, could you please comment on this? I believe RedHat has at least one QS20 right?
changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |OPEN ------- Additional Comments From mlui.com 2007-04-29 01:51 EDT ------- Dropped to xmon after an NFS install hang and got the followings: mon> t [c0000000021cfcc0] c0000000002a7c08 .__handle_sysrq+0xe8/0x1c0 [c0000000021cfd70] c0000000002ab608 .hvc_poll+0x1a0/0x2d4 [c0000000021cfe50] c0000000002abd38 .khvcd+0x90/0x16c [c0000000021cfee0] c000000000096b0c .kthread+0x124/0x174 [c0000000021cff90] c000000000029284 .kernel_thread+0x4c/0x68 2:mon> r R00 = 0000000000000000 R16 = 0000000000000000 R01 = c0000000021cfc50 R17 = 0000000000000000 R02 = c0000000006fb668 R18 = 0000000000000000 R03 = c0000000021cfae0 R19 = 4000000001410000 R04 = c00000000073b148 R20 = c0000000005a6dd8 R05 = c00000000073b178 R21 = 00000000019b7048 R06 = c0000000006c6438 R22 = 0000000000000000 R07 = c0000000006c6618 R23 = 0000000000000001 R08 = c0000000006c6608 R24 = 0000000000000000 R09 = c0000000005b68c8 R25 = 0000000000000000 R10 = c0000000006c6638 R26 = c00000003ef8e1c8 R11 = c00000000006e134 R27 = 0000000000000078 R12 = c00000000073b150 R28 = 0000000000000001 R13 = c0000000005de880 R29 = 0000000000000001 R14 = 0000000000000000 R30 = c000000000690cf8 R15 = 0000000000000000 R31 = c0000000021cfae0 pc = c00000000006e288 .sysrq_handle_xmon+0x48/0x5c lr = c00000000006e288 .sysrq_handle_xmon+0x48/0x5c msr = 9000000000001032 cr = 28000088 ctr = c00000000006e0e8 xer = 0000000020000000 trap = 0 2:mon> e cpu 0x2: Vector: 0 at [c0000000021cfae0] pc: c00000000006e288: .sysrq_handle_xmon+0x48/0x5c lr: c00000000006e288: .sysrq_handle_xmon+0x48/0x5c sp: c0000000021cfc50 msr: 9000000000001032 current = 0xc00000000feaa2c0 paca = 0xc0000000005de880 pid = 258, comm = khvcd
----- Additional Comments From mlui.com 2007-04-29 02:51 EDT ------- To summarize our work today, here are a few things we tried. 1) Installed F7 on cell8 twice. First time installed it via eth1 so that we could get a successful install. We used only half of the disk space here. Second time we installed via eth0 using the rest of the disk and it hung as expected. We booted back to the first F7 install and looked at the fs from the second partial F7 install. Got the dmesg and /var/log/message as attached but did not see anything interesting 2) Did a yum update to install rawhide rpms from a F7test4 partition. Rebooted to new kernel (2.6.21-1.3116.fc7). Did some copying test using NFS mount via eth0. Got exact same errors we got from test3 (comment #31). However when we ran the same test earlier (comment #40) on Cell BE kernel (kernel-2.6.20-CBE), we did not see this bug. Note that the set of spidernet patches in Cell BE kernel should have all been included in this rawhide kernel. 3) Got the xmon output in the last comment at NFS hang. Note that when install via FTP, it hangs much earlier, before stage2. That's why we did not get an xmon output from FTP hang.
----- Additional Comments From mlui.com 2007-04-29 03:11 EDT ------- Linas, Could you please look at the xmon output and if you could please look at the rawhide source code to see if all patches in CELL BE are in. Thanks :)
----- Additional Comments From mlui.com 2007-04-29 03:21 EDT ------- Tried a new experiment. Downloading a file via HTTP via eth0, running rawhide kernel (2.6.21-1.3116.fc7). Same "RX RAM full, incoming packets might be discarded!" errors as the other NFS and FTP file copying tests. Note that we did NOT have problem installing via HTTP but here we have problem downloading file via HTTP running the same kernel. eth0: bad status, cmd_status=x48800203 buf_addr=xaaf0f080 buf_size=x00000980 next_descr_addr=xb5eeb920 result_size=x00000600 valid_size=x000005ec data_status=x7000a400 data_error=x02100000 which=200
----- Additional Comments From mlui.com 2007-04-29 13:12 EDT ------- Reproduced the hang with echo 8 > /proc/sysrq-trigger [C0000000012FBC80] [C0000000000111C4] .__switch_to+0x12c/0x160 [C0000000012FBD10] [C000000000401330] .schedule+0x8e0/0xa3c [C0000000012FBE10] [D0000000005FFAB4] .jfs_lazycommit+0x254/ 0x2ac [jfs] [C0000000012FBEE0] [C000000000096B0C] .kthread+0x124/0x174 [C0000000012FBF90] [C000000000029284] .kernel_thread+0x4c/0x68 jfsCommit S 0000000000000000 14768 665 83 (L-TLB) Call Trace: [C0000000012FFC80] [C0000000000111C4] .__switch_to+0x12c/0x160 [C0000000012FFD10] [C000000000401330] .schedule+0x8e0/0xa3c [C0000000012FFE10] [D0000000005FFAB4] .jfs_lazycommit+0x254/ 0x2ac [jfs] [C0000000012FFEE0] [C000000000096B0C] .kthread+0x124/0x174 [C0000000012FFF90] [C000000000029284] .kernel_thread+0x4c/0x68 jfsCommit S 0000000000000000 14768 666 83 (L-TLB) Call Trace: [C000000001303C80] [C0000000000111C4] .__switch_to+0x12c/ 0x160 [C000000001303D10] [C000000000401330] .schedule+0x8e0/0xa3c [C000000001303E10] [D0000000005FFAB4] .jfs_lazycommit+0x254/ 0x2ac [jfs] [C000000001303EE0] [C000000000096B0C] .kthread+0x124/0x174 [C000000001303F90] [C000000000029284] .kernel_thread+0x4c/0x68 jfsCommit S 0000000000000000 14768 667 83 (L-TLB) Call Trace: [C00000000130BC80] [C0000000000111C4] .__switch_to+0x12c/ 0x160 [C00000000130BD10] [C000000000401330] .schedule+0x8e0/0xa3c [C00000000130BE10] [D0000000005FFAB4] .jfs_lazycommit+0x254/ 0x2ac [jfs] [C00000000130BEE0] [C000000000096B0C] .kthread+0x124/0x174 [C00000000130BF90] [C000000000029284] .kernel_thread+0x4c/0x68 jfsSync S 0000000000000000 14800 668 83 (L-TLB) Call Trace: [C00000000130FAD0] [C00000000130FB80] 0xc00000000130fb80 (unreliable) [C00000000130FCA0] [C0000000000111C4] .__switch_to+0x12c/0x160 [C00000000130FD30] [C000000000401330] .schedule+0x8e0/0xa3c [C00000000130FE30] [D0000000005FF61C] .jfs_sync+0x1c8/0x20c [jfs] [C00000000130FEE0] [C000000000096B0C] .kthread+0x124/0x174 [C00000000130FF90] [C000000000029284] .kernel_thread+0x4c/0x68 xfslogd/0 S 0000000000000000 14752 676 83 (L-TLB) Call Trace: [C000000001313C70] [C0000000000111C4] .__switch_to+0x12c/ 0x160 [C000000001313D00] [C000000000401330] .schedule+0x8e0/0xa3c [C000000001313E00] [C000000000091DF0] .worker_thread+0x128/ 0x1bc [C000000001313EE0] [C000000000096B0C] .kthread+0x124/0x174 [C000000001313F90] [C000000000029284] .kernel_thread+0x4c/0x68 xfslogd/1 S 0000000000000000 14752 677 83 (L-TLB) Call Trace: [C00000000131BC70] [C0000000000111C4] .__switch_to+0x12c/ 0x160 [C00000000131BD00] [C000000000401330] .schedule+0x8e0/0xa3c [C00000000131BE00] [C000000000091DF0] .worker_thread+0x128/ 0x1bc [C00000000131BEE0] [C000000000096B0C] .kthread+0x124/0x174 [C00000000131BF90] [C000000000029284] .kernel_thread+0x4c/0x68 xfslogd/2 S 0000000000000000 14752 678 83 (L-TLB) Call Trace: [C00000000131FC70] [C0000000000111C4] .__switch_to+0x12c/0x160 [C00000000131FD00] [C000000000401330] .schedule+0x8e0/0xa3c [C00000000131FE00] [C000000000091DF0] .worker_thread+0x128/ 0x1bc [C00000000131FEE0] [C000000000096B0C] .kthread+0x124/0x174 [C00000000131FF90] [C000000000029284] .kernel_thread+0x4c/0x68 xfslogd/3 S 0000000000000000 14752 679 83 (L-TLB) Call Trace: [C000000001323C70] [C0000000000111C4] .__switch_to+0x12c/ 0x160 [C000000001323D00] [C000000000401330] .schedule+0x8e0/0xa3c [C000000001323E00] [C000000000091DF0] .worker_thread+0x128/ 0x1bc [C000000001323EE0] [C000000000096B0C] .kthread+0x124/0x174 [C000000001323F90] [C000000000029284] .kernel_thread+0x4c/0x68 xfsdatad/0 S 0000000000000000 14752 680 83 (L-TLB) Call Trace: [C00000000132BC70] [C0000000000111C4] .__switch_to+0x12c/ 0x160 [C00000000132BD00] [C000000000401330] .schedule+0x8e0/0xa3c [C00000000132BE00] [C000000000091DF0] .worker_thread+0x128/ 0x1bc [C00000000132BEE0] [C000000000096B0C] .kthread+0x124/0x174 [C00000000132BF90] [C000000000029284] .kernel_thread+0x4c/0x68 xfsdatad/1 S 0000000000000000 14752 681 83 (L-TLB) Call Trace: [C00000000132FC70] [C0000000000111C4] .__switch_to+0x12c/0x160 [C00000000132FD00] [C000000000401330] .schedule+0x8e0/0xa3c [C00000000132FE00] [C000000000091DF0] .worker_thread+0x128/ 0x1bc [C00000000132FEE0] [C000000000096B0C] .kthread+0x124/0x174 [C00000000132FF90] [C000000000029284] .kernel_thread+0x4c/0x68 xfsdatad/2 S 0000000000000000 14016 682 83 (L-TLB) Call Trace: [C000000001333AA0] [00000000FFFF07E6] 0xffff07e6 (unreliable) [C000000001333C70] [C0000000000111C4] .__switch_to+0x12c/ 0x160 [C000000001333D00] [C000000000401330] .schedule+0x8e0/0xa3c [C000000001333E00] [C000000000091DF0] .worker_thread+0x128/ 0x1bc [C000000001333EE0] [C000000000096B0C] .kthread+0x124/0x174 [C000000001333F90] [C000000000029284] .kernel_thread+0x4c/0x68 xfsdatad/3 S 0000000000000000 14752 683 83 (L-TLB) Call Trace: [C000000001337C70] [C0000000000111C4] .__switch_to+0x12c/ 0x160 [C000000001337D00] [C000000000401330] .schedule+0x8e0/0xa3c [C000000001337E00] [C000000000091DF0] .worker_thread+0x128/ 0x1bc [C000000001337EE0] [C000000000096B0C] .kthread+0x124/0x174 [C000000001337F90] [C000000000029284] .kernel_thread+0x4c/0x68 kmirrord S 0000000000000000 14752 706 83 (L-TLB) Call Trace: [C000000001357C70] [C0000000000111C4] .__switch_to+0x12c/ 0x160 [C000000001357D00] [C000000000401330] .schedule+0x8e0/0xa3c [C000000001357E00] [C000000000091DF0] .worker_thread+0x128/ 0x1bc [C000000001357EE0] [C000000000096B0C] .kthread+0x124/0x174 [C000000001357F90] [C000000000029284] .kernel_thread+0x4c/0x68 ksnapd S 0000000000000000 14752 714 83 (L-TLB) Call Trace: [C00000000134BC70] [C0000000000111C4] .__switch_to+0x12c/ 0x160 [C00000000134BD00] [C000000000401330] .schedule+0x8e0/0xa3c [C00000000134BE00] [C000000000091DF0] .worker_thread+0x128/ 0x1bc [C00000000134BEE0] [C000000000096B0C] .kthread+0x124/0x174 [C00000000134BF90] [C000000000029284] .kernel_thread+0x4c/0x68 kmpathd/0 S 0000000000000000 14752 722 83 (L-TLB) Call Trace: [C00000000134FC70] [C0000000000111C4] .__switch_to+0x12c/0x160 [C00000000134FD00] [C000000000401330] .schedule+0x8e0/0xa3c [C00000000134FE00] [C000000000091DF0] .worker_thread+0x128/ 0x1bc [C00000000134FEE0] [C000000000096B0C] .kthread+0x124/0x174 [C00000000134FF90] [C000000000029284] .kernel_thread+0x4c/0x68 kmpathd/1 S 0000000000000000 14752 723 83 (L-TLB) Call Trace: [C00000000135BC70] [C0000000000111C4] .__switch_to+0x12c/ 0x160 [C00000000135BD00] [C000000000401330] .schedule+0x8e0/0xa3c [C00000000135BE00] [C000000000091DF0] .worker_thread+0x128/ 0x1bc [C00000000135BEE0] [C000000000096B0C] .kthread+0x124/0x174 [C00000000135BF90] [C000000000029284] .kernel_thread+0x4c/0x68 kmpathd/2 S 0000000000000000 14752 724 83 (L-TLB) Call Trace: [C000000001363C70] [C0000000000111C4] .__switch_to+0x12c/ 0x160 [C000000001363D00] [C000000000401330] .schedule+0x8e0/0xa3c [C000000001363E00] [C000000000091DF0] .worker_thread+0x128/ 0x1bc [C000000001363EE0] [C000000000096B0C] .kthread+0x124/0x174 [C000000001363F90] [C000000000029284] .kernel_thread+0x4c/0x68 kmpathd/3 S 0000000000000000 14752 725 83 (L-TLB) Call Trace: [C000000001367C70] [C0000000000111C4] .__switch_to+0x12c/ 0x160 [C000000001367D00] [C000000000401330] .schedule+0x8e0/0xa3c [C000000001367E00] [C000000000091DF0] .worker_thread+0x128/ 0x1bc [C000000001367EE0] [C000000000096B0C] .kthread+0x124/0x174 [C000000001367F90] [C000000000029284] .kernel_thread+0x4c/0x68 anaconda D 000000000fc03a24 6624 742 318 (NOTLB) Call Trace: [C00000000207B2B0] [C00000000012D458] .unlock_buffer+0x30/ 0x44 (unreliable) [C00000000207B480] [C0000000000111C4] .__switch_to+0x12c/0x160 [C00000000207B510] [C000000000401330] .schedule+0x8e0/0xa3c [C00000000207B610] [C000000000401E78] .io_schedule+0x58/0x9c [C00000000207B6A0] [C0000000000C6B6C] .sync_page+0x7c/0x98 [C00000000207B720] [C000000000402094] .__wait_on_bit_lock+0x8c/ 0x110 [C00000000207B7C0] [C0000000000C6AA8] .__lock_page+0x70/0x90 [C00000000207B890] [C0000000000C7754] .do_generic_mapping_read+0x234/0x4e0 [C00000000207B9E0] [C0000000000C9EAC] .generic_file_aio_read+0x170/0x214 [C00000000207BAB0] [D00000000018B894] .nfs_file_read+0x11c/ 0x14c [nfs] [C00000000207BB60] [C0000000000FFF18] .do_sync_read+0xc4/ 0x124 [C00000000207BCF0] [C000000000100978] .vfs_read+0x120/0x208 [C00000000207BD90] [C000000000101164] .sys_read+0x4c/0x8c [C00000000207BE30] [C0000000000087C8] syscall_exit+0x0/0x40 anaconda S 000000000fc099bc 11088 745 742 (NOTLB) Call Trace: [C00000003A2BF4E0] [C00000001E625950] 0xc00000001e625950 (unreliable) [C00000003A2BF6B0] [C0000000000111C4] .__switch_to+0x12c/0x160 [C00000003A2BF740] [C000000000401330] .schedule+0x8e0/0xa3c [C00000003A2BF840] [C000000000401F3C] .schedule_timeout+0x3c/ 0xe8 [C00000003A2BF930] [C000000000112098] .do_sys_poll+0x2c8/ 0x434 [C00000003A2BFD50] [C00000000013E7D8] .compat_sys_ppoll+0x110/0x260 [C00000003A2BFE30] [C0000000000087C8] syscall_exit+0x0/0x40 kauditd S 0000000000000000 13424 746 83 (L-TLB) Call Trace: [C00000000262FAC0] [C0000000006525A0] ioctl_start+0x3138/ 0x5730 (unreliable) [C00000000262FC90] [C0000000000111C4] .__switch_to+0x12c/0x160 [C00000000262FD20] [C000000000401330] .schedule+0x8e0/0xa3c [C00000000262FE20] [C0000000000B3A9C] .kauditd_thread+0x160/ 0x1b0 [C00000000262FEE0] [C000000000096B0C] .kthread+0x124/0x174 [C00000000262FF90] [C000000000029284] .kernel_thread+0x4c/0x68 kjournald S 0000000000000000 9952 955 83 (L-TLB) Call Trace: [C000000016623AC0] [C000000000131784] .bio_alloc_bioset+0xcc/ 0x174 (unreliable) [C000000016623C90] [C0000000000111C4] .__switch_to+0x12c/ 0x160 [C000000016623D20] [C000000000401330] .schedule+0x8e0/0xa3c [C000000016623E20] [D000000000131594] .kjournald+0x1c8/0x26c [jbd] [C000000016623EE0] [C000000000096B0C] .kthread+0x124/0x174 [C000000016623F90] [C000000000029284] .kernel_thread+0x4c/0x68 kjournald S 0000000000000000 13824 957 83 (L-TLB) Call Trace: [C000000008457AC0] [0000000028000088] 0x28000088 (unreliable) [C000000008457C90] [C0000000000111C4] .__switch_to+0x12c/ 0x160 [C000000008457D20] [C000000000401330] .schedule+0x8e0/0xa3c [C000000008457E20] [D000000000131594] .kjournald+0x1c8/0x26c [jbd] [C000000008457EE0] [C000000000096B0C] .kthread+0x124/0x174 [C000000008457F90] [C000000000029284] .kernel_thread+0x4c/0x68 syslogd S 000000000fc0cedc 9824 965 742 (NOTLB) Call Trace: [C000000016627440] [000000000000633E] 0x633e (unreliable) [C000000016627610] [C0000000000111C4] .__switch_to+0x12c/0x160 [C0000000166276A0] [C000000000401330] .schedule+0x8e0/0xa3c [C0000000166277A0] [C000000000401F3C] .schedule_timeout+0x3c/ 0xe8 [C000000016627890] [C000000000112678] .do_select+0x410/0x4b0 [C000000016627C10] [C00000000013BFF0] .compat_core_sys_select+0x180/0x244 [C000000016627D00] [C00000000013E490] .compat_sys_select+0xd0/0x190 [C000000016627DC0] [C000000000017654] .ppc32_select+0x14/0x28 [C000000016627E30] [C0000000000087C8] syscall_exit+0x0/0x40 pdflush S 0000000000000000 10704 969 83 (L-TLB) Call Trace: [C0000000392E3AA0] [D00000000058CEAC] .dm_get_table+0x48/ 0x68 [dm_mod] (unrelia) [C0000000392E3C70] [C0000000000111C4] .__switch_to+0x12c/0x160 [C0000000392E3D00] [C000000000401330] .schedule+0x8e0/0xa3c [C0000000392E3E00] [C0000000000D05C4] .pdflush+0xfc/0x26c [C0000000392E3EE0] [C000000000096B0C] .kthread+0x124/0x174 [C0000000392E3F90] [C000000000029284] .kernel_thread+0x4c/0x68 pdflush S 0000000000000000 9504 989 83 (L-TLB) Call Trace: [C00000002C0AFAA0] [C00000002C0AFB50] 0xc00000002c0afb50 (unreliable) [C00000002C0AFC70] [C0000000000111C4] .__switch_to+0x12c/ 0x160 [C00000002C0AFD00] [C000000000401330] .schedule+0x8e0/0xa3c [C00000002C0AFE00] [C0000000000D05C4] .pdflush+0xfc/0x26c [C00000002C0AFEE0] [C000000000096B0C] .kthread+0x124/0x174 [C00000002C0AFF90] [C000000000029284] .kernel_thread+0x4c/0x68 printk: 10 messages suppressed. Spider RX RAM full, incoming packets might be discarded! printk: 7 messages suppressed. Spider RX RAM full, incoming packets might be discarded! printk: 10 messages suppressed.
Created attachment 153747 [details] 33869-xmon.txt
----- Additional Comments From mlui.com 2007-04-29 16:31 EDT ------- Xmon output from another hung, seems to have more information than the last one.
----- Additional Comments From mlui.com 2007-04-29 22:54 EDT ------- Did another experiment. Created a 90MB file on lewis7 and make it available for download via HTTP to an installed system. Here are the results: Installed Cell eth0 eth1 test4 fail pass Rawhide0427 fail pass CellBE pass pass Failed tests all stopped copying the file and printed out "Spider RX RAM full" errors. Note that supposedly rawhide includes all the spidernet patches that are also in CellBE kernel.
"Obviously there's not a lot we can do to help with this until our QS2[01] hardware arrives. Closing UPSTREAM for now." What does this mean? There is a QS20 at the Westford, MA site, with a QS21 soon to ship...are you talking about some other Red Hat site?
Ah, I apologise then; I must seek out details of access to it. I thought we only had one older DD2 machine in Westford, which wasn't expected to work with current kernels. Janice? (Sorry, this might not be the first time you're telling me this).
How much use do these machines get in Westford, btw? Is that where the QS21 is going too? The PPC maintainers for RHEL and Fedora are in Cambridge -- having one here might be useful if it's possible.
Hmmm...you may be correct about the DD2 level. I will look at sending a GA blade along with the QS21 (planned to ship late this week/early next). I'm not sure how much use they get in Westford, but we also need RHEL 5.1 support, which is the reason why I believe they are shipped there.
changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|ship issue |high ------- Additional Comments From robbiew.com 2007-04-30 14:01 EDT ------- Based on IBM bugzilla Comment #126, I'm lowering this bug to "high". By LTC standards, a "high" bug is one that would otherwise be "block" or "ship-issue", but which has a valid workaround that allows testing to continue until the bug is resolved. The install works on eth1, so there's a suitable workaround available.
changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|high |ship issue ------- Additional Comments From mlui.com 2007-04-30 14:24 EDT ------- Great, Robbie, we are using definitions of bugs ;) So my opinion is ship. Ship is the second highest severity level. Bugs that are preventing progress on some tests or that only occur on a single architecture or platform, but that are not preventing those systems from being used for other testing. Bugs that still must be fixed before an IBM product or Distro version is released
----- Additional Comments From linas.com (prefers email at linas.com) 2007-04-30 17:39 EDT ------- Earlier today, Lucas reproduced the install hang. He popped it into the xmon debugger witha ctrl-o. So: 1) If ctrl-o works, then the kernel is not really hung, in that it is alive enough to be able to listen to the ctrl-o coming in on the hvc console. 2) Poking around, it quickly became clear that all four cpu's were idle. (will attach log) Exiting from xmon results in the system continuing to print the Spider RX RAM full, incoming packets might be discarded! messages. Getting bck into xmon a while later still shows all cpu's idle. Conclude: the kernel is not hung. Since the spidernet hardware is generaing the RX RAM full interrupts, one must conclude that either: a) the kernel is not calling the spidernet device driver fast enough to drain the RX ring. b) the spidernet device driver is failing to drain the RX ring. I'm preparing a patch right now, to distingusih between these two. Its a simple patch: just printk in the spider_net_poll() routine, to verify that its being called, and how often its called, and how many packets its handling each call.
----- Additional Comments From linas.com (prefers email at linas.com) 2007-05-01 16:59 EDT ------- It seems that bug 34298 offers a much simpler way of reproducing this same network problem, without having to fiddle with doing the install. I can reproduce the problem, there, and so will be focusing attention there.
----- Additional Comments From linas.com (prefers email at linas.com) 2007-05-01 20:04 EDT ------- Patches that should fix this problem have been posted to bug 34298. These are not quite the final patches (they are more verbose than they need to be), but they should resolve the problem. Please test. Assuming this tests well, I'll post final patches on friday (4 May) or early the nxt week.
(In reply to comment #82) > Please test. Assuming this tests well, I'll post final patches > on friday (4 May) or early the nxt week. DaveJ? How does that timeframe suit you for Fedora 7?
changed: What |Removed |Added ---------------------------------------------------------------------------- Status|OPEN |ASSIGNED ------- Additional Comments From lucasgf.com 2007-05-02 08:50 EDT ------- Breno and I will test this today and post the results asap
------- Additional Comments From joseferr.com 2007-05-03 12:59 EDT ------- (In reply to comment #142) > Breno and I will test this today and post the results asap We tried the new spidernet module with the patches, but it failed to copy files with NFS via eth0. We could get this messages, using dmesg: got invalid descriptor interrupt, restarting DMAC A. got invalid descriptor interrupt, restarting DMAC A. got invalid descriptor interrupt, restarting DMAC A. got invalid descriptor interrupt, restarting DMAC A. got invalid descriptor interrupt, restarting DMAC A. got invalid descriptor interrupt, restarting DMAC A. got invalid descriptor interrupt, restarting DMAC A. got invalid descriptor interrupt, restarting DMAC A. got invalid descriptor interrupt, restarting DMAC A. got invalid descriptor interrupt, restarting DMAC A. eth0: bad status, cmd_status=x40800009 buf_addr=x9f4a3080 buf_size=x00000980 next_descr_addr=x9f3f4f40 result_size=x00000600 valid_size=x000005ec data_status=x70007b00 data_error=x02100000 which=121 And after a while: nfs: server 192.168.2.4 not responding, still trying nfs: server 192.168.2.4 not responding, still trying
----- Additional Comments From mlui.com 2007-05-17 23:49 EDT ------- Breno, since the patches from #34298 have been tested, could you please make sure they are upstream? Thanks.
----- Additional Comments From joseferr.com 2007-06-04 14:20 EDT ------- Fedora 7 GA'd last thursday. I tested the distro at our Cell QS20 machines and found no problems. The system installed with both eth[0-1] and using NFS/FTP/HTTP. I believe we can finally close this bug. Thank you very much to all people involved on this issue.
changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ACCEPTED |CLOSED ------- Additional Comments From brenohl.com 2007-06-04 14:25 EDT ------- Thanks Joseph, Closing per previous comment.
Based on the date this bug was created, it appears to have been reported against rawhide during the development of a Fedora release that is no longer maintained. In order to refocus our efforts as a project we are flagging all of the open bugs for releases which are no longer maintained. If this bug remains in NEEDINFO thirty (30) days from now, we will automatically close it. If you can reproduce this bug in a maintained Fedora version (7, 8, or rawhide), please change this bug to the respective version and change the status to ASSIGNED. (If you're unable to change the bug's version or status, add a comment to the bug and someone will change it for you.) Thanks for your help, and we apologize again that we haven't handled these issues to this point. The process we're following is outlined here: http://fedoraproject.org/wiki/BugZappers/F9CleanUp We will be following the process here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this doesn't happen again.
This bug has been in NEEDINFO for more than 30 days since feedback was first requested. As a result we are closing it. If you can reproduce this bug in the future against a maintained Fedora version please feel free to reopen it against that version. The process we're following is outlined here: http://fedoraproject.org/wiki/BugZappers/F9CleanUp