236298 – Fedora 7 Test 3 Install on QS20 - Install process does not complete

Bug 236298 - Fedora 7 Test 3 Install on QS20 - Install process does not complete

Summary: Fedora 7 Test 3 Install on QS20 - Install process does not complete

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	ppc64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:	bzcl34nup
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-04-12 22:11 UTC by IBM Bug Proxy
Modified:	2008-05-07 01:28 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-05-07 01:28:38 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
/var/log/messages (245.07 KB, text/plain) 2007-04-12 22:11 UTC, IBM Bug Proxy	no flags	Details
2.6.20-CBE.info (7.81 KB, text/plain) 2007-04-16 03:30 UTC, IBM Bug Proxy	no flags	Details
2.6.20-1.3023.fc7.info (15.83 KB, text/plain) 2007-04-16 03:31 UTC, IBM Bug Proxy	no flags	Details
2.6.20-1.3059.fc7.info (13.28 KB, text/plain) 2007-04-16 03:32 UTC, IBM Bug Proxy	no flags	Details
x (6.72 KB, text/plain) 2007-04-23 21:07 UTC, IBM Bug Proxy	no flags	Details
error.txt (2.78 KB, text/plain) 2007-04-24 20:46 UTC, IBM Bug Proxy	no flags	Details
log.txt (16.08 KB, text/plain) 2007-04-25 20:47 UTC, IBM Bug Proxy	no flags	Details
33869-xmon.txt (22.55 KB, text/plain) 2007-04-29 20:35 UTC, IBM Bug Proxy	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
IBM Linux Technology Center	33869	0	None	None	None	Never

Description IBM Bug Proxy 2007-04-12 22:11:41 UTC

During the install of the packages to the hard drive (after setting up a NFS
install, selecting the default partition scheme and packages), the install
process stops during gdb package install.
If a simpler install is selected (without X, Gnome, Dev tools), the process
stops during gtk2 package install.
In any case, the install process does not complete.

Hardware Environment: QS20
Cpu type:  Cell

Special hardware: 
- Spidernet driver as the ethernet module.

Here are the steps to reproduce the error:

Netboot the Cell machine.
Config NFS install.
Config partition scheme.
Select default packages.


I successfully installed Fedora 7 on an QS20 using FTP method in text mode.
Any other combinations like NFS/vnc, NFS/text, FTP/vnc would fail.

This is Spidernet issue. 
As I was trying to install SDK 2.1 from NFS mount point,  Spidernet  error
messages keeps dumping on the minicom console. SDK install never complete
...
buf_size=x00000980
next_descr_addr=x8f3f3f60
result_size=x00000600
valid_size=x000005ec
data_status=x7000fa00
data_error=x02100000
which=250
got descriptor chain end interrupt, restarting DMAC A.
Spider RX RAM full, incoming packets might be discarded!
Spider RX RAM full, incoming packets might be discarded!
Spider RX RAM full, incoming packets might be discarded!
........
Spider RX RAM full, incoming packets might be discarded!
Spider RX RAM full, incoming packets might be discarded!
eth0: bad status, cmd_status=x48800203
buf_addr=x987c7080
buf_size=x00000980
next_descr_addr=x99f86140
result_size=x00000600
valid_size=x000005ec
data_status=x70008d00
data_error=x02100000
which=9
nfs: server qdcdepot OK
nfs: server qdcdepot OK
nfs: server qdcdepot OK
nfs: server qdcdepot OK
nfs: server qdcdepot OK
nfs: server qdcdepot OK
printk: 95 messages suppressed.
Spider RX RAM full, incoming packets might be discarded!
printk: 14 messages suppressed.
Spider RX RAM full, incoming packets might be discarded!
printk: 12 messages suppressed.
Spider RX RAM full, incoming packets might be discarded!
printk: 21 messages suppressed.
Spider RX RAM full, incoming packets might be discarded!
printk: 11 messages suppressed.
Spider RX RAM full, incoming packets might be discarded!
printk: 13 messages suppressed.
Spider RX RAM full, incoming packets might be discarded!
printk: 13 messages suppressed.
Spider RX RAM full, incoming packets might be discarded!
printk: 11 messages suppressed.
Spider RX RAM full, incoming packets might be discarded!
printk: 13 messages suppressed.
Spider RX RAM full, incoming packets might be discarded!
printk: 15 messages suppressed.
Spider RX RAM full, incoming packets might be discarded!
printk: 13 messages suppressed.
Spider RX RAM full, incoming packets might be discarded!
printk: 15 messages suppressed.
Spider RX RAM full, incoming packets might be discarded!
printk: 11 messages suppressed.
Spider RX RAM full, incoming packets might be discarded!
printk: 15 messages suppressed.
Spider RX RAM full, incoming packets might be discarded!
printk: 11 messages suppressed.
Spider RX RAM full, incoming packets might be discarded!
printk: 18 messages suppressed.
Spider RX RAM full, incoming packets might be discarded!
printk: 13 messages suppressed.
Spider RX RAM full, incoming packets might be discarded!
printk: 11 messages suppressed.
Spider RX RAM full, incoming packets might be discarded!
printk: 14 messages suppressed.
Spider RX RAM full, incoming packets might be discarded!
printk: 11 messages suppressed.
Spider RX RAM full, incoming packets might be discarded!
........


The work-around:
- force install 2.6.20-be0711.0.20070319 kernel (back level) from Cell SDK 2.1,
which have the spidernet built-in the kernel. CONFIG_SPIDER_NET=y
- reboot
- mount to NFS server, and install SDK 2.1 without any problem



Linas Vepstas and James are looking to see exactly what version of the Spidernet
driver was picked up by Fedora 7. It is possible a later version exists that
will solve  this issue. This assumes of course that it is really a Spidernet
problem. 


There are two spidernet patches that have been submitted and accepted 
for 2.6.21, that are not in 2.6.20, that seem likely to fix the problem. 
There are also 7 other patches that went into spidernet that would not
affect this bug, but fix things for mii on celleb.

I would advise testing with the spidernet taken from the current 2.6.21 
release candidate tree. You'd need to grab spider_net.c, spider_net.h and
sungem_phy.c (to get the bcm45xx/genmii changes)


Trying to install the latest driver I made the following experience:
If I install like 'rpm -ivv /nfs_mount/my_package.rpm' it will crash with the
error as described in #7.
If I copy the packages to the blade on a local directory and install like 'rpm
-ivv /local_dir/my_package.rpm' it will be successful.

The 2.6.20-be0711.0.20070319 Cell BE kernel works just fine, which build with
these spidernet patches:

spider-fix-eth_zlen.patch
spidernet-add-support-for-celleb.diff
spidernet-autoneg-medium.diff
spidernet-autoneg-support-for-celleb.diff
spidernet-avoid-double-free.diff
spidernet-ipfrag-nfs.diff
spidernet-linas-2-sync-with-mainline.diff
spidernet-linas-3-split-descr.diff
spidernet-linas-4-tx-race.diff
spidernet-linas-5-nother-race.diff
spidernet-linas-6-typo.diff
spidernet-linas.diff
spidernet-load-firmware-when-open.diff
spidernet-queue-drain.diff
spidernet-remove-txram-full-logging.diff
spidernet-sungem-update-4.diff

Comment 1 IBM Bug Proxy 2007-04-12 22:11:41 UTC

Created attachment 152510 [details]
/var/log/messages

Comment 2 IBM Bug Proxy 2007-04-13 01:50:46 UTC

changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |RETTIG.com




------- Additional Comments From mlui.com  2007-04-12 21:44 EDT -------
Oliver, adding you to cc list.

Comment 3 IBM Bug Proxy 2007-04-13 01:51:18 UTC

----- Additional Comments From mlui.com  2007-04-12 21:48 EDT -------
Note that the two latest Fedora 7 install bugs, this one and bug #33892 are 
linked to NFS.  These two bugs can be related.  

This bug is an install problem of QS20 via NFS where FTP and HTTP install 
worked, while #33892 is a NFS boot from a remote harddrive for QS21.

Comment 4 IBM Bug Proxy 2007-04-13 02:55:34 UTC

----- Additional Comments From mlui.com  2007-04-12 22:50 EDT -------
Abe, was the successful install on our PPC box(es) via NFS?  Should be a good 
comparison point.

Comment 5 IBM Bug Proxy 2007-04-13 09:06:04 UTC

changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rolf_schmidt.com




------- Additional Comments From RETTIG.com  2007-04-13 05:03 EDT -------
This is a QS20 issue AFAIK. I forward this to Rolf Shcmidt to handle (LTC).

Comment 6 IBM Bug Proxy 2007-04-13 13:55:35 UTC

------- Additional Comments From abareval.com  2007-04-13 09:53 EDT -------
(In reply to comment #15)
> Abe, was the successful install on our PPC box(es) via NFS?  Should be a good 
> comparison point.

Yep, FC7 installed successfully via NFS on PPC boxes.

Comment 7 IBM Bug Proxy 2007-04-13 15:46:17 UTC

----- Additional Comments From mlui.com  2007-04-13 11:41 EDT -------
Did a little experiment on NFS as suggested by Jim.  NFS-mounted the install 
source to cell8, a successful QS20 install with F7-test3 via FTP.  Repeatedly 
copying the source to this QS20 to see if anything breaks in NFS.  It's been 30 
minutes and still going strong.  Will continue to update the status of this 
experiment.

Comment 8 IBM Bug Proxy 2007-04-13 16:15:31 UTC

----- Additional Comments From linas.com (prefers email at linas.com)  2007-04-13 12:11 EDT -------
The primary problem, from what I can tell, is that the kernel currently 
in FC7, which is 2.6.21-rc1, does nt include a number of the patches 
for spidernet. The newer kernel 2.6.21-rc2 does include these patches. 

Recall that "rc" stands for "release candidate". Since there are now 
much newer rc's (kernel.org is up to rc6 as of yesterday), it can be 
understood that rc1 has a number of bugs that are fixed in later rc's.
The spidernet issues can be considered to be one of these "bugs". 

I can only hope that FC7, before going golden, will rebase on a stable
kernel, such as 2.6.21, instead of one of its release candidates, right?
Can someone verify that this will occur? If so, then ther won't be 
anything to "fix" here, right?

Comment 9 IBM Bug Proxy 2007-04-13 16:21:24 UTC

----- Additional Comments From abareval.com  2007-04-13 12:19 EDT -------
***Correction***
To access cell10 console, you can use "telnet con3.ltc.austin.ibm.com 7006", 
please let me know if you plan on logging on to this telnet session beforehand.

Regards,
Abraham Arevalo

Comment 10 IBM Bug Proxy 2007-04-13 16:41:04 UTC

----- Additional Comments From mlui.com  2007-04-13 12:35 EDT -------
Note that in the last experiment we were using the Cell kernel which 
has the spidernet patches: Linux cell8.ltc.austin.ibm.com 2.6.20-CBE 
#1 SMP Wed Mar 21 10:24:39 CET 2007 ppc64 ppc64 ppc64 GNU/Linux

Next we are trying the test with the test3 kernel.

Comment 11 IBM Bug Proxy 2007-04-13 16:55:27 UTC

----- Additional Comments From jklewis.com  2007-04-13 12:49 EDT -------
Monza, please use caution when mentioning "Spidernet patches". We have not
created any new patches to deal with this problem. We are still trying to
determine where the problem is (Spidernet or NFS). 

To be clear, some NFS tests were run but on the wrong kernel. We are now
switching back to the stock (install) kernel and running them again. This should
help us isolate where the problem really is.

Comment 12 IBM Bug Proxy 2007-04-13 20:37:52 UTC

----- Additional Comments From jklewis.com  2007-04-13 16:34 EDT -------
I wanted to see if NFS works with the stock kernel. If it doesn't then I believe
that means we have found where the problem is. Some tests were run, but it's now
not clear which kernel was used. 

Installs using FTP and HTTP work according to a few people. This uses Spidernet
of course. Fedora Core 6 installs worked fine using NFS. It fails with F7. The
Spidernet driver in F7 is at least as good as the one in FC6, and so NFS should
work as it did before, UNLESS something else changed. I could be wrong, but I
think it's a stretch to believe an NFS fix is now needed in Spidernet on F7 when
it wasn't needed on FC6.

Comment 13 IBM Bug Proxy 2007-04-13 22:00:41 UTC

----- Additional Comments From mlui.com  2007-04-13 17:54 EDT -------
Unfortunately the NFS test we ran the second time was in an updated kernel, 
2.6.20-1.3059.fc7.  Will have to force install the older kernel to try it 
again.  We are planning to do it coming sunday and will post the result.

Comment 14 IBM Bug Proxy 2007-04-14 15:55:53 UTC

----- Additional Comments From mlui.com  2007-04-14 11:49 EDT -------
Ran a few experiments including copying the gdb and gtk rpms over to cell8 via 
a mount point for 100 times.  They were run on the test3 original kernel 
(2.6.20-1.3023.fc7).  Finished without problem as well.  Note that the network 
goes through sitewide in all these and previous experiments.

Comment 15 IBM Bug Proxy 2007-04-16 03:00:55 UTC

----- Additional Comments From mlui.com  2007-04-15 22:56 EDT -------
Jim, 

We've found that the problem only came from eth0, not eth1 using 
lewis7 as file server.  Could you please add comments on what is the 
difference between the setup of eth0 to eth1?  Joseph will post more 
information.  Thanks.

Comment 16 IBM Bug Proxy 2007-04-16 03:05:42 UTC

----- Additional Comments From mlui.com  2007-04-15 23:02 EDT -------
Ken and Thinh,

The bug appears only when the install goes through eth0, a local network, which 
I believe is connected to a slower switch (100M).  eth1 is on the other hand 
connected to sitewide and probably connected to a faster (1G) swich. Still need 
Jim's confirmation though.  How were your network set up when you saw this 
bug?

Comment 17 IBM Bug Proxy 2007-04-16 03:25:28 UTC

----- Additional Comments From joseferr.com  2007-04-15 23:21 EDT -------
Following the steps of Monza and Jim, I started doing tests using NFS mount points.

I used a NFS mount from a machine (lewis7) using eth1 and it's main IP, and
started copying files of different sizes (RPMs packages from a Fedora 7 ISO) to
the cell machine (cell8). No problems have been found, and the copies run for
over an hour. I've used 2.6.20-1.3023-f7, 2.6.20-1.3059.f7 and 2.6.20-CBE kernels.

Then, I tried using eth0 and the sames machines, using lewis7 private IP. This
was the same setup used during the install process that generated this bug. With
this setup, the problem appeared, using both f7 kernels. Just a few seconds
after the file copies started, the cp stops, and the following output is shown:

Spider RX RAM full, incoming packets might be discarded!
Spider RX RAM full, incoming packets might be discarded!
got descriptor chain end interrupt, restarting DMAC A.
Spider RX RAM full, incoming packets might be discarded!
Spider RX RAM full, incoming packets might be discarded!
got descriptor chain end interrupt, restarting DMAC A.
eth0: bad status, cmd_status=x48800203
buf_addr=x9e814080
buf_size=x00000980
next_descr_addr=x9e7502c0
result_size=x00000600
valid_size=x000005ec
data_status=x70001500
data_error=x02100000 
which=21
Spider RX RAM full, incoming packets might be discarded!
Spider RX RAM full, incoming packets might be discarded!
Spider RX RAM full, incoming packets might be discarded!
Spider RX RAM full, incoming packets might be discarded!
Spider RX RAM full, incoming packets might be discarded!
printk: 2 messages suppressed.

With that in mind, I retried the install of F7-Test3 (6.92), but this time using
eth1 as the network card. The machines used where cell9 and lewis7 (using 9.*
IPs). The NFS/text install completed succesfully, installing 1082 packages and
over 1.5 GB of data. To rule out the bug as being specific to cell8, I tried the
NFS again at cell9 and eth0 (private network). The install process fails as
reported earlier.

With cell8 installed and running F7 (a FTP install using eth1), we could see
that both eths use spidernet module and are set up using 1000Mbit connections.

I'm adding as attachments the outputs for lsmod, dmesg and ethtool for each of
the kernels that are being used.

Monza Lui has more info about these tests, using FTP instead of NFS mounts.

One other thing that was found is that if the network is restarted (service
network restart), it shows this errors:

[root@cell8 ~]# service network restart
Shutting down interface eth0:  [  OK  ]
Shutting down interface eth1:  [  OK  ]
Shutting down loopback interface:  [  OK  ]
Bringing up loopback interface:  [  OK  ]
Bringing up interface eth0:  [  OK  ]
Bringing up interface eth1:  eth1: bad status, cmd_status=x40800009
buf_addr=xbeec7080
buf_size=x00000980
next_descr_addr=xbe9b6020
result_size=x00000080
valid_size=x00000042
data_status=x4000a500
data_error=x00004000
which=0
eth1: bad status, cmd_status=x40800109
buf_addr=xbe98e080
buf_size=x00000980
next_descr_addr=xbe9b6040
result_size=x00000080
valid_size=x00000042
data_status=x4000a600
data_error=x00004000
which=1
eth1: bad status, cmd_status=x40800109
buf_addr=xa0a7d080
buf_size=x00000980
next_descr_addr=xbe9b6060
result_size=x00000080
valid_size=x00000042
data_status=x4000a700
data_error=x00004000
which=2
eth1: bad status, cmd_status=x40800109
buf_addr=xa0b6c080
buf_size=x00000980
next_descr_addr=xbe9b6080
result_size=x00000080
valid_size=x00000042
data_status=x4000a800
data_error=x00004000
which=3
eth1: bad status, cmd_status=x40800109
buf_addr=xbe8d4080
buf_size=x00000980
next_descr_addr=xbe9b60a0
result_size=x00000080
valid_size=x00000042
data_status=x4000a900
data_error=x00004000
which=4
eth1: bad status, cmd_status=x40800109
buf_addr=xbe990080
buf_size=x00000980
next_descr_addr=xbe9b60c0
result_size=x00000080
valid_size=x00000042
data_status=x4000aa00
data_error=x00004000
which=5
eth1: bad status, cmd_status=x40800109
buf_addr=xbed04080
buf_size=x00000980
next_descr_addr=xbe9b60e0
result_size=x00000080
valid_size=x00000042
data_status=x4000ab00
data_error=x00004000
which=6
eth1: bad status, cmd_status=x40800009
buf_addr=xbea0c080
buf_size=x00000980
next_descr_addr=xbe9b6100
result_size=x00000080
valid_size=x00000042
data_status=x4000ac00
data_error=x00004000
which=7
eth1: bad status, cmd_status=x40800009
buf_addr=xbe944080
buf_size=x00000980
next_descr_addr=xbe9b6120
result_size=x00000080
valid_size=x00000042
data_status=x4000ad00
data_error=x00004000
which=8
[  OK  ]

Comment 18 IBM Bug Proxy 2007-04-16 03:30:49 UTC

Created attachment 152660 [details]
2.6.20-CBE.info

Comment 19 IBM Bug Proxy 2007-04-16 03:30:55 UTC

----- Additional Comments From joseferr.com  2007-04-15 23:24 EDT -------
 
Info using CBE kernel

Comment 20 IBM Bug Proxy 2007-04-16 03:31:27 UTC

Created attachment 152661 [details]
2.6.20-1.3023.fc7.info

Comment 21 IBM Bug Proxy 2007-04-16 03:31:34 UTC

----- Additional Comments From joseferr.com  2007-04-15 23:26 EDT -------
 
Info using 3023.fc7 kernel

Comment 22 IBM Bug Proxy 2007-04-16 03:32:06 UTC

Created attachment 152662 [details]
2.6.20-1.3059.fc7.info

Comment 23 IBM Bug Proxy 2007-04-16 03:32:33 UTC

----- Additional Comments From joseferr.com  2007-04-15 23:27 EDT -------
 
Info using 3059.fc7 kernel

Comment 24 IBM Bug Proxy 2007-04-16 13:21:19 UTC

------- Additional Comments From brenohl.com  2007-04-16 09:15 EDT -------
(In reply to comment #31)
> Spider RX RAM full, incoming packets might be discarded!
> Spider RX RAM full, incoming packets might be discarded!
> got descriptor chain end interrupt, restarting DMAC A.

Searching for this error message in the kernel source code, I found the
following codes:

source code file: drivers/net/spider_net.c
function: spider_net_handle_error_irq(struct spider_net_card *card, u32 status_reg)

This function look for all the status registers and if it found a register with
 the value SPIDER_NET_GRMFLLINT, it gives the error:

1367         case SPIDER_NET_GRMFLLINT:
1368                 if (netif_msg_intr(card) && net_ratelimit())
1369                         pr_debug("Spider RX RAM full, incoming packets "
1370                                "might be discarded!
");
1371                 spider_net_rx_irq_off(card);
1372                 tasklet_schedule(&card->rxram_full_tl);
1373                 show_error = 0;
1374                 break;
1375

Comment 25 IBM Bug Proxy 2007-04-16 14:00:40 UTC

----- Additional Comments From joseferr.com  2007-04-16 09:53 EDT -------
[root@cell8 VolGroup00]# ethtool -i eth0
driver: spidernet
version: 2.0 A
firmware-version: no information
bus-info: 0001:00:03.0

[root@cell8 VolGroup00]# ethtool -i eth1
driver: spidernet
version: 2.0 A
firmware-version: no information
bus-info: 0002:00:03.0

[root@cell8 VolGroup00]# lspci
0000:00:0a.0 IDE interface: Silicon Image, Inc. PCI0680 Ultra ATA-133 Host
Controller (rev 02)
0001:00:03.0 Ethernet controller: Toshiba America Unknown device 01b3 (rev 02)
0002:00:03.0 Ethernet controller: Toshiba America Unknown device 01b3 (rev 02)

Comment 26 IBM Bug Proxy 2007-04-16 15:30:28 UTC

----- Additional Comments From jklewis.com  2007-04-16 11:23 EDT -------
Excellent work by the test team. I'll try to add some helpful info here:

eth0 is our gigabit interface. The server, lewis7, has an e1000 that is
connected directly to the top switch in our BladeCenter. Obviously this is a
whole lot faster than using the site interface (eth1) which is at 100. Note that
this is the same configuration that worked installing FC6 via NFS.

The error messages are indeed coming from the Spidernet driver. However, this
does not necessarily mean Spidernet is at fault. In many cases something else
has changed in the kernel that has slowed the timing down, and these have caused
Spidernet to have trouble keeping up. 

When FTP and/or HTTP was used to perform a successful install, which interface
was used? If it was eth1 please try again using eth0.

At this point I still suspect something other than Spidernet to be causing this
problem. At some point, in the install tree, we need to replace Spidernet with
the latest version. I do not know the procedures for replacing files in a
distro, or building RPMs. Anyone?

Comment 27 IBM Bug Proxy 2007-04-16 17:25:51 UTC

----- Additional Comments From linas.com (prefers email at linas.com)  2007-04-16 13:21 EDT -------
ARE THOSE MESSAGES FROM THE PATCHED DRIVER, OR THE UNPATCHED DRIVER?

ONE OF THE SPIDERNET PATCHES MISSING FROM FC7 IS A PATCH THAT FIXES
A BUG THAT TRIGGERS RX RAM FULL MESSAGES!

Comment 28 IBM Bug Proxy 2007-04-16 17:41:56 UTC

------- Additional Comments From joseferr.com  2007-04-16 13:37 EDT -------
(In reply to comment #37)
> Excellent work by the test team. I'll try to add some helpful info here:
> 
> eth0 is our gigabit interface. The server, lewis7, has an e1000 that is
> connected directly to the top switch in our BladeCenter. Obviously this is a
> whole lot faster than using the site interface (eth1) which is at 100. Note 
> that this is the same configuration that worked installing FC6 via NFS.

Very interesting, Jim. We were thinking exactly the opposite. That eth0 was at a
100 and eth1 at 1000. With small or average size files (< 100 MB), eth0 fails
and eth1 is fine. eth1 is only failing when copying large files, like the F7.ISO.
Any way, shouldn't the eth1 be set at the cell machine as using a 100Mbit and
Half-Duplex network? This intrigues me.
[root@cell8 ~]# ethtool eth1
Settings for eth1:
        Supported ports: [ FIBRE ]
        Supported link modes:   1000baseT/Full
        Supports auto-negotiation: No
        Advertised link modes:  1000baseT/Full
        Advertised auto-negotiation: No
        Speed: 1000Mb/s
        Duplex: Full
        Port: FIBRE
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: off
        Supports Wake-on: d
        Wake-on: d
        Current message level: 0x00007fff (32767)

> The error messages are indeed coming from the Spidernet driver. However, this
> does not necessarily mean Spidernet is at fault. In many cases something else
> has changed in the kernel that has slowed the timing down, and these have 
> caused Spidernet to have trouble keeping up. 
> 
> When FTP and/or HTTP was used to perform a successful install, which interface
> was used? If it was eth1 please try again using eth0.

Our FTP install used eth1. But Monza did some tests with FTP and some big files,
and the failed at both eths. I can try to install using eth0, but I'm almost
certain it will fail too, and these tests did.

> At this point I still suspect something other than Spidernet to be causing 
> this problem. At some point, in the install tree, we need to replace Spidernet 
> with the latest version. I do not know the procedures for replacing files in a
> distro, or building RPMs. Anyone?
> 

Breno and I are trying to create a netboot image containing a newer kernel and
the patches, so we can retry the install process and see what happens.

Comment 29 IBM Bug Proxy 2007-04-16 17:46:14 UTC

------- Additional Comments From joseferr.com  2007-04-16 13:41 EDT -------
(In reply to comment #38)
> ARE THOSE MESSAGES FROM THE PATCHED DRIVER, OR THE UNPATCHED DRIVER?
> 
> ONE OF THE SPIDERNET PATCHES MISSING FROM FC7 IS A PATCH THAT FIXES
> A BUG THAT TRIGGERS RX RAM FULL MESSAGES!

The f7 kernel 2.6.20-1.3023 used during the install (the one that comes with the
ISO) and the last update at f7 repositories, 2.6.20-1.3059.
The 2.6.20-CBE kernel (that comes with SDK 2.1) does not show this messages, and
completes these NFS and FTP tests without problems for both eths.

Comment 30 IBM Bug Proxy 2007-04-16 18:35:40 UTC

----- Additional Comments From jklewis.com  2007-04-16 14:28 EDT -------
Joseph, at one time we did have it setup with eth0 the site and eth1 the
private. Unfortunately, you can't netboot from eth1 and so netbooting on eth0
was very tricky as there were consistent collisions with the other servers in
the lab. Switching the interfaces to how we have them now solved these problems,
and also allowed for faster installs. 

Not sure why you brought up half-duplex. I would recommend not ever using it for
various reasons. Under no circumstances should you ever mix duplex on the same
network. All modern day interfaces and switches use full-duplex.

If you run a test that copies a file larger than about 2 GB from somewhere to
the Cell blade you are going to hit Bug # 29975. According to what I have seen
this bug is NOT going to be fixed. 

On a Cell blade ethtool will always show a speed of 1000 Mb/s regardless of what
speed is really there. I believe that's because it is showing you the speed of
the switch in the BladeCenter. 

I have given Monza some pointers on how to create a new boot kernel.

Comment 31 IBM Bug Proxy 2007-04-16 18:40:28 UTC

----- Additional Comments From mlui.com  2007-04-16 14:38 EDT -------
To summarize what we found so far:

Install F7-test3 via eth1 has no problem in both cell8 and cell9 (both 
QS20) via NFS, FTP, and HTTP. Install the same via eth0 however is at 
least not working via NFS on both cells. Note that eth0 and eth1 are 
built-in spidernet cards on QS20. The difference between eth0 and 
eth1 is that eth0 is connected to a local network with a faster interface 
(1GB) while eth1 is connected to sitewide (100MB). 

We did some testing after the cell blades are installed. We copied a 
4GB file from two remote servers, lewis7 which is available through 
both eths and ppc64flp1 which is available only through eth0. For NFS, 
we mounted the remote file system to the cell blade before copying.  
For FTP, we ftp to the remote server and did a mget.  Here are the 
results so far:
Network_card     Protocol     File_server    Kernel    Result
eth0               NFS           lewis7       F7-test3  Hang
eth1               NFS           lewis7       F7-test3  OK
eth1               NFS           ppc64flp1    F7-test3  OK
eth0               FTP           lewis7       F7-test3  Fail very quickly
eth1               FTP           lewis7       F7-test3  Fail after a while
eth1               FTP           ppc64flp1    F7-test3  OK

We are currently trying to build a new set of boot image which includes 
some spidernet patches to see if they fix the install problem.

Comment 32 IBM Bug Proxy 2007-04-16 19:15:51 UTC

----- Additional Comments From joseferr.com  2007-04-16 15:09 EDT -------
> lewis7 which is available through 
> both eths and ppc64flp1 which is available only through eth0.

correction: ppc64flp1 is available only through *eth1* (as the table shows).

Comment 33 IBM Bug Proxy 2007-04-16 19:40:40 UTC

----- Additional Comments From mlui.com  2007-04-16 15:37 EDT -------
Ken said that his cell blade was also connected to a 1GB switch when 
the install bug happened. Not contradicting to what we've found.

Comment 34 IBM Bug Proxy 2007-04-17 03:15:39 UTC

----- Additional Comments From mlui.com  2007-04-16 23:09 EDT -------
Linas, could you please provide us a spidernet.ko with the patches 
compiled in?  We took the F7-test3 (2.6.20-1.3023.fc7) kernel, 
applied 13 patches we found from SDK2.1 Cell BE kernel, and 
recompiled.  Got a new spidernet.ko but it didn't work.  

We are currently trying to create a new install kernel with patched 
spidernet.ko to see if the patches fix the problem.  That's why we 
need a new spidernet.ko.  Thanks.

Comment 35 IBM Bug Proxy 2007-04-17 03:16:11 UTC

----- Additional Comments From mlui.com  2007-04-16 23:11 EDT -------
Joseph, did we try installing via eth0 via FTP?

Comment 36 IBM Bug Proxy 2007-04-17 03:30:36 UTC

----- Additional Comments From mlui.com  2007-04-16 23:25 EDT -------
Joseph, never mind, we didn't as lewis7 did not have nfs server running.

Comment 37 IBM Bug Proxy 2007-04-17 10:40:48 UTC

------- Additional Comments From joseferr.com  2007-04-17 06:38 EDT -------
(In reply to comment #47)
> Joseph, never mind, we didn't as lewis7 did not have nfs server running. 

FTP using eth0 doesn't work either. It stops right after providing the IP and
directory, at the message:

    +----------------------------+ Retrieving +----------------------------+
    |                                                                      |
    | Retrieving images/minstg2.img...                                     |
    |                                                                      |
    +----------------------------------------------------------------------+

At this point, it stops.

Using eth1 works.

Comment 38 IBM Bug Proxy 2007-04-17 20:50:41 UTC

----- Additional Comments From brenohl.com  2007-04-17 16:46 EDT -------
I've tested the problem (NFS) with the 2.6.21-rc4 and it worked as expected.
I'll mount the netboot right now in order to assure that the problem was fixed
in this kernel version.

Comment 39 Robbie Williamson 2007-04-20 14:33:37 UTC

As Fedora is the distro of choice for any development on the Cell architecture,
I cannot stress how important it is to have this fixed in FC7.  Is there
something we can do to help with this bug?

Comment 40 IBM Bug Proxy 2007-04-22 21:20:32 UTC

----- Additional Comments From mlui.com  2007-04-22 17:15 EDT -------
Linas, I remember you told me there was one NFS related patch that is not 
upstream yet.  Is it still the case?

Comment 41 IBM Bug Proxy 2007-04-22 21:25:31 UTC

----- Additional Comments From mlui.com  2007-04-22 17:20 EDT -------
The Fedora team, will these spidernet patches be in test4?

Comment 42 Dave Jones 2007-04-23 19:11:40 UTC

there are no patches attached to this bugzilla, so it's unlikely. (test4 froze
last week).  We can take them on for GA however, but we'll need them diffed
against a current (2.6.21-rc7 at time of writing) tree.

Comment 43 IBM Bug Proxy 2007-04-23 19:46:22 UTC

----- Additional Comments From brenohl.com  2007-04-23 15:39 EDT -------
Hi, 
The kernel 2.6.21-rc7 has all the spidernet patches except one patch, which is
already in the vanilla kernel, and will be available on the rc8 kernel release.

Thanks for the response.

Comment 44 Dave Jones 2007-04-23 19:49:25 UTC

linus has indicated that he's reluctant to do an -rc8, so it's likely that we'll
see a .21 any day now.  If that remaining patch isn't in -rc7-git5, please
attach it, and I'll get it into the Fedora builds.

Comment 45 IBM Bug Proxy 2007-04-23 20:22:34 UTC

----- Additional Comments From linas.com (prefers email at linas.com)  2007-04-23 16:15 EDT -------
The nfs patch is not in linux-2.6.20-rc7 but it is in linux-2.6.20-rc7-git6
commit 33bdeec80649f2eab36039f63d69c65378493cbe

Comment 46 IBM Bug Proxy 2007-04-23 20:47:25 UTC

------- Additional Comments From mlui.com  2007-04-23 16:40 EDT -------
(In reply to comment #69)
> It's been very difficult to work on this bug due to the lack of hardware
> available for us to debug. I understand that the team does not have 
many
> machines to perform the tests, 
Please note that we are very limited in hardware and we are on a very 
tight schedule in testing. Continual freeze is this thursday. And the 
team worked last weekend. However still we did provide the hardware 
more than a couple times to debug this problem.   

however if this is a ship issue, or even a
> blocker as it has been originally opened,we'd need a machine 
available for more
> time. Since there is a workaround for this, I believe we can lower this 
bug's
> severity. 
This bug should stay as ship issue as defined by bugzilla's severity 
definition. As per our converstaion online, I believe the best way to go 
about this is for your team to try out the boot image on a ppc64 box.  
If it boots up then we try it out on cell.  Just want to minimize the 
disruption to our testing.  Hope you understand.

Comment 47 IBM Bug Proxy 2007-04-23 20:51:39 UTC

----- Additional Comments From mlui.com  2007-04-23 16:44 EDT -------
For the record, the boot image should be created for ppc64, not cell.  As there 
is no install tree for cell but ppc in Fedora 7.

Sorry for the last post, should have been internal only.

Comment 48 Dave Jones 2007-04-23 21:00:53 UTC

I'll be rebasing to linux-2.6.20-rc7-git6 soon, so we should pick up that NFS
change for F7.

Comment 49 IBM Bug Proxy 2007-04-23 21:07:23 UTC

Created attachment 153314 [details]
x

Comment 50 IBM Bug Proxy 2007-04-23 21:07:53 UTC

----- Additional Comments From linas.com (prefers email at linas.com)  2007-04-23 17:03 EDT -------
 
patch that fixes NFS hang; this patch is now in linux-2.6.20-rc7-git6

Comment 51 IBM Bug Proxy 2007-04-24 20:45:39 UTC

----- Additional Comments From brenohl.com  2007-04-24 16:40 EDT -------
Hi. 
I am getting two errors, if I boot the image on a cell machine, the anaconda
loader couldn't bring the spidernet up, as described on comment#60.
If I try to boot a ppc64 default kernel compilation on a ppc64 box, I got the a
0x700 error, which is a illegal instruction.
I'll attach the boot log here.

Do anyone have any tip?

Thanks

Comment 52 IBM Bug Proxy 2007-04-24 20:46:11 UTC

Created attachment 153385 [details]
error.txt

Comment 53 IBM Bug Proxy 2007-04-24 20:46:47 UTC

----- Additional Comments From brenohl.com  2007-04-24 16:41 EDT -------
 
the boot log

Comment 54 IBM Bug Proxy 2007-04-25 20:46:30 UTC

----- Additional Comments From brenohl.com  2007-04-25 16:42 EDT -------
Hi people, 
I booted the monza image and debugged it.
The dmesg didn't show tentative to load any spidernet module.
Then I uncompress the modules.cgz in a directory and loaded the spidernet.ko
manually and I got the following error: 

<4>ksign: module signed with unknown public key
<4>- signature keyid: 6b3cae08dc6d60bc ver=3
<3>Module signed with unknown public key


I think we have two problem, the first one is the image directories which seems
wrong to load the module and the module error which prevents it to be loaded.

Linas, 
Do you know what is going wrong with the public key?


I'll post the dmesg output above.

Comment 55 IBM Bug Proxy 2007-04-25 20:47:02 UTC

Created attachment 153456 [details]
log.txt

Comment 56 IBM Bug Proxy 2007-04-25 20:47:19 UTC

----- Additional Comments From brenohl.com  2007-04-25 16:43 EDT -------
 
the dmesg output.

Look at the modprobe error too.

Comment 57 IBM Bug Proxy 2007-04-25 20:50:26 UTC

----- Additional Comments From brenohl.com  2007-04-25 16:45 EDT -------
The error message I got when loading it manually:

sh-3.2# insmod spidernet.ko
insmod: cannot insert `spidernet.ko': Required key not available (-1): Required
key not available

Comment 58 IBM Bug Proxy 2007-04-25 20:55:39 UTC

----- Additional Comments From brenohl.com  2007-04-25 16:50 EDT -------
Looking deeper in the dmesg outpu, I could see that it tried to load the
spidernet module in the loader and also failed with the same unkown key error.
So I think the image layout seems good.

Comment 59 IBM Bug Proxy 2007-04-27 02:40:23 UTC

----- Additional Comments From mlui.com  2007-04-26 22:35 EDT -------
Test4 arrived today, did a NFS install on cell14 via eth0.  It got stuck at 
    ││                                68%                                │
    ││                                                                   │
    ││                  997 of 1186 packages completed                   │
    ││                                                                   │
    ││ Installing libgnomeui - 2.18.1-2.fc7.ppc (3 MB)                   │
    ││ GNOME base GUI library                 

tcpdump -i in the server side (lewis7) shows the followings.  Look like cell14 
(192.168.2.15) is still trying to contact lewis7 (192.168.2.4).
21:15:12.810217 arp who-has 192.168.2.4 tell 192.168.2.15
21:15:12.810229 arp reply 192.168.2.4 is-at 00:0e:0c:a0:90:1c (oui Unknown)
21:15:13.810144 arp who-has 192.168.2.4 tell 192.168.2.15
21:15:13.810159 arp reply 192.168.2.4 is-at 00:0e:0c:a0:90:1c (oui Unknown)
21:15:14.147403 802.1d config 8000.00:16:ca:9a:e7:00.8013 root 
8000.00:16:ca:9a:e7:00 pathcost 0 age 0 max 20 hello 2 fdelay 15

However cell14 is not pingable from lewis7.  Will try the same but via eth1 
installing from lewis7 by NFS mount and see what happens.

Comment 60 IBM Bug Proxy 2007-04-27 03:05:24 UTC

----- Additional Comments From mlui.com  2007-04-26 22:59 EDT -------
Breno, could you please check if all the patches are in test4?  Thanks.

Comment 61 David Woodhouse 2007-04-27 06:44:09 UTC

(In reply to comment #60)
> ----- Additional Comments From mlui.com  2007-04-26 22:59 EDT -------
> Breno, could you please check if all the patches are in test4?  Thanks. 

That would be unlikely. The test4 kernel froze long ago. Try tomorrow's rawhide,
_if_ it has a kernel-2.6.20-1.3105.fc7 or later.

That seems to be when linux-2.6.20-rc7-git6 was merged, according to
http://cvs.fedora.redhat.com/viewcvs/rpms/kernel/devel/kernel-2.6.spec?view=log

Comment 62 IBM Bug Proxy 2007-04-28 01:20:36 UTC

----- Additional Comments From mlui.com  2007-04-27 21:18 EDT -------
We tested with today's rawhide which is rebased to 2.6.21 (kernel-2.6.21-
1.3116.fc7), unlike test4 which is still based on 2.6.20.  Breno verified that 
the spidernet patches are already in this rawhide code but the install still 
hangs.  We are trying to collect logs, etc.

Comment 63 IBM Bug Proxy 2007-04-28 15:10:44 UTC

changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|RH236298- Fedora 7 Test 3   |RH236298- Fedora 7 Test 3
                   |install fails using 1Gbit   |NFS and FTP install fails
                   |network on QS20 machine     |using 1Gbit network on QS20
                   |                            |machine




------- Additional Comments From mlui.com  2007-04-28 11:08 EDT -------
Like what Ken and Thinh found in F7 test2/3, HTTP install via 1Gb 
network works in the rebased F7 (rawhide from 0427 with kernel-
2.6.21-1.3116.fc7) but not FTP nor NFS install.  Verified in cell16.

Comment 64 David Woodhouse 2007-04-28 15:22:03 UTC

Obviously there's not a lot we can do to help with this until our QS2[01]
hardware arrives. Closing UPSTREAM for now.

Comment 65 IBM Bug Proxy 2007-04-28 15:35:51 UTC

------- Additional Comments From mlui.com  2007-04-28 11:30 EDT -------
(In reply to comment #109) 
> ------- Additional Comments From dwmw2  2007-04-28 11:22 EST ------
> Obviously there's not a lot we can do to help with this until our QS2[01]
> hardware arrives. Closing UPSTREAM for now.
Robbie, could you please comment on this?  I believe RedHat has at least one 
QS20 right?

Comment 66 IBM Bug Proxy 2007-04-29 05:55:41 UTC

changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |OPEN




------- Additional Comments From mlui.com  2007-04-29 01:51 EDT -------
Dropped to xmon after an NFS install hang and got the followings:
mon> t
[c0000000021cfcc0] c0000000002a7c08 .__handle_sysrq+0xe8/0x1c0
[c0000000021cfd70] c0000000002ab608 .hvc_poll+0x1a0/0x2d4
[c0000000021cfe50] c0000000002abd38 .khvcd+0x90/0x16c
[c0000000021cfee0] c000000000096b0c .kthread+0x124/0x174
[c0000000021cff90] c000000000029284 .kernel_thread+0x4c/0x68

2:mon> r
R00 = 0000000000000000   R16 = 0000000000000000
R01 = c0000000021cfc50   R17 = 0000000000000000
R02 = c0000000006fb668   R18 = 0000000000000000
R03 = c0000000021cfae0   R19 = 4000000001410000
R04 = c00000000073b148   R20 = c0000000005a6dd8
R05 = c00000000073b178   R21 = 00000000019b7048
R06 = c0000000006c6438   R22 = 0000000000000000
R07 = c0000000006c6618   R23 = 0000000000000001
R08 = c0000000006c6608   R24 = 0000000000000000
R09 = c0000000005b68c8   R25 = 0000000000000000
R10 = c0000000006c6638   R26 = c00000003ef8e1c8
R11 = c00000000006e134   R27 = 0000000000000078
R12 = c00000000073b150   R28 = 0000000000000001
R13 = c0000000005de880   R29 = 0000000000000001
R14 = 0000000000000000   R30 = c000000000690cf8
R15 = 0000000000000000   R31 = c0000000021cfae0
pc  = c00000000006e288 .sysrq_handle_xmon+0x48/0x5c
lr  = c00000000006e288 .sysrq_handle_xmon+0x48/0x5c
msr = 9000000000001032   cr  = 28000088
ctr = c00000000006e0e8   xer = 0000000020000000   trap =    0

2:mon> e
cpu 0x2: Vector: 0  at [c0000000021cfae0]
    pc: c00000000006e288: .sysrq_handle_xmon+0x48/0x5c
    lr: c00000000006e288: .sysrq_handle_xmon+0x48/0x5c
    sp: c0000000021cfc50
   msr: 9000000000001032
  current = 0xc00000000feaa2c0
  paca    = 0xc0000000005de880
    pid   = 258, comm = khvcd

Comment 67 IBM Bug Proxy 2007-04-29 06:55:29 UTC

----- Additional Comments From mlui.com  2007-04-29 02:51 EDT -------
To summarize our work today, here are a few things we tried.  
1) Installed F7 on cell8 twice. First time installed it via eth1 so that we 
could get a successful install. We used only half of the disk space here.  
Second time we installed via eth0 using the rest of the disk and it hung as 
expected. We booted back to the first F7 install and looked at the fs from the 
second partial F7 install. Got the dmesg and /var/log/message as attached but 
did not see anything interesting
2) Did a yum update to install rawhide rpms from a F7test4 partition. Rebooted 
to new kernel (2.6.21-1.3116.fc7). Did some copying test using NFS mount via 
eth0. Got exact same errors we got from test3 (comment #31). However when we 
ran the same test earlier (comment #40) on Cell BE kernel (kernel-2.6.20-CBE), 
we did not see this bug. Note that the set of spidernet patches in Cell BE 
kernel should have all been included in this rawhide kernel. 
3) Got the xmon output in the last comment at NFS hang. Note that when install 
via FTP, it hangs much earlier, before stage2. That's why we did not get an 
xmon output from FTP hang.

Comment 68 IBM Bug Proxy 2007-04-29 07:15:52 UTC

----- Additional Comments From mlui.com  2007-04-29 03:11 EDT -------
Linas,
Could you please look at the xmon output and if you could please look at the 
rawhide source code to see if all patches in CELL BE are in. Thanks :)

Comment 69 IBM Bug Proxy 2007-04-29 07:25:51 UTC

----- Additional Comments From mlui.com  2007-04-29 03:21 EDT -------
Tried a new experiment. Downloading a file via HTTP via eth0, running rawhide 
kernel (2.6.21-1.3116.fc7). Same "RX RAM full, incoming packets might be 
discarded!" errors as the other NFS and FTP file copying tests. Note that we 
did NOT have problem installing via HTTP but here we have problem downloading 
file via HTTP running the same kernel.
eth0: bad status, cmd_status=x48800203
buf_addr=xaaf0f080
buf_size=x00000980
next_descr_addr=xb5eeb920
result_size=x00000600
valid_size=x000005ec
data_status=x7000a400
data_error=x02100000
which=200

Comment 70 IBM Bug Proxy 2007-04-29 17:15:42 UTC

----- Additional Comments From mlui.com  2007-04-29 13:12 EDT -------
Reproduced the hang with echo 8 > /proc/sysrq-trigger
[C0000000012FBC80] [C0000000000111C4] .__switch_to+0x12c/0x160
[C0000000012FBD10] [C000000000401330] .schedule+0x8e0/0xa3c
[C0000000012FBE10] [D0000000005FFAB4] .jfs_lazycommit+0x254/
0x2ac [jfs]
[C0000000012FBEE0] [C000000000096B0C] .kthread+0x124/0x174
[C0000000012FBF90] [C000000000029284] .kernel_thread+0x4c/0x68
jfsCommit     S 0000000000000000 14768   665     83 (L-TLB)
Call Trace:
[C0000000012FFC80] [C0000000000111C4] .__switch_to+0x12c/0x160
[C0000000012FFD10] [C000000000401330] .schedule+0x8e0/0xa3c
[C0000000012FFE10] [D0000000005FFAB4] .jfs_lazycommit+0x254/
0x2ac [jfs]
[C0000000012FFEE0] [C000000000096B0C] .kthread+0x124/0x174
[C0000000012FFF90] [C000000000029284] .kernel_thread+0x4c/0x68
jfsCommit     S 0000000000000000 14768   666     83 (L-TLB)
Call Trace:
[C000000001303C80] [C0000000000111C4] .__switch_to+0x12c/
0x160
[C000000001303D10] [C000000000401330] .schedule+0x8e0/0xa3c
[C000000001303E10] [D0000000005FFAB4] .jfs_lazycommit+0x254/
0x2ac [jfs]
[C000000001303EE0] [C000000000096B0C] .kthread+0x124/0x174
[C000000001303F90] [C000000000029284] .kernel_thread+0x4c/0x68
jfsCommit     S 0000000000000000 14768   667     83 (L-TLB)
Call Trace:
[C00000000130BC80] [C0000000000111C4] .__switch_to+0x12c/
0x160
[C00000000130BD10] [C000000000401330] .schedule+0x8e0/0xa3c
[C00000000130BE10] [D0000000005FFAB4] .jfs_lazycommit+0x254/
0x2ac [jfs]
[C00000000130BEE0] [C000000000096B0C] .kthread+0x124/0x174
[C00000000130BF90] [C000000000029284] .kernel_thread+0x4c/0x68
jfsSync       S 0000000000000000 14800   668     83 (L-TLB)
Call Trace:
[C00000000130FAD0] [C00000000130FB80] 0xc00000000130fb80 
(unreliable)
[C00000000130FCA0] [C0000000000111C4] .__switch_to+0x12c/0x160
[C00000000130FD30] [C000000000401330] .schedule+0x8e0/0xa3c
[C00000000130FE30] [D0000000005FF61C] .jfs_sync+0x1c8/0x20c 
[jfs]
[C00000000130FEE0] [C000000000096B0C] .kthread+0x124/0x174
[C00000000130FF90] [C000000000029284] .kernel_thread+0x4c/0x68
xfslogd/0     S 0000000000000000 14752   676     83 (L-TLB)
Call Trace:
[C000000001313C70] [C0000000000111C4] .__switch_to+0x12c/
0x160
[C000000001313D00] [C000000000401330] .schedule+0x8e0/0xa3c
[C000000001313E00] [C000000000091DF0] .worker_thread+0x128/
0x1bc
[C000000001313EE0] [C000000000096B0C] .kthread+0x124/0x174
[C000000001313F90] [C000000000029284] .kernel_thread+0x4c/0x68
xfslogd/1     S 0000000000000000 14752   677     83 (L-TLB)
Call Trace:
[C00000000131BC70] [C0000000000111C4] .__switch_to+0x12c/
0x160
[C00000000131BD00] [C000000000401330] .schedule+0x8e0/0xa3c
[C00000000131BE00] [C000000000091DF0] .worker_thread+0x128/
0x1bc
[C00000000131BEE0] [C000000000096B0C] .kthread+0x124/0x174
[C00000000131BF90] [C000000000029284] .kernel_thread+0x4c/0x68
xfslogd/2     S 0000000000000000 14752   678     83 (L-TLB)
Call Trace:
[C00000000131FC70] [C0000000000111C4] .__switch_to+0x12c/0x160
[C00000000131FD00] [C000000000401330] .schedule+0x8e0/0xa3c
[C00000000131FE00] [C000000000091DF0] .worker_thread+0x128/
0x1bc
[C00000000131FEE0] [C000000000096B0C] .kthread+0x124/0x174
[C00000000131FF90] [C000000000029284] .kernel_thread+0x4c/0x68
xfslogd/3     S 0000000000000000 14752   679     83 (L-TLB)
Call Trace:
[C000000001323C70] [C0000000000111C4] .__switch_to+0x12c/
0x160
[C000000001323D00] [C000000000401330] .schedule+0x8e0/0xa3c
[C000000001323E00] [C000000000091DF0] .worker_thread+0x128/
0x1bc
[C000000001323EE0] [C000000000096B0C] .kthread+0x124/0x174
[C000000001323F90] [C000000000029284] .kernel_thread+0x4c/0x68
xfsdatad/0    S 0000000000000000 14752   680     83 (L-TLB)
Call Trace:
[C00000000132BC70] [C0000000000111C4] .__switch_to+0x12c/
0x160
[C00000000132BD00] [C000000000401330] .schedule+0x8e0/0xa3c
[C00000000132BE00] [C000000000091DF0] .worker_thread+0x128/
0x1bc
[C00000000132BEE0] [C000000000096B0C] .kthread+0x124/0x174
[C00000000132BF90] [C000000000029284] .kernel_thread+0x4c/0x68
xfsdatad/1    S 0000000000000000 14752   681     83 (L-TLB)
Call Trace:
[C00000000132FC70] [C0000000000111C4] .__switch_to+0x12c/0x160
[C00000000132FD00] [C000000000401330] .schedule+0x8e0/0xa3c
[C00000000132FE00] [C000000000091DF0] .worker_thread+0x128/
0x1bc
[C00000000132FEE0] [C000000000096B0C] .kthread+0x124/0x174
[C00000000132FF90] [C000000000029284] .kernel_thread+0x4c/0x68
xfsdatad/2    S 0000000000000000 14016   682     83 (L-TLB)
Call Trace:
[C000000001333AA0] [00000000FFFF07E6] 0xffff07e6 (unreliable)
[C000000001333C70] [C0000000000111C4] .__switch_to+0x12c/
0x160
[C000000001333D00] [C000000000401330] .schedule+0x8e0/0xa3c
[C000000001333E00] [C000000000091DF0] .worker_thread+0x128/
0x1bc
[C000000001333EE0] [C000000000096B0C] .kthread+0x124/0x174
[C000000001333F90] [C000000000029284] .kernel_thread+0x4c/0x68
xfsdatad/3    S 0000000000000000 14752   683     83 (L-TLB)
Call Trace:
[C000000001337C70] [C0000000000111C4] .__switch_to+0x12c/
0x160
[C000000001337D00] [C000000000401330] .schedule+0x8e0/0xa3c
[C000000001337E00] [C000000000091DF0] .worker_thread+0x128/
0x1bc
[C000000001337EE0] [C000000000096B0C] .kthread+0x124/0x174
[C000000001337F90] [C000000000029284] .kernel_thread+0x4c/0x68
kmirrord      S 0000000000000000 14752   706     83 (L-TLB)
Call Trace:
[C000000001357C70] [C0000000000111C4] .__switch_to+0x12c/
0x160
[C000000001357D00] [C000000000401330] .schedule+0x8e0/0xa3c
[C000000001357E00] [C000000000091DF0] .worker_thread+0x128/
0x1bc
[C000000001357EE0] [C000000000096B0C] .kthread+0x124/0x174
[C000000001357F90] [C000000000029284] .kernel_thread+0x4c/0x68
ksnapd        S 0000000000000000 14752   714     83 (L-TLB)
Call Trace:
[C00000000134BC70] [C0000000000111C4] .__switch_to+0x12c/
0x160
[C00000000134BD00] [C000000000401330] .schedule+0x8e0/0xa3c
[C00000000134BE00] [C000000000091DF0] .worker_thread+0x128/
0x1bc
[C00000000134BEE0] [C000000000096B0C] .kthread+0x124/0x174
[C00000000134BF90] [C000000000029284] .kernel_thread+0x4c/0x68
kmpathd/0     S 0000000000000000 14752   722     83 (L-TLB)
Call Trace:
[C00000000134FC70] [C0000000000111C4] .__switch_to+0x12c/0x160
[C00000000134FD00] [C000000000401330] .schedule+0x8e0/0xa3c
[C00000000134FE00] [C000000000091DF0] .worker_thread+0x128/
0x1bc
[C00000000134FEE0] [C000000000096B0C] .kthread+0x124/0x174
[C00000000134FF90] [C000000000029284] .kernel_thread+0x4c/0x68
kmpathd/1     S 0000000000000000 14752   723     83 (L-TLB)
Call Trace:
[C00000000135BC70] [C0000000000111C4] .__switch_to+0x12c/
0x160
[C00000000135BD00] [C000000000401330] .schedule+0x8e0/0xa3c
[C00000000135BE00] [C000000000091DF0] .worker_thread+0x128/
0x1bc
[C00000000135BEE0] [C000000000096B0C] .kthread+0x124/0x174
[C00000000135BF90] [C000000000029284] .kernel_thread+0x4c/0x68
kmpathd/2     S 0000000000000000 14752   724     83 (L-TLB)
Call Trace:
[C000000001363C70] [C0000000000111C4] .__switch_to+0x12c/
0x160
[C000000001363D00] [C000000000401330] .schedule+0x8e0/0xa3c
[C000000001363E00] [C000000000091DF0] .worker_thread+0x128/
0x1bc
[C000000001363EE0] [C000000000096B0C] .kthread+0x124/0x174
[C000000001363F90] [C000000000029284] .kernel_thread+0x4c/0x68
kmpathd/3     S 0000000000000000 14752   725     83 (L-TLB)
Call Trace:
[C000000001367C70] [C0000000000111C4] .__switch_to+0x12c/
0x160
[C000000001367D00] [C000000000401330] .schedule+0x8e0/0xa3c
[C000000001367E00] [C000000000091DF0] .worker_thread+0x128/
0x1bc
[C000000001367EE0] [C000000000096B0C] .kthread+0x124/0x174
[C000000001367F90] [C000000000029284] .kernel_thread+0x4c/0x68
anaconda      D 000000000fc03a24  6624   742    318 (NOTLB)
Call Trace:
[C00000000207B2B0] [C00000000012D458] .unlock_buffer+0x30/
0x44 (unreliable)
[C00000000207B480] [C0000000000111C4] .__switch_to+0x12c/0x160
[C00000000207B510] [C000000000401330] .schedule+0x8e0/0xa3c
[C00000000207B610] [C000000000401E78] .io_schedule+0x58/0x9c
[C00000000207B6A0] [C0000000000C6B6C] .sync_page+0x7c/0x98
[C00000000207B720] [C000000000402094] .__wait_on_bit_lock+0x8c/
0x110
[C00000000207B7C0] [C0000000000C6AA8] .__lock_page+0x70/0x90
[C00000000207B890] [C0000000000C7754]
 .do_generic_mapping_read+0x234/0x4e0
[C00000000207B9E0] [C0000000000C9EAC]
 .generic_file_aio_read+0x170/0x214
[C00000000207BAB0] [D00000000018B894] .nfs_file_read+0x11c/
0x14c [nfs]
[C00000000207BB60] [C0000000000FFF18] .do_sync_read+0xc4/
0x124
[C00000000207BCF0] [C000000000100978] .vfs_read+0x120/0x208
[C00000000207BD90] [C000000000101164] .sys_read+0x4c/0x8c
[C00000000207BE30] [C0000000000087C8] syscall_exit+0x0/0x40
anaconda      S 000000000fc099bc 11088   745    742 (NOTLB)
Call Trace:
[C00000003A2BF4E0] [C00000001E625950] 0xc00000001e625950 
(unreliable)
[C00000003A2BF6B0] [C0000000000111C4] .__switch_to+0x12c/0x160
[C00000003A2BF740] [C000000000401330] .schedule+0x8e0/0xa3c
[C00000003A2BF840] [C000000000401F3C] .schedule_timeout+0x3c/
0xe8
[C00000003A2BF930] [C000000000112098] .do_sys_poll+0x2c8/
0x434
[C00000003A2BFD50] [C00000000013E7D8]
 .compat_sys_ppoll+0x110/0x260
[C00000003A2BFE30] [C0000000000087C8] syscall_exit+0x0/0x40
kauditd       S 0000000000000000 13424   746     83 (L-TLB)
Call Trace:
[C00000000262FAC0] [C0000000006525A0] ioctl_start+0x3138/
0x5730 (unreliable)
[C00000000262FC90] [C0000000000111C4] .__switch_to+0x12c/0x160
[C00000000262FD20] [C000000000401330] .schedule+0x8e0/0xa3c
[C00000000262FE20] [C0000000000B3A9C] .kauditd_thread+0x160/
0x1b0
[C00000000262FEE0] [C000000000096B0C] .kthread+0x124/0x174
[C00000000262FF90] [C000000000029284] .kernel_thread+0x4c/0x68
kjournald     S 0000000000000000  9952   955     83 (L-TLB)
Call Trace:
[C000000016623AC0] [C000000000131784] .bio_alloc_bioset+0xcc/
0x174 (unreliable)
[C000000016623C90] [C0000000000111C4] .__switch_to+0x12c/
0x160
[C000000016623D20] [C000000000401330] .schedule+0x8e0/0xa3c
[C000000016623E20] [D000000000131594] .kjournald+0x1c8/0x26c 
[jbd]
[C000000016623EE0] [C000000000096B0C] .kthread+0x124/0x174
[C000000016623F90] [C000000000029284] .kernel_thread+0x4c/0x68
kjournald     S 0000000000000000 13824   957     83 (L-TLB)
Call Trace:
[C000000008457AC0] [0000000028000088] 0x28000088 (unreliable)
[C000000008457C90] [C0000000000111C4] .__switch_to+0x12c/
0x160
[C000000008457D20] [C000000000401330] .schedule+0x8e0/0xa3c
[C000000008457E20] [D000000000131594] .kjournald+0x1c8/0x26c 
[jbd]
[C000000008457EE0] [C000000000096B0C] .kthread+0x124/0x174
[C000000008457F90] [C000000000029284] .kernel_thread+0x4c/0x68
syslogd       S 000000000fc0cedc  9824   965    742 (NOTLB)
Call Trace:
[C000000016627440] [000000000000633E] 0x633e (unreliable)
[C000000016627610] [C0000000000111C4] .__switch_to+0x12c/0x160
[C0000000166276A0] [C000000000401330] .schedule+0x8e0/0xa3c
[C0000000166277A0] [C000000000401F3C] .schedule_timeout+0x3c/
0xe8
[C000000016627890] [C000000000112678] .do_select+0x410/0x4b0
[C000000016627C10] [C00000000013BFF0]
 .compat_core_sys_select+0x180/0x244
[C000000016627D00] [C00000000013E490]
 .compat_sys_select+0xd0/0x190
[C000000016627DC0] [C000000000017654] .ppc32_select+0x14/0x28
[C000000016627E30] [C0000000000087C8] syscall_exit+0x0/0x40
pdflush       S 0000000000000000 10704   969     83 (L-TLB)
Call Trace:
[C0000000392E3AA0] [D00000000058CEAC] .dm_get_table+0x48/
0x68 [dm_mod] (unrelia)
[C0000000392E3C70] [C0000000000111C4] .__switch_to+0x12c/0x160
[C0000000392E3D00] [C000000000401330] .schedule+0x8e0/0xa3c
[C0000000392E3E00] [C0000000000D05C4] .pdflush+0xfc/0x26c
[C0000000392E3EE0] [C000000000096B0C] .kthread+0x124/0x174
[C0000000392E3F90] [C000000000029284] .kernel_thread+0x4c/0x68
pdflush       S 0000000000000000  9504   989     83 (L-TLB)
Call Trace:
[C00000002C0AFAA0] [C00000002C0AFB50] 0xc00000002c0afb50 
(unreliable)
[C00000002C0AFC70] [C0000000000111C4] .__switch_to+0x12c/
0x160
[C00000002C0AFD00] [C000000000401330] .schedule+0x8e0/0xa3c
[C00000002C0AFE00] [C0000000000D05C4] .pdflush+0xfc/0x26c
[C00000002C0AFEE0] [C000000000096B0C] .kthread+0x124/0x174
[C00000002C0AFF90] [C000000000029284] .kernel_thread+0x4c/0x68
printk: 10 messages suppressed.
Spider RX RAM full, incoming packets might be discarded!
printk: 7 messages suppressed.
Spider RX RAM full, incoming packets might be discarded!
printk: 10 messages suppressed.

Comment 71 IBM Bug Proxy 2007-04-29 20:35:51 UTC

Created attachment 153747 [details]
33869-xmon.txt

Comment 72 IBM Bug Proxy 2007-04-29 20:36:08 UTC

----- Additional Comments From mlui.com  2007-04-29 16:31 EDT -------
 
Xmon output from another hung, seems to have more information than the last
one.

Comment 73 IBM Bug Proxy 2007-04-30 03:00:33 UTC

----- Additional Comments From mlui.com  2007-04-29 22:54 EDT -------
Did another experiment. Created a 90MB file on lewis7 and make it available 
for download via HTTP to an installed system. Here are the results:
Installed Cell  eth0      eth1
test4           fail      pass
Rawhide0427     fail      pass
CellBE          pass      pass
Failed tests all stopped copying the file and printed out "Spider RX RAM full" 
errors.

Note that supposedly rawhide includes all the spidernet patches that are also 
in CellBE kernel.

Comment 74 Robbie Williamson 2007-04-30 15:25:44 UTC

"Obviously there's not a lot we can do to help with this until our QS2[01]
hardware arrives. Closing UPSTREAM for now." What does this mean?  There is a
QS20 at the Westford, MA site, with a QS21 soon to ship...are you talking about
some other Red Hat site?

Comment 75 David Woodhouse 2007-04-30 15:31:24 UTC

Ah, I apologise then; I must seek out details of access to it. I thought we only
had one older DD2 machine in Westford, which wasn't expected to work with
current kernels. Janice? (Sorry, this might not be the first time you're telling
me this).

Comment 76 David Woodhouse 2007-04-30 15:33:09 UTC

How much use do these machines get in Westford, btw? Is that where the QS21 is
going too? The PPC maintainers for RHEL and Fedora are in Cambridge -- having
one here might be useful if it's possible.

Comment 77 Robbie Williamson 2007-04-30 16:02:19 UTC

Hmmm...you may be correct about the DD2 level.  I will look at sending a GA
blade along with the QS21 (planned to ship late this week/early next).  I'm not
sure how much use they get in Westford, but we also need RHEL 5.1 support, which
is the reason why I believe they are shipped there.

Comment 78 IBM Bug Proxy 2007-04-30 18:05:28 UTC

changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|ship issue                  |high




------- Additional Comments From robbiew.com  2007-04-30 14:01 EDT -------
Based on IBM bugzilla Comment #126, I'm lowering this bug to "high".  By LTC
standards, a "high" bug is one that would otherwise be "block" or "ship-issue",
but which has a valid workaround that allows testing to continue until the bug
is resolved. The install works on eth1, so there's a suitable workaround available.

Comment 79 IBM Bug Proxy 2007-04-30 18:30:41 UTC

changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|high                        |ship issue




------- Additional Comments From mlui.com  2007-04-30 14:24 EDT -------
Great, Robbie, we are using definitions of bugs ;)  So my opinion is ship.
Ship is the second highest severity level. Bugs that are preventing progress on 
some tests or that only occur on a single architecture or platform, but that 
are not preventing those systems from being used for other testing. Bugs that 
still must be fixed before an IBM product or Distro version is released

Comment 80 IBM Bug Proxy 2007-04-30 21:45:48 UTC

----- Additional Comments From linas.com (prefers email at linas.com)  2007-04-30 17:39 EDT -------
Earlier today, Lucas reproduced the install hang.  He popped it into
the xmon debugger witha ctrl-o. So:

1) If ctrl-o works, then the kernel is not really hung, in that it is
   alive enough to be able to listen to the ctrl-o coming in on the
   hvc console.

2) Poking around, it quickly became clear that all four cpu's were
   idle. (will attach log)

Exiting from xmon results in the system continuing to print the 
Spider RX RAM full, incoming packets might be discarded!
messages. Getting bck into xmon a while later still shows
all cpu's idle.

Conclude: the kernel is not hung. Since the spidernet hardware
is generaing the RX RAM full interrupts, one must conclude that
either:

a) the kernel is not calling the spidernet device driver fast enough
   to drain the RX ring.
b) the spidernet device driver is failing to drain the RX ring.

I'm preparing a patch right now, to distingusih between these two.
Its a simple patch: just printk in the spider_net_poll()
routine, to verify that its being called, and how often its
called, and how many packets its handling each call.

Comment 81 IBM Bug Proxy 2007-05-01 21:05:50 UTC

----- Additional Comments From linas.com (prefers email at linas.com)  2007-05-01 16:59 EDT -------
It seems that bug 34298 offers a much simpler way of reproducing this same
network problem, without having to fiddle with doing the install. I can 
reproduce the problem, there, and so will be focusing attention there.

Comment 82 IBM Bug Proxy 2007-05-02 00:10:32 UTC

----- Additional Comments From linas.com (prefers email at linas.com)  2007-05-01 20:04 EDT -------
Patches that should fix this problem have been posted to bug 34298.
These are not quite the final patches (they are more verbose than
they need to be), but they should resolve the problem.

Please test. Assuming this tests well, I'll post final patches
on friday (4 May) or early the nxt week.

Comment 83 David Woodhouse 2007-05-02 06:41:22 UTC

(In reply to comment #82)
> Please test. Assuming this tests well, I'll post final patches
> on friday (4 May) or early the nxt week. 

DaveJ? How does that timeframe suit you for Fedora 7?

Comment 84 IBM Bug Proxy 2007-05-02 12:55:38 UTC

changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|OPEN                        |ASSIGNED




------- Additional Comments From lucasgf.com  2007-05-02 08:50 EDT -------
Breno and I will test this today and post the results asap

Comment 85 IBM Bug Proxy 2007-05-03 17:05:50 UTC

------- Additional Comments From joseferr.com  2007-05-03 12:59 EDT -------
(In reply to comment #142)
> Breno and I will test this today and post the results asap

We tried the new spidernet module with the patches, but it failed to copy files
 with NFS via eth0. We could get this messages, using dmesg:

got invalid descriptor interrupt, restarting DMAC A.
got invalid descriptor interrupt, restarting DMAC A.
got invalid descriptor interrupt, restarting DMAC A.
got invalid descriptor interrupt, restarting DMAC A.
got invalid descriptor interrupt, restarting DMAC A.
got invalid descriptor interrupt, restarting DMAC A.
got invalid descriptor interrupt, restarting DMAC A.
got invalid descriptor interrupt, restarting DMAC A.
got invalid descriptor interrupt, restarting DMAC A.
got invalid descriptor interrupt, restarting DMAC A.
eth0: bad status, cmd_status=x40800009
buf_addr=x9f4a3080
buf_size=x00000980
next_descr_addr=x9f3f4f40
result_size=x00000600
valid_size=x000005ec
data_status=x70007b00
data_error=x02100000
which=121

And after a while:
nfs: server 192.168.2.4 not responding, still trying
nfs: server 192.168.2.4 not responding, still trying

Comment 86 IBM Bug Proxy 2007-05-18 03:55:21 UTC

----- Additional Comments From mlui.com  2007-05-17 23:49 EDT -------
Breno, since the patches from #34298 have been tested, could you please make 
sure they are upstream?  Thanks.

Comment 87 IBM Bug Proxy 2007-06-04 18:25:36 UTC

----- Additional Comments From joseferr.com  2007-06-04 14:20 EDT -------
Fedora 7 GA'd last thursday. I tested the distro at our Cell QS20 machines and
found no problems. The system installed with both eth[0-1] and using
NFS/FTP/HTTP. I believe we can finally close this bug.
Thank you very much to all people involved on this issue.

Comment 88 IBM Bug Proxy 2007-06-04 18:30:29 UTC

changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ACCEPTED                    |CLOSED




------- Additional Comments From brenohl.com  2007-06-04 14:25 EDT -------
Thanks Joseph, 
Closing per previous comment.

Comment 89 Bug Zapper 2008-04-04 00:03:46 UTC

Based on the date this bug was created, it appears to have been reported
against rawhide during the development of a Fedora release that is no
longer maintained. In order to refocus our efforts as a project we are
flagging all of the open bugs for releases which are no longer
maintained. If this bug remains in NEEDINFO thirty (30) days from now,
we will automatically close it.

If you can reproduce this bug in a maintained Fedora version (7, 8, or
rawhide), please change this bug to the respective version and change
the status to ASSIGNED. (If you're unable to change the bug's version
or status, add a comment to the bug and someone will change it for you.)

Thanks for your help, and we apologize again that we haven't handled
these issues to this point.

The process we're following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.

Comment 90 Bug Zapper 2008-05-07 01:28:37 UTC

This bug has been in NEEDINFO for more than 30 days since feedback was
first requested. As a result we are closing it.

If you can reproduce this bug in the future against a maintained Fedora
version please feel free to reopen it against that version.

The process we're following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

Note You need to log in before you can comment on or make changes to this bug.