Description of problem: 2.6.17-1.2365.fc6 hangs under a two node x460. There is a couple of show tracks and lastly this message shows up: calgary_watchdog: DMA error on bus 0, CSR = 0x2010000 Version-Release number of selected component (if applicable): 2.6.17-1.2365.fc6 How reproducible: Install 2.6.17-1.2365.fc6 kernel Actual results: Expected results: Additional info:
Created attachment 132239 [details] Serial output showing the problem.
Looks like a Calgary bug indeed... <Jul/11 10:28 am>Starting udev: Unable to handle kernel NULL pointer dereference at 0000000000000030 RIP: <Jul/11 10:28 am> [<ffffffff80283c2f>] calgary_alloc_coherent+0x3c/0xfb Can you please attach your .config? As a workaround, you can boot with command line arguments "iommu=off" until we fix this one. Thanks for reporting!
Created attachment 132310 [details] Config file used to build 2.6.17-1.2365.fc6 kernel. If you want to test this kernel (please keep in mind that the FC/RHEL5 kernel has also some extra patches on top of the 2.6.17+2.6.18-rc1-git3 patch set), you can download the source rpm from: ftp://ftp.tu-chemnitz.de/pub/linux/fedora-core/development/source/SRPMS/ There are also other mirror sites are much faster. Pls take a look in: http://fedora.redhat.com/Download/mirrors.html. If the problem is due to some patches being screwed up, please please report this here with an explanation of what needs to be done to correct the problem. Thanks!
Tnanks Konrad, I was looking for the SRPM earlier today. We'll give it a spin on our test machine, if it's reproducible there, so much the better. We're also trying to get some time on a dual node 460, will keep you updated.
Created attachment 132316 [details] Sysreport I am also attaching a sysreport (a tarball of /proc, /etc, dmidecode, etc) so that you can find out if your machine mirrors yours in case this is a config issue. Please note that I have HotPlug Memory on in the BIOS.
Ok, I think I know what's going on. This machine violates a rule we counted on relating the PCI bus number to the Calgary PHB (which was a total hack, but worked on every machine we tried it on so far). You have PCI busses 0xe and 0xf, for which bus_to_phb() returns the same result (it should return different results, as they're different PHBs), which leads to one of the busses not getting initialized (in init_one()/init_one_nontranslated()), which leads to dev->bus->self not being set, which leads to the NULL pointer deref. Fixing this is a good chance to redo this logic to stop relying on any relationship between PCI bus numbers and Calgary/PHB-per-Calgary. We're working on it.
Created attachment 132472 [details] Mainline patch to fix this bug I was able to get access to a x460 and reproduce this problem with a mainline kernel. Attached is a patch against mainline that fixes the issue on my system. I am in the process of verifing the fix against the fc6 kernel you linked above. There are many issues that had to be resolved to fix this problem. Firstly, when I originally wrote the code to handle NUMA systems, I had a large misunderstand that was not corrected until now. That was that I thought the "number of nodes online" referred to number of physical systems connected. So that if NUMA was disabled, there would only be 1 node and it would only show that node's PCI bus. In reality if NUMA is disabled, the system displays all of the connected system but is only ignorant of the delays in accessing main memory. Therefore, references to num_online_nodes() and MAX_NUMNODES are incorrect and need to be set to the maximum number of nodes that can be accessed (which are 4). I created a variable, CAL_MAX_NODES, and set it to 4 to fix this. Secondly, when walking the PCI in detect_calgary, the code only checked the first "slot" when looking to see if a device is present. This will work for most cases, but unfortunately it isn't always the case. In the NUMA MXE drawers, there are USB devices present on the 3rd slot (with slot 1 being empty). So, to work around this, all slots (up to 8) are scanned to see if there are any devices present. Lastly, the bus is being enumerated on large systems in a different way the we originally thought. This throws the ugly logic we had out the window. To more elegantly handle this, I reorganized the kva array to be sparse (which removed the need to have any bus number to kva slot logic in tce.c) and created a secondary space array to contain the bus number to phb mapping. This will enable the remove the translation_disabled bitmap in the future. With these changes Calgary boots on an x460 with 4 nodes with and without NUMA enabled.
The patch above applies with minimal fuzz (-2 lines) to the fc6. Testing on this tree shortly. Also, a better workaround for this problem would be "iommu=soft" as "iommu=off" will cause problems in systems with mroe than 4G RAM.
Jon, The x460 can be an 8-node machine. Perhaps the CAL_MAX_NODES should be set to 8 instead of 4?
I verified that there are a maximum of 8 chassis for the x460 on the IBM website. I guess that is what happens when you ask someone instead of checking the web. Thanks Konrad, I'll fix that in the next version of the patch (to be posted shortly).
Created attachment 132566 [details] Patch against the fc6 kernel This patch should fix the issue. Please test and verify it works on your system.
Patch works splendidly.
Patch posted on the internal reflector for inclusion in RHEL5.
Patch included into mainline as of 2.6.18-rc3. Bug can probably be closed now?
Muli, Not yet. Dave Jones/Don Zickus need to close this bug when he puts the patch in the FC/RHEL kernel.
Sorry, patch was picked up and then dropped b/c rc3 has it. -Don