On systems which have a /proc/device-tree directory, and where fwparam_ppc() thus is supposed to actually do something, it causes memory (probably stack) corruption, see: bug 490515
(In reply to comment #0) > On systems which have a /proc/device-tree directory, and where fwparam_ppc() > thus > is supposed to actually do something, it causes memory (probably stack) > corruption, see: bug 490515 In those setups, is iscsi being used? Is nfs or some other net based boot used?
(In reply to comment #1) > (In reply to comment #0) > > On systems which have a /proc/device-tree directory, and where fwparam_ppc() > > thus > > is supposed to actually do something, it causes memory (probably stack) > > corruption, see: bug 490515 > > In those setups, is iscsi being used? Is nfs or some other net based boot used? Adding jlaska to the CC, He should know, James can you answer this please?
I've not yet tested any open-firmware enabled iSCSI systems. I believe these might be the QS22 (ps3 cell) blades, but I have yet to test this hardware. The systems I see this on all have local storage
Created attachment 336106 [details] dont use strncat for ppc OF boot path setup It looks like the strncat use in fwparam_ppc is not right. Not sure if this is the problem, but if we called fwparam_ppc multiple times I think it could. 1. if fwparam_ppc is run multiple times, it looks like we would continue adding on to the previous use's string. 2. if filepath is not passed in (it is not by default), then it looks like we pass in the size of the buffer, because fplen is never set in that path. This patch just converts the code to use snprintf. I also memsetted the buffer to make sure it was clear before each call.
jlaska, I've made a new updates.img, which contains a version of libiscsi, which reverts my workaround and instead uses mchristie's patch, can you give this one a try please? : http://people.atrpms.net/~hdegoede/updates.img
Testing latest rawhide with updates=http://people.atrpms.net/~hdegoede/updates.img and connecting to VNC shows only a black screen.
(In reply to comment #6) > Testing latest rawhide with > updates=http://people.atrpms.net/~hdegoede/updates.img and connecting to VNC > shows only a black screen. mchristie: translated: your patch attached here does not fix the problem.
Created attachment 339207 [details] fix double close and unint var Maybe this is it. I found a double close in the ppc code. This fixes that and the uninitialized string.
Mike, I doubt this new patch is going to help. If you really want me too I can create an updates.img with a new libiscsi with this patch in for jlaska to test, but I don't think that will get us anywhere. closing an already closed fd will simply make the kernel return EBADF. Now if this was an fclose() then this might be the cause, but repeating a normal close I seriously doubt causing this. What you can do is try to create a simple test.c witha main which calls fwparam_ppc(), and run that through valgrind on ppc (do we have valgrind on ppc?) jlaska, could you give us ssh access to this box with a recent F-11 install on it, then I can poke around and see if I can reproduce this.
(In reply to comment #9) > Mike, I doubt this new patch is going to help. If you really want me too I can > create an updates.img with a new libiscsi with this patch in for jlaska to > test, but I don't think that will get us anywhere. closing an already closed fd > will > simply make the kernel return EBADF. Now if this was an fclose() then this > might > be the cause, but repeating a normal close I seriously doubt causing this. > Yeah, not sure what I was thinking. I thought this had happened before. > What you can do is try to create a simple test.c witha main which calls > fwparam_ppc(), and run that through valgrind on ppc (do we have valgrind on > ppc?) I was actually running valgrind on iscsistart. I was only able to get a ppc box with RHEL 5 (the rhts ppc fedora installs kept failing at the time). I then ran the fedora code. Let me try to get a ppc box with F-11 beta. > > jlaska, could you give us ssh access to this box with a recent F-11 install on > it, then I can poke around and see if I can reproduce this.
Ok, I've been debugging this today with remote access to jlaska's ibm ppc machine (thanks James) and I've fixed this. The problem was both the use of strncat where strncpy (or better snprintf) should be used and the fact that it is using a global counter for dev_count and nic_count, and doesn't reset that when fwparam_ppc() gets called a second time. To make things worse it also used the dev_count global var as a local var in fwparam_ppc(), so that it got set to 1 even if there were no devices triggering the bug even on machines without iscsi. The trick to reproduce was to call fwparam_ppc() twice. The segfault then was a simple null pointer deref inside fwparam_ppc.c due to the wrong dev_count. I'll attach a patch which fixes this, and I'll build a fixed version for F-12, for F-11 its better to stick with the workaround given how close we are to the release.
Created attachment 341469 [details] PATCH fixing crash when parsing ppc firmware for a second time. Mike, can you please make sure this patch gets applied to RHEL-5.4 too ? Let me know if you need a bug for that. Given that I plan to use libiscsi in anaconda in 5.4, we really need this fix.
Here is an updates.img with a patched ppc build of libiscsi in it: http://people.atrpms.net/~hdegoede/updates-491363-ppc.img I would be much obliged if you could do a test install of F-11 on that ibm ppc machine, this removes the do not try to read iscsi config from firmware on ppc hack, and replaces it with a proper fix.
I have tested this updates.img with latest rawhide (anaconda-11.5.0.47) and I am able to connect to the vnc server ,and proceed through a vnc install.
closing.