Bug 476257 - iSCSI root installs fine but doesn't boot
iSCSI root installs fine but doesn't boot
Status: CLOSED WORKSFORME
Product: Fedora
Classification: Fedora
Component: mkinitrd (Show other bugs)
rawhide
All Linux
low Severity medium
: ---
: ---
Assigned To: Peter Jones
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-12-12 13:40 EST by Dennis Jacobfeuerborn
Modified: 2008-12-30 07:54 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-12-30 07:54:49 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
lspci output (1.50 KB, text/plain)
2008-12-29 17:06 EST, Dennis Jacobfeuerborn
no flags Details
contents of "init" file (2.31 KB, text/plain)
2008-12-29 17:08 EST, Dennis Jacobfeuerborn
no flags Details

  None (edit)
Description Dennis Jacobfeuerborn 2008-12-12 13:40:51 EST
I reported this first on the fedora-test list to get some feedback so I will copy&paste the relevant information here:
(original thread: https://www.redhat.com/archives/fedora-test-list/2008-December/msg00351.html )

So I'm experimenting with iscsi and everything went fine until I ran into a problem. I setup a 2gb file as an iscsi target on my desktop and tested it locally which worked perfectly fine. Then I started a network install on my laptop choosing the advanced storage configuration option, specifying the target on my desktop machine and then proceeded with the install. Then the system rebootet and now I get:

iscsistart: Logging into 4-04.org.netbsd.iscsi-target:target0 192.168.2.100:3260,1
iscsistart: cannot make connection to 192.168.2.100:3260 (-1,101)
iscsistart: initiator reported error (4 - encountered connection failure)

I did a tcpdump on port 3260 on my desktop machine (192.168.2.100) but don't see any packets from my laptop (192.168.2.101). Also there doesn't seem to be any timeouts or retries. The kernel immediately fails unable to find a suitable root device.

Since I'm new to the iscsi stuff I'm wondering if anyone has an idea what the problem could be or what component I should file a bug for. 

<SNIP>

I also added a "sleep 5" right after the initialisation of eth0 in the initrd in case the interface isn't properly initialized before the iscsi init and added a "-d 5" to the iscsistart call to get more verbose output:

...
echo Bringing up eth0
network --device eth0 --bootproto dhcp
sleep 5
echo Attaching to iSCSI storage
/bin/iscsistart -t iqn.1994-04.org.netbsd.iscsi-target:target0 -i iqn.2005-03.com.max:01.82ebd2 -d 5 -g 1 -a 192.168.2.100
modprobe scsi_wait_scan
rmmod scsi_wait_scan
...

Unfortunately the debug output doesn't really add any information and the connection still simply "fails".

Lastly I booted the rescue cd on the laptop and executed the iscsistart line just as it appears in the initrd. The result of that is a new scsi device "sdb" popping up in dmesg just as it should.

So I think I can definitely rule out any problems on the server side as the install worked fine and iscsistart from the rescue cd works fine too.

What bugs me is that iscsistart doesn't seem to try very hard to get to the server. There seems to be no timeout period or any retries. I wonder if that means that iscsistart fails to connect to the network/interface at all failing so hard that it simply doesn't think a retry is worth the effort.
Comment 1 Charlie Moschel 2008-12-12 13:56:25 EST
(In reply to comment #0)
> 
> I also added a "sleep 5" right after the initialisation of eth0 in the initrd
> in case the interface isn't properly initialized before the iscsi init and
> added a "-d 5" to the iscsistart call to get more verbose output:
> 
> ...
> echo Bringing up eth0
> network --device eth0 --bootproto dhcp
> sleep 5
> echo Attaching to iSCSI storage
> /bin/iscsistart -t iqn.1994-04.org.netbsd.iscsi-target:target0 -i
> iqn.2005-03.com.max:01.82ebd2 -d 5 -g 1 -a 192.168.2.100
> modprobe scsi_wait_scan
> rmmod scsi_wait_scan
>

I think you are on the right track.  There are several BZ where scsi_wait_scan was not used when it should have been, and a new mkinitrd in updates-testing addresses that.  But here it seems to be used, hmm ..

Can you try moving the modprobe scsi_wait_scan before /bin/iscsistart ?  You would have to unpack your initrd & modify init, but I guess you know that ...

Alternately you could try to add "scsi_mod.scan=sync" to the kernel command line.
Comment 2 Dennis Jacobfeuerborn 2008-12-12 16:56:22 EST
I added the scsi_mod.scan=sync to the kernel command line and moved the modprobe scsi_wait_scan before the iscsistart line but I still get the same error.

Given the error message I'm worried that "network --device eth0 --bootproto dhcp
" actually doesn't bring up the interface properly. Unfortunately the network command doesn't appear in the nash manpage.

Is there a way to setup the interface with a static address? Is there a way to find out if that interface is actually working properly right after the network command brought it up (by issuing a ping to a second IP on the network for example)?
Comment 3 Charlie Moschel 2008-12-12 20:59:46 EST
(In reply to comment #2)
>Unfortunately the network command doesn't appear in the nash manpage.

Use the source, Luke :)  Nash is built from the mkinitrd srpm, and this is what the network command should understand:
struct poptOption netOptions[] = {
        { "bootproto", '\0', POPT_ARG_STRING, &bootProto, 0, NULL, NULL },
        { "device", '\0', POPT_ARG_STRING, &dev, 0, NULL, NULL },
        { "dhcpclass", '\0', POPT_ARG_STRING, &dhcpclass, 0, NULL, NULL },
        { "dns", '\0', POPT_ARG_STRING, &dns, 0, NULL, NULL },
        { "domain", '\0', POPT_ARG_STRING, &domain, 0, NULL, NULL },
        { "gateway", '\0', POPT_ARG_STRING, &gateway, 'g', NULL, NULL },
        { "ip", '\0', POPT_ARG_STRING, &ip, 'i', NULL, NULL },
        { "nameserver", '\0', POPT_ARG_STRING, &nameserver, 'n', NULL, NULL },
        { "netmask", '\0', POPT_ARG_STRING, &netmask, 'm', NULL, NULL },
        { "ethtool", '\0', POPT_ARG_STRING, &ethtool, 0, NULL, NULL },
        { "mtu", '\0', POPT_ARG_INT, &mtu, 0, NULL, NULL },
        { "hostname", '\0', POPT_ARG_STRING, &hostname, 0, NULL, NULL },
        { NULL, 0, 0, NULL, 0, NULL, NULL }
    };

(ethtool is disabled later on though)
> 
> Is there a way to setup the interface with a static address? Is there a way to

You should be able to: Set at least ip, netmask & mtu, and be sure to remove the bootproto option.

> find out if that interface is actually working properly right after the network
> command brought it up (by issuing a ping to a second IP on the network for
> example)?

There is some logging within the nash network command, but I'm not sure where it goes or how to set the debug level.

Can you see if there are any dhcp requests coming from your laptop?  Seems that the network command in nash still wants to use pump to get an ip address, but mkinitrd / nash has been changed over to libdhcp ??
Comment 4 Dennis Jacobfeuerborn 2008-12-12 23:20:05 EST
So I changed the network line to:
network --device eth0 --bootproto static --ip 192.168.2.101 --gateway 192.168.2.1 --mtu 1500 --hostname test1

...without any effect. I still get the same error.

A tcpdump on my desktop still doesn't receive a single packet from the laptop so I double checked again and executed the iscsistart command after booting the rescue cd and there it works fine and I can see the traffic in tcpdump and the new sdb device pops up as it should.

So it really looks like the interface isn't brought up properly (it uses the e100 driver btw).

Another thing: When I changed the network line first I included a "--netmask 255.255.255.0" in there. The result was a segfault in libnl.so. The offending bit in the nash code lies in network.c in the function nashSetupInterface():

...
        if (prefix != -1) {
            rtnl_addr_set_prefixlen(addr, prefix);
        }
...

The rtnl_addr_set_prefixlen() call causes the segfault.
Comment 5 Hans de Goede 2008-12-28 07:07:18 EST
I just tried to reproduce this, but for me it works fine, can you please attach the output of "lspci" on your machine as well as attach the full "init" file from the initrd here?

I think we are not loading your network card driver for some reason. Or maybe it needs firmware?
Comment 6 Dennis Jacobfeuerborn 2008-12-29 17:06:55 EST
Created attachment 327933 [details]
lspci output
Comment 7 Dennis Jacobfeuerborn 2008-12-29 17:08:14 EST
Created attachment 327934 [details]
contents of "init" file
Comment 8 Hans de Goede 2008-12-30 05:36:10 EST
Hmm, you have only one wired network card, the eepro100 and the driver does get loaded. You are not trying to do this over wireless are you?
Comment 9 Dennis Jacobfeuerborn 2008-12-30 07:28:00 EST
Nope, just the regular wired connection.
Comment 10 Hans de Goede 2008-12-30 07:54:49 EST
Hmm,

Bummer I have no further ideas how to fix / investigate this, since I've failed to reproduce this with my own testing I'm going to close this.

If you find out anything more about this, please tell!

Note You need to log in before you can comment on or make changes to this bug.