Bug 705836

Summary: RHNS-scheduled kickstart generates SIGSEGV
Product: Red Hat Enterprise Linux 5 Reporter: David Barr <dafydd>
Component: anacondaAssignee: Anaconda Maintenance Team <anaconda-maint-list>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Release Test Team <release-test-team>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 5.4CC: mzazrivec
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-03-20 17:51:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
diff -c ksNormal.txt ksHost.txt none

Description David Barr 2011-05-18 16:33:04 UTC
+++ Description of problem:
When a host checks in and receives an instruction to kickstart itself, it will successfully generate a new grub entry, with initrd.img and vmlinuz. When the host reboots and starts it installation, something generates a Segmentation Violation.


+++ Version-Release number of selected component (if applicable):
spacewalk-koan-0.2.7-7.el5sat, assuming this package is the source of the problem.


+++ How reproducible:
Five out of six attempts, so far.


+++ Steps to Reproduce:
1. Bare-metal kickstart a system.
2. Schedule a kickstart via RHNS for that system using the same ks profile.
3. Wait for the system to check in and receive its request to reinstall.

  
+++ Actual results:
For five out of six attempts, the following output is generated:

[CODE]
Loader received SIGSEGV! Backtrace:
[0x400c83]
[0x504550]
[0x51f990]
[0x417990]
[0x41fcc8]
[0x41fdf0]
[0x42027f]
[0x41f773]
[0x416e01]
[0x417445]
Install exited abnormally [1/1]
Sending termination signals...done
Sending kill signals...done
Disabling swap...
Unmounting filesystems...
       /proc/bus/usb done
       /proc done
       /dev/pts done
       /sys done
       /tmp/ramfs done
You may safely reboot your system
[/CODE]

This stack trace is verified consistent across the last three attempts.

On the sixth attempt, the installation appeared to be successful, but only a small number of packages were actually installed, and the post-install reboot failed. I didn't explore the results in detail. If it happens again, I'll look deeper into it.


+++ Expected results:
Successful software installation and reboot.


+++ Additional info:
I've had problems kickstarting in this environment already, but not like
this. Those problems involved detecting multipath to an IBM SAN, and waiting
long enough for scsi to get those paths set up before multipath combines
them and LVM goes to look for volumes on them. However, in this case, the
collapse appears to be occurring before the boot gets to that point
in its discovery process.

For context the first problem is bug 570460, which I solved by putting
the manual mkinitrd workaround in a %post script:

[CODE]
/sbin/mkinitrd -v \
--with=dm-mod \
--with=dm-multipath \
--with=dm-round-robin \
--with=scsi_dh_rdac \
\${INITRD} \${KERNELVER}
[/CODE]

The second problem is that the scsi device discovery was taking longer
than the timeout allowed for. That resulted in multipath and LVM trying to
work with devices that weren't there yet. That was fixed by modifying
mkinitrd itself to change the timeout when it generated the init file in
initrd-*.img:

[CODE]
cat <<EOT | patch /sbin/mkinitrd
900c900
<         emit "stabilized --hash --interval 1000 /proc/scsi/scsi"
---
>         emit "stabilized --hash --interval 3000 /proc/scsi/scsi"
EOT
[/CODE]

I broke open the RHNS generated initrd.img to see if the mkinitrd patch
survived... and found that the nash init script was now a 64-bit LSB
executable. So, no easy help there.

Comment 1 David Barr 2011-05-18 16:34:38 UTC
Does a method exist to tell the test host to check in with the RHNS server? I could do a better job of tracking exactly where this SIGSEGV happens if I could know to be watching when the host does its check in and receives the command to reboot and kickstart.

Thanks!

Comment 2 David Barr 2011-05-18 20:49:29 UTC
Okay, I just got lucky, and caught the scheduled install just as the host was rebooting.

It loads the qla2xxx FC HBA driver, and does a DHCP search for eth0. The next thing to happen is the SEGV.

The contents of /tftpboot/pxelinux.cfg/[MACADDR] are

[CODE]
default linux
prompt 0
timeout 1
label linux
        kernel /images/rhel54server:1:[RHNSORG]/vmlinuz
        ipappend 2
        append initrd=/images/rhel54server:1:[RHNSORG]/initrd.img ksdevice=eth0 lang=  nobond mpath text kssendmac  ks=http://[RHNS]/cblr/svc/op/ks/system/[HOSTFQDN]:1
[/CODE]

I'm going to attach a diff -c of the normal and RHNS-generated kickstarts for this test host. I noticed the biggest differences are in the determination of which MAC address is eth0. My tested "Network Configuration" was "DHCP using first available interface." I'm going to try "Use DHCP from interface [eth0]," instead. My memory wants to dredge up the possibility of that change being what got me the change in behavior the one other occasion.

The testing process, so far, is:

-) Delete the host profile from RHNS.
-) Kickstart the host bare-metal against the normal ks file.
-) Schedule an RHNS-based kickstart.
-) Watch the fireworks the next time the host checks in.

Comment 3 David Barr 2011-05-18 20:51:06 UTC
Created attachment 499692 [details]
diff -c ksNormal.txt ksHost.txt

Diff -c of the normal kickstart configuration with the one customized for the RHNS-scheduled kickstart.

Comment 4 Milan Zázrivec 2011-05-19 11:04:30 UTC
RHN Satellite (and spacewalk-koan / koan packages on the client side in
particular) have very little to do with the problem shown.

Their responsibility is just to download and install vmlinuz & initrd.img
from RHN Satellite (note that the kernel and ramdisk are installed so as
they're synchronized from Red Hat Network) and reboot the system in question.

The SIGSEGV shown in the initial comment comes from RHEL installer ->
reassigning to anaconda.

Comment 5 David Barr 2011-05-19 23:41:56 UTC
How would I test to see if the problem is a failure to find images/stage2.img.

This environment is PXE booted from bare metal at the start. What if the copied over initrd.img/vmlinuz is looking for images/stage2.img on a local device because the installation parameters don't tell it to go out to the RHN Satellite server to get it?

If I were to copy it over locally, where would initrd.img/vmlinuz expect to find it?

Comment 6 David Barr 2011-05-19 23:58:17 UTC
I do have /boot/grub/stage2 (without the .img) at 104988 bytes...

Comment 7 Chris Lumens 2011-05-20 13:55:20 UTC
Can you please try this with a more recent release, like RHEL 5.7?  I'd like to know if this is still a bug there.

Comment 8 David Barr 2011-05-20 15:22:24 UTC
I don't see a 5.7 to download. I'm downloading a .iso of 5.6 with a last-modified of 2011-05-18 05:51:23 PDT. Hopefully, that's recent enough.

However, I'm constrained to 5.4 for the software that will run on these hosts. So, a more recent version may identify that a fix already exists, but I'd still need an errata update for 5.4 to implement it in production.

In the meantime, I figured out an experiment to test my hypothesis of the copied-from-source initrd.img and vmlinuz having a problem with the stage2.img.

Comment 9 Chris Lumens 2011-05-20 16:58:49 UTC
Right - I meant 5.6.  It's hard to keep track of which we are working on sometimes.

However, we don't release updated anaconda versions for past releases so you'll either need to find a workaround or move the software you're using to a later release.  That'll be the case regardless of if it's fixed in 5.6 or not.

Comment 10 David Barr 2011-05-23 21:33:41 UTC
-) Bare metal kickstart to an RHNS-managed 5.6 distribution.
-) Schedule an RHNS-managed kickstart.
-) Time passes...
-) Kickstart kicks off...

running /sbin/loader

loading qla2xxx driver...
sending request for IP information for eth0 [Probably not exact text...]

loader received SIGSEGV! Backtrace:
[0x400913]
[0x50ac30]
[0x526360]
[0x417f10]
[0x420288]
[0x4203b0]
[0x42083f]
[0x41fd33]
[0x417381]
[0x4179c5]
install exited abnormally [1/1]
sending termination signals...done
sending kill signals...done
disabling swap...
unmounting filesystems...
        /proc/bus/usb done
        /proc done
        /dev/pts done
        /sys done
        /tmp/ramfs done
you may safely reboot your system

+++

The backtrace is different in this case.

-) Boot the rescue DVD to get the grub.conf entry.

default=0
timeout=10
splashimage=(hd0,0)/grub/splash.xpm.gz
hiddenmenu
title kick1306182811
        kernel /vmlinuz ksdevice=link nobond lang= text ks=http://[RHNS]/cblr/svc/op/ks/system/[HOSTFQDN]:1 kssendmac mpath
        initrd /initrd.img

+++

-) Compare fingerprints of copied initrd.img/vmlinuz and source copies.

[root@[HOST] boot]# md5sum initrd.img
ea701b85109eb12fe99a7a4dda6ba578  initrd.img
[root@[HOST] boot]# md5sum vmlinuz
ae4d6908010dcfcc9d3dd7dbbfef39ca  vmlinuz


[root@[RHNS] pxeboot]# pwd
/var/src/rhel-5.6-x86_64/images/pxeboot
[root@[RHNS] pxeboot]# md5sum initrd.img
ea701b85109eb12fe99a7a4dda6ba578  initrd.img
[root@[RHNS] pxeboot]# md5sum vmlinuz
ae4d6908010dcfcc9d3dd7dbbfef39ca  vmlinuz

+++

RHN Satellite (well, spacewalk-koan, or whatever) downloaded the right files.



Chris, I'm holding this test host static, to simplify experimentation. What would you like me to try next?

Comment 11 Chris Lumens 2011-05-23 21:50:58 UTC
These frames are just the sigsegv printer:

0x400913 - /usr/src/debug/anaconda-11.1.2.224/loader2/loader.c:1378
0x50ac30 - /usr/src/debug/anaconda-11.1.2.224/stubs/unicode-lite.c:28
0x526360 - /usr/src/debug/anaconda-11.1.2.224/stubs/unicode-lite.c:28

So the first line here is where the problem's occurring:

0x417f10 - /usr/src/debug/anaconda-11.1.2.224/loader2/net.c:2553
0x420288 - /usr/src/debug/anaconda-11.1.2.224/loader2/ftp.c:215
0x4203b0 - /usr/src/debug/anaconda-11.1.2.224/loader2/ftp.c:690
0x42083f - /usr/src/debug/anaconda-11.1.2.224/loader2/ftp.c:758
0x41fd33 - /usr/src/debug/anaconda-11.1.2.224/loader2/urls.c:215
0x417381 - /usr/src/debug/anaconda-11.1.2.224/loader2/urlinstall.c:62
0x4179c5 - /usr/src/debug/anaconda-11.1.2.224/loader2/urlinstall.c:114

Comment 12 David Barr 2011-05-24 17:20:56 UTC
I can get a copy of anaconda-debuginfo, if that would help. I don't know exactly where or how I would incorporate it.

Comment 13 David Barr 2011-06-06 17:58:21 UTC
For what it's worth, I experimented with making the ks= option point to a local file. It didn't work, but it did prevent the seg fault. So, I think we can safely say the lookup that is generating the seg fault is the one looking for the server housing the ks source file.