| Summary: | RHNS-scheduled kickstart generates SIGSEGV | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | David Barr <dafydd> | ||||
| Component: | anaconda | Assignee: | Anaconda Maintenance Team <anaconda-maint-list> | ||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Release Test Team <release-test-team> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 5.4 | CC: | mzazrivec | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2013-03-20 17:51:36 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
Does a method exist to tell the test host to check in with the RHNS server? I could do a better job of tracking exactly where this SIGSEGV happens if I could know to be watching when the host does its check in and receives the command to reboot and kickstart. Thanks! Okay, I just got lucky, and caught the scheduled install just as the host was rebooting.
It loads the qla2xxx FC HBA driver, and does a DHCP search for eth0. The next thing to happen is the SEGV.
The contents of /tftpboot/pxelinux.cfg/[MACADDR] are
[CODE]
default linux
prompt 0
timeout 1
label linux
kernel /images/rhel54server:1:[RHNSORG]/vmlinuz
ipappend 2
append initrd=/images/rhel54server:1:[RHNSORG]/initrd.img ksdevice=eth0 lang= nobond mpath text kssendmac ks=http://[RHNS]/cblr/svc/op/ks/system/[HOSTFQDN]:1
[/CODE]
I'm going to attach a diff -c of the normal and RHNS-generated kickstarts for this test host. I noticed the biggest differences are in the determination of which MAC address is eth0. My tested "Network Configuration" was "DHCP using first available interface." I'm going to try "Use DHCP from interface [eth0]," instead. My memory wants to dredge up the possibility of that change being what got me the change in behavior the one other occasion.
The testing process, so far, is:
-) Delete the host profile from RHNS.
-) Kickstart the host bare-metal against the normal ks file.
-) Schedule an RHNS-based kickstart.
-) Watch the fireworks the next time the host checks in.
Created attachment 499692 [details]
diff -c ksNormal.txt ksHost.txt
Diff -c of the normal kickstart configuration with the one customized for the RHNS-scheduled kickstart.
RHN Satellite (and spacewalk-koan / koan packages on the client side in particular) have very little to do with the problem shown. Their responsibility is just to download and install vmlinuz & initrd.img from RHN Satellite (note that the kernel and ramdisk are installed so as they're synchronized from Red Hat Network) and reboot the system in question. The SIGSEGV shown in the initial comment comes from RHEL installer -> reassigning to anaconda. How would I test to see if the problem is a failure to find images/stage2.img. This environment is PXE booted from bare metal at the start. What if the copied over initrd.img/vmlinuz is looking for images/stage2.img on a local device because the installation parameters don't tell it to go out to the RHN Satellite server to get it? If I were to copy it over locally, where would initrd.img/vmlinuz expect to find it? I do have /boot/grub/stage2 (without the .img) at 104988 bytes... Can you please try this with a more recent release, like RHEL 5.7? I'd like to know if this is still a bug there. I don't see a 5.7 to download. I'm downloading a .iso of 5.6 with a last-modified of 2011-05-18 05:51:23 PDT. Hopefully, that's recent enough. However, I'm constrained to 5.4 for the software that will run on these hosts. So, a more recent version may identify that a fix already exists, but I'd still need an errata update for 5.4 to implement it in production. In the meantime, I figured out an experiment to test my hypothesis of the copied-from-source initrd.img and vmlinuz having a problem with the stage2.img. Right - I meant 5.6. It's hard to keep track of which we are working on sometimes. However, we don't release updated anaconda versions for past releases so you'll either need to find a workaround or move the software you're using to a later release. That'll be the case regardless of if it's fixed in 5.6 or not. -) Bare metal kickstart to an RHNS-managed 5.6 distribution.
-) Schedule an RHNS-managed kickstart.
-) Time passes...
-) Kickstart kicks off...
running /sbin/loader
loading qla2xxx driver...
sending request for IP information for eth0 [Probably not exact text...]
loader received SIGSEGV! Backtrace:
[0x400913]
[0x50ac30]
[0x526360]
[0x417f10]
[0x420288]
[0x4203b0]
[0x42083f]
[0x41fd33]
[0x417381]
[0x4179c5]
install exited abnormally [1/1]
sending termination signals...done
sending kill signals...done
disabling swap...
unmounting filesystems...
/proc/bus/usb done
/proc done
/dev/pts done
/sys done
/tmp/ramfs done
you may safely reboot your system
+++
The backtrace is different in this case.
-) Boot the rescue DVD to get the grub.conf entry.
default=0
timeout=10
splashimage=(hd0,0)/grub/splash.xpm.gz
hiddenmenu
title kick1306182811
kernel /vmlinuz ksdevice=link nobond lang= text ks=http://[RHNS]/cblr/svc/op/ks/system/[HOSTFQDN]:1 kssendmac mpath
initrd /initrd.img
+++
-) Compare fingerprints of copied initrd.img/vmlinuz and source copies.
[root@[HOST] boot]# md5sum initrd.img
ea701b85109eb12fe99a7a4dda6ba578 initrd.img
[root@[HOST] boot]# md5sum vmlinuz
ae4d6908010dcfcc9d3dd7dbbfef39ca vmlinuz
[root@[RHNS] pxeboot]# pwd
/var/src/rhel-5.6-x86_64/images/pxeboot
[root@[RHNS] pxeboot]# md5sum initrd.img
ea701b85109eb12fe99a7a4dda6ba578 initrd.img
[root@[RHNS] pxeboot]# md5sum vmlinuz
ae4d6908010dcfcc9d3dd7dbbfef39ca vmlinuz
+++
RHN Satellite (well, spacewalk-koan, or whatever) downloaded the right files.
Chris, I'm holding this test host static, to simplify experimentation. What would you like me to try next?
These frames are just the sigsegv printer: 0x400913 - /usr/src/debug/anaconda-11.1.2.224/loader2/loader.c:1378 0x50ac30 - /usr/src/debug/anaconda-11.1.2.224/stubs/unicode-lite.c:28 0x526360 - /usr/src/debug/anaconda-11.1.2.224/stubs/unicode-lite.c:28 So the first line here is where the problem's occurring: 0x417f10 - /usr/src/debug/anaconda-11.1.2.224/loader2/net.c:2553 0x420288 - /usr/src/debug/anaconda-11.1.2.224/loader2/ftp.c:215 0x4203b0 - /usr/src/debug/anaconda-11.1.2.224/loader2/ftp.c:690 0x42083f - /usr/src/debug/anaconda-11.1.2.224/loader2/ftp.c:758 0x41fd33 - /usr/src/debug/anaconda-11.1.2.224/loader2/urls.c:215 0x417381 - /usr/src/debug/anaconda-11.1.2.224/loader2/urlinstall.c:62 0x4179c5 - /usr/src/debug/anaconda-11.1.2.224/loader2/urlinstall.c:114 I can get a copy of anaconda-debuginfo, if that would help. I don't know exactly where or how I would incorporate it. For what it's worth, I experimented with making the ks= option point to a local file. It didn't work, but it did prevent the seg fault. So, I think we can safely say the lookup that is generating the seg fault is the one looking for the server housing the ks source file. |
+++ Description of problem: When a host checks in and receives an instruction to kickstart itself, it will successfully generate a new grub entry, with initrd.img and vmlinuz. When the host reboots and starts it installation, something generates a Segmentation Violation. +++ Version-Release number of selected component (if applicable): spacewalk-koan-0.2.7-7.el5sat, assuming this package is the source of the problem. +++ How reproducible: Five out of six attempts, so far. +++ Steps to Reproduce: 1. Bare-metal kickstart a system. 2. Schedule a kickstart via RHNS for that system using the same ks profile. 3. Wait for the system to check in and receive its request to reinstall. +++ Actual results: For five out of six attempts, the following output is generated: [CODE] Loader received SIGSEGV! Backtrace: [0x400c83] [0x504550] [0x51f990] [0x417990] [0x41fcc8] [0x41fdf0] [0x42027f] [0x41f773] [0x416e01] [0x417445] Install exited abnormally [1/1] Sending termination signals...done Sending kill signals...done Disabling swap... Unmounting filesystems... /proc/bus/usb done /proc done /dev/pts done /sys done /tmp/ramfs done You may safely reboot your system [/CODE] This stack trace is verified consistent across the last three attempts. On the sixth attempt, the installation appeared to be successful, but only a small number of packages were actually installed, and the post-install reboot failed. I didn't explore the results in detail. If it happens again, I'll look deeper into it. +++ Expected results: Successful software installation and reboot. +++ Additional info: I've had problems kickstarting in this environment already, but not like this. Those problems involved detecting multipath to an IBM SAN, and waiting long enough for scsi to get those paths set up before multipath combines them and LVM goes to look for volumes on them. However, in this case, the collapse appears to be occurring before the boot gets to that point in its discovery process. For context the first problem is bug 570460, which I solved by putting the manual mkinitrd workaround in a %post script: [CODE] /sbin/mkinitrd -v \ --with=dm-mod \ --with=dm-multipath \ --with=dm-round-robin \ --with=scsi_dh_rdac \ \${INITRD} \${KERNELVER} [/CODE] The second problem is that the scsi device discovery was taking longer than the timeout allowed for. That resulted in multipath and LVM trying to work with devices that weren't there yet. That was fixed by modifying mkinitrd itself to change the timeout when it generated the init file in initrd-*.img: [CODE] cat <<EOT | patch /sbin/mkinitrd 900c900 < emit "stabilized --hash --interval 1000 /proc/scsi/scsi" --- > emit "stabilized --hash --interval 3000 /proc/scsi/scsi" EOT [/CODE] I broke open the RHNS generated initrd.img to see if the mkinitrd patch survived... and found that the nash init script was now a 64-bit LSB executable. So, no easy help there.