446669 – Fedora 9 cannot boot with init segfault

Bug 446669 - Fedora 9 cannot boot with init segfault

Summary: Fedora 9 cannot boot with init segfault

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	mkinitrd
Sub Component:
Version:	9
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	urgent
Target Milestone:	---
Assignee:	Peter Jones
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	447724 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-05-15 16:25 UTC by Arc C.
Modified:	2009-07-14 18:16 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-07-14 18:16:02 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
IMAGE: traceback when entering wrong password for encrypted device (23.69 KB, image/png) 2008-07-09 12:27 UTC, Alexander Todorov	no flags	Details
screen shot of the segfault / xen guest (3.41 KB, image/png) 2008-08-14 08:33 UTC, Alexander Todorov	no flags	Details
View All

Description Arc C. 2008-05-15 16:25:58 UTC

After fresh install of Fedora 9 on two disk in software-RAID, the booting failes
with following:

root (hd0,0)
 Filesystem type is ext2fs, partition type 0xfd
kernel /boot/vmlinuz-2.6.25-14.fc9.x86_64 ro
root=UUID=576c8a30-5637-4b72-93a1-d5fbcb800169 rhdb quiet
   [Linux-bzImage, setup=0x2e00, size=0x1f63b8]
initrd /boot/initrd-2.6.25-14.fc9.x86_64.img
     [Linux-initrd @ 0x37cdf000, 0x31080c bytes]
     
Decompressing Linux... done.
Booting the kernel.

Red Hat nash version 6.0.52 starting
device-mapper: table: 253:0: mirror: Device lookup failure
device-mapper: reload ioctl failed: No such device or address
mdadm: /dev/md0 has been started with 1 drive (out of 2).
mdadm: /dev/md1 has been started with 1 drive (out of 2).
device-mapper: table ioctl failed: No such device or address
device-mapper: deps ioctl failed: No such device or address
init[1]: segfault at 10 ip 7f1d8a3b5004 sp 7fff927c2ef8 error 4 in
libdevmapper.so.1.02[7f1d8a3a7000+15000]
nash received SIGSEGV! Backtrace (16):
/bin/nash[0x40d093]
/lib64/libc.so.6[0x7f1d87cce2a0]
/lib64/libdevmapper.so.1.02(dm_task_get_deps+0x4)[0x7f1d8a3b5004]
/usr/lib64/libnash.so.6.0.52[0x626c1b]
/usr/lib64/libnash.so.6.0.52(nashDmDevGetName+0x3d)[0x627ab0]
/usr/lib64/libnash.so.6.0.52[0x6242e3]
/usr/lib64/libnash.so.6.0.52[0x6243fc]
/usr/lib64/libnash.so.6.0.52(nashBdevIterNext+0x120)[0x624871]
/usr/lib64/libnash.so.6.0.52[0x624abb]
/usr/lib64/libnash.so.6.0.52(nashFindFsByName+0x60)[0x624b86]
/usr/lib64/libnash.so.6.0.52(nashAGetPathBySpec+0x86)[0x624c73]
/bin/nash[0x408b24]
/bin/nash[0x40cf49]
/bin/nash[0x40d576]
/lib64/libc.so.6(__libc_start_main+0xfa)[0x7f1d87cba32a]
/bin/nash[0x404509]

Comment 1 Khamit Ardashev 2008-05-15 19:48:21 UTC

almost same problem on intel RAID, MB dp35dp, 3 SATA drives striped:
on install all fine, but during first boot
table: 253:0 striped
segfault at 10 Couldn't parse string destination
error 4 in libdevmapper.so.1.02
nash received SIGSEGV

Comment 2 Arc C. 2008-05-15 20:08:06 UTC

Yes, I forgot to mention I have an internal PCI SATA RAID card (4-port HighPoint
HPT374 chip which is RocketRAID 1640). I installed with it because I had two
disks with data attached to it and it was working fine under FC3. Now even if I
remove it and boot, I still get the segfault. I guess re-installing FC9 without
the card would make it work, but I need the card to access the drives (don't
have enough SATA ports on motherboard).

Comment 3 Zdenek Kabelac 2008-05-16 15:08:05 UTC

This bug is visible also on simple notebook T61. Problem might be in the
incorrect ram disk creation - for now check the bug #443332.

Comment 4 Arc C. 2008-05-19 19:26:20 UTC

Since this is an important server to me I've used the workaround of
re-installing FC9 without the PCI SATA RAID card and added the card after. I
will not be able to replicate the problems as I don't want to re-build this
system anymore.

Comment 5 c.h. 2008-05-21 12:59:48 UTC

I'm having problems that seem very similar:
https://bugzilla.redhat.com/show_bug.cgi?id=447724

In fact much of the backtrace I get is very similar to the data originally
posted for this bug report, though there are some differences to be seen in the
console text and in the particulars of the system configurations.

Comment 6 Zdenek Kabelac 2008-05-21 14:10:30 UTC

*** Bug 447724 has been marked as a duplicate of this bug. ***

Comment 7 Alasdair Kergon 2008-05-21 23:44:16 UTC

initrd/nash debugging first I think

Comment 8 Zdenek Kabelac 2008-05-22 10:42:12 UTC

It looks like one bug is in the nash - it doesn't properly check the result
after dm_task_run - i.e. missing dm_task_get_info - so the device without table
crashes nash - this should be easy to fix - however another part of the problem
is that this device should not have the empty table.

Comment 9 c.h. 2008-05-23 01:04:41 UTC

In the case where I encountered the bug I had selected an encrypted LVM with the
anaconda check-box.  Shouldn't it be right about in the area of having booted
from the clear text boot and proceeding to start mounting the encrypted root
that it should ask for a password before it can really find/access the LVM?  Or
is it crashing just before that code?

Also it is noteworthy that I had just installed the system using the F9 media
supplemented by the default internet based repository, so clearly at install
time it had been able to probe / configure / create / access the LVM and other
system storage devices, so something that works apparently fine at install time
mustn't be doing so at first-boot time.

In case it is relevant, during install on the same system I did get a dialog box
about some unreadable storage devices *not* used for Fedora 9 install and not
selected to be mounted in the installed system:
https://bugzilla.redhat.com/show_bug.cgi?id=447729

WARNING
The partition table on device mapper/mpath0 was unreadable.  To create new
partitions it must by initialized, causing the loss of ALL DATA on this drive.

This operation will override any previous installation choices about which
drives to ignore.

Would you like to initialize this drive, erasing ALL DATA?


-- I answered NO, so out of 5 physical drives, only 'sde' contained "/" and
"/boot" etc. and the other drives were variously blank or parts of an old
unrelated irrelevant to FC9 RAID set or had ext3 on them.

AHCI mode was selected for all drives, four SATA, one PATA, split between ICH9R
SATA ports and the motherboard's JMicron PATA/eSATA controller.

In any case the system is still in the broken state and I haven't found a
palatable work-around (not wanting to rebuild the box and unplug the drives
yet), so if there are any helpful diagnostics I could run, I'm glad to try.
However, within a day or two I may put the system into service, though, which
might limit debug capability subsequently.

Comment 10 Milan Broz 2008-05-23 13:52:19 UTC

nash segfault reproducer:

1) run in one terminal:
 while :; do dmsetup create xxx --notable ; dmsetup remove xxx ; done

(so device xxx is appearing and disappearing continuously)

2) in second terminal run mkinitrd

# mkinitrd --force-lvm-probe -v /boot/initrd-2.6.26-rc3.img 2.6.26-rc3
Creating initramfs
device-mapper: table ioctl failed: No such device or address
device-mapper: deps ioctl failed: No such device or address
nash received SIGSEGV!  Backtrace (16):
/sbin/nash[0x40d093]
/lib64/libc.so.6[0x7f6ada8412a0]
/lib64/libdevmapper.so.1.02(dm_task_get_deps+0x4)[0x7f6adcf23004]
/usr/lib64/libnash.so.6.0.52[0x7f6add33ec1b]
/usr/lib64/libnash.so.6.0.52(nashDmDevGetName+0x3d)[0x7f6add33fab0]
/usr/lib64/libnash.so.6.0.52[0x7f6add33c2e3]
/usr/lib64/libnash.so.6.0.52[0x7f6add33c3fc]
/usr/lib64/libnash.so.6.0.52(nashBdevIterNext+0x120)[0x7f6add33c871]
/usr/lib64/libnash.so.6.0.52[0x7f6add33cabb]
/usr/lib64/libnash.so.6.0.52(nashFindFsByName+0x60)[0x7f6add33cb86]
/usr/lib64/libnash.so.6.0.52(nashAGetPathBySpec+0x86)[0x7f6add33cc73]
/sbin/nash[0x4088a1]
/sbin/nash[0x40cf49]
/sbin/nash[0x40d576]
/lib64/libc.so.6(__libc_start_main+0xfa)[0x7f6ada82d32a]
/sbin/nash[0x404509]

# rpm -q nash
nash-6.0.52-2.fc9.x86_64

(used nash has already patched bug #443332 with this patch:

@@ -984,7 +984,7 @@ nashDmDevGetType(nashContext *nc, dev_t
     while ((obj = dm_iter_next(iter, 1)) && (obj->devno != devno))
         ;

-    if (obj) {
+    if (obj && obj->type) {
         strncpy(buf, obj->type, 31);
         dm_iter_destroy(iter);
         return buf;
)

Comment 11 Kuba Ober 2008-07-02 06:09:39 UTC

Please yum install mkinitrd-debuginfo so that we get to see what line it fails 
on. It may be a manifestation of a buffer overrun bug for which I posted a 
patch (possibly incorrectly) under bug 443332. For me the suggestion from 
Zdenek Kabalec from 443332 doesn't work, the obj->type is never zero (for me) 
when the memory isn't corrupted in the first place. YMMV...

Comment 12 Alexander Todorov 2008-07-09 12:27:05 UTC

Created attachment 311365 [details]
IMAGE: traceback when entering wrong password for encrypted device

I'm seeing a similar traceback (see image) when I enter wrong password for the
encrypted device. This is on a Xen PV guest with stock F9 install. Will apply
latest updates and try again.

Comment 13 Alexander Todorov 2008-07-09 12:43:01 UTC

comment #12 happens when entering wrong password for the encrypted device 3
times (xen pv guest on i386). 

RPM versions:
mkinitrd-6.0.52-2.fc9.i386
initscripts-8.76.2-1.i386
device-mapper-libs-1.02.24-11.fc9.i386

Please advice if I'm seeing the same bug or a different issue.

Thanks!

Comment 14 c.h. 2008-08-13 20:25:38 UTC

Any news / possible workaround / fix on the horizon on this?
Does it still happen with F10-alpha?

IIRC someone was blaming dmraid/device-mapper problems on glibc2 versioning at one point either in FC8 or F9 in a different bug report or maybe a forum post.  I think they said that reverting glibc2 to a previous version fixed their problem even though the new one was "officially" backward compatible.
I don't unfortunately recall what their symptoms were, but if this problem is remaining totally elusive to identify / solve maybe it is worth a check.

If a fix was added to a package in 'updates' would there there any easy way to perform a proper Fedora-9 install (on a system that would crash otherwise) inclusive of the fix short of doing an FTP/HTTP install?  i.e. could one install from the F9 DVD with networking enabled and have the system install the non-updated software from DVD but install any available updates from the updates repository over the internet?

Comment 15 Alexander Todorov 2008-08-14 08:31:36 UTC

(In reply to comment #14)
> Does it still happen with F10-alpha?
> 

I still have a trace back with F10-alpha even after the 1st entry of wrong password, not the 3rd as in comment #13

Comment 16 Alexander Todorov 2008-08-14 08:33:58 UTC

Created attachment 314294 [details]
screen shot of the segfault / xen guest

Comment 17 Ray Strode [halfline] 2008-08-14 14:16:06 UTC

Hi Alexander,

Comment 15 is an unrelated issue in the F10 Alpha.  See the release notes for the alpha for more information.

Comment 18 Nicolas Troncoso Carrere 2009-01-06 15:37:34 UTC

I'm seeing a similar crashdump as in https://bugzilla.redhat.com/show_bug.cgi?id=446669#c10
I'm testing Fedora RawHide. 

I booted without quiet nor rhgb and noticed that the kernel panics, but when using rhgb nash seems to be spawned any way without having a clue that the kernel has taken a trip to never land.

Comment 19 Nicolas Troncoso Carrere 2009-01-06 15:56:05 UTC

After getting the password right in using rhgb, the distro booted any way. Even though it reported a kernel panic

Comment 20 Bug Zapper 2009-06-10 00:51:46 UTC

This message is a reminder that Fedora 9 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 9.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '9'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 9's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 9 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 21 Khamit Ardashev 2009-07-06 20:40:44 UTC

It is fixed in FC11.  I did upgrade from FC9 straight to FC11.  It works fine now.
Of course, with every upgrade, I have to clean up rpms on subject of duplex rpms, and unsatisfied dependencies, but this is minor nuisance.

Comment 22 Bug Zapper 2009-07-14 18:16:02 UTC

Fedora 9 changed to end-of-life (EOL) status on 2009-07-10. Fedora 9 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.