Bug 486997

Summary:

rawhide's init/nash segfaults in libblkid

Product:

[Fedora] Fedora

Reporter:

Jim Meyering <meyering>

Component:

e2fsprogs

Assignee:

Eric Sandeen <esandeen>

Status:

CLOSED NEXTRELEASE

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

medium

Docs Contact:

Priority:

low

Version:

rawhide

CC:

esandeen, kernel-maint, kzak, oliver, quintela

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

1.41.4-4.fc10

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-03-18 19:06:05 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
first 512 bytes	none
serial console log	none

Description Jim Meyering 2009-02-23 16:31:21 UTC

Description of problem:
Hello,

Normally I track rawhide pretty closely, but for the last few weeks
new kernels haven't booted on the system I use for that, so I stuck
with the most recent one that worked,

  2.6.29-0.74.rc3.git3.fc11.x86_64

However, now that there are 4 newer kernel images, none of which boots,
it's getting a little precarious, and I investigated.
Note: this is the first that failed to boot:

  2.6.29-0.99.rc4.git1.fc11.x86_64

The most recent, 2.6.29-0.131.rc5.git2.fc11.x86_64
also fails the same way, with a segfault from init/nash.

The traceback I saw had (from memory, sorry)

  glibc's strlen
  ...
  libblkid's blkid_verify
             blkid_get_dev
  nash...

I found that I could boot into any of the recent kernels,
only if I'd either disconnect /dev/hdb physically,
or if I had erased its partition table.  I did save a copy.

The partitions on /dev/hdb were of type ext4 and ext3:
Here's what parted said before I reformatted the xfs partition as ext4,
just to be sure xfs wasn't implicated:

  Model: ATA SAMSUNG HD501LJ (scsi)
  Disk /dev/sdb: 976773168s
  Sector size (logical/physical): 512B/512B
  Partition Table: msdos

  Number  Start       End         Size        Type      File system  Flags
   1      32s         390625279s  390625248s  primary   ext3         boot 
   2      390625280s  488282111s  97656832s   primary   ext3              
   3      488282112s  625000447s  136718336s  extended                    
   5      488282144s  585938943s  97656800s   logical   xfs               
   6      585938976s  625000447s  39061472s   logical                     

If you need more detail, I'll be happy to help,
but it may take me a week or so.


Version-Release number of selected component (if applicable): see above


How reproducible: always


Steps to Reproduce:
1. copy partition table back to /dev/hdb
2. reboot
3.
  
Actual results:
segfault in nash/libblkid

Expected results:
no segfault

Additional info:

Comment 1 Eric Sandeen 2009-02-23 22:49:00 UTC

Jim, can you attach the partition table here?

Thanks,
-Eric

Comment 2 Jim Meyering 2009-02-23 22:54:13 UTC

Created attachment 332980 [details]
first 512 bytes

here you go...

Comment 3 Eric Sandeen 2009-02-23 23:05:48 UTC

Also it'd be interesting if you could try installing some old kernel w/ your present system; if that fails too it might be due to something horked in an e2fsprogs upgrade?

Or, put the partition table back, boot the old working kernel+initrd, then run blkid with "-c /dev/null" so you don't read cached info, and see if you can reproduce.

I'll assume this is e2fsprogs' fault for now and take the bug.  :)

Comment 4 Jim Meyering 2009-02-23 23:48:10 UTC

hooked up serial cable.
Full log attached below. here's the stack trace:

Activating logical volumes
  2 logical volume(s) in volume group "VolGroup00" now active
init[1]: segfault at 0 ip 000000332867dd20 sp 00007fffec7f2a48 error 4 in libc-2]
nash received SIGSEGV!  Backtrace (16):
/bin/nash[0x40ef98]
/lib64/libc.so.6[0x3328633340]
/lib64/libc.so.6(strlen+0x30)[0x332867dd20]
/lib64/libblkid.so.1[0x332ba06b2e]
/lib64/libblkid.so.1[0x332ba06c0c]
/lib64/libblkid.so.1[0x332ba06d94]
/lib64/libblkid.so.1(blkid_verify+0x1cd)[0x332ba070dd]
/lib64/libblkid.so.1(blkid_get_dev+0xab)[0x332ba0415b]
/usr/lib64/libnash.so.6.0.77[0x332960cb97]
/usr/lib64/libnash.so.6.0.77(nashFindFsByName+0x63)[0x332960cd3c]
/usr/lib64/libnash.so.6.0.77(nashAGetPathBySpec+0xa1)[0x332960ce44]
/bin/nash[0x40a2bf]
/bin/nash[0x40ee72]
/bin/nash[0x40f50c]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x332861e5ed]
/bin/nash[0x4050b9]

Comment 5 Jim Meyering 2009-02-23 23:50:12 UTC

Created attachment 332992 [details]
serial console log

Comment 6 Jim Meyering 2009-02-24 07:35:32 UTC

Eric,
responding also here to your comment #3, using an older kernel does work (see the actual report for the version numbers).  with  zeroed partition table, even newer kernels boot.

when booted into the latest kernel, I restored the partition table,
ran partprobe to make the kernel reread it, then ran blkid to show everything.
Worked fine:

for i in $(blkid |perl -nle '/.* UUID="(.*?)".*/ and print $1'); do echo
      $i; blkid -l -t UUID=$i > /dev/null; done
5ed0e379-eea9-47cc-95ca-f5b86032f887
f9d4b936-c764-4bfa-af27-d9ab7f949e0d
6c27b605-0bd3-48fc-84f9-a9b536e4eae3
f9d4b936-c764-4bfa-af27-d9ab7f949e0d
toYpTt-s6ty-jpaV-K4Re-mluN-rdjl-IBqRq8
5ed0e379-eea9-47cc-95ca-f5b86032f887
47DD-19E3
e73c8549-4182-4976-9ae9-954ca952623c
f1b08685-6a49-4908-b9b0-919b886ba55c
f15321e5-d3fc-4256-acc3-3a93c7c7ae1e

Comment 7 Jim Meyering 2009-02-25 21:35:28 UTC

I've reduced it a little more.
For background, here's my partition table.
Note that hdb2 is a 50GB ext4 partition.

Model: ATA SAMSUNG HD501LJ (scsi)
Disk /dev/sdb: 976773168s
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start       End         Size        Type      File system  Flags
 1      32s         390625279s  390625248s  primary   ext3         boot
 2      390625280s  488282111s  97656832s   primary   ext3
 3      488282112s  625000447s  136718336s  extended
 5      488282144s  585938943s  97656800s   logical   ext3
 6      585938976s  625000447s  39061472s   logical

Trying to determine which partition is provoking the failure,
I'm removing #6 first:

  parted -s /dev/sdb rm 6

boot-to-newest-kernel still fails.
boot back to usable kernel, then repeat for partition 5, then 3, then finally 2.
It's only after removing partition #2 that the latest kernel managed to boot.

So now I have copied that 50GB partition to a regular file, made it sparse, tar'd and compressed it down to 32MB.  You can get a copy from  http://et.redhat.com/~meyering/sdb2.img.tar.xz (and if your distro doesn't package xz yet, prod your friendly lzma maintainer. xz is the new name for lzma: http://tukaani.org/xz/)

Given all that, you should be able to untar and copy the result to the same sectors as listed above, with the already-attached boot sector, and then reproduce the problem.

Comment 8 Jim Meyering 2009-02-26 10:00:33 UTC

FYI, I recompressed it with xz -8 (which took longer), and now it's just 12MiB.
Contrast with bzip -9's size that's 3.5 times larger:

$ du -h sdb*
43M     sdb2.img.tar.bz2
12M     sdb2.img.tar.xz

For your convenience, here's the bzip2-compressed file, too:
http://et.redhat.com/~meyering/sdb2.img.tar.bz2

Comment 9 Fedora Update System 2009-02-27 01:07:06 UTC

e2fsprogs-1.41.4-4.fc10 has been submitted as an update for Fedora 10.
http://admin.fedoraproject.org/updates/e2fsprogs-1.41.4-4.fc10

Comment 10 Jim Meyering 2009-02-27 18:18:01 UTC

I've just confirmed that latest kernel (which now has an initrd
built from fixed e2fsprogs-1.41.4-5.fc11.x86_64) solves the problem.

Thanks, Eric.

Comment 11 Fedora Update System 2009-02-28 03:26:50 UTC

e2fsprogs-1.41.4-4.fc10 has been pushed to the Fedora 10 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update e2fsprogs'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F10/FEDORA-2009-2165

Comment 12 Fedora Update System 2009-03-18 19:05:50 UTC

e2fsprogs-1.41.4-4.fc10 has been pushed to the Fedora 10 stable repository.  If problems still persist, please make note of it in this bug report.