Bug 1003654

Summary: Segfault in lvmetad during boot
Product: [Fedora] Fedora Reporter: rainman3d2002
Component: lvm2Assignee: Peter Rajnoha <prajnoha>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 19CC: agk, bmarzins, bmr, dwysocha, gansalmon, heinzm, itamar, jonathan, kernel-maint, lvm-team, madhu.chinakonda, marcelo.barbosa, mcsontos, michele, msnitzer, prajnoha, prockai, rainman3d2002, zkabelac
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: lvm2-2.02.98-13.fc19 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-11-03 04:32:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
CPU & Memory Info
none
lvmdump file none

Description rainman3d2002 2013-09-02 14:51:06 UTC
Created attachment 792888 [details]
CPU & Memory Info

Description of problem:


Version-Release number of selected component (if applicable): 3.10.10-200


How reproducible: Reboot


Steps to Reproduce:
1. Upgrade from 3.10.9-200
2. Reboot
3.

Actual results: System fails to boot, screen flashes as animated circle reaches about 3/4 and the whole thing comes to a half. Nothing else display, keyboard inoperative, can't Esc to see boot log, can't Alt-F2 to open a console.


Expected results: System boots


Additional info:
Had to reboot using kernel 3.10.9-200.

Attached is the CPU and memory info, if you need anything else, just ask.

Comment 1 Michele Baldessari 2013-09-03 22:08:57 UTC
Hi,

can you remove "quiet rhgb" from the grub command line and take a picture of the system messages when it hangs with 3.10.10-200?

Just to confirm 3.10.9 works correctly, yes?

thanks,
Michele

Comment 2 rainman3d2002 2013-09-05 04:05:01 UTC
Now none of them work right. I had earlier rebooted 3.10.9 back into itself, and it had worked. Now to get these pictures, I booted 3.10.10, which didn't freeze the system, but it still didn't boot right. When I rebooted back into 3.10.9, it didn't work either. I've tried them both with those two things removed from the grub command line. Doesn't make a difference.

3.10.9 was the first kernel in F19 that actually booted correctly all the way into KDM. All the rest fail to activate most, but not all, of my logical volumes, freak out, drop me into maintenance mode where I have to type "vgchange -ay", then when that's done, I press Ctrl-D, which will finish booting, but it won't start KDM. Instead, it will freeze at the animated F. I then press Alt-F2 to get the console, login and type "startx" to get KDE to start.

I had never been able to boot F17 into run-level-5, so eventually I gave up and switched it to run-level-3 instead. But it never failed to activate all of my logical volumes the way F19 does every time. Well, you know, not until my hardware blew up. :)

It seems to work when you first install the OS, but the process usually breaks with the very first kernel upgrade. Was like that in F17, too. I don't know about F18 as I never install that version. My F19 install, by the way, is a fresh one. I simply replaced the OS, leaving my nonsystem logical data volumes intact.

I will post pictures as soon as my cell phone syncs up with Google. It seems to be having issues at the moment.

Comment 3 rainman3d2002 2013-09-05 04:10:58 UTC
Is it important that something called "rngd" has failed? I've never noticed this one before.

[root@System45 ~]# systemctl status rngd.service
rngd.service - Hardware RNG Entropy Gatherer Daemon
   Loaded: loaded (/usr/lib/systemd/system/rngd.service; enabled)
   Active: failed (Result: exit-code) since Thu 2013-09-05 00:40:46 ADT; 28min ago
  Process: 1225 ExecStart=/sbin/rngd -f (code=exited, status=1/FAILURE)

Comment 5 Michele Baldessari 2013-09-07 18:40:50 UTC
Hi,

thanks for the photos. This is not a kernel issue per se. It's an lvm issue. You get:
lvmetad[....]: segfault at ....

in your boot logs which is the reason systemd is failing to mount your /dev/mapper/LVM3-home LV

So this is likely either https://bugzilla.redhat.com/show_bug.cgi?id=1000894
or https://bugzilla.redhat.com/show_bug.cgi?id=1003278

I suggest you look at the troubleshooting tips in those BZ and try to understand which one affects you and then duplicate this bug against the lvm bug.

hth,
Michele

Comment 6 rainman3d2002 2013-09-09 23:07:03 UTC
Okay, but why do some kernels boot normally without experiencing this issue, while others don't? I'm talking about all the kernels released in F19.

Comment 7 Michele Baldessari 2013-09-10 08:19:07 UTC
Could be timing, could be kernel changes triggering the metad crash (which should not happen anyway), could be random. With the evidence here, we need to fix lvmetad not the kernel

Comment 8 Josh Boyer 2013-09-18 20:42:02 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs.

Fedora 19 has now been rebased to 3.11.1-200.fc19.  Please test this kernel update and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.

Comment 9 rainman3d2002 2013-09-19 23:31:58 UTC
I installed 3.11.1-200 this morning as part of the normal update, and it booted fine, activating all of my logical volumes AND booting into runlevel 5. But is this issue is fixed, though, as I have seen this numerous times before. One kernel will suddenly work while 2 or 3 following won't, then the next one will, and so on. It started immediately upon installing F19 for the first time. It booted normally, but as soon as I did the first system update, including the kernel update, this pattern began.

Comment 10 Michele Baldessari 2013-09-20 07:41:06 UTC
Again, this case needs to be closed or files against lvm. Kernel has nothing to do with it. Your issue is that lvmetad is crashing and it should not. No matter what kernel version you have.

Comment 11 rainman3d2002 2013-09-20 17:16:50 UTC
FYI for whoever at the LVM team is looking at this, having to reboot today to fix a network issue has brought back the original problem, in that my logical volumes(except, it seems, the ones named with default values like "vg_system45_lv??") aren't reactivating on boot and I'm forced to enter Maintenance Mode, type in "vgchange -ay" and then press Ctrl-D to finish boot. This doesn't boot into runlevel 5. Instead, I press Alt-F2, log in at the console and use the startx command to get into KDE.

Comment 12 Marian Csontos 2013-09-24 08:03:06 UTC
Hi, add more details please, at least following:

1. what's the LVM2 version?

    rpm -q lvm2

2. output of:

    systemctl status lvm2-lvmetad.service

3. attach the file produced by 'lvmdump' after successful boot
4. if there is a coredump (e.g in /var/spool/abrt/) post the stack trace

Comment 13 Marian Csontos 2013-09-24 08:16:42 UTC
(In reply to Michele Baldessari from comment #5)
> Hi,
> 
> thanks for the photos. This is not a kernel issue per se. It's an lvm issue.
> You get:
> lvmetad[....]: segfault at ....
> 
> in your boot logs which is the reason systemd is failing to mount your
> /dev/mapper/LVM3-home LV
> 
> So this is likely either https://bugzilla.redhat.com/show_bug.cgi?id=1000894
> or https://bugzilla.redhat.com/show_bug.cgi?id=1003278
> 
> I suggest you look at the troubleshooting tips in those BZ and try to
> understand which one affects you and then duplicate this bug against the lvm
> bug.

Stop trying with different kernels and accept it is not a kernel issue but a timing bug in lvmetad. If there is one kernel where it happens more frequently boot that.

And then try to find more if it is any of above BZs as asked by Michele as so far you have provided us with as little clues as possible and without them there is nothing else we can do than close the bug with INSUFFICIENT_DATA as resolution.

> 
> hth,
> Michele

Comment 14 rainman3d2002 2013-09-25 11:40:41 UTC
Created attachment 802739 [details]
lvmdump file

Comment 15 rainman3d2002 2013-09-25 11:45:15 UTC
1.
[root@System45 ~]# rpm -q lvm2
lvm2-2.02.98-12.fc19.x86_64


2.
[root@System45 ~]# systemctl status lvm2-lvmetad.service
lvm2-lvmetad.service - LVM2 metadata daemon
   Loaded: loaded (/usr/lib/systemd/system/lvm2-lvmetad.service; disabled)
   Active: active (running) since Sat 2013-09-21 09:23:04 ADT; 3 days ago
     Docs: man:lvmetad(8)
 Main PID: 1172 (lvmetad)
   CGroup: name=systemd:/system/lvm2-lvmetad.service
           `-1172 /usr/sbin/lvmetad

Sep 21 09:23:04 System45.localdomain systemd[1]: Stopped LVM2 metadata daemon.
Sep 21 09:23:04 System45.localdomain systemd[1]: Starting LVM2 metadata daemon...


3. See attachment

4. There is no core dump file

The reason I keep trying different kernels is that some of them seem to work, some don't, and some seem to work for a while before forcing me to reactivate my logical volumes manually with each reboot.

Comment 16 Marian Csontos 2013-09-25 12:38:49 UTC
Thanks for the info.

So to summarize for prajnoha: just 2 local disks with 16 + 24 partitions (GPT) and there are about as many LVs as there are PVs.

Seems you are adding work to yourself by using partition per LV and then extending LV by adding partitions to VG.

Reading metadata areas from dozens of PVs you have at boot time must be somewhat lagging but still should not stop system from activating LVs.

But in the messages file attached in lvmdump I see dozens of fsck processes running at once which definitely may cause problems - after all the checks start there is nothing in the log for 14 minutes and may be a udev timeout.

    Sep 25 08:22:02 System45 systemd[1]: Started File System Check on /dev/LVM3/home.
    Sep 25 08:36:18 System45 dbus-daemon[1259]: dbus[1259]: [system] Activating via systemd: service name='net.reactivated.Fprint' unit='fprintd.service'

What filesystems are on those volumes? Checking some of them is quite expensive operation.

How much memory does the system have? Post output of `free` please.

If there are any filesystems with useless data (HoldSpc? Are those just phony filesystems to hold space?) just remove them.

If there are filesystems you do not need do not automount them (remove auto from entries in /etc/fstab)

Comment 17 rainman3d2002 2013-09-27 09:58:30 UTC
             total       used       free     shared    buffers     cached
Mem:       7918048    1634544    6283504          0      63672     703244
-/+ buffers/cache:     867628    7050420
Swap:      8388600          0    8388600

Just rebooted, didn't have any issues this time.

My partition layout is a holdover from older days when I had limited disk space. I guess I never really thought about rearranging things. Will need another hard drive for that, though. The Hold* partitions really do hold my data, I just never bothered giving them real names though you're right, I can probably deactivate at least HoldStuff or maybe move it to an external hard drive.

It must be pointed out, however, that given that my physical volumes/logical volumes/partitions layout hasn't really changed since F17, I don't see why I should suddenly have these problems under F19 when I never did under F17. F17 had it's own share of problems, but never this.

Comment 18 Zdenek Kabelac 2013-10-22 13:08:08 UTC
This bug more or less sound like duplicate of bug #1016322.

Comment 19 Fedora Update System 2013-10-31 08:32:14 UTC
lvm2-2.02.98-13.fc19 has been submitted as an update for Fedora 19.
https://admin.fedoraproject.org/updates/lvm2-2.02.98-13.fc19

Comment 20 Fedora Update System 2013-11-01 03:57:34 UTC
Package lvm2-2.02.98-13.fc19:
* should fix your issue,
* was pushed to the Fedora 19 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing lvm2-2.02.98-13.fc19'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2013-20436/lvm2-2.02.98-13.fc19
then log in and leave karma (feedback).

Comment 21 Fedora Update System 2013-11-03 04:32:08 UTC
lvm2-2.02.98-13.fc19 has been pushed to the Fedora 19 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 22 rainman3d2002 2013-11-04 13:58:07 UTC
I now have 2.02.98-13. On first reboot after the update, it failed again and I had to do "vgchange -ay" so it would boot, but as soon as I rebooted again, it worked properly. Will update after next reboot.

Comment 23 rainman3d2002 2013-11-06 00:57:18 UTC
Rebooted this morning for a kernel update, boot normally without issues. Hopefully this new tradition will continue... :)

Comment 24 rainman3d2002 2013-11-17 12:51:07 UTC
Happened again this morning. Had to reboot for a kernel update, got dumped into maintenance mode where I did 'vgchange -ay', but when I pressed ctrl-D, it finished booting normally into runlevel-5.

Comment 25 rainman3d2002 2013-12-07 14:10:48 UTC
It's still doing it, only this time when I did 'vgchange -ay', it didn't boot into rl-5, only into rl-3.. This happened immediately after a kernel update to 3.11.10-200.fc19.x86_64.

Comment 26 rainman3d2002 2013-12-26 04:16:50 UTC
As of kernel 3.12.5-200 update 4 days, it booted correctly.

  LVM version:     2.02.98(2) (2012-10-15)
  Library version: 1.02.77 (2012-10-15)
  Driver version:  4.26.0