Bug 232862

Summary: Anaconda exception trying lgchange the disk layout
Product: [Fedora] Fedora Reporter: G.Wolfe Woodbury <redwolfe>
Component: anacondaAssignee: Peter Jones <pjones>
Status: CLOSED RAWHIDE QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: rawhideCC: redhat-bugzilla, rmj, wwoods
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-03-27 16:37:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 150226    
Attachments:
Description Flags
Anaconda dump of exception - date of rawhide 2007-03-18
none
Anaconda dump of exception - date of rawhide 2007-03-23
none
Anaconda dump of exception - date of rawhide 2007-04-09 try #1 (mod default)
none
dump from try #2 (No LVM configured in layout)
none
dump from try #3 (accept default)
none
current f7t4 exception dump
none
Anaconda dump of exception - date of rawhide 2007-08-03 none

Description G.Wolfe Woodbury 2007-03-19 01:57:24 UTC
Description of problem:
  Anaconda throws an exception (report attached) when trying to adjust the LVMs
generated by a custom configuration.  Specifically, I'm simply selecting to
format some of the existing LVM pieces, and assign mount points, without making
any size or format changes (well no format changes in the LVMs - changing /boot
from ext2 to ext3.)


Version-Release number of selected component (if applicable):
anaconda-11.2.0.37-1

How reproducible:
three times in a row

Steps to Reproduce:
1. boot rawhide boot.iso
2. use custom config for disk and assign mount points, etc.
3. watch it throw exception before formatting the filesystems
  
Actual results:
see attached exception report.


Expected results:
normal installation


Additional info:
work1.private: Pentium III (Coppermine) @600MHz  512MB RAM  MSI Mobo
   etc. all standard and working with older rawhides.

Comment 1 G.Wolfe Woodbury 2007-03-19 01:57:24 UTC
Created attachment 150349 [details]
Anaconda dump of exception - date of rawhide 2007-03-18

Comment 2 G.Wolfe Woodbury 2007-03-24 01:04:38 UTC
Created attachment 150814 [details]
Anaconda dump of exception - date of rawhide 2007-03-23

Comment 3 G.Wolfe Woodbury 2007-03-24 01:07:08 UTC
See attached exception report for todays rawhide failure.
anaconda-11.2.0.40-1


Comment 4 G.Wolfe Woodbury 2007-03-31 09:16:43 UTC
With anaconda-11.2.0.41 I still get tracebacks.

More relevant, however, is that I tried an install without customizing the
layout - and it works!  The bug is centered somewhere where the code is
attempting to re-do the formatting of logical volumes.

I suspect that my next install with doing a custom layout with a zeroed disk
will work also.  It might also possibly be in doing something with a "weird"
name: I had a LV that was to be mounted on "/data2" and the disklabel for the
volume showed up as only "data2" (no leading slash).

I will experiment with this last possibility a bit in the next several installs.

My problem with debugging this is that I still don't really understand the
internals of the LVM code and utilities.  The anaconda dump may or may not
contain the failing vgchange command and arguments, which are:

     ("vgchange", "-ay", "-v")

and I have no idea what that is supposed to do, but it *feels* like the best
place on which to focus attention.

Let me know if there are specific tests you'd like me to make.

Comment 5 G.Wolfe Woodbury 2007-04-01 23:02:20 UTC
follow-up: wipeing the boot track doesn't prevent the bug.
   apparently, any major changes to the default layout (like resizing root and
swap and adding separate /usr and /var LVs) result in the lvchange error
occurring.   I'm going to try a no-LVM layout and see what happens.

Comment 6 G.Wolfe Woodbury 2007-04-10 05:29:18 UTC
Created attachment 152095 [details]
Anaconda dump of exception - date of rawhide 2007-04-09 try #1 (mod default)

Comment 7 G.Wolfe Woodbury 2007-04-10 05:30:24 UTC
Created attachment 152096 [details]
dump from try #2 (No LVM configured in layout)

Comment 8 G.Wolfe Woodbury 2007-04-10 05:31:44 UTC
Created attachment 152097 [details]
dump from try #3 (accept default)

Comment 9 G.Wolfe Woodbury 2007-04-10 05:38:03 UTC
Try #1 - wipe MBR and partition table, modify the LVM and attempt install.
   FAILED

Try #2 - wipe MBR and partition table, generate custom NO-LVM layout and install
   FAILED

Try #3 - wipe MBR and partition table, accept defaults and install
   FAILED

Why is it trying to LVCHANGE a layout with NO LVM partitions?

Will try a few more variations and report.

Comment 10 G.Wolfe Woodbury 2007-04-10 07:16:21 UTC
I dragged out Knoppix and used QTParted to make my desired layout on the disc.

Loaded up the boot.iso and used "custom layout" to assign the mount points.

BOOM!

same exception.

WHY is anaconda calling lvm stuff at all?

Comment 11 G.Wolfe Woodbury 2007-04-10 08:35:34 UTC
OK,
I let anaconda (disk-druid) make its default config (wipe out Linux partitions
and create default layout) and no other changes.

It's now in the process of formatting and installing....

I'm tempted (but not ready to) look at the anaconda code and try to fathom the
python stuff to see what is wrong.  It has been too long since I've had to
really look at code.

Comment 12 G.Wolfe Woodbury 2007-04-17 05:30:18 UTC
Okay, rawhide of 2007-04-16 is still suffering from the bug.

BUT!

I know why the vgchange is being called - to "activate" the volume groups on the
system.

I also know WHY IT FAILS!

reading the man page for vgchange shows the example:

  vgchange -a y

to activate all volumes, however, the args for the anaconda call are:

  "-ay" "-v"

The command to do should be:

   vgchange -a  y

That is:

   "vgchange","-a","y"

Please fix the code!

Comment 13 Jeremy Katz 2007-04-17 23:17:51 UTC
"-ay" is the same as "-a y" with the lvm tools and has been for a long time

Comment 14 G.Wolfe Woodbury 2007-04-18 12:19:31 UTC
OK, I did verify that it works as written, but that merely means I don't know
why it is failing.  Are you seeing the problem at all, or am I at risk of a
WORKSFORME disposition?

Comment 15 G.Wolfe Woodbury 2007-04-24 09:02:27 UTC
Okay, as of 2007-04-23 rawhide, there is a major improvement in this situation.

Things mostly work, with a glitch coming in if the names of the VG or LVs are
changed.  Non-standard names lead to a failure to format because the names are
not properly activated (no VG name in /dev)

Leaving the names standard allow the creation of different partitioning schemes
than just the default.  I am getting some problems also with non-LVM layouts.

I suspect the problem lies not in anaconda *per se* bu perhaps in the LVM or
Udev interactions.

Comment 16 G.Wolfe Woodbury 2007-04-29 05:30:07 UTC
Created attachment 153728 [details]
current f7t4 exception dump

have to remote copy the bug attachments from target machine by hand as the
"save to remote" doesn't seem to work.

Comment 17 G.Wolfe Woodbury 2007-04-29 05:34:54 UTC
This bug is occurring in F7t4 i386 DVD spin too.

"debugging" the exception and trying the command from the debugger results in a
"cannot allocate memory" error

memtest86+ doesn't find any errors in the 256MB of RAM on the work1.private machine.

Free from the VT2 console shows ~3.4Mb free.

will try activating swap before depsolving to see if OOM can be avoided.

Comment 18 Roderick Johnstone 2007-05-02 14:56:44 UTC
I'm seeing this same problem as in the original report. Mine is a kickstart
installation on x86_64 system with 512MB.

Comment 19 Chris Lumens 2007-05-09 17:27:11 UTC
Can you try a rawhide tree after they start showing back up and let me know if
this is still not working for you?  We've done some recent changes around
activating and deactivating LVM that I think should clear this issue up.  Thanks.

Comment 20 Roderick Johnstone 2007-05-14 10:14:35 UTC
This problem seems to be fixed in the current rawhide, for me.

Comment 21 Roderick Johnstone 2007-05-14 13:34:29 UTC
I think that you should ignore my previous comment (#20). What
happened was that when using the mirror at mirrorservice.org it gets
beyond where I thought it failed before (with the unhandled
exception), but the mirror is clearly incomplete (many messages about
missing packages). During the "starting install" phase it fails not
being able to open glibc-headers-2.5.90-21.x86_64.rpm.

Using the mirror at ftp.funet.fi the install fails with an unhandled exception
and LvmError: vgachange failed in the gui window. I can see right at
the top of the F3 screen that there is a message: "ERROR: Running lvm:
Cannot allocate memory".

I guess I was mistaken about where the failure occurred, and based on
the seemingly complete mirror at ftp.funet.fi, the install still
fails with 512MB.

Sorry for the confusion.

Comment 22 G.Wolfe Woodbury 2007-05-14 17:05:51 UTC
Unfortunately the rawhide of 2007-05-13 doesn't clear the issue.

Graphical mode with 256 MB RAM crashes always
Text Mode crashes if too much is added to the mix (Gnome+KDE e.g.)
Text mode works with default minus a few items

I have couple of anacdumps here, but I'm not going to bother submitting them
unless requested since they appear very similar to the onews already on file.

Suggestion:
   if there is an existing swap partition, use it,
    then unswap before committing the partition changes
    then mkswap the new swap partition and use it before proceeding

Comment 23 Jeremy Katz 2007-05-21 22:24:42 UTC
(In reply to comment #22)
> Suggestion:
>    if there is an existing swap partition, use it,
>     then unswap before committing the partition changes
>     then mkswap the new swap partition and use it before proceeding

That's not really going to help the problem since we'll still have the data
around.  It just changes things to be more likely to hit kernel bugs ;-)

I've changed our thresholds for doing early partition commits (and thus early
swap on) to be a little bit more aggressive on x86.  x86_64 should already be in
better shape due to making the change there a week or so ago.

Comment 24 Robert Scheck 2007-05-22 22:04:06 UTC
Well...Jeremy, isn't this bug report a fixed duplicate of mine (> 390 MB for 
x86_64 installation, > 256 MB for x86 installation)? When reading it looks 
like the same problem with the same error messages...

Comment 25 G.Wolfe Woodbury 2007-05-22 23:40:09 UTC
Oh, it's the same bug in root cause.
This one is for the ix86 architecture, and yours is for the _64 architecture.

The upshot is that 256 MB is no longer sufficient for a graphical install on the
x86 plain arch unless you don't select *any* additional packages, and even then
it might bite you due to the uncertainties of the install process. 
(Deterministic? heck no.)

I was beginning to think I was the only one seeing the bug for a while, but it
has turned into a hard one to fix.

Comment 26 G.Wolfe Woodbury 2007-05-25 01:25:10 UTC
Thursday's rawhide (2007-05-24) now has a different major failure mode in low
memory situations.

Anaconda is detecting the low memory condition and activating swap early,
however as it goes to format the filesystems previously committed, it removes
the LVM entries in /dev/mapper

They were there, I looked at them with the VT02 window during software selection
and depsolving!  Then they were gone and the format command failed.

Comment 27 G.Wolfe Woodbury 2007-05-26 21:39:24 UTC
Saturday's rawhide and RC2 are a no-go for me.

The LVM entries are there during software selection and depsolving, but
disappear when anaconda goes to format the filesystems.

This is rawhide date 2007-05-26
anaconda-11.2.0.66-1

Comment 28 G.Wolfe Woodbury 2007-05-27 06:41:21 UTC
in packages.py:turnOnFilesystems()  there is no check to see if the partitioning
has been committed already and/or that swap was activated.

It looks to me as if a check against anaconda.id.fsset.isActive() should be
somewhere in there.  it should/might be similar to the sequence found in
partitioningComplete() in partitions.py

I'm struggling cause I'm not too familiar with python (yet!)

Comment 29 Will Woods 2007-05-29 01:54:23 UTC
Do you have a log / crash dump from the more recent failures?

I'm trying to reproduce your problem but I'm not sure I understand what the
actual error is now. Does the lvm stuff fail because you are out of memory, or
is turnOnFilesystems() failing because the filesystems are already on?

You're right about turnOnFilesystems() lacking a check to see if the filesystems
are active, but I'm not sure if the check is needed.

Comment 30 Will Woods 2007-05-29 02:05:04 UTC
To explain my previous comment a bit - turnOnFilesystems() does the following:

  anaconda.id.partitions.doMetaDeletes(anaconda.id.diskset)
  anaconda.id.diskset.clearDevices()
  anaconda.id.fsset.setActive(anaconda.id.diskset)
  if not anaconda.id.fsset.isActive():
      anaconda.id.diskset.savePartitions ()
  anaconda.id.fsset.checkBadblocks(anaconda.rootPath)
  if not anaconda.id.fsset.volumesCreated:
      anaconda.id.fsset.createLogicalVolumes(anaconda.rootPath)
  anaconda.id.fsset.formatSwap(anaconda.rootPath)
  anaconda.id.fsset.turnOnSwap(anaconda.rootPath)
  # This stuff doesn't happen in partitioningComplete()
  anaconda.id.fsset.makeFilesystems (anaconda.rootPath)
  anaconda.id.fsset.mountFilesystems (anaconda)

formatSwap() and turnOnSwap(), at least, are idempotent (i.e. safe to run
multiple times). I'm not sure about the rest, especially doMetaDeletes() and
clearDevices(), but we're definitely running them twice.

Comment 31 G.Wolfe Woodbury 2007-05-29 03:14:30 UTC
clearDevices() is the real culprit here.

What is happening is that with low memory (256MB) swap is being committed and
activating LVM happens early.  The LVM points exist in /dev/mapper  during
software select and depsolving.  Then it goes to turnOnFilesystems and crashes
with no traceback or dump because the devices are no longer in /dev/mapper to be
available for the format commands.

Becase of the early commit for swap, the turnOnFilesystems() has to check for
the devices already being in place, or els it wipes the devices and doesn't
re-activate the LVM volume groups again.

To repoduce the error, use a mchine with 256 of RAM and make a LVM using custom
layout (with swap not in the LV but say for example, as a partition 2) it will
activate the swap and continue with software selection and depsolving.  during
this period, one can switch to VT02 and ls /dev/mapper to see that the LVM is
indeed there.  Then it finishes depsolving and gets the confirminstall, and
*BOOM* the devices are no longer around to be formatted.

Hmmmm.  what happens if I move the swap it's using into the LVM?   Still likely
to crash.  Main probllem is that /home is a non-LVM partition on the drive.

Comment 32 G.Wolfe Woodbury 2007-05-29 08:30:21 UTC
ok, the bug only occurs if the swap partition is *not* part of the LVM volume
group.  I move the swap partition into the LVM and the installation proceeded
without a hitch.

Still a bug, now with a workaround.  Note that the default tools generated
partition layouts won't trigger this, but a user/custom/legacy layout could see
it very easily.

Specifically: I moved the swap definition from a real partition (/dev/sda3) into
the LVM definition (/dev/VolumeGroup00/LogVol00) and the installation properly
formatted all the filesystems.

Comment 33 G.Wolfe Woodbury 2007-08-05 19:37:30 UTC
This *old* bug is back!
Anaconda/lvm doesn't have enough memory on my Dell to make the filesystems with
320MB of memory.

Attaching the dump.

Comment 34 G.Wolfe Woodbury 2007-08-05 19:39:14 UTC
Created attachment 160719 [details]
Anaconda dump of exception - date of rawhide 2007-08-03

Comment 35 G.Wolfe Woodbury 2007-11-18 09:58:02 UTC
FC8 series didn't show this bug, something may have fixed it.

But then I haven't tried F8 on the old Dell yet either.

Comment 36 G.Wolfe Woodbury 2008-03-27 16:37:07 UTC
closing as fixed long since in RAWHIDE