Bug 1585207 - oVirt node upgrade fail in %post script
Summary: oVirt node upgrade fail in %post script
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: ovirt-node
Classification: oVirt
Component: Installation & Update
Version: 4.2
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Ryan Barry
QA Contact: Yaning Wang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-01 14:00 UTC by Rob Sanders
Modified: 2018-06-04 02:33 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-01 19:59:20 UTC
oVirt Team: Node
Embargoed:


Attachments (Terms of Use)

Description Rob Sanders 2018-06-01 14:00:57 UTC
I've upgraded oVirt node from from 4.1.8 to 4.2.3.1 and it failed during %post part:

  Installing : ovirt-node-ng-image-update-4.2.3.1-1.el7.noarch                                                                                                                      1/1 
warning: %post(ovirt-node-ng-image-update-4.2.3.1-1.el7.noarch) scriptlet failed, exit status 1
Non-fatal POSTIN scriptlet failure in rpm package ovirt-node-ng-image-update-4.2.3.1-1.el7.noarch
  Verifying  : ovirt-node-ng-image-update-4.2.3.1-1.el7.noarch                                                                                                                      1/1 

Installed:
  ovirt-node-ng-image-update.noarch 0:4.2.3.1-1.el7                                                                                                                                     

Complete!


imgbase log:
https://gist.github.com/sandersr/8ab1a0048ab8ceb94a3c1f1934ab6962


It looks like this layer has been removed:
ovirt-node-ng-4.1.8-0.20171211.0

Even tho it doesn't exist, the server boots fine to 4.1.8 image!

lvdisplay:
https://gist.github.com/sandersr/0879b0bfa52051a7b8c955f0254362ec


imgbase w
You are on ovirt-node-ng-4.1.8-0.20171211.0+1


imgbase layout
Traceback (most recent call last):
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/site-packages/imgbased/__main__.py", line 53, in <module>
    CliApplication()
  File "/usr/lib/python2.7/site-packages/imgbased/__init__.py", line 82, in CliApplication
    app.hooks.emit("post-arg-parse", args)
  File "/usr/lib/python2.7/site-packages/imgbased/hooks.py", line 120, in emit
    cb(self.context, *args)
  File "/usr/lib/python2.7/site-packages/imgbased/plugins/core.py", line 182, in post_argparse
    print(layout.dumps())
  File "/usr/lib/python2.7/site-packages/imgbased/plugins/core.py", line 210, in dumps
    return self.app.imgbase.layout()
  File "/usr/lib/python2.7/site-packages/imgbased/imgbase.py", line 154, in layout
    return self.naming.layout()
  File "/usr/lib/python2.7/site-packages/imgbased/naming.py", line 109, in layout
    tree = self.tree(lvs)
  File "/usr/lib/python2.7/site-packages/imgbased/naming.py", line 224, in tree
    bases[img.base.nvr].layers.append(img)
KeyError: <NVR ovirt-node-ng-4.1.8-0.20171211.0 />


I cannot reinstall ovirt-node-ng-image-update-4.2.3.1-1.el7.noarch because imgbase will fail with:
...
2018-06-01 13:28:51,489 [DEBUG] (MainThread) Exception!   Using default stripesize 64.00 KiB.
  Logical Volume "ovirt-node-ng-4.2.3.1-0.20180530.0" already exists in volume group "onn"
...
subprocess.CalledProcessError: Command '['lvcreate', '--thin', '--virtualsize', u'155508015104B', '--name', 'ovirt-node-ng-4.2.3.1-0.20180530.0', u'onn/pool00']' returned non-zero exit status 5


Is there a way to forcefully remove 4.1.8 layer? (imgbase doesn't provide --force switch so maybe there is a manual way)

Shouldn't the post script perform a cleanup before trying to create LV? Simple if-exist-remove would fix this part.

Since the "Non-fatal POSTIN scriptlet failure" is non-fatal it's easy to overlook there is a problem (chances are this problem was already present when I upgraded to 4.1.9 but I didn't notice at the time). It also doesn't finish installation correctly (no kernel files, no configuration copy etc etc), so booting to a new image is not easy.

What's the best way forward to recover this node without complete re-installation?

Comment 1 Rob Sanders 2018-06-01 15:05:21 UTC
I managed to recover my system by analysing debug log from imgbase and re-using it's actions to recreate missing LV. Pasting here for reference only if someone else has similar problem. Chances are not all the steps are needed, but this worked for me:

# lvcreate --thin --virtualsize 155508015104B --name ovirt-node-ng-4.1.8-0.20171211.0 onn/pool00                         
  Using default stripesize 64.00 KiB.
  WARNING: Sum of all thin volume sizes (<1.02 TiB) exceeds the size of thin pool onn/pool00 and the size of whole volume group (220.00 GiB)!
  For thin pool auto extension activation/thin_pool_autoextend_threshold should be below 100.
  Logical volume "ovirt-node-ng-4.1.8-0.20171211.0" created.

# lvchange --addtag imgbased:base onn/ovirt-node-ng-4.1.8-0.20171211.0
  Logical volume onn/ovirt-node-ng-4.1.8-0.20171211.0 changed.

# lvchange --permission r onn/ovirt-node-ng-4.1.8-0.20171211.0          
  Logical volume onn/ovirt-node-ng-4.1.8-0.20171211.0 changed.

# lvchange --setactivationskip y onn/ovirt-node-ng-4.1.8-0.20171211.0         
  Logical volume onn/ovirt-node-ng-4.1.8-0.20171211.0 changed.

# lvchange --activate n onn/ovirt-node-ng-4.1.8-0.20171211.0           

# lvchange --permission rw onn/ovirt-node-ng-4.1.8-0.20171211.0
  Logical volume onn/ovirt-node-ng-4.1.8-0.20171211.0 changed.

# lvchange --activate y onn/ovirt-node-ng-4.1.8-0.20171211.0 --ignoreactivationskip


# mkfs.ext4 -E discard /dev/onn/ovirt-node-ng-4.1.8-0.20171211.0
mke2fs 1.42.9 (28-Dec-2013)
Discarding device blocks: done                            
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=16 blocks, Stripe width=16 blocks
9494528 inodes, 37965824 blocks
1898291 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=2187329536
1159 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks: 
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
        4096000, 7962624, 11239424, 20480000, 23887872

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done     

# lvchange --permission r onn/ovirt-node-ng-4.1.8-0.20171211.0                        
  Logical volume onn/ovirt-node-ng-4.1.8-0.20171211.0 changed.

# lvchange --setactivationskip y onn/ovirt-node-ng-4.1.8-0.20171211.0         
  Logical volume onn/ovirt-node-ng-4.1.8-0.20171211.0 changed.

# lvchange --activate n onn/ovirt-node-ng-4.1.8-0.20171211.0           

# imgbase layout
ovirt-node-ng-4.1.8-0.20171211.0
 +- ovirt-node-ng-4.1.8-0.20171211.0+1
ovirt-node-ng-4.1.9-0.20180124.0
 +- ovirt-node-ng-4.1.9-0.20180124.0+1
ovirt-node-ng-4.2.3.1-0.20180530.0
 +- ovirt-node-ng-4.2.3.1-0.20180530.0+1

# lvremove onn/ovirt-node-ng-4.2.3.1-0.20180530.0+1
Do you really want to remove active logical volume onn/ovirt-node-ng-4.2.3.1-0.20180530.0+1? [y/n]: y
  Logical volume "ovirt-node-ng-4.2.3.1-0.20180530.0+1" successfully removed

# lvremove onn/ovirt-node-ng-4.2.3.1-0.20180530.0
Do you really want to remove active logical volume onn/ovirt-node-ng-4.2.3.1-0.20180530.0? [y/n]: y
  Logical volume "ovirt-node-ng-4.2.3.1-0.20180530.0" successfully removed

# yum reinstall ovirt-node-ng-image-update -y

Comment 2 Ryan Barry 2018-06-01 19:59:20 UTC
I'm glad you were able to resolve, but any idea how the system got in this state in the first place? This is a totally new report to me, and I've never seen anything like it...

It looks like one of the LVs was removed but LVM still had it cached somewhere.

Closing for now, since you worked around it, but still responding to comments...


Note You need to log in before you can comment on or make changes to this bug.