Bug 416811 - Handle xendomains restore failure gracefully
Handle xendomains restore failure gracefully
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: xen (Show other bugs)
5.1
All Linux
medium Severity medium
: ---
: ---
Assigned To: Michal Novotny
Virtualization Bugs
:
Depends On:
Blocks: 514498
  Show dependency treegraph
 
Reported: 2007-12-08 15:55 EST by Warren Togami
Modified: 2014-02-02 17:36 EST (History)
6 users (show)

See Also:
Fixed In Version: xen-3.0.3-109.el5
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-01-13 17:15:55 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Invalid restore image files handling (1.32 KB, patch)
2010-03-17 10:03 EDT, Michal Novotny
no flags Details | Diff
Updated version of my patch (5.11 KB, patch)
2010-03-23 06:51 EDT, Michal Novotny
no flags Details | Diff

  None (edit)
Description Warren Togami 2007-12-08 15:55:30 EST
xen-3.0.3-41.el5

[root@virthost save]# ls -l /var/lib/xen/save
total 412536
-rwxr-xr-x 1 root root       911 Dec  8  2007 test
-rwxr-xr-x 1 root root       911 Dec  8 14:16 togami

Somehow my /var/lib/xen/save ended up with clearly invalid restore images. 
Thereafter service xendomains failed to auto-start these virtual machines in a
completely non-obvious way.  Nothing appears in logs, you only see that it
failed to start your xen guests at bootup.  The only way to see why it is
failing (without console access) is to run it manually:

[root@virthost xen]# service xendomains start
Restoring Xen domains: testError: Restore failed
Usage: xm restore <CheckpointFile>

Restore a domain from a saved state.
! togamiError: Restore failed
Usage: xm restore <CheckpointFile>

Restore a domain from a saved state.
!.

/etc/init.d/xendomains:
start()
{
    if [ -f $LOCKFILE ]; then
        echo -n "xendomains already running (lockfile exists)"
        return;
    fi

    saved_domains=" "
    if [ "$XENDOMAINS_RESTORE" = "true" ] &&
       contains_something "$XENDOMAINS_SAVE"
    then
        mkdir -p $(dirname "$LOCKFILE")
        touch $LOCKFILE
        echo -n "Restoring Xen domains:"
        saved_domains=`ls $XENDOMAINS_SAVE`
        for dom in $XENDOMAINS_SAVE/*; do
            echo -n " ${dom##*/}"
            xm restore $dom
            if [ $? -ne 0 ]; then
                rc_failed $?
                echo -n '!'
            else
                # mv $dom ${dom%/*}/.${dom##*/}
                rm $dom
            fi
        done
        echo .

If xm restore fails, it just exits with error messages but does nothing about
it.  This must be handled in some cleaner way?

Possible Options:
1) Delete the invalid restore image, xm create normally thereafter.
2) Move the invalid restore image, xm create normlaly thereafter.
3) This is a bad idea.  Don't attempt to boot the guest because it is unsafe.

What should we do?
Comment 1 RHEL Product and Program Management 2007-12-08 16:44:14 EST
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 2 RHEL Product and Program Management 2008-03-11 15:37:17 EDT
This request was previously evaluated by Red Hat Product Management
for inclusion in the current Red Hat Enterprise Linux release, but
Red Hat was unable to resolve it in time.  This request will be
reviewed for a future Red Hat Enterprise Linux release.
Comment 5 Michal Novotny 2009-07-29 09:19:10 EDT
I am not sure what's going on and we should check this in xend.log... Could you please provide us /var/log/xen/xend.log ? It should contains necessary information...

Thanks,
Michal
Comment 6 Michal Novotny 2009-07-29 09:26:56 EDT
Just for clarification, this calls `xm restore <CheckpointFile>` which internally calls `xc_restore` that's user-space program written in C, hopefully there is good logging here in xend.log to know what's wrong or one more thing we should do - when we get the error message, we should delete the file automatically but I am afraid this is not the good solution.
Comment 7 Warren Togami 2009-07-29 09:48:35 EDT
I have nothing further to add here.
Comment 8 Chris Lalancette 2009-07-29 10:12:46 EDT
(In reply to comment #7)
> I have nothing further to add here.  

Yeah, it's been open for 2 years without any movement, I just figured I'd give it a shake.  Just to be clear, I'm assuming you don't have a setup to test this anymore?  If that's the case, it's fine, I just want to know where we are.

Michal, we should probably try to reproduce this ourselves (since it seems pretty easy to reproduce), and if it's now fixed, just close the BZ out.

Chris Lalancette
Comment 9 Michal Novotny 2009-07-29 11:04:26 EDT
Ok(In reply to comment #8)
> (In reply to comment #7)
> > I have nothing further to add here.  
> 
> Yeah, it's been open for 2 years without any movement, I just figured I'd give
> it a shake.  Just to be clear, I'm assuming you don't have a setup to test this
> anymore?  If that's the case, it's fine, I just want to know where we are.
> 
> Michal, we should probably try to reproduce this ourselves (since it seems
> pretty easy to reproduce), and if it's now fixed, just close the BZ out.
> 
> Chris Lalancette  

Ok, right, anyway could Warren tell us how did the images end up invalid ? Some steps to reproduce the invalid images?

Also, one more thing about that - you shouldn't use /var/lib/xen/images for saving/restoring domain files. This is for automatic domain save/restore.

Warren, another question is about the xen user-space tools version. I saw nothing in comment #0 about xend version and I remember I've been doing something in xendomains scripts so a user-space version of XenD would be appreciated as well as steps to reproduce this one. So, do we need to try this out any possible ways or are you remembering some of those steps to reproduce ?

Thanks,
Michal
Comment 10 Warren Togami 2009-07-29 11:29:43 EDT
It doesn't matter how the invalid image happened.  It DOES happen and we need to better handle the failure case.

By the date of the filed bug, it seems this was RHEL5 GA.

What is wrong with option "1) Delete the invalid restore image, xm create normally thereafter."?
Comment 11 Michal Novotny 2009-07-29 11:59:01 EDT
The thing I would like to point out is whether you can try with the newest RPMs or at least RHEL 5.3 GA... In fact I think this should be fixed by now... Also, I would like to know how can I reproduce this one... Or can I create any bogus checkpoint file (eg by using touch or dd) and this will be working fine for reproducing it ?
Comment 12 Michal Novotny 2009-09-23 05:25:47 EDT
Anything new with this one ?
Comment 14 Michal Novotny 2010-03-17 10:03:58 EDT
Created attachment 400772 [details]
Invalid restore image files handling

Attached patch is the updated version of xendomains script that supports deleting/renaming invalid restore image files. This is set by FAILTYPE variable in this script and possible values are 'delete' and 'rename' for deleting or renaming invalid restore image files.

Michal
Comment 16 Michal Novotny 2010-03-23 06:51:03 EDT
Created attachment 402025 [details]
Updated version of my patch

Hi,
this is the new version of my patch with some logging to the default logging facility using `logger` command added.

Michal
Comment 21 Jinxin Zheng 2010-08-30 04:21:28 EDT
I can reproduce this by making an 'empty' image that will cause the restore to fail:

$ dd if=/dev/zero of=/etc/xen/auto/test bs=1M count=100

With the old version of xen, we see some error message at bootup but quickly disappears. We found nothing in the log files.

Updated to the new version, we have three options to handle this situation: 1. do nothing; 2. rename the problem image so that it's not seen by the xendomains script next time it is started; 3. delete the problem image.

We also get persisted log info from the /var/log/messages:

Aug 30 21:05:05 dhcp-93-222 xendomains: Domain restore failed for domain test
Aug 30 21:05:05 dhcp-93-222 xendomains: Invalid restore image /var/lib/xen/save/test renamed to /var/lib/xen/save/.test

As a result, this bug could be moved to VERIFIED.
Comment 23 errata-xmlrpc 2011-01-13 17:15:55 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0031.html

Note You need to log in before you can comment on or make changes to this bug.