Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1142284

Summary: nova sometimes fails to inject. Some problem with augeas.
Product: Red Hat OpenStack Reporter: Jaroslav Henner <jhenner>
Component: openstack-novaAssignee: Russell Bryant <rbryant>
Status: CLOSED UPSTREAM QA Contact: Jaroslav Henner <jhenner>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 5.0 (RHEL 7)CC: dcleal, jhenner, ndipanov, rjones, sferdjao, yeylon
Target Milestone: ---Keywords: ZStream
Target Release: 5.0 (RHEL 7)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1247638 (view as bug list) Environment:
Last Closed: 2014-10-08 16:01:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1247638    
Attachments:
Description Flags
log
none
virt-inspector stderr none

Description Jaroslav Henner 2014-09-16 13:37:48 UTC
Description of problem:
see LP, or read below:

Tempest test
tempest.api.compute.images.test_list_image_filters.ListImageFiltersTest
is quite often failing to boot the VM. In the log I can see nova was trying to inject thinngs using augeas, but it failed. It seems like there is some race.

Version-Release number of selected component (if applicable):
libguestfs-1.22.6-22.el7.x86_64
augeas-libs.x86_64 1.1.0-12.el7 @rhel7-released
openstack-nova-api.noarch 2014.1.2-1.el7ost @puddle
openstack-nova-common.noarch 2014.1.2-1.el7ost @puddle
openstack-nova-compute.noarch 2014.1.2-1.el7ost @puddle
python-nova.noarch 2014.1.2-1.el7ost @puddle

How reproducible:
5/50 test runs, using tempest.

So far no success when trying to boot 10 vms with file and key injected, wait for ERROR or ACTIVE, delete.


Steps to Reproduce:
I don't have any good reproducer yet, but I think something like

1. nova boot --key foo --file /foo=/etc/fstab --flavor m1.tiny --image foo.img foo

doing it loop shall trigger it.


Actual results:
machine in ERROR

Expected results:
machine ACTIVE

Additional info:

Comment 1 Jaroslav Henner 2014-09-16 13:45:33 UTC
By a short glance, it seems to me that it has been fixed in RHBZ#894241, so we should make sure to make the libguestfs-1.20.1-6.el7 a dep for our nova

Comment 2 Richard W.M. Jones 2014-09-16 13:54:38 UTC
Please describe the actual problem you are having in this bug, giving
full details here.

Comment 3 Jaroslav Henner 2014-09-16 17:10:39 UTC
For a sake that this is actually probably a RHEL-specific, I will copy'n'paste the info from LP, but othewise, everything which is fixable upstream shall be discussed upstream.

Comment 4 Jaroslav Henner 2014-09-16 17:19:30 UTC
I have changed the original bug report.

Comment 5 Jaroslav Henner 2014-09-16 17:19:57 UTC
Created attachment 938160 [details]
log

Comment 6 Richard W.M. Jones 2014-09-16 17:28:34 UTC
It seems as if the actual error is:

Error mounting /var/lib/nova/instances/f3f5e5fd-dacd-40a6-bf93-b206a7773a38/disk with libguestfs (mount_options: /dev/sda on / (options: ''): mount: /dev/sda is write-protected, mounting read-only
mount: unknown filesystem type '(null)')

What is the disk image?  What happens if you do (on the nova server):

virt-inspector -a /path/to/the/disk.img -v -x

Comment 7 Jaroslav Henner 2014-09-16 19:54:14 UTC
(In reply to Richard W.M. Jones from comment #6)
> It seems as if the actual error is:
> 
> Error mounting
> /var/lib/nova/instances/f3f5e5fd-dacd-40a6-bf93-b206a7773a38/disk with
> libguestfs (mount_options: /dev/sda on / (options: ''): mount: /dev/sda is
> write-protected, mounting read-only
> mount: unknown filesystem type '(null)')
> 
> What is the disk image?  What happens if you do (on the nova server):
> 
> virt-inspector -a /path/to/the/disk.img -v -x

It is cirros but I believe it is irrelevant because, and that's quite new information, I was running my wanabe reproducer hours spawning and deleting the images, without a reproduce, but when I ran the Tempest in parallel to that, I got reproduce instantly. So I will proceed in separately checking which test with my wanabe reproducer will provide good reproduce ratio. Then I could confirm the new augeas works.

I also can try that virt-inspector.

Comment 8 Jaroslav Henner 2014-09-16 19:56:39 UTC
(In reply to Richard W.M. Jones from comment #6)
> It seems as if the actual error is:
> 
> Error mounting
> /var/lib/nova/instances/f3f5e5fd-dacd-40a6-bf93-b206a7773a38/disk with
> libguestfs (mount_options: /dev/sda on / (options: ''): mount: /dev/sda is
> write-protected, mounting read-only
> mount: unknown filesystem type '(null)')

I think this is just a consequence of

2014-09-15 10:24:26.613 20740 WARNING nova.virt.disk.vfs.guestfs [req-159279a5-7fbb-43c3-8998-3c421cd4ef73 ServersTestXML-1973446531 ServersTestXML-2024486452] Failed to close augeas aug_close: do_aug_close: you must call 'aug-init' first to initialize Augeas

> 
> What is the disk image?  What happens if you do (on the nova server):
> 
> virt-inspector -a /path/to/the/disk.img -v -x

Comment 9 Richard W.M. Jones 2014-09-16 20:37:46 UTC
(In reply to Jaroslav Henner from comment #8)
> (In reply to Richard W.M. Jones from comment #6)
> > It seems as if the actual error is:
> > 
> > Error mounting
> > /var/lib/nova/instances/f3f5e5fd-dacd-40a6-bf93-b206a7773a38/disk with
> > libguestfs (mount_options: /dev/sda on / (options: ''): mount: /dev/sda is
> > write-protected, mounting read-only
> > mount: unknown filesystem type '(null)')
> 
> I think this is just a consequence of
> 
> 2014-09-15 10:24:26.613 20740 WARNING nova.virt.disk.vfs.guestfs
> [req-159279a5-7fbb-43c3-8998-3c421cd4ef73 ServersTestXML-1973446531
> ServersTestXML-2024486452] Failed to close augeas aug_close: do_aug_close:
> you must call 'aug-init' first to initialize Augeas

I don't think that is true.  In the current version of nova, it
catches that exception and prints it as a warning, but otherwise
continues.  Libguestfs emits this error only when the augeas API
is used wrongly, but would be otherwise unaffected.

Anyway, please run libguestfs-test-tool and the virt-inspector
command outlined above, and provide the *complete* output from them.

Comment 10 Jaroslav Henner 2014-09-17 11:40:18 UTC
I have updated the original message with libugestfs version.

Comment 11 Jaroslav Henner 2014-09-17 19:59:59 UTC
I have a reproducer. It is probably related to too low of memory. When I kill a process like this:

/usr/libexec/qemu-kvm -name guestfs-4hc12m53ecr1l3a9

I get the exception. Note that the real booted VM has process like:

/usr/libexec/qemu-kvm -name instance-00001b7a 


The traceback:
ipdb> bt
  /usr/lib/python2.7/site-packages/eventlet/greenthread.py(194)main()
    192     def main(self, function, args, kwargs):
    193         try:
--> 194             result = function(*args, **kwargs)
    195         except:
    196             self._exit_event.send_exception(*sys.exc_info())

  /usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py(128)<lambda>()
    126     def __call__(self, incoming):
    127         incoming.acknowledge()
--> 128         yield lambda: self._dispatch_and_reply(incoming)
    129 
    130     def _dispatch_and_reply(self, incoming):

  /usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py(133)_dispatch_and_reply()
    131         try:
    132             incoming.reply(self._dispatch(incoming.ctxt,
--> 133                                           incoming.message))
    134         except ExpectedException as e:
    135             LOG.debug('Expected exception during message handling (%s)' %

  /usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py(176)_dispatch()
    174                 localcontext.set_local_context(ctxt)
    175                 try:
--> 176                     return self._do_dispatch(endpoint, method, ctxt, args)
    177                 finally:
    178                     localcontext.clear_local_context()

  /usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py(122)_do_dispatch()
    120         for argname, arg in six.iteritems(args):
    121             new_args[argname] = self.serializer.deserialize_entity(ctxt, arg)
--> 122         result = getattr(endpoint, method)(ctxt, **new_args)
    123         return self.serializer.serialize_entity(ctxt, result)
    124 

  /usr/lib/python2.7/site-packages/oslo/messaging/rpc/server.py(139)inner()
    137         def inner(*args, **kwargs):
    138             try:
--> 139                 return func(*args, **kwargs)
    140             # Take advantage of the fact that we can catch
    141             # multiple exception types using a tuple of

  /usr/lib/python2.7/site-packages/nova/exception.py(71)wrapped()
     69             # contain confidential information.
     70             try:
---> 71                 return f(self, context, *args, **kw)
     72             except Exception as e:
     73                 with excutils.save_and_reraise_exception():

  /usr/lib/python2.7/site-packages/nova/compute/manager.py(260)decorated_function()
    258     def decorated_function(self, context, *args, **kwargs):
    259         try:
--> 260             return function(self, context, *args, **kwargs)
    261         except exception.UnexpectedTaskStateError as e:
    262             # Note(maoy): unexpected task state means the current

  /usr/lib/python2.7/site-packages/nova/compute/manager.py(327)decorated_function()
    325                                          event_name, instance_uuid):
    326 
--> 327             function(self, context, *args, **kwargs)
    328 
    329     return decorated_function

  /usr/lib/python2.7/site-packages/nova/compute/manager.py(290)decorated_function()
    288     def decorated_function(self, context, *args, **kwargs):
    289         try:
--> 290             return function(self, context, *args, **kwargs)
    291         except exception.InstanceNotFound:
    292             raise

  /usr/lib/python2.7/site-packages/nova/compute/manager.py(2102)run_instance()
   2100                     admin_password, is_first_time, node, instance,
   2101                     legacy_bdm_in_spec)
-> 2102         do_run_instance()
   2103 
   2104     def _try_deallocate_network(self, context, instance,

  /usr/lib/python2.7/site-packages/nova/openstack/common/lockutils.py(249)inner()
    247                     LOG.debug(_('Got semaphore / lock "%(function)s"'),
    248                               {'function': f.__name__})
--> 249                     return f(*args, **kwargs)
    250             finally:
    251                 LOG.debug(_('Semaphore / lock released "%(function)s"'),

  /usr/lib/python2.7/site-packages/nova/compute/manager.py(2101)do_run_instance()
   2099                     filter_properties, requested_networks, injected_files,
   2100                     admin_password, is_first_time, node, instance,
-> 2101                     legacy_bdm_in_spec)
   2102         do_run_instance()
   2103 

  /usr/lib/python2.7/site-packages/nova/compute/manager.py(1209)_run_instance()
   1207                     request_spec, filter_properties, requested_networks,
   1208                     injected_files, admin_password, is_first_time, node,
-> 1209                     instance, image_meta, legacy_bdm_in_spec)
   1210             notify("end", msg=_("Success"), network_info=network_info)
   1211 

  /usr/lib/python2.7/site-packages/nova/compute/manager.py(1329)_build_instance()
   1327                                        network_info, block_device_info,
   1328                                        injected_files, admin_password,
-> 1329                                        set_access_ip=set_access_ip)
   1330         except (exception.InstanceNotFound,
   1331                 exception.UnexpectedDeletingTaskStateError):

  /usr/lib/python2.7/site-packages/nova/compute/manager.py(393)decorated_function()
    391             kwargs['migration'] = migration
    392 
--> 393         return function(self, context, *args, **kwargs)
    394 
    395     return decorated_function

  /usr/lib/python2.7/site-packages/nova/compute/manager.py(1738)_spawn()
   1736                               injected_files, admin_password,
   1737                               network_info,
-> 1738                               block_device_info)
   1739         except Exception:
   1740             with excutils.save_and_reraise_exception():

  /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py(2282)spawn()
   2280                            block_device_info=block_device_info,
   2281                            files=injected_files,
-> 2282                            admin_pass=admin_password)
   2283         xml = self.to_xml(context, instance, network_info,
   2284                           disk_info, image_meta,

  /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py(2733)_create_image()
   2731                            'instance is not supported'), instance=instance)
   2732             self._inject_data(
-> 2733                 instance, network_info, admin_pass, files, suffix)
   2734 
   2735         if CONF.libvirt.virt_type == 'uml':

  /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py(2569)_inject_data()
   2567                                  partition=target_partition,
   2568                                  use_cow=CONF.use_cow_images,
-> 2569                                  mandatory=('files',))
   2570             except Exception as e:
   2571                 with excutils.save_and_reraise_exception():

> /usr/lib/python2.7/site-packages/nova/virt/disk/api.py(360)inject_data()
    358             inject_val = locals()[inject]
    359             import ipdb; ipdb.set_trace()
--> 360             if inject_val:
    361                 raise
    362         LOG.warn(_('Ignoring error injecting data into image '

ipdb> e
NovaException(u"Error mounting /var/lib/nova/instances/6fa0e6d0-8c13-4b2f-b2dd-9558e7490f40/disk with libguestfs (guestfs_launch failed.\nThis usually means the libguestfs appliance failed to start or crashed.\nSee http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs\nor run 'libguestfs-test-tool' and post the *complete* output into a\nbug report or message to the libguestfs mailing list.)",)


My guess is, that this problem happens when there are several attempts to boot the vm. They pass trough scheduler, because scheduler IMHO doesn't decrease available memory by the memory required. It probably schedules the VM to host which it finds it has enough free memory at the moment. But when there were some VMs already booting, they may not consume too much of RAM yet, resulting in overfilling of the host.

When nova tries to inject the metadata to VM, it calls libguestfs in the inject_data()

    try:
        # Note(mrda): Test if the image exists first to short circuit errors
        os.stat(image)
        fs = vfs.VFS.instance_for_image(image, fmt, partition)
        fs.setup()
    except Exception as e:
      ...


and that somehow fails in memory allocation or using RAM-backed tmpfs. The  The real reason gets thrown away somewhere (inside of libguestfs ?) and nova just sees the libguestfs failed. There is no OOM killer message because OOM just didn't take an action. It seems the kernel just didn't allow the libuguestfs or some child of it to allocate the memory.

Comment 12 Richard W.M. Jones 2014-09-17 20:10:25 UTC
(In reply to Jaroslav Henner from comment #11)
> The 
> The real reason gets thrown away somewhere (inside of libguestfs ?)

libguestfs extensively logs everything, but currently Nova throws
it all away.  There is a bug open to fix this:

https://bugs.launchpad.net/nova/+bug/1279857

which I believe Sahid is working on.

Comment 13 Jaroslav Henner 2014-09-19 11:32:58 UTC
Created attachment 939233 [details]
virt-inspector stderr

attaching the virt-inspector output. Stdout follows:

<?xml version="1.0"?>
<operatingsystems/>

Comment 14 Richard W.M. Jones 2014-09-19 12:06:39 UTC
(In reply to Jaroslav Henner from comment #13)
> Created attachment 939233 [details]
> virt-inspector stderr
> 
> attaching the virt-inspector output. Stdout follows:
> 
> <?xml version="1.0"?>
> <operatingsystems/>

There's no problem in the virt-inspector debug output.  The fact
that CirrOS is not recognized is bug 1045450, which is fixed
in >= 1.26 (I have checked and libguestfs 1.27.49 still recognizes
CirrOS 0.3.2 & 0.3.3 so we have not had any upstream regression).

As to the original Nova problem, you're going to need to
get the full debug output when the failure happens.  I do not
think there is any other way to diagnose this.

http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs

Comment 15 Sahid Ferdjaoui 2014-10-02 09:20:19 UTC
(In reply to Richard W.M. Jones from comment #12)
> (In reply to Jaroslav Henner from comment #11)
> > The 
> > The real reason gets thrown away somewhere (inside of libguestfs ?)
> 
> libguestfs extensively logs everything, but currently Nova throws
> it all away.  There is a bug open to fix this:
> 
> https://bugs.launchpad.net/nova/+bug/1279857
> 
> which I believe Sahid is working on.

The patch is waiting for reviews here:
   https://review.openstack.org/#/c/124700/

Comment 16 Russell Bryant 2014-10-08 16:01:12 UTC
(In reply to Sahid Ferdjaoui from comment #15)
> (In reply to Richard W.M. Jones from comment #12)
> > (In reply to Jaroslav Henner from comment #11)
> > > The 
> > > The real reason gets thrown away somewhere (inside of libguestfs ?)
> > 
> > libguestfs extensively logs everything, but currently Nova throws
> > it all away.  There is a bug open to fix this:
> > 
> > https://bugs.launchpad.net/nova/+bug/1279857
> > 
> > which I believe Sahid is working on.
> 
> The patch is waiting for reviews here:
>    https://review.openstack.org/#/c/124700/

It looks like this problem should just be tracked upstream, so I'm going to just close the bz.