Bug 1142284
| Summary: | nova sometimes fails to inject. Some problem with augeas. | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Jaroslav Henner <jhenner> | ||||||
| Component: | openstack-nova | Assignee: | Russell Bryant <rbryant> | ||||||
| Status: | CLOSED UPSTREAM | QA Contact: | Jaroslav Henner <jhenner> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 5.0 (RHEL 7) | CC: | dcleal, jhenner, ndipanov, rjones, sferdjao, yeylon | ||||||
| Target Milestone: | --- | Keywords: | ZStream | ||||||
| Target Release: | 5.0 (RHEL 7) | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | |||||||||
| : | 1247638 (view as bug list) | Environment: | |||||||
| Last Closed: | 2014-10-08 16:01:12 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1247638 | ||||||||
| Attachments: |
|
||||||||
|
Description
Jaroslav Henner
2014-09-16 13:37:48 UTC
By a short glance, it seems to me that it has been fixed in RHBZ#894241, so we should make sure to make the libguestfs-1.20.1-6.el7 a dep for our nova Please describe the actual problem you are having in this bug, giving full details here. For a sake that this is actually probably a RHEL-specific, I will copy'n'paste the info from LP, but othewise, everything which is fixable upstream shall be discussed upstream. I have changed the original bug report. Created attachment 938160 [details]
log
It seems as if the actual error is: Error mounting /var/lib/nova/instances/f3f5e5fd-dacd-40a6-bf93-b206a7773a38/disk with libguestfs (mount_options: /dev/sda on / (options: ''): mount: /dev/sda is write-protected, mounting read-only mount: unknown filesystem type '(null)') What is the disk image? What happens if you do (on the nova server): virt-inspector -a /path/to/the/disk.img -v -x (In reply to Richard W.M. Jones from comment #6) > It seems as if the actual error is: > > Error mounting > /var/lib/nova/instances/f3f5e5fd-dacd-40a6-bf93-b206a7773a38/disk with > libguestfs (mount_options: /dev/sda on / (options: ''): mount: /dev/sda is > write-protected, mounting read-only > mount: unknown filesystem type '(null)') > > What is the disk image? What happens if you do (on the nova server): > > virt-inspector -a /path/to/the/disk.img -v -x It is cirros but I believe it is irrelevant because, and that's quite new information, I was running my wanabe reproducer hours spawning and deleting the images, without a reproduce, but when I ran the Tempest in parallel to that, I got reproduce instantly. So I will proceed in separately checking which test with my wanabe reproducer will provide good reproduce ratio. Then I could confirm the new augeas works. I also can try that virt-inspector. (In reply to Richard W.M. Jones from comment #6) > It seems as if the actual error is: > > Error mounting > /var/lib/nova/instances/f3f5e5fd-dacd-40a6-bf93-b206a7773a38/disk with > libguestfs (mount_options: /dev/sda on / (options: ''): mount: /dev/sda is > write-protected, mounting read-only > mount: unknown filesystem type '(null)') I think this is just a consequence of 2014-09-15 10:24:26.613 20740 WARNING nova.virt.disk.vfs.guestfs [req-159279a5-7fbb-43c3-8998-3c421cd4ef73 ServersTestXML-1973446531 ServersTestXML-2024486452] Failed to close augeas aug_close: do_aug_close: you must call 'aug-init' first to initialize Augeas > > What is the disk image? What happens if you do (on the nova server): > > virt-inspector -a /path/to/the/disk.img -v -x (In reply to Jaroslav Henner from comment #8) > (In reply to Richard W.M. Jones from comment #6) > > It seems as if the actual error is: > > > > Error mounting > > /var/lib/nova/instances/f3f5e5fd-dacd-40a6-bf93-b206a7773a38/disk with > > libguestfs (mount_options: /dev/sda on / (options: ''): mount: /dev/sda is > > write-protected, mounting read-only > > mount: unknown filesystem type '(null)') > > I think this is just a consequence of > > 2014-09-15 10:24:26.613 20740 WARNING nova.virt.disk.vfs.guestfs > [req-159279a5-7fbb-43c3-8998-3c421cd4ef73 ServersTestXML-1973446531 > ServersTestXML-2024486452] Failed to close augeas aug_close: do_aug_close: > you must call 'aug-init' first to initialize Augeas I don't think that is true. In the current version of nova, it catches that exception and prints it as a warning, but otherwise continues. Libguestfs emits this error only when the augeas API is used wrongly, but would be otherwise unaffected. Anyway, please run libguestfs-test-tool and the virt-inspector command outlined above, and provide the *complete* output from them. I have updated the original message with libugestfs version. I have a reproducer. It is probably related to too low of memory. When I kill a process like this:
/usr/libexec/qemu-kvm -name guestfs-4hc12m53ecr1l3a9
I get the exception. Note that the real booted VM has process like:
/usr/libexec/qemu-kvm -name instance-00001b7a
The traceback:
ipdb> bt
/usr/lib/python2.7/site-packages/eventlet/greenthread.py(194)main()
192 def main(self, function, args, kwargs):
193 try:
--> 194 result = function(*args, **kwargs)
195 except:
196 self._exit_event.send_exception(*sys.exc_info())
/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py(128)<lambda>()
126 def __call__(self, incoming):
127 incoming.acknowledge()
--> 128 yield lambda: self._dispatch_and_reply(incoming)
129
130 def _dispatch_and_reply(self, incoming):
/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py(133)_dispatch_and_reply()
131 try:
132 incoming.reply(self._dispatch(incoming.ctxt,
--> 133 incoming.message))
134 except ExpectedException as e:
135 LOG.debug('Expected exception during message handling (%s)' %
/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py(176)_dispatch()
174 localcontext.set_local_context(ctxt)
175 try:
--> 176 return self._do_dispatch(endpoint, method, ctxt, args)
177 finally:
178 localcontext.clear_local_context()
/usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py(122)_do_dispatch()
120 for argname, arg in six.iteritems(args):
121 new_args[argname] = self.serializer.deserialize_entity(ctxt, arg)
--> 122 result = getattr(endpoint, method)(ctxt, **new_args)
123 return self.serializer.serialize_entity(ctxt, result)
124
/usr/lib/python2.7/site-packages/oslo/messaging/rpc/server.py(139)inner()
137 def inner(*args, **kwargs):
138 try:
--> 139 return func(*args, **kwargs)
140 # Take advantage of the fact that we can catch
141 # multiple exception types using a tuple of
/usr/lib/python2.7/site-packages/nova/exception.py(71)wrapped()
69 # contain confidential information.
70 try:
---> 71 return f(self, context, *args, **kw)
72 except Exception as e:
73 with excutils.save_and_reraise_exception():
/usr/lib/python2.7/site-packages/nova/compute/manager.py(260)decorated_function()
258 def decorated_function(self, context, *args, **kwargs):
259 try:
--> 260 return function(self, context, *args, **kwargs)
261 except exception.UnexpectedTaskStateError as e:
262 # Note(maoy): unexpected task state means the current
/usr/lib/python2.7/site-packages/nova/compute/manager.py(327)decorated_function()
325 event_name, instance_uuid):
326
--> 327 function(self, context, *args, **kwargs)
328
329 return decorated_function
/usr/lib/python2.7/site-packages/nova/compute/manager.py(290)decorated_function()
288 def decorated_function(self, context, *args, **kwargs):
289 try:
--> 290 return function(self, context, *args, **kwargs)
291 except exception.InstanceNotFound:
292 raise
/usr/lib/python2.7/site-packages/nova/compute/manager.py(2102)run_instance()
2100 admin_password, is_first_time, node, instance,
2101 legacy_bdm_in_spec)
-> 2102 do_run_instance()
2103
2104 def _try_deallocate_network(self, context, instance,
/usr/lib/python2.7/site-packages/nova/openstack/common/lockutils.py(249)inner()
247 LOG.debug(_('Got semaphore / lock "%(function)s"'),
248 {'function': f.__name__})
--> 249 return f(*args, **kwargs)
250 finally:
251 LOG.debug(_('Semaphore / lock released "%(function)s"'),
/usr/lib/python2.7/site-packages/nova/compute/manager.py(2101)do_run_instance()
2099 filter_properties, requested_networks, injected_files,
2100 admin_password, is_first_time, node, instance,
-> 2101 legacy_bdm_in_spec)
2102 do_run_instance()
2103
/usr/lib/python2.7/site-packages/nova/compute/manager.py(1209)_run_instance()
1207 request_spec, filter_properties, requested_networks,
1208 injected_files, admin_password, is_first_time, node,
-> 1209 instance, image_meta, legacy_bdm_in_spec)
1210 notify("end", msg=_("Success"), network_info=network_info)
1211
/usr/lib/python2.7/site-packages/nova/compute/manager.py(1329)_build_instance()
1327 network_info, block_device_info,
1328 injected_files, admin_password,
-> 1329 set_access_ip=set_access_ip)
1330 except (exception.InstanceNotFound,
1331 exception.UnexpectedDeletingTaskStateError):
/usr/lib/python2.7/site-packages/nova/compute/manager.py(393)decorated_function()
391 kwargs['migration'] = migration
392
--> 393 return function(self, context, *args, **kwargs)
394
395 return decorated_function
/usr/lib/python2.7/site-packages/nova/compute/manager.py(1738)_spawn()
1736 injected_files, admin_password,
1737 network_info,
-> 1738 block_device_info)
1739 except Exception:
1740 with excutils.save_and_reraise_exception():
/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py(2282)spawn()
2280 block_device_info=block_device_info,
2281 files=injected_files,
-> 2282 admin_pass=admin_password)
2283 xml = self.to_xml(context, instance, network_info,
2284 disk_info, image_meta,
/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py(2733)_create_image()
2731 'instance is not supported'), instance=instance)
2732 self._inject_data(
-> 2733 instance, network_info, admin_pass, files, suffix)
2734
2735 if CONF.libvirt.virt_type == 'uml':
/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py(2569)_inject_data()
2567 partition=target_partition,
2568 use_cow=CONF.use_cow_images,
-> 2569 mandatory=('files',))
2570 except Exception as e:
2571 with excutils.save_and_reraise_exception():
> /usr/lib/python2.7/site-packages/nova/virt/disk/api.py(360)inject_data()
358 inject_val = locals()[inject]
359 import ipdb; ipdb.set_trace()
--> 360 if inject_val:
361 raise
362 LOG.warn(_('Ignoring error injecting data into image '
ipdb> e
NovaException(u"Error mounting /var/lib/nova/instances/6fa0e6d0-8c13-4b2f-b2dd-9558e7490f40/disk with libguestfs (guestfs_launch failed.\nThis usually means the libguestfs appliance failed to start or crashed.\nSee http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs\nor run 'libguestfs-test-tool' and post the *complete* output into a\nbug report or message to the libguestfs mailing list.)",)
My guess is, that this problem happens when there are several attempts to boot the vm. They pass trough scheduler, because scheduler IMHO doesn't decrease available memory by the memory required. It probably schedules the VM to host which it finds it has enough free memory at the moment. But when there were some VMs already booting, they may not consume too much of RAM yet, resulting in overfilling of the host.
When nova tries to inject the metadata to VM, it calls libguestfs in the inject_data()
try:
# Note(mrda): Test if the image exists first to short circuit errors
os.stat(image)
fs = vfs.VFS.instance_for_image(image, fmt, partition)
fs.setup()
except Exception as e:
...
and that somehow fails in memory allocation or using RAM-backed tmpfs. The The real reason gets thrown away somewhere (inside of libguestfs ?) and nova just sees the libguestfs failed. There is no OOM killer message because OOM just didn't take an action. It seems the kernel just didn't allow the libuguestfs or some child of it to allocate the memory.
(In reply to Jaroslav Henner from comment #11) > The > The real reason gets thrown away somewhere (inside of libguestfs ?) libguestfs extensively logs everything, but currently Nova throws it all away. There is a bug open to fix this: https://bugs.launchpad.net/nova/+bug/1279857 which I believe Sahid is working on. Created attachment 939233 [details]
virt-inspector stderr
attaching the virt-inspector output. Stdout follows:
<?xml version="1.0"?>
<operatingsystems/>
(In reply to Jaroslav Henner from comment #13) > Created attachment 939233 [details] > virt-inspector stderr > > attaching the virt-inspector output. Stdout follows: > > <?xml version="1.0"?> > <operatingsystems/> There's no problem in the virt-inspector debug output. The fact that CirrOS is not recognized is bug 1045450, which is fixed in >= 1.26 (I have checked and libguestfs 1.27.49 still recognizes CirrOS 0.3.2 & 0.3.3 so we have not had any upstream regression). As to the original Nova problem, you're going to need to get the full debug output when the failure happens. I do not think there is any other way to diagnose this. http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs (In reply to Richard W.M. Jones from comment #12) > (In reply to Jaroslav Henner from comment #11) > > The > > The real reason gets thrown away somewhere (inside of libguestfs ?) > > libguestfs extensively logs everything, but currently Nova throws > it all away. There is a bug open to fix this: > > https://bugs.launchpad.net/nova/+bug/1279857 > > which I believe Sahid is working on. The patch is waiting for reviews here: https://review.openstack.org/#/c/124700/ (In reply to Sahid Ferdjaoui from comment #15) > (In reply to Richard W.M. Jones from comment #12) > > (In reply to Jaroslav Henner from comment #11) > > > The > > > The real reason gets thrown away somewhere (inside of libguestfs ?) > > > > libguestfs extensively logs everything, but currently Nova throws > > it all away. There is a bug open to fix this: > > > > https://bugs.launchpad.net/nova/+bug/1279857 > > > > which I believe Sahid is working on. > > The patch is waiting for reviews here: > https://review.openstack.org/#/c/124700/ It looks like this problem should just be tracked upstream, so I'm going to just close the bz. |