Created attachment 1397038 [details] a list of nova servers with bad metadata Description of problem: I am working with a Red Hat academic partner. They have deployed OSP12 manually (no director). Some of the compute instances have an empty "<nova:owner>" attribute in the embedded metadata in the libvirt XML. This is in turn causing ceilometer to blow up. For example: <metadata> <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0"> <nova:package version="14.0.3-9.el7ost"/> <nova:name>wm-solr</nova:name> <nova:creationTime>2017-05-24 15:29:24</nova:creationTime> <nova:flavor name="m1.large"> <nova:memory>8192</nova:memory> <nova:disk>80</nova:disk> <nova:swap>0</nova:swap> <nova:ephemeral>0</nova:ephemeral> <nova:vcpus>4</nova:vcpus> </nova:flavor> <nova:owner/> <nova:root type="image" uuid="a332ad63-8b38-4700-b818-b39fa69233a9"/> </nova:instance> </metadata> This causes ceilometer to fail because it is trying to retrieve the owner/user and owner/project elements. Out of 144 servers, there are about 14 instances that exhibit this problem. The servers still exists in the Nova inventory, as do the owning project and user. These instances were all created with previous versions of OSP. I've attached the result of a simple validation that I ran across the compute nodes that shows, for each server exhibiting this problem, which version of nova was used to create it. Version-Release number of selected component (if applicable): openstack-nova-compute-16.0.2-9.el7ost.noarch lyarwood points out that https://review.openstack.org/#/c/399679/ landed upstream recently and maybe is relevant to this issue.
Created attachment 1397039 [details] output of openstack server event list lyarwood suggests that the result of 'openstack server event list' for one of the servers might be of interest.
(In reply to Lars Kellogg-Stedman from comment #1) > Created attachment 1397039 [details] > output of openstack server event list > > lyarwood suggests that the result of 'openstack server event list' for one > of the servers might be of interest. I think this could be linked to the evacuation of the instance but I've not attempted to reproduce. It appears we tried to use a request context with user_name, project_id and project_name all set to None, that results in LibvirtConfigGuestMetaNovaOwner.format_dom() returning <nova:owner/> : nova/virt/libvirt/driver.py 3842 def _get_guest_config_meta(self, context, instance): 3843 """Get metadata config for guest.""" [..] 3854 if context is not None: 3855 ometa = vconfig.LibvirtConfigGuestMetaNovaOwner() 3856 ometa.userid = context.user_id 3857 ometa.username = context.user_name 3858 ometa.projectid = context.project_id 3859 ometa.projectname = context.project_name 3860 meta.owner = ometa nova/virt/libvirt/config.py 2476 class LibvirtConfigGuestMetaNovaOwner(LibvirtConfigObject): [..] 2489 def format_dom(self): 2490 meta = super(LibvirtConfigGuestMetaNovaOwner, self).format_dom() 2491 if self.userid is not None and self.username is not None: 2492 user = self._text_node("user", self.username) 2493 user.set("uuid", self.userid) 2494 meta.append(user) 2495 if self.projectid is not None and self.projectname is not None: 2496 project = self._text_node("project", self.projectname) 2497 project.set("uuid", self.projectid) 2498 meta.append(project) 2499 return meta https://review.openstack.org/#/c/399679/ actually landed in Pike and has switched to using the instance object to populate these fields. I'll try to backport this to Ocata and Newton for OSP to help avoid this going forward. For this customer the best way to workaround this now is to stop and start the instances, forcing the domain XML to be recreated with the correct owner details on Pike. Can you confirm that this resolves the issue with Ceilometer? FWIW I'd also suggest following up with that team to handle this situation.
I will ask the customer about stopping/starting these servers. That may not be possible at this time. On the ceilometer side, I have opened https://bugs.launchpad.net/ceilometer/+bug/1749960 upstream and submitted a fix that would make ceilometer less sensitive to this sort of issue.
https://bugzilla.redhat.com/show_bug.cgi?id=1546176 is the bugzilla version of the upstream bug.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1624