Bug 2223410

Summary: Ceilometer gets kicked out of libvirt socket after nova changes permissions
Product: Red Hat OpenStack Reporter: Juan Larriba <jlarriba>
Component: openstack-ceilometerAssignee: Jaromír Wysoglad <jwysogla>
Status: ON_DEV --- QA Contact: Leonid Natapov <lnatapov>
Severity: high Docs Contact: mgeary <mgeary>
Priority: high    
Version: 18.0 (Zed)CC: apevec, mrunge
Target Milestone: betaKeywords: Triaged
Target Release: 18.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Juan Larriba 2023-07-17 17:23:52 UTC
In RHOSP18 next-gen, when the libvirt_* containers of Nova start, they re-set permissions on all that is running inside /run/libvirt. Ceilometer needs access to /run/libvirt/virtqemud-sock-ro to get libvirt metrics.

When a libvirt_* container restarts, Ceilometer starts failing with:

libvirt: XML-RPC error : Failed to connect socket to '/var/run/libvirt/virtqemud-sock-ro': Permission denied

because the permissions have changed in the disk. This is fixable with a reconnection to the socket.

Ceilometer must detect when this happens, control the error and then it can do two things:

- Attempt to reconnect
- Fail and exit

In the second option, the container that is running Ceilometer will automatically restart, triggering a reconnection.

Comment 1 Matthias Runge 2023-07-19 11:59:35 UTC
This is a permission issue in libvirt, not in ceilometer

Comment 2 Juan Larriba 2023-07-19 13:05:43 UTC
Yes it is an issue in libvirt, but Ceilometer should be robust enough to survive this issue by trying to reconnect and number of times or die gracefully if it is impossible. Staying there forever without doing nothing except logging hundreds of lines is not the best reaction to a change of permissions in the libvirt socket (even if it is caused by an external entity).

Comment 3 Jaromír Wysoglad 2023-07-26 06:49:01 UTC
I looked into the issue. I found the place in ceilometer code, where the error gets output [0]. As you can see, ceilometer is actually already trying to reconnect, but it doesn't succeed. I tried creating a shell inside the running ceilometer container and found out, that I can't even ls the /run/libvirt folder (I get permission denied).

I think, that when the nova and libvirt* containers restart, they also delete and recreate the whole /run/libvirt folder on the host, but inside the ceilometer container, we probably still have the old deleted folder mounted.

I think, that one solution might be mounting a folder one level higher (/run in this case), but when trying that, I got an error saying: "SELinux relabeling of /run is not allowed". Even if it worked, I'm not sure if sharing /run between host and container is a good idea.

Another possibility is to just exit ceilometer and stop the container as Juan suggested. It would get automatically restarted, the right folder would get mounted inside it and everything would work. Until the libvirt* containers restart again. The issue I have with this is that, as I understand it, libvirt isn't the only thing, that the compute agent watches. From just browsing the code, it looks to me like ceilometer should be able to get some disk or network metrics even without libvirt for example. I'm not sure, if we would be able to control how much we restart the ceilometer. If I just put exit() after a failed attempt to reconnect to the libvirt socket, we could end up in a situation, where there is some real issue with libvirt and we just keep restarting in a loop - we wouldn't get any metrics (while without restarting we might at least get the non libvirt ones).

Juan, Matthias, any suggestions about how to proceed?


A note about what's happening with the libvirt containers and about replicating the issue. The issue occurs repeatedly after some amount of time, but I haven't found a pattern in how much time it takes to happen. Sometimes it was minutes, sometimes it was multiple hours. Everytime the libvirt* and nova containers on the compute node would get restarted and this set of events can be seen in OCP at the same time [1]

[0] https://opendev.org/openstack/ceilometer/src/commit/091d2dac126cb7abed0e4961ea802967102d0db8/ceilometer/compute/virt/libvirt/utils.py#L94
[1] https://paste.opendev.org/show/b2qHCHwWwdYSQLNvuDXw/

Comment 4 Juan Larriba 2023-07-26 09:02:54 UTC
For me, restarting in a loop when Ceilometer is unable to connect to libvirt is perfectly acceptable and even desired behavior for a container-based deployment.

Comment 5 Juan Larriba 2023-07-26 09:07:08 UTC
It is true that nova is doing something funny as it is the operator the one triggering the restart of the containers and the re-creation of the dir or the permissions. That might get fixed in the future. However, I insist in saying that the current Ceilometer reaction to such an event is not acceptable.

When Ceilometer enter this state, it does not send anything. The non-libvirt metrics you highlighted such as disk or network metrics are not send. So, in case the Ceilometer agent cannot connect to libvirt, it is simply useless.

Comment 6 Matthias Runge 2023-07-26 19:09:51 UTC
Thank you Jaromir for looking into this.

Busy waiting until a permission denied gets fixed is unacceptable IMHO and restarting in a loop is busy waiting for me. More: it won't clearly tell why metrics are not showing, but you'll see containers creating load.

Comment 7 Jaromír Wysoglad 2023-07-27 14:42:43 UTC
After a discussion today we decided to implement a compromise. I'll add a new configuration parameter. I'll leave the current behavior (ceilometer will try to reconnect to the socket indefinitely) as a default. If the new parameter is set, it'll try to reconnect a few times, if it doesn't succeed, it will exit and the container will get restarted, which will solve this issue.