Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1868990

Summary: Restarting nova_libvirt container does not clean up pid files
Product: Red Hat OpenStack Reporter: Andrew Mercer <amercer>
Component: openstack-novaAssignee: OSP DFG:Compute <osp-dfg-compute>
Status: CLOSED NOTABUG QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: high Docs Contact:
Priority: high    
Version: 16.1 (Train)CC: anbs, bdobreli, dasmith, dvd, eglynn, jhakimra, jhardee, kchamart, ldenny, lseki, lyarwood, mprivozn, mschuppe, rosingh, satmakur, sbauza, sgordon, vromanso
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-16 14:44:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Andrew Mercer 2020-08-14 22:53:46 UTC
Description of problem:

A customer restarted the nova_libvirt container on all computes after which the nova_libvirt container would not start. A reboot of a compute would resolve this and nova_libvirt would start successfully. On computes not rebooted, the following error was noticed in the logs:

 error : virPidFileAcquirePath:369 : Failed to acquire pid file '/run/libvirtd.pid': Resource temporarily unavailable

and on investigation the /run/libvirtd.pid file existed despite the container being stopped. We were able to do some testing in this environment and found that moving this pid and /run/libvirt/* out of the way resulted in a successful start of the nova_libvirt container.


Version-Release number of selected component (if applicable): 16.1


How reproducible:


Steps to Reproduce:
1. Restart/start nova_libvirt container fails to start
2. Reboot the compute, container starts
3. Remove pid files, container starts

Actual results:
The nova_libvirt container does not start


Expected results:
nova_libvirt restarts successfully


Additional info:

Comment 3 Kashyap Chamarthy 2020-08-21 14:13:39 UTC
(In reply to Andrew Mercer from comment #0)
> Description of problem:
> 
> A customer restarted the nova_libvirt container on all computes after which
> the nova_libvirt container would not start. A reboot of a compute would
> resolve this and nova_libvirt would start successfully. On computes not
> rebooted, the following error was noticed in the logs:
> 
>  error : virPidFileAcquirePath:369 : Failed to acquire pid file
> '/run/libvirtd.pid': Resource temporarily unavailable

The above "failed to acquire pid file" means that there is another 'libvirtd' processes already running in the 'nova_libvirt' container.

One way to check for that is by running this command (and get the output) in the 'nova_libvirt' container: `lsof | grep /run/libvirtd.pid`.

That said, Podman seems to be killing the libvirtd process before it is gracefully shutdown.  (The default timeout for Podman, I learn, is 10 seconds, after that it starts killing processes.)


Workaround [to be addressed in TripleO]
---------------------------------------

My colleague, Lee, from the Nova team suggest that we could potentially modify the 'run' script (the nova-libvirt.json.j2 file from 'kolla-ansible'[1]) of the 'nova_libvirt' container to always remove the /run/libvirtd.pid 


[1] https://github.com/openstack/kolla-ansible/blob/master/ansible/roles/nova-cell/templates/nova-libvirt.json.j2


A few related notes
-------------------

- The 'nova_libvirt' contianer does not run 'systemd', so we cannot use `systemctl stop libvirtd` to gracefully shutdown the container
- 'nova_libvirt' container is a privileged container, launched with `--pid=host`
- In containers without `--pid=host` the Linux kernel handles clean shutdown for us, the entire PID namespace is torn down when PID 1 in 
  the container dies.
- In containers with `--pid=host` Podman uses CGroups to identify and stop the other processes which is racier.


[...]

Comment 4 Kashyap Chamarthy 2020-08-21 14:24:27 UTC
Although, before going ahead the above workaround for TripleO, we should first understand what exactly is holding the /run/libvirtd.pid file.

Comment 5 Kashyap Chamarthy 2020-08-25 13:12:02 UTC
(In reply to Kashyap Chamarthy from comment #3)
> (In reply to Andrew Mercer from comment #0)

[...]

> >  error : virPidFileAcquirePath:369 : Failed to acquire pid file
> > '/run/libvirtd.pid': Resource temporarily unavailable
> 
> The above "failed to acquire pid file" means that there is another
> 'libvirtd' processes already running in the 'nova_libvirt' container.
> 
> One way to check for that is by running this command (and get the output) in
> the 'nova_libvirt' container: `lsof | grep /run/libvirtd.pid`.

@Andrew: To be clear, the NEEDINFO was to get the above `lsof` output (from the container) when you hit the error — to see exactly what is holding the PID file.  So that we can further investigate why exactly Podman isn't killing the process on restart.


> That said, Podman seems to be killing the libvirtd process before it is
> gracefully shutdown.  (The default timeout for Podman, I learn, is 10
> seconds, after that it starts killing processes.)
> 

[...]

Comment 20 Kashyap Chamarthy 2021-06-16 14:44:54 UTC
I'm closing this based on the rationale that using `systemctl` to restart the "tripleo_nova_libvirt" service should solve the issue:

    $ systemctl restart tripleo_nova_libvirt.service

Comment 21 ldenny 2021-11-18 00:30:06 UTC
Hi Team,

We had the same issue after the customer restarted the nova_libvirt container with podman rather then systemd, the following commands sorted it out without the need for a restart:

```
[heat-admin@compute19 ~]$ sudo systemctl stop tripleo_nova_libvirt
[heat-admin@compute19 ~]$ pidof libvirtd
862190
[heat-admin@compute19 ~]$ sudo pkill libvirtd
[heat-admin@compute19 ~]$ sudo systemctl start tripleo_nova_libvirt
[heat-admin@compute19 ~]$ sudo systemctl status tripleo_nova_libvirt
● tripleo_nova_libvirt.service - nova_libvirt container
   Loaded: loaded (/etc/systemd/system/tripleo_nova_libvirt.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2021-11-18 00:21:11 UTC; 9s ago
```

Just leaving this here for the next time we hit it.

Comment 22 Lucio Seki 2021-11-19 13:13:13 UTC
I'm the customer who restarted the nova_libvirt container :-)

It happened again and just realized that the following is sufficient to fix the issue

```
[heat-admin@compute19 ~]$ sudo pkill libvirtd

```

... without stopping & starting the tripleo_nova_libvirt service.

systemd will take care of everything and spawn a new libvirtd process.

Comment 23 Michal Privoznik 2021-11-22 13:52:41 UTC
(In reply to ldenny from comment #21)
> Hi Team,
> 
> We had the same issue after the customer restarted the nova_libvirt
> container with podman rather then systemd, the following commands sorted it
> out without the need for a restart:
> 
> ```
> [heat-admin@compute19 ~]$ sudo systemctl stop tripleo_nova_libvirt
> [heat-admin@compute19 ~]$ pidof libvirtd
> 862190
> [heat-admin@compute19 ~]$ sudo pkill libvirtd
> [heat-admin@compute19 ~]$ sudo systemctl start tripleo_nova_libvirt
> [heat-admin@compute19 ~]$ sudo systemctl status tripleo_nova_libvirt
> ● tripleo_nova_libvirt.service - nova_libvirt container
>    Loaded: loaded (/etc/systemd/system/tripleo_nova_libvirt.service;
> enabled; vendor preset: disabled)
>    Active: active (running) since Thu 2021-11-18 00:21:11 UTC; 9s ago
> ```
> 
> Just leaving this here for the next time we hit it.

I have no idea what tripleo_nova_libvirt service does, but if it requires restarting libvirtd service then I'd say the tripleo_nova_libvirt unit file needs to reflect that.

Comment 24 Lucio Seki 2021-11-22 14:41:17 UTC
> I have no idea what tripleo_nova_libvirt service does, but if it requires restarting libvirtd service then I'd say the tripleo_nova_libvirt unit file needs to reflect that.

FWIW this is the tripleo_nova_libvirt unit file that RHOSP installation creates for me:
```
$ cat /etc/systemd/system/tripleo_nova_libvirt.service
[Unit]
Description=nova_libvirt container
After=paunch-container-shutdown.service
Wants=tripleo_nova_virtlogd.service
[Service]
Restart=always
ExecStart=/usr/libexec/paunch-start-podman-container nova_libvirt
ExecReload=/usr/bin/podman kill --signal HUP nova_libvirt
ExecStop=/usr/bin/podman stop -t 10 nova_libvirt
ExecStopPost=/usr/bin/podman stop -t 10 nova_libvirt
SuccessExitStatus=137 142 143
KillMode=none
Type=forking
PIDFile=/var/run/nova_libvirt.pid

[Install]
WantedBy=multi-user.target
```

Maybe /var/run/nova_libvirt.pid file gets inconsistent when I manually restart the nova_libvirtd container? Not sure how this could be improved.

Comment 25 David Vallee Delisle 2022-01-25 16:53:29 UTC
(In reply to Lucio Seki from comment #24)
> > I have no idea what tripleo_nova_libvirt service does, but if it requires restarting libvirtd service then I'd say the tripleo_nova_libvirt unit file needs to reflect that.
> 
> FWIW this is the tripleo_nova_libvirt unit file that RHOSP installation
> creates for me:
> ```
> $ cat /etc/systemd/system/tripleo_nova_libvirt.service
> [Unit]
> Description=nova_libvirt container
> After=paunch-container-shutdown.service
> Wants=tripleo_nova_virtlogd.service
> [Service]
> Restart=always
> ExecStart=/usr/libexec/paunch-start-podman-container nova_libvirt
> ExecReload=/usr/bin/podman kill --signal HUP nova_libvirt
> ExecStop=/usr/bin/podman stop -t 10 nova_libvirt
> ExecStopPost=/usr/bin/podman stop -t 10 nova_libvirt
> SuccessExitStatus=137 142 143
> KillMode=none
> Type=forking
> PIDFile=/var/run/nova_libvirt.pid
> 
> [Install]
> WantedBy=multi-user.target
> ```
> 
> Maybe /var/run/nova_libvirt.pid file gets inconsistent when I manually
> restart the nova_libvirtd container? Not sure how this could be improved.

As mentioned earlier, the only supported way of restarting any kind of RHOSP containerized services is by using systemd via systemctl start/stop/restart tripleo_container_name.service. Restarting containers with podman is not supported and can unfortunately lead to issues like this.