Bug 1868990
| Summary: | Restarting nova_libvirt container does not clean up pid files | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Andrew Mercer <amercer> |
| Component: | openstack-nova | Assignee: | OSP DFG:Compute <osp-dfg-compute> |
| Status: | CLOSED NOTABUG | QA Contact: | OSP DFG:Compute <osp-dfg-compute> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 16.1 (Train) | CC: | anbs, bdobreli, dasmith, dvd, eglynn, jhakimra, jhardee, kchamart, ldenny, lseki, lyarwood, mprivozn, mschuppe, rosingh, satmakur, sbauza, sgordon, vromanso |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-06-16 14:44:54 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Andrew Mercer
2020-08-14 22:53:46 UTC
(In reply to Andrew Mercer from comment #0) > Description of problem: > > A customer restarted the nova_libvirt container on all computes after which > the nova_libvirt container would not start. A reboot of a compute would > resolve this and nova_libvirt would start successfully. On computes not > rebooted, the following error was noticed in the logs: > > error : virPidFileAcquirePath:369 : Failed to acquire pid file > '/run/libvirtd.pid': Resource temporarily unavailable The above "failed to acquire pid file" means that there is another 'libvirtd' processes already running in the 'nova_libvirt' container. One way to check for that is by running this command (and get the output) in the 'nova_libvirt' container: `lsof | grep /run/libvirtd.pid`. That said, Podman seems to be killing the libvirtd process before it is gracefully shutdown. (The default timeout for Podman, I learn, is 10 seconds, after that it starts killing processes.) Workaround [to be addressed in TripleO] --------------------------------------- My colleague, Lee, from the Nova team suggest that we could potentially modify the 'run' script (the nova-libvirt.json.j2 file from 'kolla-ansible'[1]) of the 'nova_libvirt' container to always remove the /run/libvirtd.pid [1] https://github.com/openstack/kolla-ansible/blob/master/ansible/roles/nova-cell/templates/nova-libvirt.json.j2 A few related notes ------------------- - The 'nova_libvirt' contianer does not run 'systemd', so we cannot use `systemctl stop libvirtd` to gracefully shutdown the container - 'nova_libvirt' container is a privileged container, launched with `--pid=host` - In containers without `--pid=host` the Linux kernel handles clean shutdown for us, the entire PID namespace is torn down when PID 1 in the container dies. - In containers with `--pid=host` Podman uses CGroups to identify and stop the other processes which is racier. [...] Although, before going ahead the above workaround for TripleO, we should first understand what exactly is holding the /run/libvirtd.pid file. (In reply to Kashyap Chamarthy from comment #3) > (In reply to Andrew Mercer from comment #0) [...] > > error : virPidFileAcquirePath:369 : Failed to acquire pid file > > '/run/libvirtd.pid': Resource temporarily unavailable > > The above "failed to acquire pid file" means that there is another > 'libvirtd' processes already running in the 'nova_libvirt' container. > > One way to check for that is by running this command (and get the output) in > the 'nova_libvirt' container: `lsof | grep /run/libvirtd.pid`. @Andrew: To be clear, the NEEDINFO was to get the above `lsof` output (from the container) when you hit the error — to see exactly what is holding the PID file. So that we can further investigate why exactly Podman isn't killing the process on restart. > That said, Podman seems to be killing the libvirtd process before it is > gracefully shutdown. (The default timeout for Podman, I learn, is 10 > seconds, after that it starts killing processes.) > [...] I'm closing this based on the rationale that using `systemctl` to restart the "tripleo_nova_libvirt" service should solve the issue:
$ systemctl restart tripleo_nova_libvirt.service
Hi Team, We had the same issue after the customer restarted the nova_libvirt container with podman rather then systemd, the following commands sorted it out without the need for a restart: ``` [heat-admin@compute19 ~]$ sudo systemctl stop tripleo_nova_libvirt [heat-admin@compute19 ~]$ pidof libvirtd 862190 [heat-admin@compute19 ~]$ sudo pkill libvirtd [heat-admin@compute19 ~]$ sudo systemctl start tripleo_nova_libvirt [heat-admin@compute19 ~]$ sudo systemctl status tripleo_nova_libvirt ● tripleo_nova_libvirt.service - nova_libvirt container Loaded: loaded (/etc/systemd/system/tripleo_nova_libvirt.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2021-11-18 00:21:11 UTC; 9s ago ``` Just leaving this here for the next time we hit it. I'm the customer who restarted the nova_libvirt container :-) It happened again and just realized that the following is sufficient to fix the issue ``` [heat-admin@compute19 ~]$ sudo pkill libvirtd ``` ... without stopping & starting the tripleo_nova_libvirt service. systemd will take care of everything and spawn a new libvirtd process. (In reply to ldenny from comment #21) > Hi Team, > > We had the same issue after the customer restarted the nova_libvirt > container with podman rather then systemd, the following commands sorted it > out without the need for a restart: > > ``` > [heat-admin@compute19 ~]$ sudo systemctl stop tripleo_nova_libvirt > [heat-admin@compute19 ~]$ pidof libvirtd > 862190 > [heat-admin@compute19 ~]$ sudo pkill libvirtd > [heat-admin@compute19 ~]$ sudo systemctl start tripleo_nova_libvirt > [heat-admin@compute19 ~]$ sudo systemctl status tripleo_nova_libvirt > ● tripleo_nova_libvirt.service - nova_libvirt container > Loaded: loaded (/etc/systemd/system/tripleo_nova_libvirt.service; > enabled; vendor preset: disabled) > Active: active (running) since Thu 2021-11-18 00:21:11 UTC; 9s ago > ``` > > Just leaving this here for the next time we hit it. I have no idea what tripleo_nova_libvirt service does, but if it requires restarting libvirtd service then I'd say the tripleo_nova_libvirt unit file needs to reflect that. > I have no idea what tripleo_nova_libvirt service does, but if it requires restarting libvirtd service then I'd say the tripleo_nova_libvirt unit file needs to reflect that.
FWIW this is the tripleo_nova_libvirt unit file that RHOSP installation creates for me:
```
$ cat /etc/systemd/system/tripleo_nova_libvirt.service
[Unit]
Description=nova_libvirt container
After=paunch-container-shutdown.service
Wants=tripleo_nova_virtlogd.service
[Service]
Restart=always
ExecStart=/usr/libexec/paunch-start-podman-container nova_libvirt
ExecReload=/usr/bin/podman kill --signal HUP nova_libvirt
ExecStop=/usr/bin/podman stop -t 10 nova_libvirt
ExecStopPost=/usr/bin/podman stop -t 10 nova_libvirt
SuccessExitStatus=137 142 143
KillMode=none
Type=forking
PIDFile=/var/run/nova_libvirt.pid
[Install]
WantedBy=multi-user.target
```
Maybe /var/run/nova_libvirt.pid file gets inconsistent when I manually restart the nova_libvirtd container? Not sure how this could be improved.
(In reply to Lucio Seki from comment #24) > > I have no idea what tripleo_nova_libvirt service does, but if it requires restarting libvirtd service then I'd say the tripleo_nova_libvirt unit file needs to reflect that. > > FWIW this is the tripleo_nova_libvirt unit file that RHOSP installation > creates for me: > ``` > $ cat /etc/systemd/system/tripleo_nova_libvirt.service > [Unit] > Description=nova_libvirt container > After=paunch-container-shutdown.service > Wants=tripleo_nova_virtlogd.service > [Service] > Restart=always > ExecStart=/usr/libexec/paunch-start-podman-container nova_libvirt > ExecReload=/usr/bin/podman kill --signal HUP nova_libvirt > ExecStop=/usr/bin/podman stop -t 10 nova_libvirt > ExecStopPost=/usr/bin/podman stop -t 10 nova_libvirt > SuccessExitStatus=137 142 143 > KillMode=none > Type=forking > PIDFile=/var/run/nova_libvirt.pid > > [Install] > WantedBy=multi-user.target > ``` > > Maybe /var/run/nova_libvirt.pid file gets inconsistent when I manually > restart the nova_libvirtd container? Not sure how this could be improved. As mentioned earlier, the only supported way of restarting any kind of RHOSP containerized services is by using systemd via systemctl start/stop/restart tripleo_container_name.service. Restarting containers with podman is not supported and can unfortunately lead to issues like this. |