Bug 1797889 - compute node reboot takes too long because nova_libvirt times out during OSP16 update
Summary: compute node reboot takes too long because nova_libvirt times out during OSP1...
Keywords:
Status: CLOSED DUPLICATE of bug 1802573
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 16.0 (Train)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Piotr Kopec
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-04 07:19 UTC by Eduardo Olivares
Modified: 2023-03-21 19:27 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-02-25 14:00:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ir log (reboot fails) (1.21 MB, application/x-shellscript)
2020-02-04 07:19 UTC, Eduardo Olivares
no flags Details
compute-1 artifacts (8.14 MB, application/gzip)
2020-02-04 07:20 UTC, Eduardo Olivares
no flags Details

Description Eduardo Olivares 2020-02-04 07:19:24 UTC
Created attachment 1657482 [details]
ir log (reboot fails)

Description of problem:
the following job fails while rebooting overcloud nodes
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/54/


It fails because compute node reboot takes much longer than it took before. It is related with nova_libvirt container.

PS: Comparing logs from executions 50 and 54 from our job, this is the main difference I'm seeing: at #50, nova_libvirt stopped after 19 seconds; at #54, it timed out after more than 3 minutes.

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/50
(find logs at panther23.lab.eng.tlv2.redhat.com:/tmp/bug1790467/exec50/compute-1/var/log/messages, reboot starts after 19 seconds)
Jan 29 22:12:06 compute-1 ansible-command[176257]: Invoked with _uses_shell=True _raw_params=sleep 5 && shutdown -r now warn=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Jan 29 22:12:11 compute-1 systemd[1]: Stopped libcontainer container f5da881cfda858a45729bb3594793525baeb274341cc7a26e0390c537fcdd2f0.
Jan 29 22:12:11 compute-1 podman[176269]: 2020-01-29 22:12:11.675020934 +0000 UTC m=+0.430666358 container died f5da881cfda858a45729bb3594793525baeb274341cc7a26e0390c537fcdd2f0 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp1
6-openstack-nova-libvirt:20200124.1, name=nova_libvirt)
Jan 29 22:12:11 compute-1 podman[176269]: 2020-01-29 22:12:11.73445307 +0000 UTC m=+0.490098604 container stop f5da881cfda858a45729bb3594793525baeb274341cc7a26e0390c537fcdd2f0 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16
-openstack-nova-libvirt:20200124.1, name=nova_libvirt)
Jan 29 22:12:11 compute-1 systemd[1]: Unmounted /var/lib/containers/storage/overlay-containers/f5da881cfda858a45729bb3594793525baeb274341cc7a26e0390c537fcdd2f0/userdata/shm.
Jan 29 22:12:11 compute-1 podman[176538]: 2020-01-29 22:12:11.884489319 +0000 UTC m=+0.324819049 container cleanup f5da881cfda858a45729bb3594793525baeb274341cc7a26e0390c537fcdd2f0 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rho
sp16-openstack-nova-libvirt:20200124.1, name=nova_libvirt)
Jan 29 22:12:11 compute-1 systemd[1]: Stopped libpod-conmon-f5da881cfda858a45729bb3594793525baeb274341cc7a26e0390c537fcdd2f0.scope.
Jan 29 22:12:25 compute-1 systemd[1]: Starting Reboot...




https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/54
(find logs at panther23.lab.eng.tlv2.redhat.com:/tmp/bug1790467/exec54/compute-1/var/log/messages, reboot starts after more than three minutes)
Feb  3 16:37:13 compute-1 ansible-command[161557]: Invoked with _uses_shell=True _raw_params=sleep 5 && shutdown -r now warn=True argv=None chdir=None executable=None creates=None removes=None stdin=None
Feb  3 16:37:28 compute-1 podman[161581]: 2020-02-03 16:37:28.898956956 +0000 UTC m=+10.382196818 container died f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp
16-openstack-nova-libvirt:20200130.1, name=nova_libvirt)
Feb  3 16:37:28 compute-1 podman[161581]: 2020-02-03 16:37:28.900302283 +0000 UTC m=+10.383542084 container stop f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp
16-openstack-nova-libvirt:20200130.1, name=nova_libvirt)
Feb  3 16:37:28 compute-1 podman[161581]: f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db
Feb  3 16:37:28 compute-1 systemd[1]: Unmounted /var/lib/containers/storage/overlay-containers/f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db/userdata/shm.
Feb  3 16:37:29 compute-1 systemd[1]: Stopping libcontainer container f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db.
Feb  3 16:37:29 compute-1 systemd[1]: Stopped libpod-conmon-f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db.scope.
Feb  3 16:38:59 compute-1 systemd[1]: libpod-f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db.scope: Stopping timed out. Killing.
Feb  3 16:38:59 compute-1 systemd[1]: libpod-f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db.scope: Killing process 25093 (libvirtd) with signal SIGKILL.
Feb  3 16:38:59 compute-1 systemd[1]: libpod-f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db.scope: Killing process 161823 (libvirtd) with signal SIGKILL.
Feb  3 16:40:29 compute-1 systemd[1]: libpod-f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db.scope: Still around after SIGKILL. Ignoring.
Feb  3 16:40:29 compute-1 systemd[1]: libpod-f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db.scope: Failed with result 'timeout'.
Feb  3 16:40:29 compute-1 systemd[1]: Stopped libcontainer container f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db.
Feb  3 16:40:29 compute-1 systemd[1]: libpod-f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db.scope: Consumed 32.723s CPU time
Feb  3 16:40:29 compute-1 systemd[1]: Starting Reboot...





Version-Release number of selected component (if applicable):
update from RHOS_TRUNK-16.0-RHEL-8-20200130.n.0
to RHOS_TRUNK-16.0-RHEL-8-20200131.n.0

(also from RHOS_TRUNK-16.0-RHEL-8-20200122.n.1 to RHOS_TRUNK-16.0-RHEL-8-20200130.n.0)



How reproducible:
100%

Steps to Reproduce:
1. Run CI job https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/
2.
3.

Actual results:
job fails because compute-1 reboot takes too long


Expected results:
compute-1 reboot should be faster


Additional info:

Comment 1 Eduardo Olivares 2020-02-04 07:20:49 UTC
Created attachment 1657483 [details]
compute-1 artifacts

Comment 2 Eduardo Olivares 2020-02-04 10:49:35 UTC
This problems was reproduced twice with our update job:
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/52/ - update from RHOS_TRUNK-16.0-RHEL-8-20200122.n.1 to RHOS_TRUNK-16.0-RHEL-8-20200130.n.0
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/54/ - update from RHOS_TRUNK-16.0-RHEL-8-20200130.n.0 to RHOS_TRUNK-16.0-RHEL-8-20200131.n.0

But in latest execution, overcloud reboot worked correctly:
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/56/ - update from RHOS_TRUNK-16.0-RHEL-8-20200122.n.1 to RHOS_TRUNK-16.0-RHEL-8-20200131.n.0

So, we this data, what we have is "How reproducible: 66.6%"

Comment 6 Sofer Athlan-Guyot 2020-02-25 14:00:14 UTC
Hi Piotr,

so the command "podman images" only reflects what has been pulled on
the local node, while tripleo-container-image prepare update the
repository.

We currently have no user interface to see what's in the local repo,
beside looking directly into /var/lib/image-serve directly.

The real issue happens there in
undercloud-0/home/stack/overcloud_update_run_Controller.log


    2020-02-20 22:05:46 |         "$ podman stop nova_scheduler",
    2020-02-20 22:05:46 |         "b'fd2cf249eb4484ea424b3686df97e4b2603629aa039a3e5125fbe45036e83906\\n'",
    2020-02-20 22:05:46 |         "$ podman rm nova_scheduler",
    2020-02-20 22:05:46 |         "b'Error: cannot remove container fd2cf249eb4484ea424b3686df97e4b2603629aa039a3e5125fbe45036e83906 as it has active exec sessions: container state improper\\n'",
    2020-02-20 22:05:46 |         "Error removing container: nova_scheduler",
    2020-02-20 22:05:46 |         "Error: cannot remove container fd2cf249eb4484ea424b3686df97e4b2603629aa039a3e5125fbe45036e83906 as it has active exec sessions: container state improper",

and later:

    2020-02-20 21:40:05 |         "Running container: nova_scheduler",
    2020-02-20 21:40:05 |         "Skipping existing container: nova_scheduler",

so the nova_schedule is silently not updated.

We have two issues here:
 1. paunch is unable to remove the container because it has some process still running
 2. paunch doesn't raise an error when this happens.

1. is a solved in bz#1802573, as for 2., as paunch is going to be
replaced soon with an ansible module, not sure if it's going to be
solved.

Closing this one as a duplicate, to keep everything nice and tidy.

Thanks.

*** This bug has been marked as a duplicate of bug 1802573 ***


Note You need to log in before you can comment on or make changes to this bug.