Created attachment 1657482 [details] ir log (reboot fails) Description of problem: the following job fails while rebooting overcloud nodes https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/54/ It fails because compute node reboot takes much longer than it took before. It is related with nova_libvirt container. PS: Comparing logs from executions 50 and 54 from our job, this is the main difference I'm seeing: at #50, nova_libvirt stopped after 19 seconds; at #54, it timed out after more than 3 minutes. https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/50 (find logs at panther23.lab.eng.tlv2.redhat.com:/tmp/bug1790467/exec50/compute-1/var/log/messages, reboot starts after 19 seconds) Jan 29 22:12:06 compute-1 ansible-command[176257]: Invoked with _uses_shell=True _raw_params=sleep 5 && shutdown -r now warn=True argv=None chdir=None executable=None creates=None removes=None stdin=None Jan 29 22:12:11 compute-1 systemd[1]: Stopped libcontainer container f5da881cfda858a45729bb3594793525baeb274341cc7a26e0390c537fcdd2f0. Jan 29 22:12:11 compute-1 podman[176269]: 2020-01-29 22:12:11.675020934 +0000 UTC m=+0.430666358 container died f5da881cfda858a45729bb3594793525baeb274341cc7a26e0390c537fcdd2f0 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp1 6-openstack-nova-libvirt:20200124.1, name=nova_libvirt) Jan 29 22:12:11 compute-1 podman[176269]: 2020-01-29 22:12:11.73445307 +0000 UTC m=+0.490098604 container stop f5da881cfda858a45729bb3594793525baeb274341cc7a26e0390c537fcdd2f0 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16 -openstack-nova-libvirt:20200124.1, name=nova_libvirt) Jan 29 22:12:11 compute-1 systemd[1]: Unmounted /var/lib/containers/storage/overlay-containers/f5da881cfda858a45729bb3594793525baeb274341cc7a26e0390c537fcdd2f0/userdata/shm. Jan 29 22:12:11 compute-1 podman[176538]: 2020-01-29 22:12:11.884489319 +0000 UTC m=+0.324819049 container cleanup f5da881cfda858a45729bb3594793525baeb274341cc7a26e0390c537fcdd2f0 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rho sp16-openstack-nova-libvirt:20200124.1, name=nova_libvirt) Jan 29 22:12:11 compute-1 systemd[1]: Stopped libpod-conmon-f5da881cfda858a45729bb3594793525baeb274341cc7a26e0390c537fcdd2f0.scope. Jan 29 22:12:25 compute-1 systemd[1]: Starting Reboot... https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/54 (find logs at panther23.lab.eng.tlv2.redhat.com:/tmp/bug1790467/exec54/compute-1/var/log/messages, reboot starts after more than three minutes) Feb 3 16:37:13 compute-1 ansible-command[161557]: Invoked with _uses_shell=True _raw_params=sleep 5 && shutdown -r now warn=True argv=None chdir=None executable=None creates=None removes=None stdin=None Feb 3 16:37:28 compute-1 podman[161581]: 2020-02-03 16:37:28.898956956 +0000 UTC m=+10.382196818 container died f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp 16-openstack-nova-libvirt:20200130.1, name=nova_libvirt) Feb 3 16:37:28 compute-1 podman[161581]: 2020-02-03 16:37:28.900302283 +0000 UTC m=+10.383542084 container stop f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp 16-openstack-nova-libvirt:20200130.1, name=nova_libvirt) Feb 3 16:37:28 compute-1 podman[161581]: f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db Feb 3 16:37:28 compute-1 systemd[1]: Unmounted /var/lib/containers/storage/overlay-containers/f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db/userdata/shm. Feb 3 16:37:29 compute-1 systemd[1]: Stopping libcontainer container f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db. Feb 3 16:37:29 compute-1 systemd[1]: Stopped libpod-conmon-f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db.scope. Feb 3 16:38:59 compute-1 systemd[1]: libpod-f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db.scope: Stopping timed out. Killing. Feb 3 16:38:59 compute-1 systemd[1]: libpod-f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db.scope: Killing process 25093 (libvirtd) with signal SIGKILL. Feb 3 16:38:59 compute-1 systemd[1]: libpod-f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db.scope: Killing process 161823 (libvirtd) with signal SIGKILL. Feb 3 16:40:29 compute-1 systemd[1]: libpod-f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db.scope: Still around after SIGKILL. Ignoring. Feb 3 16:40:29 compute-1 systemd[1]: libpod-f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db.scope: Failed with result 'timeout'. Feb 3 16:40:29 compute-1 systemd[1]: Stopped libcontainer container f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db. Feb 3 16:40:29 compute-1 systemd[1]: libpod-f04bfd1bc9aa9a27382410c7709b61b565750775d709292d2f79bc47e139e9db.scope: Consumed 32.723s CPU time Feb 3 16:40:29 compute-1 systemd[1]: Starting Reboot... Version-Release number of selected component (if applicable): update from RHOS_TRUNK-16.0-RHEL-8-20200130.n.0 to RHOS_TRUNK-16.0-RHEL-8-20200131.n.0 (also from RHOS_TRUNK-16.0-RHEL-8-20200122.n.1 to RHOS_TRUNK-16.0-RHEL-8-20200130.n.0) How reproducible: 100% Steps to Reproduce: 1. Run CI job https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/ 2. 3. Actual results: job fails because compute-1 reboot takes too long Expected results: compute-1 reboot should be faster Additional info:
Created attachment 1657483 [details] compute-1 artifacts
This problems was reproduced twice with our update job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/52/ - update from RHOS_TRUNK-16.0-RHEL-8-20200122.n.1 to RHOS_TRUNK-16.0-RHEL-8-20200130.n.0 https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/54/ - update from RHOS_TRUNK-16.0-RHEL-8-20200130.n.0 to RHOS_TRUNK-16.0-RHEL-8-20200131.n.0 But in latest execution, overcloud reboot worked correctly: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/56/ - update from RHOS_TRUNK-16.0-RHEL-8-20200122.n.1 to RHOS_TRUNK-16.0-RHEL-8-20200131.n.0 So, we this data, what we have is "How reproducible: 66.6%"
Hi Piotr, so the command "podman images" only reflects what has been pulled on the local node, while tripleo-container-image prepare update the repository. We currently have no user interface to see what's in the local repo, beside looking directly into /var/lib/image-serve directly. The real issue happens there in undercloud-0/home/stack/overcloud_update_run_Controller.log 2020-02-20 22:05:46 | "$ podman stop nova_scheduler", 2020-02-20 22:05:46 | "b'fd2cf249eb4484ea424b3686df97e4b2603629aa039a3e5125fbe45036e83906\\n'", 2020-02-20 22:05:46 | "$ podman rm nova_scheduler", 2020-02-20 22:05:46 | "b'Error: cannot remove container fd2cf249eb4484ea424b3686df97e4b2603629aa039a3e5125fbe45036e83906 as it has active exec sessions: container state improper\\n'", 2020-02-20 22:05:46 | "Error removing container: nova_scheduler", 2020-02-20 22:05:46 | "Error: cannot remove container fd2cf249eb4484ea424b3686df97e4b2603629aa039a3e5125fbe45036e83906 as it has active exec sessions: container state improper", and later: 2020-02-20 21:40:05 | "Running container: nova_scheduler", 2020-02-20 21:40:05 | "Skipping existing container: nova_scheduler", so the nova_schedule is silently not updated. We have two issues here: 1. paunch is unable to remove the container because it has some process still running 2. paunch doesn't raise an error when this happens. 1. is a solved in bz#1802573, as for 2., as paunch is going to be replaced soon with an ansible module, not sure if it's going to be solved. Closing this one as a duplicate, to keep everything nice and tidy. Thanks. *** This bug has been marked as a duplicate of bug 1802573 ***