I'm able to reproduce this issue easily: root 82445 0.0 0.0 339200 1880 ? Sl 13:26 0:00 \_ /usr/bin/docker-containerd-shim-current f8bd73468edc5331b94dcfcb67a14803fff815e56dd761147be08376206e143e /var/run/docker/libcontainerd/f 42435 82460 0.2 2.6 282076 76408 ? Ss 13:26 0:10 | \_ /usr/bin/python2 /usr/bin/networking-ovn-metadata-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/n 42435 82685 0.0 2.3 278352 69868 ? S 13:26 0:00 | \_ /usr/bin/python2 /usr/bin/networking-ovn-metadata-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugi 42435 82686 0.0 2.3 278352 69868 ? S 13:26 0:00 | \_ /usr/bin/python2 /usr/bin/networking-ovn-metadata-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugi root 82703 0.0 1.1 185684 32452 ? S 13:26 0:00 | \_ /usr/bin/python2 /bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/networking-ovn/n 42435 99990 0.0 0.0 0 0 ? Zs 14:27 0:00 | \_ [haproxy] <defunct> steps: 1) Deploy overcloud with ovn+dvr+ha 2) Create a VM on computeA 3) Migrate it away result: haproxy is now Zombie
This appears to happen only if the last VM was migrated off this host as if I create 2 VMs and live migrate one, haproxy keeps on runnin but as soon as I migrate off that host the remaining VM, haproxy is Zombified.
If you try to create a new VM after that, a new haproxy service is spawned but we're still stuck with the Zomby haproxy until we restart the docker container.
2018-10-03 14:52:57.524 1 INFO networking_ovn.agent.metadata.agent [-] Port 4d9dd315-7fb3-4720-80e5-fdcb55387c3f in datapath 76a5cdba-0212-492c-99cd-ba77d726a14b bound to our chassis 2018-10-03 14:53:23.810 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/instance-id HTTP/1.1" status: 200 len: 146 time: 2.6696141 2018-10-03 14:53:23.856 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/public-keys HTTP/1.1" status: 404 len: 297 time: 0.0136170 2018-10-03 14:53:23.931 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/instance-id HTTP/1.1" status: 200 len: 146 time: 0.0357289 2018-10-03 14:53:23.968 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/ami-launch-index HTTP/1.1" status: 200 len: 136 time: 0.0220120 2018-10-03 14:53:24.004 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/instance-type HTTP/1.1" status: 200 len: 143 time: 0.0178549 2018-10-03 14:53:24.044 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/local-ipv4 HTTP/1.1" status: 200 len: 146 time: 0.0263171 2018-10-03 14:53:24.090 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/public-ipv4 HTTP/1.1" status: 200 len: 135 time: 0.0326929 2018-10-03 14:53:24.124 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/hostname HTTP/1.1" status: 200 len: 144 time: 0.0196948 2018-10-03 14:53:24.159 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/local-hostname HTTP/1.1" status: 200 len: 144 time: 0.0164320 2018-10-03 14:53:24.194 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/user-data HTTP/1.1" status: 404 len: 297 time: 0.0225489 2018-10-03 14:53:24.253 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/block-device-mapping HTTP/1.1" status: 200 len: 143 time: 0.0239232 2018-10-03 14:53:24.285 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/block-device-mapping/ami HTTP/1.1" status: 200 len: 138 time: 0.0190861 2018-10-03 14:53:24.327 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/block-device-mapping/root HTTP/1.1" status: 200 len: 143 time: 0.0323339 2018-10-03 14:53:24.372 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/public-hostname HTTP/1.1" status: 200 len: 144 time: 0.0328629 2018-10-03 14:53:24.421 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/placement/availability-zone HTTP/1.1" status: 200 len: 139 time: 0.0364070 2018-10-03 14:55:03.852 1 INFO networking_ovn.agent.metadata.agent [-] Port 4d9dd315-7fb3-4720-80e5-fdcb55387c3f in datapath 76a5cdba-0212-492c-99cd-ba77d726a14b unbound from our chassis 2018-10-03 14:55:03.892 1 INFO networking_ovn.agent.metadata.agent [-] Cleaning up ovnmeta-76a5cdba-0212-492c-99cd-ba77d726a14b namespace which is not needed anymore 2018-10-03 14:55:06.223 1 INFO oslo_service.service [-] Child 1852 exited with status 0 2018-10-03 14:55:06.224 1 WARNING oslo_service.service [-] pid 1852 not in child list
This also happens when you delete the last VM on that compute.
It looks like we're deleting the namespace because it's no longer needed but before deleting the namespace we don't stop haproxy . Does this make sense?
It should kill haproxy it as per this code: https://github.com/openstack/networking-ovn/blob/master/networking_ovn/agent/metadata/agent.py#L233 I see you have this trace: 2018-10-03 14:55:06.223 1 INFO oslo_service.service [-] Child 1852 exited with status 0 2018-10-03 So could you reproduce dump a "ps -ef" to a file and check if the Child XXX corresponds to the haproxy instance?
After our IRC conversation it looks like it's actually being killed through kill -9. I think this is the reason for the zombie processes: https://medium.com/@nagarwal/an-init-system-inside-the-docker-container-3821ee233f4b
We changed in stable/queens the way we execute haproxy, ie. instead of spawning haproxy process inside the ovn metadata container, it'll spawn a separate docker container for it. https://review.openstack.org/#/c/591298/ This should land in next Z.
This should fix the zombies issue: https://bugzilla.redhat.com/show_bug.cgi?id=1589849
In next OSP13.z (and also OSP14), this is the behavior when a VM is being hosted on a compute (ie. haproxy won't run anymore on the OVN metadata agent container but instead, a new sidecar container is spawned/torn down when a VM boots/stops): [heat-admin@compute-1 ~]$ sudo docker ps | grep metadata-agent 3d4b63f13a8e 192.168.24.1:8787/rhosp14/openstack-neutron-metadata-agent-ovn:2018-10-01.1 "ip netns exec ovn..." 19 minutes ago Up 19 minutes neutron-haproxy-ovnmeta-2f7d8747-bce6-4afd-8f20-cff29e531ff4 aea63c8cb334 192.168.24.1:8787/rhosp14/openstack-neutron-metadata-agent-ovn:2018-10-01.1 "kolla_start" 15 hours ago Up 15 hours (healthy) ovn_metadata_agent The docker container running the OVN metadata agent plus a sidecar container running the haproxy instance. If we switch the VM off, that sidecar container is expected to go away: (overcloud) [stack@undercloud-0 ~]$ openstack server stop cirros1 [heat-admin@compute-1 ~]$ sudo docker ps | grep metadata-agent aea63c8cb334 192.168.24.1:8787/rhosp14/openstack-neutron-metadata-agent-ovn:2018-10-01.1 "kolla_start" 16 hours ago Up 16 hours (healthy) ovn_metadata_agent [heat-admin@compute-1 ~]$ ps -aef | grep haproxy [heat-admin@compute-1 ~]$ This effort is being tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1589849
*** This bug has been marked as a duplicate of bug 1589849 ***