Bug 1633594 - haproxy service running inside ovn_metadata_agent container goes in defunct state after instance migrated/deleted from the compute node
Summary: haproxy service running inside ovn_metadata_agent container goes in defunct s...
Keywords:
Status: CLOSED DUPLICATE of bug 1589849
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-networking-ovn
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Assaf Muller
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-09-27 10:59 UTC by Rahul Chincholkar
Modified: 2021-12-10 17:51 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-04 08:55:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-11673 0 None None None 2021-12-10 17:51:56 UTC
Red Hat Knowledge Base (Solution) 3644782 0 None None None 2018-10-09 12:49:20 UTC

Comment 2 David Hill 2018-10-03 14:41:00 UTC
I'm able to reproduce this issue easily:

root       82445  0.0  0.0 339200  1880 ?        Sl   13:26   0:00      \_ /usr/bin/docker-containerd-shim-current f8bd73468edc5331b94dcfcb67a14803fff815e56dd761147be08376206e143e /var/run/docker/libcontainerd/f
42435      82460  0.2  2.6 282076 76408 ?        Ss   13:26   0:10      |   \_ /usr/bin/python2 /usr/bin/networking-ovn-metadata-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/n
42435      82685  0.0  2.3 278352 69868 ?        S    13:26   0:00      |       \_ /usr/bin/python2 /usr/bin/networking-ovn-metadata-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugi
42435      82686  0.0  2.3 278352 69868 ?        S    13:26   0:00      |       \_ /usr/bin/python2 /usr/bin/networking-ovn-metadata-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugi
root       82703  0.0  1.1 185684 32452 ?        S    13:26   0:00      |       \_ /usr/bin/python2 /bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/networking-ovn/n
42435      99990  0.0  0.0      0     0 ?        Zs   14:27   0:00      |       \_ [haproxy] <defunct>

steps:

1) Deploy overcloud with ovn+dvr+ha
2) Create a VM on computeA
3) Migrate it away

result:
haproxy is now Zombie

Comment 3 David Hill 2018-10-03 14:50:15 UTC
This appears to happen only if the last VM was migrated off this host as if I create 2 VMs and live migrate one, haproxy keeps on runnin but as soon as I migrate off that host the remaining VM, haproxy is Zombified.

Comment 4 David Hill 2018-10-03 14:54:38 UTC
If you try to create a new VM after that, a new haproxy service is spawned but we're still stuck with the Zomby haproxy until we restart the docker container.

Comment 5 David Hill 2018-10-03 14:56:08 UTC
2018-10-03 14:52:57.524 1 INFO networking_ovn.agent.metadata.agent [-] Port 4d9dd315-7fb3-4720-80e5-fdcb55387c3f in datapath 76a5cdba-0212-492c-99cd-ba77d726a14b bound to our chassis
2018-10-03 14:53:23.810 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/instance-id HTTP/1.1" status: 200  len: 146 time: 2.6696141
2018-10-03 14:53:23.856 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/public-keys HTTP/1.1" status: 404  len: 297 time: 0.0136170
2018-10-03 14:53:23.931 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/instance-id HTTP/1.1" status: 200  len: 146 time: 0.0357289
2018-10-03 14:53:23.968 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/ami-launch-index HTTP/1.1" status: 200  len: 136 time: 0.0220120
2018-10-03 14:53:24.004 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/instance-type HTTP/1.1" status: 200  len: 143 time: 0.0178549
2018-10-03 14:53:24.044 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/local-ipv4 HTTP/1.1" status: 200  len: 146 time: 0.0263171
2018-10-03 14:53:24.090 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/public-ipv4 HTTP/1.1" status: 200  len: 135 time: 0.0326929
2018-10-03 14:53:24.124 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/hostname HTTP/1.1" status: 200  len: 144 time: 0.0196948
2018-10-03 14:53:24.159 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/local-hostname HTTP/1.1" status: 200  len: 144 time: 0.0164320
2018-10-03 14:53:24.194 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/user-data HTTP/1.1" status: 404  len: 297 time: 0.0225489
2018-10-03 14:53:24.253 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/block-device-mapping HTTP/1.1" status: 200  len: 143 time: 0.0239232
2018-10-03 14:53:24.285 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/block-device-mapping/ami HTTP/1.1" status: 200  len: 138 time: 0.0190861
2018-10-03 14:53:24.327 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/block-device-mapping/root HTTP/1.1" status: 200  len: 143 time: 0.0323339
2018-10-03 14:53:24.372 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/public-hostname HTTP/1.1" status: 200  len: 144 time: 0.0328629
2018-10-03 14:53:24.421 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/placement/availability-zone HTTP/1.1" status: 200  len: 139 time: 0.0364070
2018-10-03 14:55:03.852 1 INFO networking_ovn.agent.metadata.agent [-] Port 4d9dd315-7fb3-4720-80e5-fdcb55387c3f in datapath 76a5cdba-0212-492c-99cd-ba77d726a14b unbound from our chassis
2018-10-03 14:55:03.892 1 INFO networking_ovn.agent.metadata.agent [-] Cleaning up ovnmeta-76a5cdba-0212-492c-99cd-ba77d726a14b namespace which is not needed anymore
2018-10-03 14:55:06.223 1 INFO oslo_service.service [-] Child 1852 exited with status 0
2018-10-03 14:55:06.224 1 WARNING oslo_service.service [-] pid 1852 not in child list

Comment 6 David Hill 2018-10-03 14:58:50 UTC
This also happens when you delete the last VM on that compute.

Comment 7 David Hill 2018-10-03 15:08:20 UTC
It looks like we're deleting the namespace because it's no longer needed but before deleting the namespace we don't stop haproxy .  Does this make sense?

Comment 8 Daniel Alvarez Sanchez 2018-10-03 15:14:19 UTC
It should kill haproxy it as per this code:

https://github.com/openstack/networking-ovn/blob/master/networking_ovn/agent/metadata/agent.py#L233

I see you have this trace:
2018-10-03 14:55:06.223 1 INFO oslo_service.service [-] Child 1852 exited with status 0
2018-10-03

So could you reproduce dump a "ps -ef" to a file and check if the Child XXX corresponds to the haproxy instance?

Comment 9 Daniel Alvarez Sanchez 2018-10-03 15:25:29 UTC
After our IRC conversation it looks like it's actually being killed through kill -9.

I think this is the reason for the zombie processes:

https://medium.com/@nagarwal/an-init-system-inside-the-docker-container-3821ee233f4b

Comment 10 Daniel Alvarez Sanchez 2018-10-03 15:30:52 UTC
We changed in stable/queens the way we execute haproxy, ie. instead of spawning haproxy process inside the ovn metadata container, it'll spawn a separate docker container for it.

https://review.openstack.org/#/c/591298/

This should land in next Z.

Comment 11 Daniel Alvarez Sanchez 2018-10-03 15:53:46 UTC
This should fix the zombies issue:
https://bugzilla.redhat.com/show_bug.cgi?id=1589849

Comment 12 Daniel Alvarez Sanchez 2018-10-04 08:55:09 UTC
In next OSP13.z (and also OSP14), this is the behavior when a VM is being hosted on a compute (ie. haproxy won't run anymore on the OVN metadata agent container but instead, a new sidecar container is spawned/torn down when a VM boots/stops):


[heat-admin@compute-1 ~]$ sudo docker ps | grep metadata-agent
3d4b63f13a8e        192.168.24.1:8787/rhosp14/openstack-neutron-metadata-agent-ovn:2018-10-01.1   "ip netns exec ovn..."   19 minutes ago      Up 19 minutes                               neutron-haproxy-ovnmeta-2f7d8747-bce6-4afd-8f20-cff29e531ff4
aea63c8cb334        192.168.24.1:8787/rhosp14/openstack-neutron-metadata-agent-ovn:2018-10-01.1   "kolla_start"            15 hours ago        Up 15 hours (healthy)                       ovn_metadata_agent


The docker container running the OVN metadata agent plus a sidecar container running the haproxy instance.

If we switch the VM off, that sidecar container is expected to go away:

(overcloud) [stack@undercloud-0 ~]$ openstack server stop cirros1

[heat-admin@compute-1 ~]$ sudo docker ps | grep metadata-agent
aea63c8cb334        192.168.24.1:8787/rhosp14/openstack-neutron-metadata-agent-ovn:2018-10-01.1   "kolla_start"       16 hours ago        Up 16 hours (healthy)                       ovn_metadata_agent
 

[heat-admin@compute-1 ~]$ ps -aef | grep haproxy
[heat-admin@compute-1 ~]$


This effort is being tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1589849

Comment 13 Daniel Alvarez Sanchez 2018-10-04 08:55:57 UTC

*** This bug has been marked as a duplicate of bug 1589849 ***


Note You need to log in before you can comment on or make changes to this bug.