Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1633594

Summary:	haproxy service running inside ovn_metadata_agent container goes in defunct state after instance migrated/deleted from the compute node
Product:	Red Hat OpenStack	Reporter:	Rahul Chincholkar <rchincho>
Component:	python-networking-ovn	Assignee:	Assaf Muller <amuller>
Status:	CLOSED DUPLICATE	QA Contact:	Eran Kuris <ekuris>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	13.0 (Queens)	CC:	apevec, bperkins, dalvarez, dhill, lhh, majopela, nyechiel, sandyada
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-10-04 08:55:57 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 2 David Hill 2018-10-03 14:41:00 UTC

I'm able to reproduce this issue easily:

root       82445  0.0  0.0 339200  1880 ?        Sl   13:26   0:00      \_ /usr/bin/docker-containerd-shim-current f8bd73468edc5331b94dcfcb67a14803fff815e56dd761147be08376206e143e /var/run/docker/libcontainerd/f
42435      82460  0.2  2.6 282076 76408 ?        Ss   13:26   0:10      |   \_ /usr/bin/python2 /usr/bin/networking-ovn-metadata-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/n
42435      82685  0.0  2.3 278352 69868 ?        S    13:26   0:00      |       \_ /usr/bin/python2 /usr/bin/networking-ovn-metadata-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugi
42435      82686  0.0  2.3 278352 69868 ?        S    13:26   0:00      |       \_ /usr/bin/python2 /usr/bin/networking-ovn-metadata-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugi
root       82703  0.0  1.1 185684 32452 ?        S    13:26   0:00      |       \_ /usr/bin/python2 /bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/networking-ovn/n
42435      99990  0.0  0.0      0     0 ?        Zs   14:27   0:00      |       \_ [haproxy] <defunct>

steps:

1) Deploy overcloud with ovn+dvr+ha
2) Create a VM on computeA
3) Migrate it away

result:
haproxy is now Zombie

Comment 3 David Hill 2018-10-03 14:50:15 UTC

This appears to happen only if the last VM was migrated off this host as if I create 2 VMs and live migrate one, haproxy keeps on runnin but as soon as I migrate off that host the remaining VM, haproxy is Zombified.

Comment 4 David Hill 2018-10-03 14:54:38 UTC

If you try to create a new VM after that, a new haproxy service is spawned but we're still stuck with the Zomby haproxy until we restart the docker container.

Comment 5 David Hill 2018-10-03 14:56:08 UTC

2018-10-03 14:52:57.524 1 INFO networking_ovn.agent.metadata.agent [-] Port 4d9dd315-7fb3-4720-80e5-fdcb55387c3f in datapath 76a5cdba-0212-492c-99cd-ba77d726a14b bound to our chassis
2018-10-03 14:53:23.810 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/instance-id HTTP/1.1" status: 200  len: 146 time: 2.6696141
2018-10-03 14:53:23.856 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/public-keys HTTP/1.1" status: 404  len: 297 time: 0.0136170
2018-10-03 14:53:23.931 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/instance-id HTTP/1.1" status: 200  len: 146 time: 0.0357289
2018-10-03 14:53:23.968 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/ami-launch-index HTTP/1.1" status: 200  len: 136 time: 0.0220120
2018-10-03 14:53:24.004 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/instance-type HTTP/1.1" status: 200  len: 143 time: 0.0178549
2018-10-03 14:53:24.044 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/local-ipv4 HTTP/1.1" status: 200  len: 146 time: 0.0263171
2018-10-03 14:53:24.090 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/public-ipv4 HTTP/1.1" status: 200  len: 135 time: 0.0326929
2018-10-03 14:53:24.124 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/hostname HTTP/1.1" status: 200  len: 144 time: 0.0196948
2018-10-03 14:53:24.159 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/local-hostname HTTP/1.1" status: 200  len: 144 time: 0.0164320
2018-10-03 14:53:24.194 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/user-data HTTP/1.1" status: 404  len: 297 time: 0.0225489
2018-10-03 14:53:24.253 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/block-device-mapping HTTP/1.1" status: 200  len: 143 time: 0.0239232
2018-10-03 14:53:24.285 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/block-device-mapping/ami HTTP/1.1" status: 200  len: 138 time: 0.0190861
2018-10-03 14:53:24.327 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/block-device-mapping/root HTTP/1.1" status: 200  len: 143 time: 0.0323339
2018-10-03 14:53:24.372 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/public-hostname HTTP/1.1" status: 200  len: 144 time: 0.0328629
2018-10-03 14:53:24.421 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/placement/availability-zone HTTP/1.1" status: 200  len: 139 time: 0.0364070
2018-10-03 14:55:03.852 1 INFO networking_ovn.agent.metadata.agent [-] Port 4d9dd315-7fb3-4720-80e5-fdcb55387c3f in datapath 76a5cdba-0212-492c-99cd-ba77d726a14b unbound from our chassis
2018-10-03 14:55:03.892 1 INFO networking_ovn.agent.metadata.agent [-] Cleaning up ovnmeta-76a5cdba-0212-492c-99cd-ba77d726a14b namespace which is not needed anymore
2018-10-03 14:55:06.223 1 INFO oslo_service.service [-] Child 1852 exited with status 0
2018-10-03 14:55:06.224 1 WARNING oslo_service.service [-] pid 1852 not in child list

Comment 6 David Hill 2018-10-03 14:58:50 UTC

This also happens when you delete the last VM on that compute.

Comment 7 David Hill 2018-10-03 15:08:20 UTC

It looks like we're deleting the namespace because it's no longer needed but before deleting the namespace we don't stop haproxy .  Does this make sense?

Comment 8 Daniel Alvarez Sanchez 2018-10-03 15:14:19 UTC

It should kill haproxy it as per this code:

https://github.com/openstack/networking-ovn/blob/master/networking_ovn/agent/metadata/agent.py#L233

I see you have this trace:
2018-10-03 14:55:06.223 1 INFO oslo_service.service [-] Child 1852 exited with status 0
2018-10-03

So could you reproduce dump a "ps -ef" to a file and check if the Child XXX corresponds to the haproxy instance?

Comment 9 Daniel Alvarez Sanchez 2018-10-03 15:25:29 UTC

After our IRC conversation it looks like it's actually being killed through kill -9.

I think this is the reason for the zombie processes:

https://medium.com/@nagarwal/an-init-system-inside-the-docker-container-3821ee233f4b

Comment 10 Daniel Alvarez Sanchez 2018-10-03 15:30:52 UTC

We changed in stable/queens the way we execute haproxy, ie. instead of spawning haproxy process inside the ovn metadata container, it'll spawn a separate docker container for it.

https://review.openstack.org/#/c/591298/

This should land in next Z.

Comment 11 Daniel Alvarez Sanchez 2018-10-03 15:53:46 UTC

This should fix the zombies issue:
https://bugzilla.redhat.com/show_bug.cgi?id=1589849

Comment 12 Daniel Alvarez Sanchez 2018-10-04 08:55:09 UTC

In next OSP13.z (and also OSP14), this is the behavior when a VM is being hosted on a compute (ie. haproxy won't run anymore on the OVN metadata agent container but instead, a new sidecar container is spawned/torn down when a VM boots/stops):


[heat-admin@compute-1 ~]$ sudo docker ps | grep metadata-agent
3d4b63f13a8e        192.168.24.1:8787/rhosp14/openstack-neutron-metadata-agent-ovn:2018-10-01.1   "ip netns exec ovn..."   19 minutes ago      Up 19 minutes                               neutron-haproxy-ovnmeta-2f7d8747-bce6-4afd-8f20-cff29e531ff4
aea63c8cb334        192.168.24.1:8787/rhosp14/openstack-neutron-metadata-agent-ovn:2018-10-01.1   "kolla_start"            15 hours ago        Up 15 hours (healthy)                       ovn_metadata_agent


The docker container running the OVN metadata agent plus a sidecar container running the haproxy instance.

If we switch the VM off, that sidecar container is expected to go away:

(overcloud) [stack@undercloud-0 ~]$ openstack server stop cirros1

[heat-admin@compute-1 ~]$ sudo docker ps | grep metadata-agent
aea63c8cb334        192.168.24.1:8787/rhosp14/openstack-neutron-metadata-agent-ovn:2018-10-01.1   "kolla_start"       16 hours ago        Up 16 hours (healthy)                       ovn_metadata_agent
 

[heat-admin@compute-1 ~]$ ps -aef | grep haproxy
[heat-admin@compute-1 ~]$


This effort is being tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1589849

Comment 13 Daniel Alvarez Sanchez 2018-10-04 08:55:57 UTC


*** This bug has been marked as a duplicate of bug 1589849 ***