Bug 1633594
| Summary: | haproxy service running inside ovn_metadata_agent container goes in defunct state after instance migrated/deleted from the compute node | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Rahul Chincholkar <rchincho> |
| Component: | python-networking-ovn | Assignee: | Assaf Muller <amuller> |
| Status: | CLOSED DUPLICATE | QA Contact: | Eran Kuris <ekuris> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 13.0 (Queens) | CC: | apevec, bperkins, dalvarez, dhill, lhh, majopela, nyechiel, sandyada |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-10-04 08:55:57 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Comment 2
David Hill
2018-10-03 14:41:00 UTC
This appears to happen only if the last VM was migrated off this host as if I create 2 VMs and live migrate one, haproxy keeps on runnin but as soon as I migrate off that host the remaining VM, haproxy is Zombified. If you try to create a new VM after that, a new haproxy service is spawned but we're still stuck with the Zomby haproxy until we restart the docker container. 2018-10-03 14:52:57.524 1 INFO networking_ovn.agent.metadata.agent [-] Port 4d9dd315-7fb3-4720-80e5-fdcb55387c3f in datapath 76a5cdba-0212-492c-99cd-ba77d726a14b bound to our chassis 2018-10-03 14:53:23.810 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/instance-id HTTP/1.1" status: 200 len: 146 time: 2.6696141 2018-10-03 14:53:23.856 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/public-keys HTTP/1.1" status: 404 len: 297 time: 0.0136170 2018-10-03 14:53:23.931 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/instance-id HTTP/1.1" status: 200 len: 146 time: 0.0357289 2018-10-03 14:53:23.968 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/ami-launch-index HTTP/1.1" status: 200 len: 136 time: 0.0220120 2018-10-03 14:53:24.004 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/instance-type HTTP/1.1" status: 200 len: 143 time: 0.0178549 2018-10-03 14:53:24.044 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/local-ipv4 HTTP/1.1" status: 200 len: 146 time: 0.0263171 2018-10-03 14:53:24.090 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/public-ipv4 HTTP/1.1" status: 200 len: 135 time: 0.0326929 2018-10-03 14:53:24.124 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/hostname HTTP/1.1" status: 200 len: 144 time: 0.0196948 2018-10-03 14:53:24.159 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/local-hostname HTTP/1.1" status: 200 len: 144 time: 0.0164320 2018-10-03 14:53:24.194 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/user-data HTTP/1.1" status: 404 len: 297 time: 0.0225489 2018-10-03 14:53:24.253 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/block-device-mapping HTTP/1.1" status: 200 len: 143 time: 0.0239232 2018-10-03 14:53:24.285 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/block-device-mapping/ami HTTP/1.1" status: 200 len: 138 time: 0.0190861 2018-10-03 14:53:24.327 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/block-device-mapping/root HTTP/1.1" status: 200 len: 143 time: 0.0323339 2018-10-03 14:53:24.372 24 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/public-hostname HTTP/1.1" status: 200 len: 144 time: 0.0328629 2018-10-03 14:53:24.421 23 INFO eventlet.wsgi.server [-] 10.254.0.5,<local> "GET /2009-04-04/meta-data/placement/availability-zone HTTP/1.1" status: 200 len: 139 time: 0.0364070 2018-10-03 14:55:03.852 1 INFO networking_ovn.agent.metadata.agent [-] Port 4d9dd315-7fb3-4720-80e5-fdcb55387c3f in datapath 76a5cdba-0212-492c-99cd-ba77d726a14b unbound from our chassis 2018-10-03 14:55:03.892 1 INFO networking_ovn.agent.metadata.agent [-] Cleaning up ovnmeta-76a5cdba-0212-492c-99cd-ba77d726a14b namespace which is not needed anymore 2018-10-03 14:55:06.223 1 INFO oslo_service.service [-] Child 1852 exited with status 0 2018-10-03 14:55:06.224 1 WARNING oslo_service.service [-] pid 1852 not in child list This also happens when you delete the last VM on that compute. It looks like we're deleting the namespace because it's no longer needed but before deleting the namespace we don't stop haproxy . Does this make sense? It should kill haproxy it as per this code: https://github.com/openstack/networking-ovn/blob/master/networking_ovn/agent/metadata/agent.py#L233 I see you have this trace: 2018-10-03 14:55:06.223 1 INFO oslo_service.service [-] Child 1852 exited with status 0 2018-10-03 So could you reproduce dump a "ps -ef" to a file and check if the Child XXX corresponds to the haproxy instance? After our IRC conversation it looks like it's actually being killed through kill -9. I think this is the reason for the zombie processes: https://medium.com/@nagarwal/an-init-system-inside-the-docker-container-3821ee233f4b We changed in stable/queens the way we execute haproxy, ie. instead of spawning haproxy process inside the ovn metadata container, it'll spawn a separate docker container for it. https://review.openstack.org/#/c/591298/ This should land in next Z. This should fix the zombies issue: https://bugzilla.redhat.com/show_bug.cgi?id=1589849 In next OSP13.z (and also OSP14), this is the behavior when a VM is being hosted on a compute (ie. haproxy won't run anymore on the OVN metadata agent container but instead, a new sidecar container is spawned/torn down when a VM boots/stops): [heat-admin@compute-1 ~]$ sudo docker ps | grep metadata-agent 3d4b63f13a8e 192.168.24.1:8787/rhosp14/openstack-neutron-metadata-agent-ovn:2018-10-01.1 "ip netns exec ovn..." 19 minutes ago Up 19 minutes neutron-haproxy-ovnmeta-2f7d8747-bce6-4afd-8f20-cff29e531ff4 aea63c8cb334 192.168.24.1:8787/rhosp14/openstack-neutron-metadata-agent-ovn:2018-10-01.1 "kolla_start" 15 hours ago Up 15 hours (healthy) ovn_metadata_agent The docker container running the OVN metadata agent plus a sidecar container running the haproxy instance. If we switch the VM off, that sidecar container is expected to go away: (overcloud) [stack@undercloud-0 ~]$ openstack server stop cirros1 [heat-admin@compute-1 ~]$ sudo docker ps | grep metadata-agent aea63c8cb334 192.168.24.1:8787/rhosp14/openstack-neutron-metadata-agent-ovn:2018-10-01.1 "kolla_start" 16 hours ago Up 16 hours (healthy) ovn_metadata_agent [heat-admin@compute-1 ~]$ ps -aef | grep haproxy [heat-admin@compute-1 ~]$ This effort is being tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1589849 *** This bug has been marked as a duplicate of bug 1589849 *** |