Bug 1673412

Summary: After pcs resource restart ovn-dbs-bundle, all Neutron agents are in Flapping Dead state
Product: Red Hat OpenStack Reporter: pkomarov
Component: openstack-tripleo-commonAssignee: Kamil Sambor <ksambor>
Status: CLOSED CURRENTRELEASE QA Contact: Eran Kuris <ekuris>
Severity: medium Docs Contact:
Priority: medium    
Version: 14.0 (Rocky)CC: apevec, aschultz, bhaley, dalvarez, jlibosva, lhh, lmartins, majopela, mburns, michele, slinaber, twilson
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-15 14:37:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description pkomarov 2019-02-07 13:18:11 UTC
Description of problem:
After pcs resource restart ovn-dbs-bundle, all Neutron agents are in Dead state

Version-Release number of selected component (if applicable):
osp14 with ovs setup 

How reproducible:
always

Steps to Reproduce:
1.deploy osp14 with ovs
2.from any controller do : pcs resource restart ovn-dbs-bundle
3.notice that :

 (undercloud) [stack@undercloud-0 ~]$ openstack network agent list
+--------------------------------------+--------------------+--------------------------------------+-------------------+-------+-------+---------------------------+
| ID                                   | Agent Type         | Host                                 | Availability Zone | Alive | State | Binary                    |
+--------------------------------------+--------------------+--------------------------------------+-------------------+-------+-------+---------------------------+
| 0ee0aa9a-3577-48e2-916e-418a802cf873 | DHCP agent         | undercloud-0.localdomain             | nova              | :-)   | UP    | neutron-dhcp-agent        |
| 15be8f7e-5d33-4db5-9daa-0723f1a169c8 | Baremetal Node     | 155163a6-c299-4fea-a264-34391aa8b31e | None              | :-)   | UP    | ironic-neutron-agent      |
| 3dae012c-5029-428d-84a8-fba517684f07 | Baremetal Node     | 59315382-0d95-4014-93b7-76b96e1d29c1 | None              | :-)   | UP    | ironic-neutron-agent      |
| 4662b6fd-6b6d-4858-8a90-0f5df539f08d | Baremetal Node     | 0d135f21-7b08-4566-8e98-b79d8a12a5ff | None              | :-)   | UP    | ironic-neutron-agent      |
| 8b88e238-51ed-413a-ac03-b4825d68f512 | L3 agent           | undercloud-0.localdomain             | nova              | :-)   | UP    | neutron-l3-agent          |
| 8ee6f1be-67df-47df-adbc-40231fc5f59e | Baremetal Node     | fddc0164-3bb7-47fa-9707-c53721709538 | None              | :-)   | UP    | ironic-neutron-agent      |
| a6b6eab5-bf4c-4362-b552-e161dcdc9181 | Baremetal Node     | d701304a-ecc6-48e3-b196-1a0966d71735 | None              | :-)   | UP    | ironic-neutron-agent      |
| ed2b41a9-8716-4b9c-810c-a74d01f3625f | Open vSwitch agent | undercloud-0.localdomain             | None              | :-)   | UP    | neutron-openvswitch-agent |
+--------------------------------------+--------------------+--------------------------------------+-------------------+-------+-------+---------------------------+
(undercloud) [stack@undercloud-0 ~]$ while true ;do sleep 5s;openstack network agent l'ist
> ^C
(undercloud) [stack@undercloud-0 ~]$ . overcloudrc
(overcloud) [stack@undercloud-0 ~]$ while true ;do date;sleep 5s;openstack network agent list;date;done|& agent_list.log
-bash: agent_list.log: command not found
^C
(overcloud) [stack@undercloud-0 ~]$ while true ;do date;sleep 5s;openstack network agent list;date;done|& tee agent_list.log
Thu Feb  7 08:16:25 EST 2019
+--------------------------------------+------------------------------+--------------------------+-------------------+-------+-------+-------------------------------+
| ID                                   | Agent Type                   | Host                     | Availability Zone | Alive | State | Binary                        |
+--------------------------------------+------------------------------+--------------------------+-------------------+-------+-------+-------------------------------+
| 8204aa87-ba43-48ff-abea-73ab36bcbd58 | OVN Controller Gateway agent | controller-0.localdomain | n/a               | XXX   | UP    | ovn-controller                |
| abc299b6-208a-4fb9-be1f-0c5854c0d91e | OVN Metadata agent           | compute-1.localdomain    | n/a               | XXX   | UP    | networking-ovn-metadata-agent |
| 50fa7202-fb94-4e83-b4b0-c6eca05a232b | OVN Controller agent         | compute-1.localdomain    | n/a               | XXX   | UP    | ovn-controller                |
| 41540df6-1146-41fb-b971-3886e1bb4622 | OVN Controller Gateway agent | controller-1.localdomain | n/a               | XXX   | UP    | ovn-controller                |
| c7400438-36b0-480e-a84c-dd5e1630d007 | OVN Controller agent         | compute-0.localdomain    | n/a               | XXX   | UP    | ovn-controller                |
| df8de298-756e-4eec-8f18-ff6f30b25862 | OVN Metadata agent           | compute-0.localdomain    | n/a               | XXX   | UP    | networking-ovn-metadata-agent |
| 75094519-5762-4b44-a8cb-34dcb24374c8 | OVN Controller Gateway agent | controller-2.localdomain | n/a               | XXX   | UP    | ovn-controller                |
+--------------------------------------+------------------------------+--------------------------+-------------------+-------+-------+-------------------------------+
Thu Feb  7 08:16:32 EST 2019
Thu Feb  7 08:16:32 EST 2019
+--------------------------------------+------------------------------+--------------------------+-------------------+-------+-------+-------------------------------+
| ID                                   | Agent Type                   | Host                     | Availability Zone | Alive | State | Binary                        |
+--------------------------------------+------------------------------+--------------------------+-------------------+-------+-------+-------------------------------+
| 8204aa87-ba43-48ff-abea-73ab36bcbd58 | OVN Controller Gateway agent | controller-0.localdomain | n/a               | :-)   | UP    | ovn-controller                |
| abc299b6-208a-4fb9-be1f-0c5854c0d91e | OVN Metadata agent           | compute-1.localdomain    | n/a               | :-)   | UP    | networking-ovn-metadata-agent |
| 50fa7202-fb94-4e83-b4b0-c6eca05a232b | OVN Controller agent         | compute-1.localdomain    | n/a               | :-)   | UP    | ovn-controller                |
| 41540df6-1146-41fb-b971-3886e1bb4622 | OVN Controller Gateway agent | controller-1.localdomain | n/a               | XXX   | UP    | ovn-controller                |
| c7400438-36b0-480e-a84c-dd5e1630d007 | OVN Controller agent         | compute-0.localdomain    | n/a               | :-)   | UP    | ovn-controller                |
| df8de298-756e-4eec-8f18-ff6f30b25862 | OVN Metadata agent           | compute-0.localdomain    | n/a               | :-)   | UP    | networking-ovn-metadata-agent |
| 75094519-5762-4b44-a8cb-34dcb24374c8 | OVN Controller Gateway agent | controller-2.localdomain | n/a               | :-)   | UP    | ovn-controller                |
+--------------------------------------+------------------------------+--------------------------+-------------------+-------+-------+-------------------------------+
Thu Feb  7 08:16:41 EST 2019


Actual results:


Expected results:


Additional info:

Comment 1 pkomarov 2019-02-07 13:22:38 UTC
OC sos reports and stack home are at :
http://rhos-release.virt.bos.redhat.com/log/pkomarov_sosreports/BZ1673412/

Comment 2 pkomarov 2019-02-07 13:29:19 UTC
Correction:
Version-Release number of selected component (if applicable):
osp14 with ovs setup -> osp14 with OVN setup

Comment 3 pkomarov 2019-02-19 12:11:44 UTC
Adding additional info : 
Network agent are reported in unhealthy state => docker healthchecks are failing => there is a listener for port 6642  , 
but the healthcheck executable itself is not found...
Adding DFG:DF as main , since this maybe be a kolla configuration issue...


(overcloud) [stack@undercloud-0 ~]$ ansible overcloud -b -mshell -a"docker ps|grep ovn_controller"
 [WARNING]: Found both group and host with same name: undercloud

controller-0 | SUCCESS | rc=0 >>
5a4fef8d0533        192.168.24.1:8787/rhosp14/openstack-ovn-controller:2019-02-05.1            "kolla_start"            16 hours ago        Up 19 minutes (unhealthy)                       ovn_controller

compute-0 | SUCCESS | rc=0 >>
ccda5101a6f6        192.168.24.1:8787/rhosp14/openstack-ovn-controller:2019-02-05.1               "kolla_start"       16 hours ago        Up 19 minutes (unhealthy)                       ovn_controller

compute-1 | SUCCESS | rc=0 >>
5eefa78edb5b        192.168.24.1:8787/rhosp14/openstack-ovn-controller:2019-02-05.1               "kolla_start"       16 hours ago        Up 19 minutes (unhealthy)                       ovn_controller

controller-2 | SUCCESS | rc=0 >>
751944831ac9        192.168.24.1:8787/rhosp14/openstack-ovn-controller:2019-02-05.1            "kolla_start"            16 hours ago        Up 19 minutes (unhealthy)                       ovn_controller

controller-1 | SUCCESS | rc=0 >>
bb7e14729af5        192.168.24.1:8787/rhosp14/openstack-ovn-controller:2019-02-05.1            "kolla_start"            16 hours ago        Up 19 minutes (unhealthy)                       ovn_controller

(overcloud) [stack@undercloud-0 ~]$ ansible overcloud -b -mshell -a"docker inspect ovn_controller|grep healthcheck"
 [WARNING]: Found both group and host with same name: undercloud

compute-0 | SUCCESS | rc=0 >>
                        "Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
                        "Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
                        "Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
                        "Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
                        "Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
                    "/openstack/healthcheck 6642"
                "config_data": "{\"start_order\": 1, \"healthcheck\": {\"test\": \"/openstack/healthcheck 6642\"}, \"image\": \"192.168.24.1:8787/rhosp14/openstack-ovn-controller:2019-02-05.1\", \"environment\": [\"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\"], \"user\": \"root\", \"volumes\": [\"/var/lib/kolla/config_files/ovn_controller.json:/var/lib/kolla/config_files/config.json:ro\", \"/lib/modules:/lib/modules:ro\", \"/run:/run\", \"/var/log/containers/openvswitch:/var/log/openvswitch\"], \"net\": \"host\", \"privileged\": true, \"restart\": \"always\"}",

controller-0 | SUCCESS | rc=0 >>
                        "Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
                        "Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
                        "Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
                        "Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
                        "Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
                    "/openstack/healthcheck 6642"
                "config_data": "{\"start_order\": 1, \"healthcheck\": {\"test\": \"/openstack/healthcheck 6642\"}, \"image\": \"192.168.24.1:8787/rhosp14/openstack-ovn-controller:2019-02-05.1\", \"environment\": [\"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\"], \"user\": \"root\", \"volumes\": [\"/var/lib/kolla/config_files/ovn_controller.json:/var/lib/kolla/config_files/config.json:ro\", \"/lib/modules:/lib/modules:ro\", \"/run:/run\", \"/var/log/containers/openvswitch:/var/log/openvswitch\"], \"net\": \"host\", \"privileged\": true, \"restart\": \"always\"}",

compute-1 | SUCCESS | rc=0 >>
                        "Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
                        "Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
                        "Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
                        "Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
                        "Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
                    "/openstack/healthcheck 6642"
                "config_data": "{\"start_order\": 1, \"healthcheck\": {\"test\": \"/openstack/healthcheck 6642\"}, \"image\": \"192.168.24.1:8787/rhosp14/openstack-ovn-controller:2019-02-05.1\", \"environment\": [\"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\"], \"user\": \"root\", \"volumes\": [\"/var/lib/kolla/config_files/ovn_controller.json:/var/lib/kolla/config_files/config.json:ro\", \"/lib/modules:/lib/modules:ro\", \"/run:/run\", \"/var/log/containers/openvswitch:/var/log/openvswitch\"], \"net\": \"host\", \"privileged\": true, \"restart\": \"always\"}",

controller-1 | SUCCESS | rc=0 >>
                        "Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
                        "Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
                        "Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
                        "Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
                        "Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
                    "/openstack/healthcheck 6642"
                "config_data": "{\"start_order\": 1, \"healthcheck\": {\"test\": \"/openstack/healthcheck 6642\"}, \"image\": \"192.168.24.1:8787/rhosp14/openstack-ovn-controller:2019-02-05.1\", \"environment\": [\"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\"], \"user\": \"root\", \"volumes\": [\"/var/lib/kolla/config_files/ovn_controller.json:/var/lib/kolla/config_files/config.json:ro\", \"/lib/modules:/lib/modules:ro\", \"/run:/run\", \"/var/log/containers/openvswitch:/var/log/openvswitch\"], \"net\": \"host\", \"privileged\": true, \"restart\": \"always\"}",

controller-2 | SUCCESS | rc=0 >>
                        "Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
                        "Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
                        "Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
                        "Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
                        "Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
                    "/openstack/healthcheck 6642"
                "config_data": "{\"start_order\": 1, \"healthcheck\": {\"test\": \"/openstack/healthcheck 6642\"}, \"image\": \"192.168.24.1:8787/rhosp14/openstack-ovn-controller:2019-02-05.1\", \"environment\": [\"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\"], \"user\": \"root\", \"volumes\": [\"/var/lib/kolla/config_files/ovn_controller.json:/var/lib/kolla/config_files/config.json:ro\", \"/lib/modules:/lib/modules:ro\", \"/run:/run\", \"/var/log/containers/openvswitch:/var/log/openvswitch\"], \"net\": \"host\", \"privileged\": true, \"restart\": \"always\"}",

(overcloud) [stack@undercloud-0 ~]$ ansible overcloud -b -mshell -a'docker exec `docker ps -f name=ovn_controller -q`  sh -c "grep -r 6642 /etc"'
 [WARNING]: Found both group and host with same name: undercloud

compute-0 | SUCCESS | rc=0 >>
/etc/selinux/targeted/active/ports.local:portcon tcp 6642 system_u:object_r:ovsdb_port_t:s0

controller-1 | SUCCESS | rc=0 >>
/etc/selinux/targeted/active/ports.local:portcon tcp 6642 system_u:object_r:ovsdb_port_t:s0

controller-2 | SUCCESS | rc=0 >>
/etc/selinux/targeted/active/ports.local:portcon tcp 6642 system_u:object_r:ovsdb_port_t:s0

controller-0 | SUCCESS | rc=0 >>
/etc/selinux/targeted/active/ports.local:portcon tcp 6642 system_u:object_r:ovsdb_port_t:s0

compute-1 | SUCCESS | rc=0 >>
/etc/selinux/targeted/active/ports.local:portcon tcp 6642 system_u:object_r:ovsdb_port_t:s0


(overcloud) [stack@undercloud-0 ~]$ ansible overcloud -b -mshell -a'docker exec `docker ps -f name=ovn_controller -q`  sh -c "ss -atlp|grep 6642"'
 [WARNING]: Found both group and host with same name: undercloud

compute-0 | FAILED | rc=1 >>
non-zero return code

compute-1 | FAILED | rc=1 >>
non-zero return code

controller-1 | SUCCESS | rc=0 >>
LISTEN     0      10     172.17.1.11:6642                     *:*                    

controller-0 | SUCCESS | rc=0 >>
LISTEN     0      10     172.17.1.11:6642                     *:*                    

controller-2 | SUCCESS | rc=0 >>
LISTEN     0      10     172.17.1.11:6642                     *:*

Comment 4 Daniel Alvarez Sanchez 2019-02-20 13:53:11 UTC
Looks like ovn-controller is not running/restarting on compute nodes as per "ss -atlp|grep 6642" rc=1 there
Need to check in sosreports how ovn-controller logs look like (/var/log/containers/openvswitch/ovn-controller.log*)

Comment 5 Lucas Alvares Gomes 2019-02-20 14:11:49 UTC
(In reply to Daniel Alvarez Sanchez from comment #4)
> Looks like ovn-controller is not running/restarting on compute nodes as per
> "ss -atlp|grep 6642" rc=1 there
> Need to check in sosreports how ovn-controller logs look like
> (/var/log/containers/openvswitch/ovn-controller.log*)

Yeah that would be good. Also, apparently this /openstack/healthcheck script should have been added to the image by TripleO [0][1]. I do not know why it's missing there.

[0] https://github.com/openstack/tripleo-common/blob/fe8dd5c9076ba7ada444da361b4e5533ace90435/container-images/tripleo_kolla_template_overrides.j2#L722-L726
[1] https://github.com/openstack/tripleo-common/blob/fe8dd5c9076ba7ada444da361b4e5533ace90435/healthcheck/ovn-controller

Comment 6 pkomarov 2019-02-21 06:26:38 UTC
(In reply to Lucas Alvares Gomes from comment #5)
> (In reply to Daniel Alvarez Sanchez from comment #4)
> > Looks like ovn-controller is not running/restarting on compute nodes as per
> > "ss -atlp|grep 6642" rc=1 there
> > Need to check in sosreports how ovn-controller logs look like
> > (/var/log/containers/openvswitch/ovn-controller.log*)
> 
> Yeah that would be good. Also, apparently this /openstack/healthcheck script
> should have been added to the image by TripleO [0][1]. I do not know why
> it's missing there.
> 
> [0]
> https://github.com/openstack/tripleo-common/blob/
> fe8dd5c9076ba7ada444da361b4e5533ace90435/container-images/
> tripleo_kolla_template_overrides.j2#L722-L726
> [1]
> https://github.com/openstack/tripleo-common/blob/
> fe8dd5c9076ba7ada444da361b4e5533ace90435/healthcheck/ovn-controller

also please note that the added healthcheck script is not always missing , first it was there . then we restarted the container and it was gone.
If we do some more restarts the ovn_controller container may load that mount/script like it should , so I'm just reminding that this breaks "sometimes" :)

Comment 7 Lucas Alvares Gomes 2019-02-26 10:01:08 UTC
(In reply to pkomarov from comment #6)
> (In reply to Lucas Alvares Gomes from comment #5)
> > (In reply to Daniel Alvarez Sanchez from comment #4)
> > > Looks like ovn-controller is not running/restarting on compute nodes as per
> > > "ss -atlp|grep 6642" rc=1 there
> > > Need to check in sosreports how ovn-controller logs look like
> > > (/var/log/containers/openvswitch/ovn-controller.log*)
> > 
> > Yeah that would be good. Also, apparently this /openstack/healthcheck script
> > should have been added to the image by TripleO [0][1]. I do not know why
> > it's missing there.
> > 
> > [0]
> > https://github.com/openstack/tripleo-common/blob/
> > fe8dd5c9076ba7ada444da361b4e5533ace90435/container-images/
> > tripleo_kolla_template_overrides.j2#L722-L726
> > [1]
> > https://github.com/openstack/tripleo-common/blob/
> > fe8dd5c9076ba7ada444da361b4e5533ace90435/healthcheck/ovn-controller
> 
> also please note that the added healthcheck script is not always missing ,
> first it was there . then we restarted the container and it was gone.
> If we do some more restarts the ovn_controller container may load that
> mount/script like it should , so I'm just reminding that this breaks
> "sometimes" :)

Interesting, thanks for that pointer.

Btw, I'm changing the component of this bug to python-tripleo-common because that's were the healthcheck script is injected into the image.

Comment 10 Jakub Libosvar 2020-05-15 14:37:25 UTC
This has been fixed by https://review.opendev.org/#/c/568265/5, checked on OSP16 we have the healthchecks in ovn_controller. I'm closing this BZ but feel free to reopen in case there is still an issue.