Bug 1673412
| Summary: | After pcs resource restart ovn-dbs-bundle, all Neutron agents are in Flapping Dead state | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | pkomarov |
| Component: | openstack-tripleo-common | Assignee: | Kamil Sambor <ksambor> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Eran Kuris <ekuris> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 14.0 (Rocky) | CC: | apevec, aschultz, bhaley, dalvarez, jlibosva, lhh, lmartins, majopela, mburns, michele, slinaber, twilson |
| Target Milestone: | --- | Keywords: | Triaged, ZStream |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-05-15 14:37:25 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
OC sos reports and stack home are at : http://rhos-release.virt.bos.redhat.com/log/pkomarov_sosreports/BZ1673412/ Correction: Version-Release number of selected component (if applicable): osp14 with ovs setup -> osp14 with OVN setup Adding additional info :
Network agent are reported in unhealthy state => docker healthchecks are failing => there is a listener for port 6642 ,
but the healthcheck executable itself is not found...
Adding DFG:DF as main , since this maybe be a kolla configuration issue...
(overcloud) [stack@undercloud-0 ~]$ ansible overcloud -b -mshell -a"docker ps|grep ovn_controller"
[WARNING]: Found both group and host with same name: undercloud
controller-0 | SUCCESS | rc=0 >>
5a4fef8d0533 192.168.24.1:8787/rhosp14/openstack-ovn-controller:2019-02-05.1 "kolla_start" 16 hours ago Up 19 minutes (unhealthy) ovn_controller
compute-0 | SUCCESS | rc=0 >>
ccda5101a6f6 192.168.24.1:8787/rhosp14/openstack-ovn-controller:2019-02-05.1 "kolla_start" 16 hours ago Up 19 minutes (unhealthy) ovn_controller
compute-1 | SUCCESS | rc=0 >>
5eefa78edb5b 192.168.24.1:8787/rhosp14/openstack-ovn-controller:2019-02-05.1 "kolla_start" 16 hours ago Up 19 minutes (unhealthy) ovn_controller
controller-2 | SUCCESS | rc=0 >>
751944831ac9 192.168.24.1:8787/rhosp14/openstack-ovn-controller:2019-02-05.1 "kolla_start" 16 hours ago Up 19 minutes (unhealthy) ovn_controller
controller-1 | SUCCESS | rc=0 >>
bb7e14729af5 192.168.24.1:8787/rhosp14/openstack-ovn-controller:2019-02-05.1 "kolla_start" 16 hours ago Up 19 minutes (unhealthy) ovn_controller
(overcloud) [stack@undercloud-0 ~]$ ansible overcloud -b -mshell -a"docker inspect ovn_controller|grep healthcheck"
[WARNING]: Found both group and host with same name: undercloud
compute-0 | SUCCESS | rc=0 >>
"Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
"Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
"Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
"Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
"Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
"/openstack/healthcheck 6642"
"config_data": "{\"start_order\": 1, \"healthcheck\": {\"test\": \"/openstack/healthcheck 6642\"}, \"image\": \"192.168.24.1:8787/rhosp14/openstack-ovn-controller:2019-02-05.1\", \"environment\": [\"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\"], \"user\": \"root\", \"volumes\": [\"/var/lib/kolla/config_files/ovn_controller.json:/var/lib/kolla/config_files/config.json:ro\", \"/lib/modules:/lib/modules:ro\", \"/run:/run\", \"/var/log/containers/openvswitch:/var/log/openvswitch\"], \"net\": \"host\", \"privileged\": true, \"restart\": \"always\"}",
controller-0 | SUCCESS | rc=0 >>
"Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
"Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
"Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
"Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
"Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
"/openstack/healthcheck 6642"
"config_data": "{\"start_order\": 1, \"healthcheck\": {\"test\": \"/openstack/healthcheck 6642\"}, \"image\": \"192.168.24.1:8787/rhosp14/openstack-ovn-controller:2019-02-05.1\", \"environment\": [\"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\"], \"user\": \"root\", \"volumes\": [\"/var/lib/kolla/config_files/ovn_controller.json:/var/lib/kolla/config_files/config.json:ro\", \"/lib/modules:/lib/modules:ro\", \"/run:/run\", \"/var/log/containers/openvswitch:/var/log/openvswitch\"], \"net\": \"host\", \"privileged\": true, \"restart\": \"always\"}",
compute-1 | SUCCESS | rc=0 >>
"Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
"Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
"Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
"Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
"Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
"/openstack/healthcheck 6642"
"config_data": "{\"start_order\": 1, \"healthcheck\": {\"test\": \"/openstack/healthcheck 6642\"}, \"image\": \"192.168.24.1:8787/rhosp14/openstack-ovn-controller:2019-02-05.1\", \"environment\": [\"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\"], \"user\": \"root\", \"volumes\": [\"/var/lib/kolla/config_files/ovn_controller.json:/var/lib/kolla/config_files/config.json:ro\", \"/lib/modules:/lib/modules:ro\", \"/run:/run\", \"/var/log/containers/openvswitch:/var/log/openvswitch\"], \"net\": \"host\", \"privileged\": true, \"restart\": \"always\"}",
controller-1 | SUCCESS | rc=0 >>
"Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
"Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
"Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
"Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
"Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
"/openstack/healthcheck 6642"
"config_data": "{\"start_order\": 1, \"healthcheck\": {\"test\": \"/openstack/healthcheck 6642\"}, \"image\": \"192.168.24.1:8787/rhosp14/openstack-ovn-controller:2019-02-05.1\", \"environment\": [\"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\"], \"user\": \"root\", \"volumes\": [\"/var/lib/kolla/config_files/ovn_controller.json:/var/lib/kolla/config_files/config.json:ro\", \"/lib/modules:/lib/modules:ro\", \"/run:/run\", \"/var/log/containers/openvswitch:/var/log/openvswitch\"], \"net\": \"host\", \"privileged\": true, \"restart\": \"always\"}",
controller-2 | SUCCESS | rc=0 >>
"Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
"Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
"Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
"Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
"Output": "/bin/sh: /openstack/healthcheck: No such file or directory\n"
"/openstack/healthcheck 6642"
"config_data": "{\"start_order\": 1, \"healthcheck\": {\"test\": \"/openstack/healthcheck 6642\"}, \"image\": \"192.168.24.1:8787/rhosp14/openstack-ovn-controller:2019-02-05.1\", \"environment\": [\"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\"], \"user\": \"root\", \"volumes\": [\"/var/lib/kolla/config_files/ovn_controller.json:/var/lib/kolla/config_files/config.json:ro\", \"/lib/modules:/lib/modules:ro\", \"/run:/run\", \"/var/log/containers/openvswitch:/var/log/openvswitch\"], \"net\": \"host\", \"privileged\": true, \"restart\": \"always\"}",
(overcloud) [stack@undercloud-0 ~]$ ansible overcloud -b -mshell -a'docker exec `docker ps -f name=ovn_controller -q` sh -c "grep -r 6642 /etc"'
[WARNING]: Found both group and host with same name: undercloud
compute-0 | SUCCESS | rc=0 >>
/etc/selinux/targeted/active/ports.local:portcon tcp 6642 system_u:object_r:ovsdb_port_t:s0
controller-1 | SUCCESS | rc=0 >>
/etc/selinux/targeted/active/ports.local:portcon tcp 6642 system_u:object_r:ovsdb_port_t:s0
controller-2 | SUCCESS | rc=0 >>
/etc/selinux/targeted/active/ports.local:portcon tcp 6642 system_u:object_r:ovsdb_port_t:s0
controller-0 | SUCCESS | rc=0 >>
/etc/selinux/targeted/active/ports.local:portcon tcp 6642 system_u:object_r:ovsdb_port_t:s0
compute-1 | SUCCESS | rc=0 >>
/etc/selinux/targeted/active/ports.local:portcon tcp 6642 system_u:object_r:ovsdb_port_t:s0
(overcloud) [stack@undercloud-0 ~]$ ansible overcloud -b -mshell -a'docker exec `docker ps -f name=ovn_controller -q` sh -c "ss -atlp|grep 6642"'
[WARNING]: Found both group and host with same name: undercloud
compute-0 | FAILED | rc=1 >>
non-zero return code
compute-1 | FAILED | rc=1 >>
non-zero return code
controller-1 | SUCCESS | rc=0 >>
LISTEN 0 10 172.17.1.11:6642 *:*
controller-0 | SUCCESS | rc=0 >>
LISTEN 0 10 172.17.1.11:6642 *:*
controller-2 | SUCCESS | rc=0 >>
LISTEN 0 10 172.17.1.11:6642 *:*
Looks like ovn-controller is not running/restarting on compute nodes as per "ss -atlp|grep 6642" rc=1 there Need to check in sosreports how ovn-controller logs look like (/var/log/containers/openvswitch/ovn-controller.log*) (In reply to Daniel Alvarez Sanchez from comment #4) > Looks like ovn-controller is not running/restarting on compute nodes as per > "ss -atlp|grep 6642" rc=1 there > Need to check in sosreports how ovn-controller logs look like > (/var/log/containers/openvswitch/ovn-controller.log*) Yeah that would be good. Also, apparently this /openstack/healthcheck script should have been added to the image by TripleO [0][1]. I do not know why it's missing there. [0] https://github.com/openstack/tripleo-common/blob/fe8dd5c9076ba7ada444da361b4e5533ace90435/container-images/tripleo_kolla_template_overrides.j2#L722-L726 [1] https://github.com/openstack/tripleo-common/blob/fe8dd5c9076ba7ada444da361b4e5533ace90435/healthcheck/ovn-controller (In reply to Lucas Alvares Gomes from comment #5) > (In reply to Daniel Alvarez Sanchez from comment #4) > > Looks like ovn-controller is not running/restarting on compute nodes as per > > "ss -atlp|grep 6642" rc=1 there > > Need to check in sosreports how ovn-controller logs look like > > (/var/log/containers/openvswitch/ovn-controller.log*) > > Yeah that would be good. Also, apparently this /openstack/healthcheck script > should have been added to the image by TripleO [0][1]. I do not know why > it's missing there. > > [0] > https://github.com/openstack/tripleo-common/blob/ > fe8dd5c9076ba7ada444da361b4e5533ace90435/container-images/ > tripleo_kolla_template_overrides.j2#L722-L726 > [1] > https://github.com/openstack/tripleo-common/blob/ > fe8dd5c9076ba7ada444da361b4e5533ace90435/healthcheck/ovn-controller also please note that the added healthcheck script is not always missing , first it was there . then we restarted the container and it was gone. If we do some more restarts the ovn_controller container may load that mount/script like it should , so I'm just reminding that this breaks "sometimes" :) (In reply to pkomarov from comment #6) > (In reply to Lucas Alvares Gomes from comment #5) > > (In reply to Daniel Alvarez Sanchez from comment #4) > > > Looks like ovn-controller is not running/restarting on compute nodes as per > > > "ss -atlp|grep 6642" rc=1 there > > > Need to check in sosreports how ovn-controller logs look like > > > (/var/log/containers/openvswitch/ovn-controller.log*) > > > > Yeah that would be good. Also, apparently this /openstack/healthcheck script > > should have been added to the image by TripleO [0][1]. I do not know why > > it's missing there. > > > > [0] > > https://github.com/openstack/tripleo-common/blob/ > > fe8dd5c9076ba7ada444da361b4e5533ace90435/container-images/ > > tripleo_kolla_template_overrides.j2#L722-L726 > > [1] > > https://github.com/openstack/tripleo-common/blob/ > > fe8dd5c9076ba7ada444da361b4e5533ace90435/healthcheck/ovn-controller > > also please note that the added healthcheck script is not always missing , > first it was there . then we restarted the container and it was gone. > If we do some more restarts the ovn_controller container may load that > mount/script like it should , so I'm just reminding that this breaks > "sometimes" :) Interesting, thanks for that pointer. Btw, I'm changing the component of this bug to python-tripleo-common because that's were the healthcheck script is injected into the image. This has been fixed by https://review.opendev.org/#/c/568265/5, checked on OSP16 we have the healthchecks in ovn_controller. I'm closing this BZ but feel free to reopen in case there is still an issue. |
Description of problem: After pcs resource restart ovn-dbs-bundle, all Neutron agents are in Dead state Version-Release number of selected component (if applicable): osp14 with ovs setup How reproducible: always Steps to Reproduce: 1.deploy osp14 with ovs 2.from any controller do : pcs resource restart ovn-dbs-bundle 3.notice that : (undercloud) [stack@undercloud-0 ~]$ openstack network agent list +--------------------------------------+--------------------+--------------------------------------+-------------------+-------+-------+---------------------------+ | ID | Agent Type | Host | Availability Zone | Alive | State | Binary | +--------------------------------------+--------------------+--------------------------------------+-------------------+-------+-------+---------------------------+ | 0ee0aa9a-3577-48e2-916e-418a802cf873 | DHCP agent | undercloud-0.localdomain | nova | :-) | UP | neutron-dhcp-agent | | 15be8f7e-5d33-4db5-9daa-0723f1a169c8 | Baremetal Node | 155163a6-c299-4fea-a264-34391aa8b31e | None | :-) | UP | ironic-neutron-agent | | 3dae012c-5029-428d-84a8-fba517684f07 | Baremetal Node | 59315382-0d95-4014-93b7-76b96e1d29c1 | None | :-) | UP | ironic-neutron-agent | | 4662b6fd-6b6d-4858-8a90-0f5df539f08d | Baremetal Node | 0d135f21-7b08-4566-8e98-b79d8a12a5ff | None | :-) | UP | ironic-neutron-agent | | 8b88e238-51ed-413a-ac03-b4825d68f512 | L3 agent | undercloud-0.localdomain | nova | :-) | UP | neutron-l3-agent | | 8ee6f1be-67df-47df-adbc-40231fc5f59e | Baremetal Node | fddc0164-3bb7-47fa-9707-c53721709538 | None | :-) | UP | ironic-neutron-agent | | a6b6eab5-bf4c-4362-b552-e161dcdc9181 | Baremetal Node | d701304a-ecc6-48e3-b196-1a0966d71735 | None | :-) | UP | ironic-neutron-agent | | ed2b41a9-8716-4b9c-810c-a74d01f3625f | Open vSwitch agent | undercloud-0.localdomain | None | :-) | UP | neutron-openvswitch-agent | +--------------------------------------+--------------------+--------------------------------------+-------------------+-------+-------+---------------------------+ (undercloud) [stack@undercloud-0 ~]$ while true ;do sleep 5s;openstack network agent l'ist > ^C (undercloud) [stack@undercloud-0 ~]$ . overcloudrc (overcloud) [stack@undercloud-0 ~]$ while true ;do date;sleep 5s;openstack network agent list;date;done|& agent_list.log -bash: agent_list.log: command not found ^C (overcloud) [stack@undercloud-0 ~]$ while true ;do date;sleep 5s;openstack network agent list;date;done|& tee agent_list.log Thu Feb 7 08:16:25 EST 2019 +--------------------------------------+------------------------------+--------------------------+-------------------+-------+-------+-------------------------------+ | ID | Agent Type | Host | Availability Zone | Alive | State | Binary | +--------------------------------------+------------------------------+--------------------------+-------------------+-------+-------+-------------------------------+ | 8204aa87-ba43-48ff-abea-73ab36bcbd58 | OVN Controller Gateway agent | controller-0.localdomain | n/a | XXX | UP | ovn-controller | | abc299b6-208a-4fb9-be1f-0c5854c0d91e | OVN Metadata agent | compute-1.localdomain | n/a | XXX | UP | networking-ovn-metadata-agent | | 50fa7202-fb94-4e83-b4b0-c6eca05a232b | OVN Controller agent | compute-1.localdomain | n/a | XXX | UP | ovn-controller | | 41540df6-1146-41fb-b971-3886e1bb4622 | OVN Controller Gateway agent | controller-1.localdomain | n/a | XXX | UP | ovn-controller | | c7400438-36b0-480e-a84c-dd5e1630d007 | OVN Controller agent | compute-0.localdomain | n/a | XXX | UP | ovn-controller | | df8de298-756e-4eec-8f18-ff6f30b25862 | OVN Metadata agent | compute-0.localdomain | n/a | XXX | UP | networking-ovn-metadata-agent | | 75094519-5762-4b44-a8cb-34dcb24374c8 | OVN Controller Gateway agent | controller-2.localdomain | n/a | XXX | UP | ovn-controller | +--------------------------------------+------------------------------+--------------------------+-------------------+-------+-------+-------------------------------+ Thu Feb 7 08:16:32 EST 2019 Thu Feb 7 08:16:32 EST 2019 +--------------------------------------+------------------------------+--------------------------+-------------------+-------+-------+-------------------------------+ | ID | Agent Type | Host | Availability Zone | Alive | State | Binary | +--------------------------------------+------------------------------+--------------------------+-------------------+-------+-------+-------------------------------+ | 8204aa87-ba43-48ff-abea-73ab36bcbd58 | OVN Controller Gateway agent | controller-0.localdomain | n/a | :-) | UP | ovn-controller | | abc299b6-208a-4fb9-be1f-0c5854c0d91e | OVN Metadata agent | compute-1.localdomain | n/a | :-) | UP | networking-ovn-metadata-agent | | 50fa7202-fb94-4e83-b4b0-c6eca05a232b | OVN Controller agent | compute-1.localdomain | n/a | :-) | UP | ovn-controller | | 41540df6-1146-41fb-b971-3886e1bb4622 | OVN Controller Gateway agent | controller-1.localdomain | n/a | XXX | UP | ovn-controller | | c7400438-36b0-480e-a84c-dd5e1630d007 | OVN Controller agent | compute-0.localdomain | n/a | :-) | UP | ovn-controller | | df8de298-756e-4eec-8f18-ff6f30b25862 | OVN Metadata agent | compute-0.localdomain | n/a | :-) | UP | networking-ovn-metadata-agent | | 75094519-5762-4b44-a8cb-34dcb24374c8 | OVN Controller Gateway agent | controller-2.localdomain | n/a | :-) | UP | ovn-controller | +--------------------------------------+------------------------------+--------------------------+-------------------+-------+-------+-------------------------------+ Thu Feb 7 08:16:41 EST 2019 Actual results: Expected results: Additional info: