Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1602350

Summary: [OCP on OSP via kuryr] Got invalid status code from CNI daemon
Product: Red Hat OpenStack Reporter: Yadan Pei <yapei>
Component: openstack-octaviaAssignee: Carlos Goncalves <cgoncalves>
Status: CLOSED WORKSFORME QA Contact: Alexander Stafeyev <astafeye>
Severity: high Docs Contact:
Priority: high    
Version: 14.0 (Rocky)CC: aos-bugs, asegurap, bbennett, gcheresh, hongli, ihrachys, juriarte, lpeer, majopela, nyechiel, oblaut, tsedovic, yapei
Target Milestone: ---Keywords: AutomationBlocker, TestBlocker
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-25 14:02:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Pod-kuryr-cni-ds.Container-install-cni-plugins
none
Pod-kuryr-cni-ds.Container-sky-dns
none
Pod-kuryr-cni-ds.Container-kuryr-cni
none
Pod-kuryr-controller
none
Octavia - api log
none
Octavia - health-manager log
none
Octavia - housekeeping log
none
Octavia - worker log none

Description Yadan Pei 2018-07-18 09:12:29 UTC
Description of problem:
Meet error(following) after running some tests
NetworkPlugin cni failed to set up pod "<pod>" network: Got invalid status code from CNI daemon

Version-Release number of selected component (if applicable):
OSP13
openshift v3.10.18
python2-kuryr-kubernetes-0.4.3-1.el7ost.noarch
openstack-kuryr-kubernetes-controller-0.4.3-1.el7ost.noarch
python2-kuryr-lib-0.7.0-1.el7ost.noarch
openstack-kuryr-kubernetes-common-0.4.3-1.el7ost.noarch

How reproducible:
Sometimes

Steps to Reproduce:
1.Run some web console cases and other cases
2.
3.

Actual results:
2. Meet these error 
  Warning  FailedCreatePodSandBox  1m  kubelet, master-2.openshift.example.com  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "c8f62e69133396592cb70af977612709edc5d53002726f6a8a551c1ac0092963" network for pod "webconsole-785689b664-8lpgk": NetworkPlugin cni failed to set up pod "webconsole-785689b664-8lpgk_openshift-web-console" network: Got invalid status code from CNI daemon.; Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/api.py", line 89, in run
    vif = self._add(params)
  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/api.py", line 147, in _add
    resp = self._make_request('addNetwork', params, httplib.ACCEPTED)
  File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/api.py", line 181, in _make_request
    raise k_exc.CNIError('Got invalid status code from CNI daemon.')
CNIError: Got invalid status code from CNI daemon.
  Normal  SandboxChanged  1m (x4 over 11m)  kubelet, master-2.openshift.example.com  Pod sandbox changed, it will be killed and re-created.


Expected results:
2. Should work without errors

Additional info:

Comment 1 Yadan Pei 2018-07-18 09:27:58 UTC
This issue block new pods creation and thus most testings are blocked

Comment 2 Casey Callendrello 2018-07-18 13:43:10 UTC
Can you get the logs from the kuryr daemon itself?

Comment 3 Ben Bennett 2018-07-18 17:59:17 UTC
Toni, FYI.

Comment 4 Yadan Pei 2018-07-19 02:23:43 UTC
Created attachment 1459840 [details]
Pod-kuryr-cni-ds.Container-install-cni-plugins

Comment 5 Yadan Pei 2018-07-19 02:24:16 UTC
Created attachment 1459841 [details]
Pod-kuryr-cni-ds.Container-sky-dns

Comment 6 Yadan Pei 2018-07-19 02:24:44 UTC
Created attachment 1459842 [details]
Pod-kuryr-cni-ds.Container-kuryr-cni

Comment 7 Yadan Pei 2018-07-19 02:25:14 UTC
Created attachment 1459843 [details]
Pod-kuryr-controller

Comment 8 Jon Uriarte 2018-07-19 14:18:44 UTC
Created attachment 1460207 [details]
Octavia - api log

Comment 9 Jon Uriarte 2018-07-19 14:19:28 UTC
Created attachment 1460215 [details]
Octavia - health-manager log

Comment 10 Jon Uriarte 2018-07-19 14:19:54 UTC
Created attachment 1460221 [details]
Octavia - housekeeping log

Comment 11 Jon Uriarte 2018-07-19 14:20:23 UTC
Created attachment 1460228 [details]
Octavia - worker log

Comment 12 Jon Uriarte 2018-07-19 14:23:58 UTC
Added more info:

The issue is that the kuryr controller, running in a container in Openshift, has lost connectivity with the kubernetes API, which is behind a load balancer, so new pods cannot be created.

Logs from kuryr-controller:
2018-07-18 11:53:04.490 1 ERROR health-daemon ConnectionError: HTTPSConnectionPool(host='172.30.0.1', port=8443): Max retries exceeded with url: /healthz (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f6ace9e7850>: Failed to establish a new connection: [Errno 113] EHOSTUNREACH',))


The service (in kubernetes) and the load balancer (in Openstack) do exist:

[openshift@master-0 ~]$ oc get svc kubernetes
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                 AGE
kubernetes   ClusterIP   172.30.0.1   <none>        443/TCP,53/UDP,53/TCP   1d


(openstack) (overcloud) [heat-admin@compute-1 ~]$ openstack loadbalancer list
+--------------------------------------+------------------------------------------------+----------------------------------+----------------+---------------------+----------+
| id                                   | name                                           | project_id                       | vip_address    | provisioning_status | provider |
+--------------------------------------+------------------------------------------------+----------------------------------+----------------+---------------------+----------+
| f477721b-6dc4-4e13-a8fa-9b5198afaa21 | default/docker-registry                        | 293061856e1a4cffbae3ed099cf4f619 | 172.30.217.220 | ACTIVE              | octavia  |
| 8d50f5b0-7cad-4377-9909-83504b7616e7 | openshift-ansible-openshift.example.com-api-lb | 293061856e1a4cffbae3ed099cf4f619 | 172.30.0.1     | ERROR               | octavia  |

(overcloud) [heat-admin@compute-1 ~]$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+-----------+------------+---------------+----------------+
| id                                   | loadbalancer_id                      | status    | role       | lb_network_ip | ha_ip          |
+--------------------------------------+--------------------------------------+-----------+------------+---------------+----------------+
| 0aa81b91-7623-46e2-a249-618fe3333586 | 8d50f5b0-7cad-4377-9909-83504b7616e7 | ERROR     | STANDALONE | 172.24.0.4    | 172.30.0.1     |
| df3492d1-31f0-41e6-83c1-856138be1100 | 7c7b43cb-b02b-4c11-8c7c-251fef99a399 | ALLOCATED | STANDALONE | 172.24.0.3    | 172.30.167.165 |
| fb858a05-bb54-494f-9cff-6eefff49b29e | cbc9c320-2d63-432e-835c-a60afddeff22 | ALLOCATED | STANDALONE | 172.24.0.30   | 172.30.198.62  |
| fc8df28b-ae56-461b-924c-82e733ce88d2 | 855f438d-5937-4d6a-a3ec-0acc1d197e96 | ERROR     | STANDALONE | 172.24.0.7    | 172.30.136.198 |
| feccbb36-1409-4a39-a792-aa36de4f1e90 | d0e9f18e-b809-4c55-ae0e-29914bbe88d4 | ALLOCATED | STANDALONE | 172.24.0.23   | 172.30.213.124 |
+--------------------------------------+--------------------------------------+-----------+------------+---------------+----------------+


But the amphora VM does not exist:

(overcloud) [heat-admin@compute-1 ~]$ openstack server list --all
+--------------------------------------+----------------------------------------------+--------+--------------------------------------------------------------------------------------------+---------------------------------+-----------+
| ID                                   | Name                                         | Status | Networks                                                                                   | Image                           | Flavor    |
+--------------------------------------+----------------------------------------------+--------+--------------------------------------------------------------------------------------------+---------------------------------+-----------+
| 912ce712-8215-40de-a90f-ade97a5a20fd | amphora-df3492d1-31f0-41e6-83c1-856138be1100 | ACTIVE | lb-mgmt-net=172.24.0.3; openshift-ansible-openshift.example.com-service-net=172.30.128.44  | octavia-amphora-13.0-20180621.1 |           |
| 0114f7ee-04ff-464e-9694-c0b8c0b10694 | amphora-fb858a05-bb54-494f-9cff-6eefff49b29e | ACTIVE | lb-mgmt-net=172.24.0.30; openshift-ansible-openshift.example.com-service-net=172.30.128.29 | octavia-amphora-13.0-20180621.1 |           |
| 93fbcef4-f074-4d82-870d-fea4867d1f63 | amphora-feccbb36-1409-4a39-a792-aa36de4f1e90 | ACTIVE | lb-mgmt-net=172.24.0.23; openshift-ansible-openshift.example.com-service-net=172.30.128.22 | octavia-amphora-13.0-20180621.1 |           |


The amphora VM seems to be deleted for some reason.

Find attached Octavia logs.

Comment 13 Carlos Goncalves 2018-07-19 15:49:21 UTC
Timeline for load balancer 8d50f5b0-7cad-4377-9909-83504b7616e7:

[2018-07-17 11:57:37] Load balancer 8d50f5b0-7cad-4377-9909-83504b7616e7 was create with Amphora ID 0aa81b91-7623-46e2-a249-618fe333358, and transitioned to ACTIVE.

(No events logged related to this LB/amphora between this time and next)

[2018-07-18 08:26:05] First warning message for health of amphora 0aa81b91-7623-46e2-a249-618fe333358: "Amphora 0aa81b91-7623-46e2-a249-618fe3333586 health message was processed too slowly: 11.3263750076s! The system may be overloaded or otherwise malfunctioning. This heartbeat has been ignored and no update was made to the amphora health entry. THIS IS NOT GOOD." This message appears 3 more times until Octavia Health Manager considers amphora stalled.

(Bunch of similar warnings for other amphorae (from other load balancers) are also logged)

[2018-07-18 08:26:51] Amphora 0aa81b91-7623-46e2-a249-618fe3333586 is stalled and the Health Manager triggers an amphora failover.

2018-07-18 08:26:51.260 24 INFO octavia.controller.healthmanager.health_manager [-] Stale amphora's id is: 0aa81b91-7623-46e2-a249-618fe3333586
2018-07-18 08:26:51.267 24 INFO octavia.controller.healthmanager.health_manager [-] Waiting for 1 failovers to finish


Many/all (>= 10 amphorae) are also stalled and trigger a failover. All failovers then fails due to an exception: "AddrFormatError: failed to detect a valid IP address from None". Nevertheless, issues at failover are not the root cause.

What happened to the system a few minutes before 2018-07-17 11:57:37?

The Octavia Health Manager started to lag on processing hearbeats sent by the amphora agent within a window of 10 seconds, so much that it invalidated the heartbeat. Further hearbeats were also dropped for the same reason. The HM eventually gave up (heartbeat_timeout=60 seconds) and considered the amphora as stalled.

Patch [1] (released in Queens) tried to help on scenarios like this where the HM becomes overloaded, but it can only help/short-circuit till some extent.

How reproducible is this? If you can reproduce, please monitor the load of the Octavia HM service.


[1] https://review.openstack.org/#/c/531006/

Comment 14 Yadan Pei 2018-07-20 02:30:42 UTC
Hi Carlos,our testing environment already in a bad state, so I need a good/fresh one and try to reproduce.


Hi Jon, are we able to recover the environment? Or re-install another one then I could try if we could reproduce the error.

Comment 15 Jon Uriarte 2018-07-20 10:10:00 UTC
I've re-deployed the Openshift cluster, as we do not have more environments.

Yapei, if you manage to reproduce the issue let us know so we can take all the information from Openstack components.

I left a script monitoring the Octavia health manager processes load.

Thanks Carlos and Yapei.

Comment 16 Jon Uriarte 2018-07-20 12:55:27 UTC
Adding more info related to Octavia just after the new fresh deployment and some basic tests:

[openshift@master-0 ~]$ oc get svc kubernetes
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                 AGE
kubernetes   ClusterIP   172.30.0.1   <none>        443/TCP,53/UDP,53/TCP   43m


(overcloud) [heat-admin@compute-1 ~]$ openstack loadbalancer list -c id -c name -c vip_address -c provisioning_status
+--------------------------------------+------------------------------------------------+---------------+---------------------+
| id                                   | name                                           | vip_address   | provisioning_status |
+--------------------------------------+------------------------------------------------+---------------+---------------------+
| b944b377-366c-427c-bbb0-28f5ad3d2f01 | openshift-ansible-openshift.example.com-api-lb | 172.30.0.1    | ACTIVE              |  <----
| c66322c3-9968-4a72-9836-e4a16cc81d36 | openshift-web-console/webconsole               | 172.30.38.21  | ACTIVE              |
| c29a0981-599a-4670-9537-454c7ddfaf80 | default/docker-registry                        | 172.30.39.225 | ACTIVE              |
| d73beda9-cab2-4d0c-93f3-3b74e0f6a134 | default/router                                 | 172.30.54.12  | ACTIVE              |
| 485ca21c-ecff-4827-8e48-9e48cdfe66c8 | default/registry-console                       | 172.30.234.4  | ACTIVE              |
+--------------------------------------+------------------------------------------------+---------------+---------------------+


overcloud) [heat-admin@compute-1 ~]$ openstack loadbalancer amphora list -c id -c loadbalancer_id -c status -c ha_ip
+--------------------------------------+--------------------------------------+-----------+---------------+
| id                                   | loadbalancer_id                      | status    | ha_ip         |
+--------------------------------------+--------------------------------------+-----------+---------------+
| 07ff7cb7-6da6-49db-b78b-8e040c99931d | b944b377-366c-427c-bbb0-28f5ad3d2f01 | ALLOCATED | 172.30.0.1    |  <----
| 0da35c8c-a184-41fa-933b-f76520d17f81 | c66322c3-9968-4a72-9836-e4a16cc81d36 | ALLOCATED | 172.30.38.21  |
| 4ba72210-6451-4c1d-8afb-96a34ca92e7e | d73beda9-cab2-4d0c-93f3-3b74e0f6a134 | ALLOCATED | 172.30.54.12  |
| 90047c37-b752-4c6b-aa3e-271a1f5bc618 | c29a0981-599a-4670-9537-454c7ddfaf80 | ALLOCATED | 172.30.39.225 |
| 97ec8f82-e404-42cd-a2dd-f6ae6ee12944 | 485ca21c-ecff-4827-8e48-9e48cdfe66c8 | ALLOCATED | 172.30.234.4  |
+--------------------------------------+--------------------------------------+-----------+---------------+


(overcloud) [heat-admin@compute-1 ~]$ openstack server list --all -c ID -c Name -c Status
+--------------------------------------+----------------------------------------------+--------+
| ID                                   | Name                                         | Status |
+--------------------------------------+----------------------------------------------+--------+
| 8bb7e6a1-5c75-4ad4-b558-e1af9088e233 | amphora-97ec8f82-e404-42cd-a2dd-f6ae6ee12944 | ACTIVE |
| 93db8508-a996-4518-a9d3-380747716582 | amphora-4ba72210-6451-4c1d-8afb-96a34ca92e7e | ACTIVE |
| d1edfb0a-d278-4cee-bfce-99957630238c | amphora-90047c37-b752-4c6b-aa3e-271a1f5bc618 | ACTIVE |
| 2b8c3ff9-6a8a-4b78-98b8-d0a8540c1472 | amphora-0da35c8c-a184-41fa-933b-f76520d17f81 | ACTIVE |
| a50cc916-bd44-412e-98a4-3d77d7ea4d6e | master-2.openshift.example.com               | ACTIVE |
| b7b4436a-93ee-45ca-9a94-3a1a228021b7 | master-1.openshift.example.com               | ACTIVE |
| 304cedcb-99fb-49b7-95a0-0573b58dd6f4 | master-0.openshift.example.com               | ACTIVE |
| f6697a46-44ac-4382-8938-a332267d1e91 | app-node-1.openshift.example.com             | ACTIVE |
| b46dfd78-c1c0-4ffd-ba72-1abeb3474091 | infra-node-0.openshift.example.com           | ACTIVE |
| ea552074-4519-4e7e-a35b-a171e5acef5c | infra-node-1.openshift.example.com           | ACTIVE |
| 415b0e5e-ca2a-4280-b687-89f4a82a8a13 | app-node-0.openshift.example.com             | ACTIVE |
| c7417e23-722b-4f6d-be18-95e8804280ae | amphora-07ff7cb7-6da6-49db-b78b-8e040c99931d | ACTIVE |  <----
| 36c22a99-c0b0-4a8f-9345-f10202e1fef3 | openshift-dns                                | ACTIVE |
| bfea6809-37c9-4b3b-b631-c28de9f2ba62 | bastion                                      | ACTIVE |
+--------------------------------------+----------------------------------------------+--------+

Octavia logs related to Kubernetes API load balancer:
[heat-admin@controller-0 ~]$ grep -r 'b944b377\|07ff7cb7\|c7417e23' /var/log/containers/octavia/

/var/log/containers/octavia/worker.log:
2018-07-20 09:26:07.234 23 INFO octavia.controller.queue.endpoint [req-ca7d0764-78ed-4455-899c-5f5fddb28aca - 293061856e1a4cffbae3ed099cf4f619 - - -] Creating load balancer 'b944b377-366c-427c-bbb0-28f5ad3d2f01'...

/var/log/containers/octavia/worker.log:
2018-07-20 09:26:07.358 23 INFO octavia.controller.worker.tasks.database_tasks [req-ca7d0764-78ed-4455-899c-5f5fddb28aca - 293061856e1a4cffbae3ed099cf4f619 - - -] Created Amphora in DB with id 07ff7cb7-6da6-49db-b78b-8e040c99931d

/var/log/containers/octavia/worker.log:
2018-07-20 09:26:21.567 23 INFO octavia.controller.worker.tasks.database_tasks [req-ca7d0764-78ed-4455-899c-5f5fddb28aca - 293061856e1a4cffbae3ed099cf4f619 - - -] Mark ALLOCATED in DB for amphora: 07ff7cb7-6da6-49db-b78b-8e040c99931d with compute id c7417e23-722b-4f6d-be18-95e8804280ae for load balancer: b944b377-366c-427c-bbb0-28f5ad3d2f01

/var/log/containers/octavia/worker.log:
2018-07-20 09:26:36.561 23 INFO octavia.controller.worker.tasks.database_tasks [req-ca7d0764-78ed-4455-899c-5f5fddb28aca - 293061856e1a4cffbae3ed099cf4f619 - - -] Mark ACTIVE in DB for load balancer id: b944b377-366c-427c-bbb0-28f5ad3d2f01

/var/log/containers/octavia/api.log:
2018-07-20 09:26:07.204 1 INFO octavia.api.v2.controllers.load_balancer [req-f58487df-09aa-451f-a0f7-f0420766ea30 - 293061856e1a4cffbae3ed099cf4f619 - default default] Sending created Load Balancer b944b377-366c-427c-bbb0-28f5ad3d2f01 to the handler

/var/log/containers/octavia/api.log:
2018-07-20 09:26:38.902 1 INFO octavia.api.v2.controllers.listener [req-fb5d47b5-4b1f-4759-aef7-e0c05ae02f14 - 293061856e1a4cffbae3ed099cf4f619 - default default] Load Balancer b944b377-366c-427c-bbb0-28f5ad3d2f01 is immutable.

Everything seems to be ok for the moment.

I noticed that octavia health manager process goes sometimes to >100% of CPU consumption, but I guess it is ok as the controller node has 48 cores.

Comment 17 Carlos Goncalves 2018-07-25 14:02:25 UTC
Please reopen if you can reproduce or Octavia fails again for similar reasons.