Bug 1602350
| Summary: | [OCP on OSP via kuryr] Got invalid status code from CNI daemon | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Yadan Pei <yapei> | ||||||||||||||||||
| Component: | openstack-octavia | Assignee: | Carlos Goncalves <cgoncalves> | ||||||||||||||||||
| Status: | CLOSED WORKSFORME | QA Contact: | Alexander Stafeyev <astafeye> | ||||||||||||||||||
| Severity: | high | Docs Contact: | |||||||||||||||||||
| Priority: | high | ||||||||||||||||||||
| Version: | 14.0 (Rocky) | CC: | aos-bugs, asegurap, bbennett, gcheresh, hongli, ihrachys, juriarte, lpeer, majopela, nyechiel, oblaut, tsedovic, yapei | ||||||||||||||||||
| Target Milestone: | --- | Keywords: | AutomationBlocker, TestBlocker | ||||||||||||||||||
| Target Release: | --- | ||||||||||||||||||||
| Hardware: | Unspecified | ||||||||||||||||||||
| OS: | Unspecified | ||||||||||||||||||||
| Whiteboard: | |||||||||||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||||
| Last Closed: | 2018-07-25 14:02:25 UTC | Type: | Bug | ||||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||
| Embargoed: | |||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||
This issue block new pods creation and thus most testings are blocked Can you get the logs from the kuryr daemon itself? Toni, FYI. Created attachment 1459840 [details]
Pod-kuryr-cni-ds.Container-install-cni-plugins
Created attachment 1459841 [details]
Pod-kuryr-cni-ds.Container-sky-dns
Created attachment 1459842 [details]
Pod-kuryr-cni-ds.Container-kuryr-cni
Created attachment 1459843 [details]
Pod-kuryr-controller
Created attachment 1460207 [details]
Octavia - api log
Created attachment 1460215 [details]
Octavia - health-manager log
Created attachment 1460221 [details]
Octavia - housekeeping log
Created attachment 1460228 [details]
Octavia - worker log
Added more info:
The issue is that the kuryr controller, running in a container in Openshift, has lost connectivity with the kubernetes API, which is behind a load balancer, so new pods cannot be created.
Logs from kuryr-controller:
2018-07-18 11:53:04.490 1 ERROR health-daemon ConnectionError: HTTPSConnectionPool(host='172.30.0.1', port=8443): Max retries exceeded with url: /healthz (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f6ace9e7850>: Failed to establish a new connection: [Errno 113] EHOSTUNREACH',))
The service (in kubernetes) and the load balancer (in Openstack) do exist:
[openshift@master-0 ~]$ oc get svc kubernetes
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 172.30.0.1 <none> 443/TCP,53/UDP,53/TCP 1d
(openstack) (overcloud) [heat-admin@compute-1 ~]$ openstack loadbalancer list
+--------------------------------------+------------------------------------------------+----------------------------------+----------------+---------------------+----------+
| id | name | project_id | vip_address | provisioning_status | provider |
+--------------------------------------+------------------------------------------------+----------------------------------+----------------+---------------------+----------+
| f477721b-6dc4-4e13-a8fa-9b5198afaa21 | default/docker-registry | 293061856e1a4cffbae3ed099cf4f619 | 172.30.217.220 | ACTIVE | octavia |
| 8d50f5b0-7cad-4377-9909-83504b7616e7 | openshift-ansible-openshift.example.com-api-lb | 293061856e1a4cffbae3ed099cf4f619 | 172.30.0.1 | ERROR | octavia |
(overcloud) [heat-admin@compute-1 ~]$ openstack loadbalancer amphora list
+--------------------------------------+--------------------------------------+-----------+------------+---------------+----------------+
| id | loadbalancer_id | status | role | lb_network_ip | ha_ip |
+--------------------------------------+--------------------------------------+-----------+------------+---------------+----------------+
| 0aa81b91-7623-46e2-a249-618fe3333586 | 8d50f5b0-7cad-4377-9909-83504b7616e7 | ERROR | STANDALONE | 172.24.0.4 | 172.30.0.1 |
| df3492d1-31f0-41e6-83c1-856138be1100 | 7c7b43cb-b02b-4c11-8c7c-251fef99a399 | ALLOCATED | STANDALONE | 172.24.0.3 | 172.30.167.165 |
| fb858a05-bb54-494f-9cff-6eefff49b29e | cbc9c320-2d63-432e-835c-a60afddeff22 | ALLOCATED | STANDALONE | 172.24.0.30 | 172.30.198.62 |
| fc8df28b-ae56-461b-924c-82e733ce88d2 | 855f438d-5937-4d6a-a3ec-0acc1d197e96 | ERROR | STANDALONE | 172.24.0.7 | 172.30.136.198 |
| feccbb36-1409-4a39-a792-aa36de4f1e90 | d0e9f18e-b809-4c55-ae0e-29914bbe88d4 | ALLOCATED | STANDALONE | 172.24.0.23 | 172.30.213.124 |
+--------------------------------------+--------------------------------------+-----------+------------+---------------+----------------+
But the amphora VM does not exist:
(overcloud) [heat-admin@compute-1 ~]$ openstack server list --all
+--------------------------------------+----------------------------------------------+--------+--------------------------------------------------------------------------------------------+---------------------------------+-----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+----------------------------------------------+--------+--------------------------------------------------------------------------------------------+---------------------------------+-----------+
| 912ce712-8215-40de-a90f-ade97a5a20fd | amphora-df3492d1-31f0-41e6-83c1-856138be1100 | ACTIVE | lb-mgmt-net=172.24.0.3; openshift-ansible-openshift.example.com-service-net=172.30.128.44 | octavia-amphora-13.0-20180621.1 | |
| 0114f7ee-04ff-464e-9694-c0b8c0b10694 | amphora-fb858a05-bb54-494f-9cff-6eefff49b29e | ACTIVE | lb-mgmt-net=172.24.0.30; openshift-ansible-openshift.example.com-service-net=172.30.128.29 | octavia-amphora-13.0-20180621.1 | |
| 93fbcef4-f074-4d82-870d-fea4867d1f63 | amphora-feccbb36-1409-4a39-a792-aa36de4f1e90 | ACTIVE | lb-mgmt-net=172.24.0.23; openshift-ansible-openshift.example.com-service-net=172.30.128.22 | octavia-amphora-13.0-20180621.1 | |
The amphora VM seems to be deleted for some reason.
Find attached Octavia logs.
Timeline for load balancer 8d50f5b0-7cad-4377-9909-83504b7616e7: [2018-07-17 11:57:37] Load balancer 8d50f5b0-7cad-4377-9909-83504b7616e7 was create with Amphora ID 0aa81b91-7623-46e2-a249-618fe333358, and transitioned to ACTIVE. (No events logged related to this LB/amphora between this time and next) [2018-07-18 08:26:05] First warning message for health of amphora 0aa81b91-7623-46e2-a249-618fe333358: "Amphora 0aa81b91-7623-46e2-a249-618fe3333586 health message was processed too slowly: 11.3263750076s! The system may be overloaded or otherwise malfunctioning. This heartbeat has been ignored and no update was made to the amphora health entry. THIS IS NOT GOOD." This message appears 3 more times until Octavia Health Manager considers amphora stalled. (Bunch of similar warnings for other amphorae (from other load balancers) are also logged) [2018-07-18 08:26:51] Amphora 0aa81b91-7623-46e2-a249-618fe3333586 is stalled and the Health Manager triggers an amphora failover. 2018-07-18 08:26:51.260 24 INFO octavia.controller.healthmanager.health_manager [-] Stale amphora's id is: 0aa81b91-7623-46e2-a249-618fe3333586 2018-07-18 08:26:51.267 24 INFO octavia.controller.healthmanager.health_manager [-] Waiting for 1 failovers to finish Many/all (>= 10 amphorae) are also stalled and trigger a failover. All failovers then fails due to an exception: "AddrFormatError: failed to detect a valid IP address from None". Nevertheless, issues at failover are not the root cause. What happened to the system a few minutes before 2018-07-17 11:57:37? The Octavia Health Manager started to lag on processing hearbeats sent by the amphora agent within a window of 10 seconds, so much that it invalidated the heartbeat. Further hearbeats were also dropped for the same reason. The HM eventually gave up (heartbeat_timeout=60 seconds) and considered the amphora as stalled. Patch [1] (released in Queens) tried to help on scenarios like this where the HM becomes overloaded, but it can only help/short-circuit till some extent. How reproducible is this? If you can reproduce, please monitor the load of the Octavia HM service. [1] https://review.openstack.org/#/c/531006/ Hi Carlos,our testing environment already in a bad state, so I need a good/fresh one and try to reproduce. Hi Jon, are we able to recover the environment? Or re-install another one then I could try if we could reproduce the error. I've re-deployed the Openshift cluster, as we do not have more environments. Yapei, if you manage to reproduce the issue let us know so we can take all the information from Openstack components. I left a script monitoring the Octavia health manager processes load. Thanks Carlos and Yapei. Adding more info related to Octavia just after the new fresh deployment and some basic tests: [openshift@master-0 ~]$ oc get svc kubernetes NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 172.30.0.1 <none> 443/TCP,53/UDP,53/TCP 43m (overcloud) [heat-admin@compute-1 ~]$ openstack loadbalancer list -c id -c name -c vip_address -c provisioning_status +--------------------------------------+------------------------------------------------+---------------+---------------------+ | id | name | vip_address | provisioning_status | +--------------------------------------+------------------------------------------------+---------------+---------------------+ | b944b377-366c-427c-bbb0-28f5ad3d2f01 | openshift-ansible-openshift.example.com-api-lb | 172.30.0.1 | ACTIVE | <---- | c66322c3-9968-4a72-9836-e4a16cc81d36 | openshift-web-console/webconsole | 172.30.38.21 | ACTIVE | | c29a0981-599a-4670-9537-454c7ddfaf80 | default/docker-registry | 172.30.39.225 | ACTIVE | | d73beda9-cab2-4d0c-93f3-3b74e0f6a134 | default/router | 172.30.54.12 | ACTIVE | | 485ca21c-ecff-4827-8e48-9e48cdfe66c8 | default/registry-console | 172.30.234.4 | ACTIVE | +--------------------------------------+------------------------------------------------+---------------+---------------------+ overcloud) [heat-admin@compute-1 ~]$ openstack loadbalancer amphora list -c id -c loadbalancer_id -c status -c ha_ip +--------------------------------------+--------------------------------------+-----------+---------------+ | id | loadbalancer_id | status | ha_ip | +--------------------------------------+--------------------------------------+-----------+---------------+ | 07ff7cb7-6da6-49db-b78b-8e040c99931d | b944b377-366c-427c-bbb0-28f5ad3d2f01 | ALLOCATED | 172.30.0.1 | <---- | 0da35c8c-a184-41fa-933b-f76520d17f81 | c66322c3-9968-4a72-9836-e4a16cc81d36 | ALLOCATED | 172.30.38.21 | | 4ba72210-6451-4c1d-8afb-96a34ca92e7e | d73beda9-cab2-4d0c-93f3-3b74e0f6a134 | ALLOCATED | 172.30.54.12 | | 90047c37-b752-4c6b-aa3e-271a1f5bc618 | c29a0981-599a-4670-9537-454c7ddfaf80 | ALLOCATED | 172.30.39.225 | | 97ec8f82-e404-42cd-a2dd-f6ae6ee12944 | 485ca21c-ecff-4827-8e48-9e48cdfe66c8 | ALLOCATED | 172.30.234.4 | +--------------------------------------+--------------------------------------+-----------+---------------+ (overcloud) [heat-admin@compute-1 ~]$ openstack server list --all -c ID -c Name -c Status +--------------------------------------+----------------------------------------------+--------+ | ID | Name | Status | +--------------------------------------+----------------------------------------------+--------+ | 8bb7e6a1-5c75-4ad4-b558-e1af9088e233 | amphora-97ec8f82-e404-42cd-a2dd-f6ae6ee12944 | ACTIVE | | 93db8508-a996-4518-a9d3-380747716582 | amphora-4ba72210-6451-4c1d-8afb-96a34ca92e7e | ACTIVE | | d1edfb0a-d278-4cee-bfce-99957630238c | amphora-90047c37-b752-4c6b-aa3e-271a1f5bc618 | ACTIVE | | 2b8c3ff9-6a8a-4b78-98b8-d0a8540c1472 | amphora-0da35c8c-a184-41fa-933b-f76520d17f81 | ACTIVE | | a50cc916-bd44-412e-98a4-3d77d7ea4d6e | master-2.openshift.example.com | ACTIVE | | b7b4436a-93ee-45ca-9a94-3a1a228021b7 | master-1.openshift.example.com | ACTIVE | | 304cedcb-99fb-49b7-95a0-0573b58dd6f4 | master-0.openshift.example.com | ACTIVE | | f6697a46-44ac-4382-8938-a332267d1e91 | app-node-1.openshift.example.com | ACTIVE | | b46dfd78-c1c0-4ffd-ba72-1abeb3474091 | infra-node-0.openshift.example.com | ACTIVE | | ea552074-4519-4e7e-a35b-a171e5acef5c | infra-node-1.openshift.example.com | ACTIVE | | 415b0e5e-ca2a-4280-b687-89f4a82a8a13 | app-node-0.openshift.example.com | ACTIVE | | c7417e23-722b-4f6d-be18-95e8804280ae | amphora-07ff7cb7-6da6-49db-b78b-8e040c99931d | ACTIVE | <---- | 36c22a99-c0b0-4a8f-9345-f10202e1fef3 | openshift-dns | ACTIVE | | bfea6809-37c9-4b3b-b631-c28de9f2ba62 | bastion | ACTIVE | +--------------------------------------+----------------------------------------------+--------+ Octavia logs related to Kubernetes API load balancer: [heat-admin@controller-0 ~]$ grep -r 'b944b377\|07ff7cb7\|c7417e23' /var/log/containers/octavia/ /var/log/containers/octavia/worker.log: 2018-07-20 09:26:07.234 23 INFO octavia.controller.queue.endpoint [req-ca7d0764-78ed-4455-899c-5f5fddb28aca - 293061856e1a4cffbae3ed099cf4f619 - - -] Creating load balancer 'b944b377-366c-427c-bbb0-28f5ad3d2f01'... /var/log/containers/octavia/worker.log: 2018-07-20 09:26:07.358 23 INFO octavia.controller.worker.tasks.database_tasks [req-ca7d0764-78ed-4455-899c-5f5fddb28aca - 293061856e1a4cffbae3ed099cf4f619 - - -] Created Amphora in DB with id 07ff7cb7-6da6-49db-b78b-8e040c99931d /var/log/containers/octavia/worker.log: 2018-07-20 09:26:21.567 23 INFO octavia.controller.worker.tasks.database_tasks [req-ca7d0764-78ed-4455-899c-5f5fddb28aca - 293061856e1a4cffbae3ed099cf4f619 - - -] Mark ALLOCATED in DB for amphora: 07ff7cb7-6da6-49db-b78b-8e040c99931d with compute id c7417e23-722b-4f6d-be18-95e8804280ae for load balancer: b944b377-366c-427c-bbb0-28f5ad3d2f01 /var/log/containers/octavia/worker.log: 2018-07-20 09:26:36.561 23 INFO octavia.controller.worker.tasks.database_tasks [req-ca7d0764-78ed-4455-899c-5f5fddb28aca - 293061856e1a4cffbae3ed099cf4f619 - - -] Mark ACTIVE in DB for load balancer id: b944b377-366c-427c-bbb0-28f5ad3d2f01 /var/log/containers/octavia/api.log: 2018-07-20 09:26:07.204 1 INFO octavia.api.v2.controllers.load_balancer [req-f58487df-09aa-451f-a0f7-f0420766ea30 - 293061856e1a4cffbae3ed099cf4f619 - default default] Sending created Load Balancer b944b377-366c-427c-bbb0-28f5ad3d2f01 to the handler /var/log/containers/octavia/api.log: 2018-07-20 09:26:38.902 1 INFO octavia.api.v2.controllers.listener [req-fb5d47b5-4b1f-4759-aef7-e0c05ae02f14 - 293061856e1a4cffbae3ed099cf4f619 - default default] Load Balancer b944b377-366c-427c-bbb0-28f5ad3d2f01 is immutable. Everything seems to be ok for the moment. I noticed that octavia health manager process goes sometimes to >100% of CPU consumption, but I guess it is ok as the controller node has 48 cores. Please reopen if you can reproduce or Octavia fails again for similar reasons. |
Description of problem: Meet error(following) after running some tests NetworkPlugin cni failed to set up pod "<pod>" network: Got invalid status code from CNI daemon Version-Release number of selected component (if applicable): OSP13 openshift v3.10.18 python2-kuryr-kubernetes-0.4.3-1.el7ost.noarch openstack-kuryr-kubernetes-controller-0.4.3-1.el7ost.noarch python2-kuryr-lib-0.7.0-1.el7ost.noarch openstack-kuryr-kubernetes-common-0.4.3-1.el7ost.noarch How reproducible: Sometimes Steps to Reproduce: 1.Run some web console cases and other cases 2. 3. Actual results: 2. Meet these error Warning FailedCreatePodSandBox 1m kubelet, master-2.openshift.example.com Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "c8f62e69133396592cb70af977612709edc5d53002726f6a8a551c1ac0092963" network for pod "webconsole-785689b664-8lpgk": NetworkPlugin cni failed to set up pod "webconsole-785689b664-8lpgk_openshift-web-console" network: Got invalid status code from CNI daemon.; Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/api.py", line 89, in run vif = self._add(params) File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/api.py", line 147, in _add resp = self._make_request('addNetwork', params, httplib.ACCEPTED) File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/cni/api.py", line 181, in _make_request raise k_exc.CNIError('Got invalid status code from CNI daemon.') CNIError: Got invalid status code from CNI daemon. Normal SandboxChanged 1m (x4 over 11m) kubelet, master-2.openshift.example.com Pod sandbox changed, it will be killed and re-created. Expected results: 2. Should work without errors Additional info: