Bug 1596696 - Kuryr-controller crashes if Octavia amphoras take too long or fail to be created
Summary: Kuryr-controller crashes if Octavia amphoras take too long or fail to be created
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-octavia
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: z2
: 13.0 (Queens)
Assignee: Luis Tomas Bolivar
QA Contact: Jon Uriarte
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-29 13:22 UTC by Luis Tomas Bolivar
Modified: 2019-09-10 14:09 UTC (History)
9 users (show)

Fixed In Version: openstack-octavia-2.0.1-6.ddaf235git.el7ost
Doc Type: Bug Fix
Doc Text:
Previously, filtering load balancers by VIP data raised an internal server error (500 HTTP) in octavia-api. As a result, the Kuryr-controller did not locate previously created load balancers upon restart and crashed. In this update, the fields translation on filtering has been corrected and pool filtering by healthmonitor_id has been added. Now filtering load balancers by VIP data is possible and the Kuryr-controller can find previously created load balancers.
Clone Of:
Environment:
Last Closed: 2018-08-29 16:20:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 579144 0 None master: ABANDONED kuryr-kubernetes: Ensure Octavia created load balancer can be found (I10494b142803d8c32e09f970ecc4e87e68f2ebca) 2018-07-19 02:10:18 UTC
OpenStack gerrit 579559 0 None master: MERGED octavia: Fix fields translation on filtering (Iaa4877f71e98689b7e92b3e2abb9d7da2c5ca521) 2018-07-19 02:10:12 UTC
OpenStack gerrit 579846 0 None stable/queens: MERGED octavia: Fix fields translation on filtering (Iaa4877f71e98689b7e92b3e2abb9d7da2c5ca521) 2018-07-19 02:10:06 UTC
Red Hat Product Errata RHBA-2018:2596 0 None None None 2018-08-29 16:20:51 UTC

Description Luis Tomas Bolivar 2018-06-29 13:22:46 UTC
When using Octavia as LBaaS, Amphora VMs may take long time to boot up or even fail sometimes (for instance if there is not enough space for them).

When this happens, kuryr-kubernetes loadbalancer handler will hit a wait-for-provision timeout and through a ResourceNotReady exception that in turns will trigger a retry. However, when this happens, and due to the next bug in Octavia(https://storyboard.openstack.org/#!/story/2001944, listing existing load balancers with filtering will fail and trigger a InternalServerError exception:


2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry [-] Report handler unhealthy LoadBalancerHandler: InternalServerError: 500-{u'debuginfo': u'Traceback (most recent call last):\n\n  File "/usr/lib/python2.7/site-packages/wsmeext/pecan.py", line 85, in callfunction\n    result = f(self, *args, **kwargs)\n\n  File "/opt/stack/octavia/octavia/api/v2/controllers/load_b
alancer.py", line 83, in get_all\n    **query_filter)\n\n  File "/opt/stack/octavia/octavia/db/repositories.py", line 145, in get_all\n    query, self.model_class)\n\n  File "/opt/stack/octavia/octavia/api/common/pagination.py", line 232, in apply\n    query = model.apply_filter(query, model, self.filters)\n\n  File "/opt/stack/octavia/octavia/db/base_models.py", line 123, in ap
ply_filter\n    query = query.filter_by(**translated_filters)\n\n  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 1632, in filter_by\n    for key, value in kwargs.items()]\n\n  File "/usr/lib64/python2.7/site-packages/sqlalchemy/sql/operators.py", line 344, in __eq__\n    return self.operate(eq, other)\n\n  File "/usr/lib64/python2.7/site-packages/sqlalc
hemy/orm/attributes.py", line 180, in operate\n    return op(self.comparator, *other, **kwargs)\n\n  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/relationships.py", line 1039, in __eq__\n    other, adapt_source=self.adapter))\n\n  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/relationships.py", line 1372, in _optimized_compare\n    state = attributes.instanc
e_state(state)\n\nAttributeError: \'dict\' object has no attribute \'_sa_instance_state\'\n', u'faultcode': u'Server', u'faultstring': u"'dict' object has no attribute '_sa_instance_state'"}
Neutron server returns request_ids: ['req-8b7c13c4-b52b-4025-8a65-f82120881f71']
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry Traceback (most recent call last):
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/retry.py", line 55, in __call__
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry     self._handler(event)
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 75, in __call__
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry     self.on_present(obj)
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/handlers/lbaas.py", line 243, in on_present
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry     if self._sync_lbaas_members(endpoints, lbaas_state, lbaas_spec):
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/handlers/lbaas.py", line 318, in _sync_lbaas_members
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry     if self._sync_lbaas_pools(endpoints, lbaas_state, lbaas_spec):
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/handlers/lbaas.py", line 428, in _sync_lbaas_pools
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry     if self._sync_lbaas_listeners(endpoints, lbaas_state, lbaas_spec):
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/handlers/lbaas.py", line 483, in _sync_lbaas_listeners
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry     if self._sync_lbaas_loadbalancer(endpoints, lbaas_state, lbaas_spec):
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/handlers/lbaas.py", line 573, in _sync_lbaas_loadbalancer
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry     service_type=lbaas_spec.type)
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/lbaasv2.py", line 60, in ensure_loadbalancer
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry     self._find_loadbalancer)
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/lbaasv2.py", line 404, in _ensure
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry     result = find(obj)
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/lbaasv2.py", line 271, in _find_loadbalancer
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry     vip_subnet_id=loadbalancer.subnet_id)
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 1124, in list_loadbalancers
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry     retrieve_all, **_params)
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 369, in list
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry     for r in self._pagination(collection, path, **params):
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 384, in _pagination
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry     res = self.get(path, params=params)
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 354, in get
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry     headers=headers, params=params)
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 331, in retry_request
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry     headers=headers, params=params)
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 294, in do_request
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry     self._handle_fault_response(status_code, replybody, resp)
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 269, in _handle_fault_response
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry     exception_handler_v20(status_code, error_body)
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry   File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 93, in exception_handler_v20
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry     request_ids=request_ids)
2018-06-28 11:34:46.005 1 ERROR kuryr_kubernetes.handlers.retry InternalServerError: 500-{u'debuginfo': u'Traceback (most recent call last):\n\n  File "/usr/lib/python2.7/site-packages/wsmeext/pecan.py", line 85, in callfunction\n    result = f(self, *args, **kwargs)\n\n  File "/opt/stack/octavia/octavia/api/v2/controllers/load_balancer.py", line 83, in get_all\n    **query_filt
er)\n\n  File "/opt/stack/octavia/octavia/db/repositories.py", line 145, in get_all\n    query, self.model_class)\n\n  File "/opt/stack/octavia/octavia/api/common/pagination.py", line 232, in apply\n    query = model.apply_filter(query, model, self.filters)\n\n  File "/opt/stack/octavia/octavia/db/base_models.py", line 123, in apply_filter\n    query = query.filter_by(**translat
ed_filters)\n\n  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 1632, in filter_by\n    for key, value in kwargs.items()]\n\n  File "/usr/lib64/python2.7/site-packages/sqlalchemy/sql/operators.py", line 344, in __eq__\n    return self.operate(eq, other)\n\n  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/attributes.py", line 180, in operate\n   
 return op(self.comparator, *other, **kwargs)\n\n  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/relationships.py", line 1039, in __eq__\n    other, adapt_source=self.adapter))\n\n  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/relationships.py", line 1372, in _optimized_compare\n    state = attributes.instance_state(state)\n\nAttributeError: \'dict\' object 
has no attribute \'_sa_instance_state\'\n', u'faultcode': u'Server', u'faultstring': u"'dict' object has no attribute '_sa_instance_state'"}


Due to this, and after a few retries, the complete kuryr-controller pod will crash and it will be restarted. However, upon restart the controller will still not be able to recover the information about the previously created loadbalancer, and it will crash again after a few more retries.

Comment 3 Luis Tomas Bolivar 2018-07-06 12:10:14 UTC
https://review.openstack.org/#/c/579144/ is not needed anymore as the problem was fixed on Octavia (and backported):
https://review.openstack.org/#/c/579559/
https://review.openstack.org/#/c/579846/

Comment 8 Joanne O'Flynn 2018-08-13 14:07:22 UTC
This bug is marked for inclusion in the errata but does not currently contain draft documentation text. To ensure the timely release of this advisory please provide draft documentation text for this bug as soon as possible.

If you do not think this bug requires errata documentation, set the requires_doc_text flag to "-".


To add draft documentation text:

* Select the documentation type from the "Doc Type" drop down field.

* A template will be provided in the "Doc Text" field based on the "Doc Type" value selected. Enter draft text in the "Doc Text" field.

Comment 9 Jon Uriarte 2018-08-27 15:18:18 UTC
Verified on OSP 13 2018-08-16.1 puddle / openshift-ansible-3.10.34-1, and:
 openstack-kuryr-kubernetes-cni-0.4.3-2.el7ost.noarch
 openstack-kuryr-kubernetes-controller-0.4.3-2.el7ost.noarch

Verification steps:
1. Deploy OCP on OSP with kuryr
2. From master node:
   $ oc new-project test
   $ oc run --image=kuryr/demo demo -n test
   $ oc scale -n test dc/demo --replicas=2
   $ oc get pods
   NAME           READY     STATUS    RESTARTS   AGE
   demo-1-t24lc   1/1       Running   0          2m
   demo-1-xx2mx   1/1       Running   0          2m

   $ oc expose -n test dc/demo --port 80 --target-port 8080

   $ oc get pods -n openshift-infra
   NAME                                READY     STATUS    RESTARTS   AGE
   bootstrap-autoapprover-0            1/1       Running   0          2h
   kuryr-cni-ds-85kmz                  2/2       Running   0          2h
   kuryr-cni-ds-fhfcz                  2/2       Running   0          2h
   kuryr-cni-ds-qq8pq                  2/2       Running   0          2h
   kuryr-cni-ds-xzrd4                  2/2       Running   0          2h
   kuryr-controller-59fc7f478b-swpsm   1/1       Running   3          58m

3. Delete kuryr-controller pod, from master node:
   $ oc -n openshift-infra delete pod kuryr-controller-59fc7f478b-swpsm

4. Check a new kuryr-controller pod is deployed, with no error messages, from master node:
   $ oc get pods -n openshift-infra
   NAME                                READY     STATUS    RESTARTS   AGE
   bootstrap-autoapprover-0            1/1       Running   0          2h
   kuryr-cni-ds-85kmz                  2/2       Running   0          2h
   kuryr-cni-ds-fhfcz                  2/2       Running   0          2h
   kuryr-cni-ds-qq8pq                  2/2       Running   0          2h
   kuryr-cni-ds-xzrd4                  2/2       Running   0          2h
   kuryr-controller-59fc7f478b-jfz2b   1/1       Running   0          13s

   $ oc -n openshift-infra logs -f kuryr-controller-59fc7f478b-2gdrr
2018-08-27 14:26:53.632 1 INFO kuryr_kubernetes.config [-] Logging enabled!
2018-08-27 14:26:53.633 1 INFO kuryr_kubernetes.config [-] /usr/bin/kuryr-k8s-controller version 0.4.3
2018-08-27 14:26:55.154 1 INFO os_vif [-] Loaded VIF plugins: noop, ovs, linux_bridge
2018-08-27 14:26:55.241 1 INFO kuryr_kubernetes.controller.service [-] Service 'KuryrK8sService' stopped
2018-08-27 14:26:55.250 1 INFO kuryr_kubernetes.controller.service [-] Service 'KuryrK8sService' starting
2018-08-27 14:26:55.250 1 INFO kuryr_kubernetes.controller.managers.health [-] Starting health check server.
2018-08-27 14:26:55.251 1 INFO werkzeug [-]  * Running on http://localhost:8082/
2018-08-27 14:26:55.257 1 INFO kuryr_kubernetes.watcher [-] Started watching '/api/v1/pods'
2018-08-27 14:26:55.259 1 INFO kuryr_kubernetes.watcher [-] Started watching '/api/v1/services'
2018-08-27 14:26:55.262 1 INFO kuryr_kubernetes.watcher [-] Started watching '/api/v1/endpoints'
2018-08-27 14:26:56.914 1 INFO kuryr_kubernetes.controller.drivers.vif_pool [-] PORTS POOL: pools updated with pre-created ports

5. Curl to the service (load balancer) IP, and check it is routed to each pod in a round robin basis, from the master node:
   $ oc get svc
   NAME      TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
   demo      ClusterIP   172.30.1.140   <none>        80/TCP    1m

   $ curl 172.30.1.140
   demo-1-t24lc: HELLO! I AM ALIVE!!!
   $ curl 172.30.1.140
   demo-1-xx2mx: HELLO! I AM ALIVE!!!

6. Check kuryr-controller logs, no exception should be raised.

7. Delete the project:
   $ oc delete project test

Comment 10 Jon Uriarte 2018-08-28 08:01:15 UTC
Forgot to add a step in the verification:
After creating the service (oc expose -n test dc/demo --port 80 --target-port 8080) a new scale-out must be performed:
  $ oc scale -n test dc/demo --replicas=3
The next steps remain equal.

Comment 12 errata-xmlrpc 2018-08-29 16:20:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2596


Note You need to log in before you can comment on or make changes to this bug.