Bug 1759095 - Kuryr-controller restarts due to race between namespace and pod creation
Summary: Kuryr-controller restarts due to race between namespace and pod creation
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.3.0
Assignee: Luis Tomas Bolivar
QA Contact: Jon Uriarte
URL:
Whiteboard:
: 1757876 (view as bug list)
Depends On:
Blocks: 1757916 1759097
TreeView+ depends on / blocked
 
Reported: 2019-10-07 12:08 UTC by Luis Tomas Bolivar
Modified: 2020-05-13 21:27 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1759097 (view as bug list)
Environment:
Last Closed: 2020-05-13 21:27:01 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kuryr-kubernetes pull 53 0 'None' 'closed' 'Bug 1759095: Prevent Kuryr from restarting due to corner-cases' 2019-12-09 12:24:56 UTC
Launchpad 1847085 0 None None None 2019-10-07 12:08:30 UTC
OpenStack gerrit 686436 0 'None' 'MERGED' 'Ensure no KuryrNet addition is tried when namespace is not present' 2019-12-09 12:24:55 UTC
OpenStack gerrit 687008 0 'None' 'MERGED' 'Ensure vif handler retry in case of missing ns resources' 2019-12-09 12:24:55 UTC
Red Hat Product Errata RHBA-2020:0062 0 None None None 2020-05-13 21:27:06 UTC

Description Luis Tomas Bolivar 2019-10-07 12:08:31 UTC
When a namespace is created, and a new pod in that namespace is created right away, it may happen that all needed resources (neutron resources + kuryr CRD objects) are not yet created when the vif handler is executed to obtain a port for the pod. This leads to the following error that makes the kuryr-controller to restart (when it should only retry):

2019-10-07 10:01:14.043 1 ERROR kuryr_kubernetes.controller.drivers.namespace_subnet [-] Kubernetes Client Exception.: K8sResourceNotFound: Resource not found: u'{"kind":"Status","apiVersion":"v1","metadata":{},
"status":"Failure","message":"kuryrnets.openstack.org \\"ns-openshift-apiserver-operator\\" not found","reason":"NotFound","details":{"name":"ns-openshift-apiserver-operator","group":"openstack.org","kind":"kury
rnets"},"code":404}\n'
2019-10-07 10:01:14.043 1 ERROR kuryr_kubernetes.controller.drivers.namespace_subnet Traceback (most recent call last):
2019-10-07 10:01:14.043 1 ERROR kuryr_kubernetes.controller.drivers.namespace_subnet File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 73, in _get_namespace_
subnet_id
2019-10-07 10:01:14.043 1 ERROR kuryr_kubernetes.controller.drivers.namespace_subnet constants.K8S_API_CRD, net_crd_name))
2019-10-07 10:01:14.043 1 ERROR kuryr_kubernetes.controller.drivers.namespace_subnet File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/k8s_client.py", line 83, in get
2019-10-07 10:01:14.043 1 ERROR kuryr_kubernetes.controller.drivers.namespace_subnet raise exc.K8sResourceNotFound(response.text)
2019-10-07 10:01:14.043 1 ERROR kuryr_kubernetes.controller.drivers.namespace_subnet K8sResourceNotFound: Resource not found: u'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"kury
rnets.openstack.org \\"ns-openshift-apiserver-operator\\" not found","reason":"NotFound","details":{"name":"ns-openshift-apiserver-operator","group":"openstack.org","kind":"kuryrnets"},"code":404}\n'
2019-10-07 10:01:14.043 1 ERROR kuryr_kubernetes.controller.drivers.namespace_subnet
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry [-] Report handler unhealthy VIFHandler: K8sResourceNotFound: Resource not found: u'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Fail
ure","message":"kuryrnets.openstack.org \\"ns-openshift-apiserver-operator\\" not found","reason":"NotFound","details":{"name":"ns-openshift-apiserver-operator","group":"openstack.org","kind":"kuryrnets"},"code"
:404}\n'
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry Traceback (most recent call last):
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/retry.py", line 56, in __call__
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry self._handler(event)
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 75, in __call__
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry self.on_present(obj)
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/handlers/vif.py", line 94, in on_present
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry subnets = self._drv_subnets.get_subnets(pod, project_id)
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 47, in get_subnets
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry return self.get_namespace_subnet(pod_namespace)
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 51, in get_namespace_subnet
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry subnet_id = self._get_namespace_subnet_id(namespace)
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 73, in _get_namespace_subnet_id
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry constants.K8S_API_CRD, net_crd_name))
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/k8s_client.py", line 83, in get
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry raise exc.K8sResourceNotFound(response.text)
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry K8sResourceNotFound: Resource not found: u'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"kuryrnets.openstack.org \
\"ns-openshift-apiserver-operator\\" not found","reason":"NotFound","details":{"name":"ns-openshift-apiserver-operator","group":"openstack.org","kind":"kuryrnets"},"code":404}\n'
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.logging [-] Failed to handle event {u'object': {u'status': {u'qosClass': u'Burstable', u'containerStatuses': [{u'restartCount': 0, u'name': u'openshift-a
piserver-operator', u'image': u'quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5687e124eb0cc6989f2bf188313ffb96fcd4cce32596eeaba92c65186b9c1e89', u'imageID': u'', u'state': {u'waiting': {u'reason': u'Cont
ainerCreating'}}, u'ready': False, u'lastState': {}}], u'startTime': u'2019-10-07T09:42:20Z', u'hostIP': u'10.196.0.13', u'phase': u'Pending', u'conditions': [{u'status': u'True', u'lastProbeTime': None, u'type'
: u'Initialized', u'lastTransitionTime': u'2019-10-07T09:42:20Z'}, {u'status': u'False', u'lastTransitionTime': u'2019-10-07T09:42:20Z', u'reason': u'ContainersNotReady', u'lastProbeTime': None, u'message': u'co
ntainers with unready status: [openshift-apiserver-operator]', u'type': u'Ready'}, {u'status': u'False', u'lastTransitionTime': u'2019-10-07T09:42:20Z', u'reason': u'ContainersNotReady', u'lastProbeTime': None,
u'message': u'containers with unready status: [openshift-apiserver-operator]', u'type': u'ContainersReady'}, {u'status': u'True', u'lastProbeTime': None, u'type': u'PodScheduled', u'lastTransitionTime': u'2019-1
0-07T09:42:20Z'}]}, u'kind': u'Pod', u'spec': {u'priorityClassName': u'system-cluster-critical', u'dnsPolicy': u'ClusterFirst', u'securityContext': {}, u'serviceAccountName': u'openshift-apiserver-operator', u's
chedulerName': u'default-scheduler', u'enableServiceLinks': True, u'serviceAccount': u'openshift-apiserver-operator', u'nodeSelector': {u'node-role.kubernetes.io/master': u''}, u'priority': 2000000000, u'termina
tionGracePeriodSeconds': 30, u'restartPolicy': u'Always', u'volumes': [{u'secret': {u'optional': True, u'defaultMode': 420, u'secretName': u'openshift-apiserver-operator-serving-cert'}, u'name': u'serving-cert'}
, {u'configMap': {u'name': u'openshift-apiserver-operator-config', u'defaultMode': 420}, u'name': u'config'}, {u'secret': {u'defaultMode': 420, u'secretName': u'openshift-apiserver-operator-token-pz2hb'}, u'name
': u'openshift-apiserver-operator-token-pz2hb'}], u'tolerations': [{u'operator': u'Exists', u'effect': u'NoSchedule', u'key': u'node.kubernetes.io/memory-pressure'}, {u'operator': u'Exists', u'effect': u'NoSched
ule', u'key': u'node-role.kubernetes.io/master'}, {u'operator': u'Exists', u'tolerationSeconds': 120, u'effect': u'NoExecute', u'key': u'node.kubernetes.io/unreachable'}, {u'operator': u'Exists', u'tolerationSec
onds': 120, u'effect': u'NoExecute', u'key': u'node.kubernetes.io/not-ready'}], u'containers': [{u'terminationMessagePath': u'/dev/termination-log', u'name': u'openshift-apiserver-operator', u'image': u'quay.io/
openshift-release-dev/ocp-v4.0-art-dev@sha256:5687e124eb0cc6989f2bf188313ffb96fcd4cce32596eeaba92c65186b9c1e89', u'args': [u'--config=/var/run/configmaps/config/config.yaml'], u'volumeMounts': [{u'mountPath': u'
/var/run/configmaps/config', u'name': u'config'}, {u'mountPath': u'/var/run/secrets/serving-cert', u'name': u'serving-cert'}, {u'readOnly': True, u'mountPath': u'/var/run/secrets/kubernetes.io/serviceaccount', u
'name': u'openshift-apiserver-operator-token-pz2hb'}], u'terminationMessagePolicy': u'FallbackToLogsOnError', u'command': [u'cluster-openshift-apiserver-operator', u'operator'], u'env': [{u'name': u'IMAGE', u'va
lue': u'quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:43961dab5d4079061db12a0da58d301056a068c305ada0f4f29537c8a6e8120e'}, {u'name': u'OPERATOR_IMAGE', u'value': u'quay.io/openshift-release-dev/ocp-v4.0-a
rt-dev@sha256:5687e124eb0cc6989f2bf188313ffb96fcd4cce32596eeaba92c65186b9c1e89'}, {u'name': u'OPERATOR_IMAGE_VERSION', u'value': u'4.2.0-0.nightly-2019-10-07-011045'}, {u'name': u'OPERAND_IMAGE_VERSION', u'value
': u'4.2.0-0.nightly-2019-10-07-011045'}], u'imagePullPolicy': u'IfNotPresent', u'ports': [{u'protocol': u'TCP', u'name': u'metrics', u'containerPort': 8443}], u'resources': {u'requests': {u'cpu': u'10m', u'memo
ry': u'50Mi'}}}], u'nodeName': u'ostest-6qg24-master-0'}, u'apiVersion': u'v1', u'metadata': {u'name': u'openshift-apiserver-operator-754dc6f9f5-cvdsz', u'labels': {u'pod-template-hash': u'754dc6f9f5', u'app': u
'openshift-apiserver-operator'}, u'namespace': u'openshift-apiserver-operator', u'ownerReferences': [{u'kind': u'ReplicaSet', u'name': u'openshift-apiserver-operator-754dc6f9f5', u'apiVersion': u'apps/v1', u'con
troller': True, u'blockOwnerDeletion': True, u'uid': u'808e40ef-e8e6-11e9-8c73-fa163e70b8e3'}], u'resourceVersion': u'5152', u'generateName': u'openshift-apiserver-operator-754dc6f9f5-', u'creationTimestamp': u'
2019-10-07T09:40:31Z', u'annotations': {u'k8s.v1.cni.cncf.io/networks-status': u''}, u'selfLink': u'/api/v1/namespaces/openshift-apiserver-operator/pods/openshift-apiserver-operator-754dc6f9f5-cvdsz', u'uid': u'
80b3219a-e8e6-11e9-8c73-fa163e70b8e3'}}, u'type': u'ADDED'}: K8sResourceNotFound: Resource not found: u'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"kuryrnets.openstack.org \\"n
s-openshift-apiserver-operator\\" not found","reason":"NotFound","details":{"name":"ns-openshift-apiserver-operator","group":"openstack.org","kind":"kuryrnets"},"code":404}\n'

Comment 2 Luis Tomas Bolivar 2019-10-17 15:40:43 UTC
*** Bug 1757876 has been marked as a duplicate of this bug. ***

Comment 3 Jon Uriarte 2019-10-18 10:12:28 UTC
Verified on OCP 4.3.0-0.nightly-2019-10-17-061631 build on top of OSP 13 2019-10-01.1 puddle.

release image: registry.svc.ci.openshift.org/ocp/release@sha256:2cafe25ec1ed2dfdec361cde13b4461d2a30194d0b41fbd1c6d3fad5ab34ca05

The scenario has been reproduced by deleting the kuryrnet CRD before creating a new pod, showing the warning message and retrying.

Verification steps:
$ oc new-project test
$ oc get kuryrnets | grep test
ns-test         10.128.104.0/23   34s

$ oc delete kuryrnet ns-test
kuryrnet.openstack.org "ns-test" deleted

$ oc run --image kuryr/demo demo

Logs in kuryr-controller show the retries, as it detects kuryrnet resource is not created, and the aliveness probe reports ok:

WARNING kuryr_kubernetes.controller.drivers.namespace_subnet [-] Kuryrnet resource not yet created, retrying...: K8sResourceNotFound: Resource not found: u'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"kuryrnets.openstack.org \\"ns-test\\" not found","reason":"NotFound","details":{"name":"ns test","group":"openstack.org","kind":"kuryrnets"},"code":404}\n'
INFO werkzeug [-] 192.168.0.12 - - [18/Oct/2019 08:57:40] "GET /alive HTTP/1.1" 200 -

Comment 5 errata-xmlrpc 2020-05-13 21:27:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.