Bug 1759097

Summary: Kuryr-controller restarts due to race between namespace and pod creation
Product: OpenShift Container Platform Reporter: Luis Tomas Bolivar <ltomasbo>
Component: NetworkingAssignee: Luis Tomas Bolivar <ltomasbo>
Networking sub component: kuryr QA Contact: Jon Uriarte <juriarte>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: bbennett, juriarte
Version: 4.2.0Keywords: Triaged
Target Milestone: ---   
Target Release: 4.2.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1759095 Environment:
Last Closed: 2019-10-30 04:44:56 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1759095    
Bug Blocks:    

Description Luis Tomas Bolivar 2019-10-07 12:10:22 UTC
+++ This bug was initially created as a clone of Bug #1759095 +++

When a namespace is created, and a new pod in that namespace is created right away, it may happen that all needed resources (neutron resources + kuryr CRD objects) are not yet created when the vif handler is executed to obtain a port for the pod. This leads to the following error that makes the kuryr-controller to restart (when it should only retry):

2019-10-07 10:01:14.043 1 ERROR kuryr_kubernetes.controller.drivers.namespace_subnet [-] Kubernetes Client Exception.: K8sResourceNotFound: Resource not found: u'{"kind":"Status","apiVersion":"v1","metadata":{},
"status":"Failure","message":"kuryrnets.openstack.org \\"ns-openshift-apiserver-operator\\" not found","reason":"NotFound","details":{"name":"ns-openshift-apiserver-operator","group":"openstack.org","kind":"kury
rnets"},"code":404}\n'
2019-10-07 10:01:14.043 1 ERROR kuryr_kubernetes.controller.drivers.namespace_subnet Traceback (most recent call last):
2019-10-07 10:01:14.043 1 ERROR kuryr_kubernetes.controller.drivers.namespace_subnet File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 73, in _get_namespace_
subnet_id
2019-10-07 10:01:14.043 1 ERROR kuryr_kubernetes.controller.drivers.namespace_subnet constants.K8S_API_CRD, net_crd_name))
2019-10-07 10:01:14.043 1 ERROR kuryr_kubernetes.controller.drivers.namespace_subnet File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/k8s_client.py", line 83, in get
2019-10-07 10:01:14.043 1 ERROR kuryr_kubernetes.controller.drivers.namespace_subnet raise exc.K8sResourceNotFound(response.text)
2019-10-07 10:01:14.043 1 ERROR kuryr_kubernetes.controller.drivers.namespace_subnet K8sResourceNotFound: Resource not found: u'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"kury
rnets.openstack.org \\"ns-openshift-apiserver-operator\\" not found","reason":"NotFound","details":{"name":"ns-openshift-apiserver-operator","group":"openstack.org","kind":"kuryrnets"},"code":404}\n'
2019-10-07 10:01:14.043 1 ERROR kuryr_kubernetes.controller.drivers.namespace_subnet
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry [-] Report handler unhealthy VIFHandler: K8sResourceNotFound: Resource not found: u'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Fail
ure","message":"kuryrnets.openstack.org \\"ns-openshift-apiserver-operator\\" not found","reason":"NotFound","details":{"name":"ns-openshift-apiserver-operator","group":"openstack.org","kind":"kuryrnets"},"code"
:404}\n'
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry Traceback (most recent call last):
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/retry.py", line 56, in __call__
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry self._handler(event)
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/handlers/k8s_base.py", line 75, in __call__
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry self.on_present(obj)
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/handlers/vif.py", line 94, in on_present
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry subnets = self._drv_subnets.get_subnets(pod, project_id)
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 47, in get_subnets
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry return self.get_namespace_subnet(pod_namespace)
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 51, in get_namespace_subnet
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry subnet_id = self._get_namespace_subnet_id(namespace)
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/controller/drivers/namespace_subnet.py", line 73, in _get_namespace_subnet_id
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry constants.K8S_API_CRD, net_crd_name))
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry File "/usr/lib/python2.7/site-packages/kuryr_kubernetes/k8s_client.py", line 83, in get
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry raise exc.K8sResourceNotFound(response.text)
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry K8sResourceNotFound: Resource not found: u'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"kuryrnets.openstack.org \
\"ns-openshift-apiserver-operator\\" not found","reason":"NotFound","details":{"name":"ns-openshift-apiserver-operator","group":"openstack.org","kind":"kuryrnets"},"code":404}\n'
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.retry
2019-10-07 10:01:14.044 1 ERROR kuryr_kubernetes.handlers.logging [-] Failed to handle event {u'object': {u'status': {u'qosClass': u'Burstable', u'containerStatuses': [{u'restartCount': 0, u'name': u'openshift-a
piserver-operator', u'image': u'quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5687e124eb0cc6989f2bf188313ffb96fcd4cce32596eeaba92c65186b9c1e89', u'imageID': u'', u'state': {u'waiting': {u'reason': u'Cont
ainerCreating'}}, u'ready': False, u'lastState': {}}], u'startTime': u'2019-10-07T09:42:20Z', u'hostIP': u'10.196.0.13', u'phase': u'Pending', u'conditions': [{u'status': u'True', u'lastProbeTime': None, u'type'
: u'Initialized', u'lastTransitionTime': u'2019-10-07T09:42:20Z'}, {u'status': u'False', u'lastTransitionTime': u'2019-10-07T09:42:20Z', u'reason': u'ContainersNotReady', u'lastProbeTime': None, u'message': u'co
ntainers with unready status: [openshift-apiserver-operator]', u'type': u'Ready'}, {u'status': u'False', u'lastTransitionTime': u'2019-10-07T09:42:20Z', u'reason': u'ContainersNotReady', u'lastProbeTime': None,
u'message': u'containers with unready status: [openshift-apiserver-operator]', u'type': u'ContainersReady'}, {u'status': u'True', u'lastProbeTime': None, u'type': u'PodScheduled', u'lastTransitionTime': u'2019-1
0-07T09:42:20Z'}]}, u'kind': u'Pod', u'spec': {u'priorityClassName': u'system-cluster-critical', u'dnsPolicy': u'ClusterFirst', u'securityContext': {}, u'serviceAccountName': u'openshift-apiserver-operator', u's
chedulerName': u'default-scheduler', u'enableServiceLinks': True, u'serviceAccount': u'openshift-apiserver-operator', u'nodeSelector': {u'node-role.kubernetes.io/master': u''}, u'priority': 2000000000, u'termina
tionGracePeriodSeconds': 30, u'restartPolicy': u'Always', u'volumes': [{u'secret': {u'optional': True, u'defaultMode': 420, u'secretName': u'openshift-apiserver-operator-serving-cert'}, u'name': u'serving-cert'}
, {u'configMap': {u'name': u'openshift-apiserver-operator-config', u'defaultMode': 420}, u'name': u'config'}, {u'secret': {u'defaultMode': 420, u'secretName': u'openshift-apiserver-operator-token-pz2hb'}, u'name
': u'openshift-apiserver-operator-token-pz2hb'}], u'tolerations': [{u'operator': u'Exists', u'effect': u'NoSchedule', u'key': u'node.kubernetes.io/memory-pressure'}, {u'operator': u'Exists', u'effect': u'NoSched
ule', u'key': u'node-role.kubernetes.io/master'}, {u'operator': u'Exists', u'tolerationSeconds': 120, u'effect': u'NoExecute', u'key': u'node.kubernetes.io/unreachable'}, {u'operator': u'Exists', u'tolerationSec
onds': 120, u'effect': u'NoExecute', u'key': u'node.kubernetes.io/not-ready'}], u'containers': [{u'terminationMessagePath': u'/dev/termination-log', u'name': u'openshift-apiserver-operator', u'image': u'quay.io/
openshift-release-dev/ocp-v4.0-art-dev@sha256:5687e124eb0cc6989f2bf188313ffb96fcd4cce32596eeaba92c65186b9c1e89', u'args': [u'--config=/var/run/configmaps/config/config.yaml'], u'volumeMounts': [{u'mountPath': u'
/var/run/configmaps/config', u'name': u'config'}, {u'mountPath': u'/var/run/secrets/serving-cert', u'name': u'serving-cert'}, {u'readOnly': True, u'mountPath': u'/var/run/secrets/kubernetes.io/serviceaccount', u
'name': u'openshift-apiserver-operator-token-pz2hb'}], u'terminationMessagePolicy': u'FallbackToLogsOnError', u'command': [u'cluster-openshift-apiserver-operator', u'operator'], u'env': [{u'name': u'IMAGE', u'va
lue': u'quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:43961dab5d4079061db12a0da58d301056a068c305ada0f4f29537c8a6e8120e'}, {u'name': u'OPERATOR_IMAGE', u'value': u'quay.io/openshift-release-dev/ocp-v4.0-a
rt-dev@sha256:5687e124eb0cc6989f2bf188313ffb96fcd4cce32596eeaba92c65186b9c1e89'}, {u'name': u'OPERATOR_IMAGE_VERSION', u'value': u'4.2.0-0.nightly-2019-10-07-011045'}, {u'name': u'OPERAND_IMAGE_VERSION', u'value
': u'4.2.0-0.nightly-2019-10-07-011045'}], u'imagePullPolicy': u'IfNotPresent', u'ports': [{u'protocol': u'TCP', u'name': u'metrics', u'containerPort': 8443}], u'resources': {u'requests': {u'cpu': u'10m', u'memo
ry': u'50Mi'}}}], u'nodeName': u'ostest-6qg24-master-0'}, u'apiVersion': u'v1', u'metadata': {u'name': u'openshift-apiserver-operator-754dc6f9f5-cvdsz', u'labels': {u'pod-template-hash': u'754dc6f9f5', u'app': u
'openshift-apiserver-operator'}, u'namespace': u'openshift-apiserver-operator', u'ownerReferences': [{u'kind': u'ReplicaSet', u'name': u'openshift-apiserver-operator-754dc6f9f5', u'apiVersion': u'apps/v1', u'con
troller': True, u'blockOwnerDeletion': True, u'uid': u'808e40ef-e8e6-11e9-8c73-fa163e70b8e3'}], u'resourceVersion': u'5152', u'generateName': u'openshift-apiserver-operator-754dc6f9f5-', u'creationTimestamp': u'
2019-10-07T09:40:31Z', u'annotations': {u'k8s.v1.cni.cncf.io/networks-status': u''}, u'selfLink': u'/api/v1/namespaces/openshift-apiserver-operator/pods/openshift-apiserver-operator-754dc6f9f5-cvdsz', u'uid': u'
80b3219a-e8e6-11e9-8c73-fa163e70b8e3'}}, u'type': u'ADDED'}: K8sResourceNotFound: Resource not found: u'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"kuryrnets.openstack.org \\"n
s-openshift-apiserver-operator\\" not found","reason":"NotFound","details":{"name":"ns-openshift-apiserver-operator","group":"openstack.org","kind":"kuryrnets"},"code":404}\n'

Comment 1 Luis Tomas Bolivar 2019-10-17 15:41:19 UTC
*** Bug 1757916 has been marked as a duplicate of this bug. ***

Comment 3 Jon Uriarte 2019-10-24 14:36:56 UTC
Verified on OCP 4.2.0-0.nightly-2019-10-23-033305 build on top of OSP 13 2019-10-16.2 puddle.

release image: registry.svc.ci.openshift.org/ocp/release@sha256:9408af4d970a3839c68b9429a7aa1eccc7a4bb1f522c3693062d747cb72c378d

The scenario has been reproduced by deleting the kuryrnet CRD before creating a new pod, showing the warning message and retrying.

Verification steps:
$ oc new-project test
$ oc get kuryrnets | grep test
ns-test         10.128.102.0/23    4s

$ oc delete kuryrnet ns-test
kuryrnet.openstack.org "ns-test" deleted

$ oc run --image kuryr/demo demo

Logs in kuryr-controller show the retries, as it detects kuryrnet resource is not created, and the aliveness probe reports ok:

WARNING kuryr_kubernetes.controller.drivers.namespace_subnet [-] Kuryrnet resource not yet created, retrying...: K8sResourceNotFound: Resource not found: u'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"kuryrnets.openstack.org \\"ns-test\\" not found","reason":"NotFound","details":{"name":"ns-test","group":"openstack.org","kind":"kuryrnets"},"code":404}\n'

INFO werkzeug [-] 10.196.0.18 - - [24/Oct/2019 14:35:16] "GET /alive HTTP/1.1" 200 -

Comment 5 errata-xmlrpc 2019-10-30 04:44:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3151