Description of problem: Many ci job failed with error: pods "xxx" is forbidden: unable to validate against any security context Version-Release number of selected component (if applicable): ocp4.7 How reproducible: Always Steps to Reproduce: 1. CI job create a new project; 2. CI job create a new app with command: `oc new-app openshift/ruby\~https://github.com/openshift/ruby-hello-world.git` Actual results: 2. The pod failed to create , check the project missing the annotation: 15:36:45 INFO> Shell Commands: oc new-project wm4wu --kubeconfig=/home/jenkins/ws/workspace/ocp-common/Runner/workdir/ocp4_testuser-28.kubeconfig Now using project "wm4wu" on server "https://api.qeci-18869.qe.azure.devcluster.openshift.com:6443". You can add applications to this project with the 'new-app' command. For example, try: oc new-app rails-postgresql-example to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application: kubectl create deployment hello-node --image=k8s.gcr.io/serve_hostname 15:36:46 INFO> Exit Status: 0 15:36:47 INFO> oc get projects wm4wu --output=yaml --kubeconfig=/home/jenkins/ws/workspace/ocp-common/Runner/workdir/ocp4_testuser-28.kubeconfig 15:36:47 INFO> After 1 iterations and 1 seconds: apiVersion: project.openshift.io/v1 kind: Project metadata: annotations: openshift.io/description: "" openshift.io/display-name: "" openshift.io/requester: testuser-28 creationTimestamp: "2021-04-11T15:36:45Z" labels: kubernetes.io/metadata.name: wm4wu name: wm4wu resourceVersion: "101308" uid: e53eaac5-f039-4fd5-8d19-470873d3b522 spec: finalizers: - kubernetes status: phase: Active Expected results: 2. project created with correct annotation and pod created succeed. Additional info:
I have looked at the must gather at [1] and below is what i see from the cpc logs. 2021-05-29T03:38:34.034240493Z E0529 03:38:34.034179 1 namespace_scc_allocation_controller.go:404] error syncing namespace, it will be retried: Put "https://api-int.qeci-21249.qe.azure.devcluster.openshift.com:6443/apis/security.internal.openshift.io/v1/rangeallocations/scc-uid": read tcp 10.0.0.6:36846->10.0.0.4:6443: read: connection timed out 2021-05-29T03:38:34.034379593Z E0529 03:38:34.034181 1 reconciliation_controller.go:166] failed to discover resources: Get "https://api-int.qeci-21249.qe.azure.devcluster.openshift.com:6443/api?timeout=32s": read tcp 10.0.0.6:36846->10.0.0.4:6443: read: connection timed out 2021-05-29T03:38:34.034379593Z E0529 03:38:34.034181 1 resource_quota_controller.go:408] failed to discover resources: Get "https://api-int.qeci-21249.qe.azure.devcluster.openshift.com:6443/api?timeout=32s": read tcp 10.0.0.6:36846->10.0.0.4:6443: read: connection timed out 2021-05-29T03:38:34.069933790Z I0529 03:38:34.069625 1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"zm4f8", Name:"zm4f8", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CreatedSCCRanges' created SCC ranges 2021-05-29T03:38:34.086751589Z I0529 03:38:34.086704 1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"7c7wq", Name:"7c7wq", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CreatedSCCRanges' created SCC ranges 2021-05-29T03:38:34.104744987Z I0529 03:38:34.104691 1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"fjxv2", Name:"fjxv2", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CreatedSCCRanges' created SCC ranges 2021-05-29T03:38:34.127283885Z I0529 03:38:34.127243 1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"fib1f", Name:"fib1f", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CreatedSCCRanges' created SCC ranges 2021-05-29T03:38:34.149596683Z I0529 03:38:34.149554 1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"ihslt", Name:"ihslt", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CreatedSCCRanges' created SCC ranges 2021-05-29T03:38:34.164123882Z I0529 03:38:34.164059 1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"dmaho", Name:"dmaho", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CreatedSCCRanges' created SCC ranges 2021-05-29T03:38:34.184666680Z I0529 03:38:34.184485 1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"blunn", Name:"blunn", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CreatedSCCRanges' created SCC ranges 2021-05-29T03:38:34.206424779Z I0529 03:38:34.206293 1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"mizgw", Name:"mizgw", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CreatedSCCRanges' created SCC ranges 2021-05-29T03:38:34.222118077Z I0529 03:38:34.222007 1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"2zgm9", Name:"2zgm9", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CreatedSCCRanges' created SCC ranges 2021-05-29T03:38:34.242955076Z I0529 03:38:34.242899 1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"dji27", Name:"dji27", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CreatedSCCRanges' created SCC ranges 2021-05-29T03:39:05.098170411Z I0529 03:39:05.098111 1 request.go:645] Throttling request took 1.059358448s, request: GET:https://api-int.qeci-21249.qe.azure.devcluste Attaching must-gather as well here.
Created attachment 1789213 [details] cpc log file
Link to the must-gather log file: http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/must-gather-21918-740750081/
nagetive example for the issue: 03:29:02 INFO> Shell Commands: oc new-project rvowr --kubeconfig=/home/jenkins/ws/workspace/ocp-common/Runner/workdir/ocp4_testuser-36.kubeconfig Now using project "rvowr" on server "https://api.qeci-21249.qe.azure.devcluster.openshift.com:6443". You can add applications to this project with the 'new-app' command. For example, try: oc new-app rails-postgresql-example to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application: kubectl create deployment hello-node --image=k8s.gcr.io/serve_hostname 03:29:03 INFO> Exit Status: 0 03:29:04 INFO> oc get projects rvowr --output=yaml --kubeconfig=/home/jenkins/ws/workspace/ocp-common/Runner/workdir/ocp4_testuser-36.kubeconfig 03:29:04 INFO> After 1 iterations and 1 seconds: apiVersion: project.openshift.io/v1 kind: Project metadata: annotations: openshift.io/description: "" openshift.io/display-name: "" openshift.io/requester: testuser-36 creationTimestamp: "2021-05-29T03:29:03Z" managedFields: - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:openshift.io/description: {} f:openshift.io/display-name: {} f:openshift.io/requester: {} f:status: f:phase: {} manager: openshift-apiserver operation: Update time: "2021-05-29T03:29:03Z" - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:spec: f:finalizers: {} manager: openshift-controller-manager operation: Update time: "2021-05-29T03:29:03Z" name: rvowr resourceVersion: "70989" selfLink: /api/v1/namespaces/rvowr uid: 42b4d7a5-465a-4f2f-8b05-ffe820abaea0 spec: finalizers: - kubernetes status: phase: Active
Positive example for project : 04:45:01 INFO> Shell Commands: oc new-project 5le4w --kubeconfig=/home/jenkins/ws/workspace/ocp-common/Runner/workdir/ocp4_testuser-8.kubeconfig Now using project "5le4w" on server "https://api.qeci-21249.qe.azure.devcluster.openshift.com:6443". You can add applications to this project with the 'new-app' command. For example, try: oc new-app rails-postgresql-example to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application: kubectl create deployment hello-node --image=k8s.gcr.io/serve_hostname 04:45:02 INFO> Exit Status: 0 04:45:03 INFO> oc get projects 5le4w --output=yaml --kubeconfig=/home/jenkins/ws/workspace/ocp-common/Runner/workdir/ocp4_testuser-8.kubeconfig 04:45:03 INFO> After 1 iterations and 1 seconds: apiVersion: project.openshift.io/v1 kind: Project metadata: annotations: openshift.io/description: "" openshift.io/display-name: "" openshift.io/requester: testuser-8 openshift.io/sa.scc.mcs: s0:c84,c39 openshift.io/sa.scc.supplemental-groups: 1007050000/10000 openshift.io/sa.scc.uid-range: 1007050000/10000 creationTimestamp: "2021-05-29T04:45:01Z" managedFields: - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: f:openshift.io/sa.scc.mcs: {} f:openshift.io/sa.scc.supplemental-groups: {} f:openshift.io/sa.scc.uid-range: {} manager: cluster-policy-controller operation: Update time: "2021-05-29T04:45:01Z" - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:openshift.io/description: {} f:openshift.io/display-name: {} f:openshift.io/requester: {} f:status: f:phase: {} manager: openshift-apiserver operation: Update time: "2021-05-29T04:45:01Z" - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:spec: f:finalizers: {} manager: openshift-controller-manager operation: Update time: "2021-05-29T04:45:01Z" name: 5le4w resourceVersion: "159916" selfLink: /api/v1/namespaces/5le4w uid: 1afbe4c2-d576-4db5-b2b0-ba5ce906d1f8 spec: finalizers: - kubernetes status: phase: Active
As RamaKasturi said when the issue comes , we could see error logs from cluster-policy-controller around that time: 2021-05-29T03:27:32.894628305Z E0529 03:27:32.894559 1 reconciliation_controller.go:166] failed to discover resources: Get "https://api-int.qeci-21249.qe.azure.devcluster.openshift.com:6443/api?timeout=32s": net/http: request canceled (Client.Timeout exceeded while awaiting headers) 2021-05-29T03:28:31.737376431Z E0529 03:28:31.737217 1 resource_quota_controller.go:408] failed to discover resources: Get "https://api-int.qeci-21249.qe.azure.devcluster.openshift.com:6443/api?timeout=32s": net/http: request canceled (Client.Timeout exceeded while awaiting headers) 2021-05-29T03:28:34.895234981Z E0529 03:28:34.895101 1 reconciliation_controller.go:166] failed to discover resources: Get "https://api-int.qeci-21249.qe.azure.devcluster.openshift.com:6443/api?timeout=32s": net/http: request canceled (Client.Timeout exceeded while awaiting headers) 2021-05-29T03:29:33.737873174Z E0529 03:29:33.737726 1 resource_quota_controller.go:408] failed to discover resources: Get "https://api-int.qeci-21249.qe.azure.devcluster.openshift.com:6443/api?timeout=32s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I have a trouble of reproducing this and did not get far by just looking at the logs. Tested on both latest 4.7 / 4.8 ci clusters Can you please specify how to reproduce this? Or point me to the failed ci jobs? I only found this job failing which is a known flake [sig-auth][Feature:SCC][Early] should not have pod creation failures during install [Suite:openshift/conformance/parallel] eg in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.6/1404355655778701312 But I coulnd't find these failures on such a big scale similar to what is posted in this bug report.
I gave this another try. First little background about the origin of the error: 1. namespace is created 2. cluster-policy-controller processes / syncs the namespace and meanwhile connection to apiserver fails 3. since cluster-policy-controller couldn't allocate SCCs (security context constraints) all pods will fail on creation/ admission in that namespace. The solution cluster-policy-controller will resolve the SCCs of the namespace eventually. Either through gaining connection to the apiserver once again or if the node got lost through electing new controller on another node . This another controller will repair the unresolved namespaces. As for the failed pods who were unable to validate against any security context, this should be the responsibility of their controller (DS, etc) to retry the creation. Suggestion - if it is a naked pod inside the test and the apiserver failure is expected, then the test should retry the creation once the namespace is fixed. Testing the cluster-policy-controller I have tried causing network failures. Either with partial connection loss or complete connection loss to the node with the elected cluster-policy-controller. With a short connection loss timeframe around 15-25s. The namespace SCC resolution stops working. But after that period the controller recovers and resolves the SCC annotations. With a longer connection loss. The cluster-policy-controller either restarts (in case we disable other cluster-policy-controller in the other nodes) or a new cluster-policy-controller in other node is elected. Either way the SCC annoations are resolved properly after such outage. I have also managed to reproduce a very similar error to the one described in this issue E0616 18:54:50.205144 1 namespace_scc_allocation_controller.go:404] error syncing namespace, it will be retried: rpc error: code = Unavailable desc = transport is closing And this particular namespace got synced properly and annotated with scc annotations after the connection to the apiserver was repaired. With this conlusion I am closing the issue.