1968383 – The 'manager: cluster-policy-controller' annotation missing for project caused the pod created failed

Bug 1968383 - The 'manager: cluster-policy-controller' annotation missing for project caused the pod created failed

Summary: The 'manager: cluster-policy-controller' annotation missing for project cause...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-controller-manager
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Filip Krepinsky
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-07 08:50 UTC by zhou ying
Modified:	2021-06-16 19:34 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-16 19:34:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
cpc log file (730.28 KB, text/plain) 2021-06-07 13:07 UTC, RamaKasturi	no flags	Details
View All

Description zhou ying 2021-06-07 08:50:46 UTC

Description of problem:
Many ci job failed with error: pods "xxx" is forbidden: unable to validate against any security context

Version-Release number of selected component (if applicable):
ocp4.7

How reproducible:
Always

Steps to Reproduce:
1. CI job create a new project;
2. CI job create a new app with command:
  `oc new-app openshift/ruby\~https://github.com/openshift/ruby-hello-world.git`


Actual results:
2. The pod failed to create , check the project missing the annotation:

15:36:45 INFO> Shell Commands: oc new-project wm4wu --kubeconfig=/home/jenkins/ws/workspace/ocp-common/Runner/workdir/ocp4_testuser-28.kubeconfig
Now using project "wm4wu" on server "https://api.qeci-18869.qe.azure.devcluster.openshift.com:6443".
You can add applications to this project with the 'new-app' command. For example, try:
    oc new-app rails-postgresql-example
to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application:
    kubectl create deployment hello-node --image=k8s.gcr.io/serve_hostname
15:36:46 INFO> Exit Status: 0
15:36:47 INFO> oc get projects wm4wu --output=yaml --kubeconfig=/home/jenkins/ws/workspace/ocp-common/Runner/workdir/ocp4_testuser-28.kubeconfig
15:36:47 INFO> After 1 iterations and 1 seconds:
apiVersion: project.openshift.io/v1
kind: Project
metadata:
  annotations:
    openshift.io/description: ""
    openshift.io/display-name: ""
    openshift.io/requester: testuser-28
  creationTimestamp: "2021-04-11T15:36:45Z"
  labels:
    kubernetes.io/metadata.name: wm4wu
  name: wm4wu
  resourceVersion: "101308"
  uid: e53eaac5-f039-4fd5-8d19-470873d3b522
spec:
  finalizers:
  - kubernetes
status:
  phase: Active

Expected results:
2. project created with correct annotation and pod created succeed. 

Additional info:

Comment 1 RamaKasturi 2021-06-07 13:02:55 UTC

I have looked at the must gather at [1] and below is what i see from the cpc logs.

2021-05-29T03:38:34.034240493Z E0529 03:38:34.034179       1 namespace_scc_allocation_controller.go:404] error syncing namespace, it will be retried: Put "https://api-int.qeci-21249.qe.azure.devcluster.openshift.com:6443/apis/security.internal.openshift.io/v1/rangeallocations/scc-uid": read tcp 10.0.0.6:36846->10.0.0.4:6443: read: connection timed out
2021-05-29T03:38:34.034379593Z E0529 03:38:34.034181       1 reconciliation_controller.go:166] failed to discover resources: Get "https://api-int.qeci-21249.qe.azure.devcluster.openshift.com:6443/api?timeout=32s": read tcp 10.0.0.6:36846->10.0.0.4:6443: read: connection timed out
2021-05-29T03:38:34.034379593Z E0529 03:38:34.034181       1 resource_quota_controller.go:408] failed to discover resources: Get "https://api-int.qeci-21249.qe.azure.devcluster.openshift.com:6443/api?timeout=32s": read tcp 10.0.0.6:36846->10.0.0.4:6443: read: connection timed out
2021-05-29T03:38:34.069933790Z I0529 03:38:34.069625       1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"zm4f8", Name:"zm4f8", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CreatedSCCRanges' created SCC ranges
2021-05-29T03:38:34.086751589Z I0529 03:38:34.086704       1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"7c7wq", Name:"7c7wq", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CreatedSCCRanges' created SCC ranges
2021-05-29T03:38:34.104744987Z I0529 03:38:34.104691       1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"fjxv2", Name:"fjxv2", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CreatedSCCRanges' created SCC ranges
2021-05-29T03:38:34.127283885Z I0529 03:38:34.127243       1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"fib1f", Name:"fib1f", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CreatedSCCRanges' created SCC ranges
2021-05-29T03:38:34.149596683Z I0529 03:38:34.149554       1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"ihslt", Name:"ihslt", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CreatedSCCRanges' created SCC ranges
2021-05-29T03:38:34.164123882Z I0529 03:38:34.164059       1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"dmaho", Name:"dmaho", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CreatedSCCRanges' created SCC ranges
2021-05-29T03:38:34.184666680Z I0529 03:38:34.184485       1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"blunn", Name:"blunn", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CreatedSCCRanges' created SCC ranges
2021-05-29T03:38:34.206424779Z I0529 03:38:34.206293       1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"mizgw", Name:"mizgw", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CreatedSCCRanges' created SCC ranges
2021-05-29T03:38:34.222118077Z I0529 03:38:34.222007       1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"2zgm9", Name:"2zgm9", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CreatedSCCRanges' created SCC ranges
2021-05-29T03:38:34.242955076Z I0529 03:38:34.242899       1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"dji27", Name:"dji27", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CreatedSCCRanges' created SCC ranges
2021-05-29T03:39:05.098170411Z I0529 03:39:05.098111       1 request.go:645] Throttling request took 1.059358448s, request: GET:https://api-int.qeci-21249.qe.azure.devcluste

Attaching must-gather as well here.

Comment 2 RamaKasturi 2021-06-07 13:07:56 UTC

Created attachment 1789213 [details]
cpc log file

Comment 3 RamaKasturi 2021-06-07 15:46:21 UTC

Link to the must-gather log file: http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/must-gather-21918-740750081/

Comment 4 zhou ying 2021-06-08 03:38:24 UTC

nagetive example for the issue:
03:29:02 INFO> Shell Commands: oc new-project rvowr --kubeconfig=/home/jenkins/ws/workspace/ocp-common/Runner/workdir/ocp4_testuser-36.kubeconfig
Now using project "rvowr" on server "https://api.qeci-21249.qe.azure.devcluster.openshift.com:6443".

You can add applications to this project with the 'new-app' command. For example, try:

    oc new-app rails-postgresql-example

to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application:

    kubectl create deployment hello-node --image=k8s.gcr.io/serve_hostname

03:29:03 INFO> Exit Status: 0
03:29:04 INFO> oc get projects rvowr --output=yaml --kubeconfig=/home/jenkins/ws/workspace/ocp-common/Runner/workdir/ocp4_testuser-36.kubeconfig
03:29:04 INFO> After 1 iterations and 1 seconds:
apiVersion: project.openshift.io/v1
kind: Project
metadata:
  annotations:
    openshift.io/description: ""
    openshift.io/display-name: ""
    openshift.io/requester: testuser-36
  creationTimestamp: "2021-05-29T03:29:03Z"
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:openshift.io/description: {}
          f:openshift.io/display-name: {}
          f:openshift.io/requester: {}
      f:status:
        f:phase: {}
    manager: openshift-apiserver
    operation: Update
    time: "2021-05-29T03:29:03Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        f:finalizers: {}
    manager: openshift-controller-manager
    operation: Update
    time: "2021-05-29T03:29:03Z"
  name: rvowr
  resourceVersion: "70989"
  selfLink: /api/v1/namespaces/rvowr
  uid: 42b4d7a5-465a-4f2f-8b05-ffe820abaea0
spec:
  finalizers:
  - kubernetes
status:
  phase: Active

Comment 5 zhou ying 2021-06-08 03:39:12 UTC

Positive example for project : 

04:45:01 INFO> Shell Commands: oc new-project 5le4w --kubeconfig=/home/jenkins/ws/workspace/ocp-common/Runner/workdir/ocp4_testuser-8.kubeconfig
Now using project "5le4w" on server "https://api.qeci-21249.qe.azure.devcluster.openshift.com:6443".

You can add applications to this project with the 'new-app' command. For example, try:

    oc new-app rails-postgresql-example

to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application:

    kubectl create deployment hello-node --image=k8s.gcr.io/serve_hostname

04:45:02 INFO> Exit Status: 0
04:45:03 INFO> oc get projects 5le4w --output=yaml --kubeconfig=/home/jenkins/ws/workspace/ocp-common/Runner/workdir/ocp4_testuser-8.kubeconfig
04:45:03 INFO> After 1 iterations and 1 seconds:
apiVersion: project.openshift.io/v1
kind: Project
metadata:
  annotations:
    openshift.io/description: ""
    openshift.io/display-name: ""
    openshift.io/requester: testuser-8
    openshift.io/sa.scc.mcs: s0:c84,c39
    openshift.io/sa.scc.supplemental-groups: 1007050000/10000
    openshift.io/sa.scc.uid-range: 1007050000/10000
  creationTimestamp: "2021-05-29T04:45:01Z"
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:openshift.io/sa.scc.mcs: {}
          f:openshift.io/sa.scc.supplemental-groups: {}
          f:openshift.io/sa.scc.uid-range: {}
    manager: cluster-policy-controller
    operation: Update
    time: "2021-05-29T04:45:01Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:openshift.io/description: {}
          f:openshift.io/display-name: {}
          f:openshift.io/requester: {}
      f:status:
        f:phase: {}
    manager: openshift-apiserver
    operation: Update
    time: "2021-05-29T04:45:01Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        f:finalizers: {}
    manager: openshift-controller-manager
    operation: Update
    time: "2021-05-29T04:45:01Z"
  name: 5le4w
  resourceVersion: "159916"
  selfLink: /api/v1/namespaces/5le4w
  uid: 1afbe4c2-d576-4db5-b2b0-ba5ce906d1f8
spec:
  finalizers:
  - kubernetes
status:
  phase: Active

Comment 6 zhou ying 2021-06-08 03:41:57 UTC

As RamaKasturi  said when the issue comes , we could see error logs from cluster-policy-controller around that time:

2021-05-29T03:27:32.894628305Z E0529 03:27:32.894559       1 reconciliation_controller.go:166] failed to discover resources: Get "https://api-int.qeci-21249.qe.azure.devcluster.openshift.com:6443/api?timeout=32s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2021-05-29T03:28:31.737376431Z E0529 03:28:31.737217       1 resource_quota_controller.go:408] failed to discover resources: Get "https://api-int.qeci-21249.qe.azure.devcluster.openshift.com:6443/api?timeout=32s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2021-05-29T03:28:34.895234981Z E0529 03:28:34.895101       1 reconciliation_controller.go:166] failed to discover resources: Get "https://api-int.qeci-21249.qe.azure.devcluster.openshift.com:6443/api?timeout=32s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2021-05-29T03:29:33.737873174Z E0529 03:29:33.737726       1 resource_quota_controller.go:408] failed to discover resources: Get "https://api-int.qeci-21249.qe.azure.devcluster.openshift.com:6443/api?timeout=32s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Comment 7 Filip Krepinsky 2021-06-14 18:21:26 UTC

I have a trouble of reproducing this and did not get far by just looking at the logs. Tested on both latest 4.7 / 4.8 ci clusters 

Can you please specify how to reproduce this? Or point me to the failed ci jobs?


I only found this job failing which is a known flake

[sig-auth][Feature:SCC][Early] should not have pod creation failures during install [Suite:openshift/conformance/parallel] 

eg in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-compact-4.6/1404355655778701312


But I coulnd't find these failures on such a big scale similar to what is posted in this bug report.

Comment 8 Filip Krepinsky 2021-06-16 19:31:34 UTC

I gave this another try. 

First little background about the origin of the error:

1. namespace is created
2. cluster-policy-controller processes / syncs the namespace and meanwhile connection to apiserver fails
3. since cluster-policy-controller couldn't allocate SCCs (security context constraints) all pods will fail on creation/ admission in that namespace.

The solution

cluster-policy-controller will resolve the SCCs of the namespace eventually. Either through gaining connection to the apiserver once again or if the node got lost through electing new controller on another node . This another controller will repair the unresolved namespaces.

As for the failed pods who were unable to validate against any security context, this should be the responsibility of their controller (DS, etc) to retry the creation. Suggestion - if it is a naked pod inside the test and the apiserver failure is expected, then the test should retry the creation once the namespace is fixed.

Testing the cluster-policy-controller
I have tried causing network failures. Either with partial connection loss or complete connection loss to the node with the elected cluster-policy-controller.

With a short connection loss timeframe around 15-25s. The namespace SCC resolution stops working. But after that period the controller recovers and resolves the SCC annotations.

With a longer connection loss. The cluster-policy-controller either restarts (in case we disable other cluster-policy-controller in the other nodes) or a new cluster-policy-controller in other node is elected.
Either way the SCC annoations are resolved properly after such outage.

I have also managed to reproduce a very similar error to the one described in this issue

E0616 18:54:50.205144       1 namespace_scc_allocation_controller.go:404] error syncing namespace, it will be retried: rpc error: code = Unavailable desc = transport is closing 

And this particular namespace got synced properly and annotated with scc annotations after the connection to the apiserver was repaired.

With this conlusion I am closing the issue.

Note You need to log in before you can comment on or make changes to this bug.