Description of problem: authentication operator degraded during 4.7.13 to 4.7.16 update on fast channel with error "cannot remove '/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem': Permission denied" on authentication pod. Version-Release number of selected component (if applicable): 4.7.16 How reproducible: Customer has two cluster, got same issue on both. Steps to Reproduce: 1. Setup cluster 4.7.13 on Azure via IPI 2. Installed datadog via helm chart 3. Try to upgrade the cluster to 4.7.16 via fast channel 4.7 4. During upgrade authentication operator went into degraded state. 5. Further checking in authentication operator pods, found that annotation of got changed to openshift.io/scc : datadog-cluster-agent. 6. Thus giving error "cannot remove '/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem': Permission denied" Actual results: SCC is getting changed automatically from anyuid to datadog-cluster-agent Expected results: SCC should not be changed from anyuid to datadog-cluster-agent Additional info: After changing scc to anyuid manually, cluster upgrade went successful. Customer never changed the scc of authentication operator.
Please provide a must-gather so we can investigate. From your description it appears that the datadog helm charm installs an SCC named "datadog-cluster-agent" with higher priority than anyuid.
Hello Team, Must-gather is heavier in size than allowed (109MB, 19MB is allowed). "02965667" is the case number if you can pull directly. Yes, I also suspect the same regarding the priority but that has been installed by default.
Another case with same issue "02963656"
Just (In reply to Sergiusz Urbaniak from comment #1) > Please provide a must-gather so we can investigate. > > From your description it appears that the datadog helm charm installs an SCC > named "datadog-cluster-agent" with higher priority than anyuid. Just wanted to check, is there any progress on this? BR Pawan Kumar
(In reply to Sergiusz Urbaniak from comment #1) > Please provide a must-gather so we can investigate. > > From your description it appears that the datadog helm charm installs an SCC > named "datadog-cluster-agent" with higher priority than anyuid. Hello Team, May I seek an update on this? Regards, Pawan Kumar
(In reply to Sergiusz Urbaniak from comment #1) > Please provide a must-gather so we can investigate. > > From your description it appears that the datadog helm charm installs an SCC > named "datadog-cluster-agent" with higher priority than anyuid. Hello Team, Is there any progress on this? BR Pawan
(In reply to Sergiusz Urbaniak from comment #1) > Please provide a must-gather so we can investigate. > > From your description it appears that the datadog helm charm installs an SCC > named "datadog-cluster-agent" with higher priority than anyuid. Hello Team, Any further update on this? BR, Pawan
(In reply to Sergiusz Urbaniak from comment #1) > Please provide a must-gather so we can investigate. > > From your description it appears that the datadog helm charm installs an SCC > named "datadog-cluster-agent" with higher priority than anyuid. Hello Team, Customer is waiting on this from around 7 weeks, can we have an update on this please. Either positive or negative, atleast give us an update. Regards Pawan
Hello, I just linked another customer case. They are seeing this issue when upgrading from 4.6.36 to 4.6.39.
We are still investigating the issue but are not having yet a hypothesis. In any case, did the customer modify existing SCCs or create new ones?
One observation and this needs a fix on its own. We are running authentication-operator as PID 1 (root) inside the container. If we have environments where this is not the case then the failure is obvious as /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem is owned by root. ``` $ oc -n openshift-authentication-operator rsh authentication-operator-7748bbb7f-4vvc8 sh-4.4# ls -l /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem -r--r--r--. 1 root root 216090 Aug 6 09:03 /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem sh-4.4# ps auxf USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 17 0.0 0.0 12052 3384 pts/0 Ss 11:15 0:00 /bin/sh root 32 0.0 0.0 44668 3332 pts/0 R+ 11:16 0:00 \_ ps auxf root 1 2.1 1.0 1585188 163072 ? Ssl 09:03 2:53 authentication-operator operator --config=/var/run/configmaps/config/operator-config.yaml --v=2 --terminate-on-files=/var/run/configmaps/trusted-ca-bundle/ca-bundle.crt ```
(In reply to Sergiusz Urbaniak from comment #10) > We are still investigating the issue but are not having yet a hypothesis. > > In any case, did the customer modify existing SCCs or create new ones? Hello Sergiusz, Thanks for the update. In any case, did the customer modify existing SCCs or create new ones? ---- No, they havent done any changes in default SCC. But they noticed it like when they installed the Datadog it got changed automatically and when they removed datadog, it changes back to default one. But there isn't any manual changes. Regards, Pawan Kumar
> But they noticed it like when they installed the Datadog it got changed automatically and when they removed datadog, it changes back to default one. But there isn't any manual changes. can you elaborate on this? what do you mean with "changes back to default one"?
also, given the example above with "oc rsh" and "ps auxf", can you check what user authentication-operator is running as when the failure occurs?
(In reply to Sergiusz Urbaniak from comment #14) > also, given the example above with "oc rsh" and "ps auxf", can you check > what user authentication-operator is running as when the failure occurs? As per client words: We noticed that the pods for openshift-authentication-operator and authentication-operator had an annotation openshift.io/scc : datadog-cluster-agent. (this was completely by chance, as we were glancing at the yaml and saw the annotation which stuck out at us as odd) This leads us to the first investigation area #1 : What is adding this annotation? Our only conclusion was that something in openshift authentication is trying to populate the scc with anyuid, but picking up one from datadog. We have no idea about this. We have lots of other scc's on the cluster.. (For background, datadog is installed via helm chart) We then un-installed datadog from the cluster, which also removed it's scc (confirmed with >oc get scc). We then deleted all the crashloop pods for openshift authentication and noticed that when they were re-created, then had the correct anyuid annotation. The cluster upgrade to now seems to be progressing. Let me know if this helps.. REgards, Pawan kumar
Please read about security context constraints as per https://docs.openshift.com/container-platform/4.8/authentication/managing-security-context-constraints.html. If the user deploys SCCs having higher priority than default out of the box SCCs then these will be favored. We are thinking of refactoring SCCs so users don't have the chance to interleave custom ones but this is not possible as of today. I am leaving this open so we can pin openshift-authentication-operator to a concrete SCC but this is formally not a bug.
(In reply to Sergiusz Urbaniak from comment #16) > Please read about security context constraints as per > https://docs.openshift.com/container-platform/4.8/authentication/managing- > security-context-constraints.html. > > If the user deploys SCCs having higher priority than default out of the box > SCCs then these will be favored. We are thinking of refactoring SCCs so > users don't have the chance to interleave custom ones but this is not > possible as of today. > > I am leaving this open so we can pin openshift-authentication-operator to a > concrete SCC but this is formally not a bug. Hello Sergiusz, Thanks for the update, but the thing is when the datadog installed, why it is changing the default SCC of openshift-authentication-operator from anyuid to datadog uid. It shouldn't have changed the default SCC until and unless customer tweak something manually. Customer didn't changed the priority of datadog SCC. So everytime if datadog installed, openshift-authentication-operator will give an error as its SCC is getting changed automatically. Let me know your thoughts on this. Regards, Pawan Kumar
most likely datadog SCC declares itself with higher priority than the out-of-the box ones. Can you actually send us the concrete datadog SCC definition?
(In reply to Sergiusz Urbaniak from comment #18) > most likely datadog SCC declares itself with higher priority than the > out-of-the box ones. Can you actually send us the concrete datadog SCC > definition? Hello Sergiusz, I dont have any other data apart from must-gather. Client has also remove datadog to solve this issue. But they have attached the must-gather before removing the datadog. You can fetch the must-gather from #Comment5 from the case 02965667. Let me know if that works for you and please share further thoughts on this. Regards, Pawan Kumar
sprint review: this issue is being worked on
Hello Sergiusz, Any further update on this? I got two more cases 03012297 and 03012287 for the same issue. Regards, Pawan Kumar
I debugged this further today and indeed the given datadog-agent-cluster-agent has the same priority as anyuid (priority 10) but "wins" as it is more restrictive than anyuid. The audit log reveals: ``` "securitycontextconstraints.admission.openshift.io/reason": "\"datadog-cluster-agent\" is most restrictive, not denied, and chosen over \"anyuid\" because \"datadog-cluster-agent\" forbids host volume mounts", ``` Generally speaking the way to fix this is to lower the priority of the datadog-cluster-agent SCC. Tightening out of the box constraints will lead to unexpected results. I am still investigating if we can lower the SCC surface area for authentication-operator, but this is still being investigated.
Hello Sergiusz, Thank you very much for providing the update. I just want to know, if there is any workaround we can follow to solve this situation? I have few clients who are facing this issue in current version and some of them cannot remove datadog. Is there any way we can reduce the priority of datadog? Will it affect the functionality? Regards Pawan
A PR is out which fixes the issue. We have to set runAsUser explicitely so anyuid is continued to be picked up by the SCC algorithm. QE: steps for verification: 1. create the following "datadog-cluster-agent" SCC: allowHostDirVolumePlugin: false allowHostIPC: false allowHostNetwork: false allowHostPID: false allowHostPorts: false allowPrivilegeEscalation: false allowPrivilegedContainer: false allowedCapabilities: null apiVersion: security.openshift.io/v1 defaultAddCapabilities: null fsGroup: type: MustRunAs groups: [] kind: SecurityContextConstraints metadata: annotations: meta.helm.sh/release-name: datadog meta.helm.sh/release-namespace: apmt-monitoring labels: app.kubernetes.io/instance: datadog app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: datadog app.kubernetes.io/version: "7" helm.sh/chart: datadog-2.16.5 name: datadog-cluster-agent priority: 10 readOnlyRootFilesystem: false requiredDropCapabilities: - KILL - MKNOD - SETUID - SETGID runAsUser: type: MustRunAsRange seLinuxContext: type: MustRunAs supplementalGroups: type: RunAsAny users: [] volumes: - configMap - downwardAPI - emptyDir - persistentVolumeClaim - projected - secret 2. delete the authentication-operator pod: $ oc -n openshift-authentication-operator delete pod authentication-operator-5c746c5d85-6hs8x 3. Check that the newly authentication-operator pod started successfully: $ oc -n openshift-authentication-operator get pod NAME READY STATUS RESTARTS AGE authentication-operator-5c746c5d85-g6257 1/1 Running 0 2m12s
@pawankum yes if you reset the datadog-cluster-agent priority to 9 (less than anyuid) then authentication-operator will be started. that is definitely a viable workaround.
Thank you Sergiusz, Appreciated your help in this case. Regards, Pawan Kumar
Follow the steps in Comment 25 for verification 1. when tested in fresh cluster 4.9.0-0.nightly-2021-08-22-070405 including the fix, 'newly authentication-operator pod started successfully' as expected result 2. when tested in fresh cluster 4.9.0-0.nightly-2021-08-18-033031 NOT including the fix in step3, check 'newly authentication-operator pod' and it's CrashLoopBackOff status $ oc get pods -n openshift-authentication-operator NAME READY STATUS RESTARTS AGE authentication-operator-66c9c7cc95-bqpx4 0/1 CrashLoopBackOff 11 (3m56s ago) 35m check authentication operator pod logs, this is the same with Comment 1 $ oc logs -f authentication-operator-66c9c7cc95-bqpx4 -n openshift-authentication-operator Copying system trust bundle cp: cannot remove '/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem': Permission denied
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759