Bug 1973005 - authentication operator degraded during 4.7.16 update
Summary: authentication operator degraded during 4.7.16 update
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: apiserver-auth
Version: 4.7
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 4.9.0
Assignee: Sergiusz Urbaniak
QA Contact: liyao
URL:
Whiteboard:
Depends On:
Blocks: 2003632
TreeView+ depends on / blocked
 
Reported: 2021-06-17 05:22 UTC by pawankum
Modified: 2021-11-17 05:38 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-18 17:34:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-authentication-operator pull 472 0 None None None 2021-08-18 09:42:36 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:35:22 UTC

Description pawankum 2021-06-17 05:22:03 UTC
Description of problem:
authentication operator degraded during 4.7.13 to 4.7.16 update on fast channel with error "cannot remove '/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem': Permission denied" on authentication pod.

Version-Release number of selected component (if applicable):
4.7.16

How reproducible:
Customer has two cluster, got same issue on both.


Steps to Reproduce:
1. Setup cluster 4.7.13 on Azure via IPI
2. Installed datadog via helm chart
3. Try to upgrade the cluster to 4.7.16 via fast channel 4.7
4. During upgrade authentication operator went into degraded state.
5. Further checking in authentication operator pods, found that annotation of got changed to openshift.io/scc : datadog-cluster-agent.
6. Thus giving error "cannot remove '/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem': Permission denied"

Actual results:
SCC is getting changed automatically from anyuid to datadog-cluster-agent


Expected results:
SCC should not be changed from anyuid to datadog-cluster-agent


Additional info:
After changing scc to anyuid manually, cluster upgrade went successful. Customer never changed the scc of authentication operator.

Comment 1 Sergiusz Urbaniak 2021-06-17 07:03:26 UTC
Please provide a must-gather so we can investigate.

From your description it appears that the datadog helm charm installs an SCC named "datadog-cluster-agent" with higher priority than anyuid.

Comment 2 pawankum 2021-06-17 07:29:37 UTC
Hello Team,

Must-gather is heavier in size than allowed (109MB, 19MB is allowed). "02965667" is the case number if you can pull directly. 

Yes, I also suspect the same regarding the priority but that has been installed by default.

Comment 3 pawankum 2021-06-17 09:58:37 UTC
Another case with same issue "02963656"

Comment 4 pawankum 2021-06-28 04:37:45 UTC
Just (In reply to Sergiusz Urbaniak from comment #1)
> Please provide a must-gather so we can investigate.
> 
> From your description it appears that the datadog helm charm installs an SCC
> named "datadog-cluster-agent" with higher priority than anyuid.

Just wanted to check, is there any progress on this?


BR
Pawan Kumar

Comment 5 pawankum 2021-07-07 10:39:14 UTC
(In reply to Sergiusz Urbaniak from comment #1)
> Please provide a must-gather so we can investigate.
> 
> From your description it appears that the datadog helm charm installs an SCC
> named "datadog-cluster-agent" with higher priority than anyuid.

Hello Team,

May I seek an update on this?



Regards,
Pawan Kumar

Comment 6 pawankum 2021-07-13 08:23:28 UTC
(In reply to Sergiusz Urbaniak from comment #1)
> Please provide a must-gather so we can investigate.
> 
> From your description it appears that the datadog helm charm installs an SCC
> named "datadog-cluster-agent" with higher priority than anyuid.

Hello Team,

Is there any progress on this?



BR
Pawan

Comment 7 pawankum 2021-07-23 11:45:02 UTC
(In reply to Sergiusz Urbaniak from comment #1)
> Please provide a must-gather so we can investigate.
> 
> From your description it appears that the datadog helm charm installs an SCC
> named "datadog-cluster-agent" with higher priority than anyuid.

Hello Team,

Any further update on this?



BR,
Pawan

Comment 8 pawankum 2021-08-03 11:23:03 UTC
(In reply to Sergiusz Urbaniak from comment #1)
> Please provide a must-gather so we can investigate.
> 
> From your description it appears that the datadog helm charm installs an SCC
> named "datadog-cluster-agent" with higher priority than anyuid.

Hello Team,

Customer is waiting on this from around 7 weeks, can we have an update on this please. Either positive or negative, atleast give us an update.


Regards
Pawan

Comment 9 Tony Garcia 2021-08-03 22:23:06 UTC
Hello,

I just linked another customer case. They are seeing this issue when upgrading from 4.6.36 to 4.6.39.

Comment 10 Sergiusz Urbaniak 2021-08-06 11:12:34 UTC
We are still investigating the issue but are not having yet a hypothesis.

In any case, did the customer modify existing SCCs or create new ones?

Comment 11 Sergiusz Urbaniak 2021-08-06 11:18:42 UTC
One observation and this needs a fix on its own. We are running authentication-operator as PID 1 (root) inside the container. If we have environments where this is not the case then the failure is obvious as /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem is owned by root.

```
$ oc -n openshift-authentication-operator rsh authentication-operator-7748bbb7f-4vvc8
sh-4.4# ls -l /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem 
-r--r--r--. 1 root root 216090 Aug  6 09:03 /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem

sh-4.4# ps auxf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          17  0.0  0.0  12052  3384 pts/0    Ss   11:15   0:00 /bin/sh
root          32  0.0  0.0  44668  3332 pts/0    R+   11:16   0:00  \_ ps auxf
root           1  2.1  1.0 1585188 163072 ?      Ssl  09:03   2:53 authentication-operator operator --config=/var/run/configmaps/config/operator-config.yaml --v=2 --terminate-on-files=/var/run/configmaps/trusted-ca-bundle/ca-bundle.crt
```

Comment 12 pawankum 2021-08-06 11:27:10 UTC
(In reply to Sergiusz Urbaniak from comment #10)
> We are still investigating the issue but are not having yet a hypothesis.
> 
> In any case, did the customer modify existing SCCs or create new ones?

Hello Sergiusz,

Thanks for the update.

In any case, did the customer modify existing SCCs or create new ones? ---- No, they havent done any changes in default SCC. But they noticed it like when they installed the Datadog it got changed automatically and when they removed datadog, it changes back to default one. But there isn't any manual changes.



Regards,
Pawan Kumar

Comment 13 Sergiusz Urbaniak 2021-08-06 11:30:43 UTC
> But they noticed it like when they installed the Datadog it got changed automatically and when they removed datadog, it changes back to default one. But there isn't any manual changes.

can you elaborate on this? what do you mean with "changes back to default one"?

Comment 14 Sergiusz Urbaniak 2021-08-06 11:31:26 UTC
also, given the example above with "oc rsh" and "ps auxf", can you check what user authentication-operator is running as when the failure occurs?

Comment 15 pawankum 2021-08-06 11:35:16 UTC
(In reply to Sergiusz Urbaniak from comment #14)
> also, given the example above with "oc rsh" and "ps auxf", can you check
> what user authentication-operator is running as when the failure occurs?

As per client words:

We noticed that the pods for openshift-authentication-operator and authentication-operator had an annotation openshift.io/scc : datadog-cluster-agent.  (this was completely by chance, as we were glancing at the yaml and saw the annotation which stuck out at us as odd)

This leads us to the first investigation area #1 : What is adding this annotation? Our only conclusion was that something in openshift authentication is trying to populate the scc with anyuid, but picking up one from datadog.  We have no idea about this. We have lots of other scc's on the cluster..
(For background, datadog is installed via helm chart)

We then un-installed datadog from the cluster, which also removed it's scc  (confirmed with >oc get scc).

We then deleted all the crashloop pods for openshift authentication and noticed that when they were re-created, then had the correct anyuid annotation.
The cluster upgrade to now seems to be progressing.


Let me know if this helps..



REgards,
Pawan kumar

Comment 16 Sergiusz Urbaniak 2021-08-06 12:54:50 UTC
Please read about security context constraints as per https://docs.openshift.com/container-platform/4.8/authentication/managing-security-context-constraints.html.

If the user deploys SCCs having higher priority than default out of the box SCCs then these will be favored. We are thinking of refactoring SCCs so users don't have the chance to interleave custom ones but this is not possible as of today.

I am leaving this open so we can pin openshift-authentication-operator to a concrete SCC but this is formally not a bug.

Comment 17 pawankum 2021-08-07 05:09:04 UTC
(In reply to Sergiusz Urbaniak from comment #16)
> Please read about security context constraints as per
> https://docs.openshift.com/container-platform/4.8/authentication/managing-
> security-context-constraints.html.
> 
> If the user deploys SCCs having higher priority than default out of the box
> SCCs then these will be favored. We are thinking of refactoring SCCs so
> users don't have the chance to interleave custom ones but this is not
> possible as of today.
> 
> I am leaving this open so we can pin openshift-authentication-operator to a
> concrete SCC but this is formally not a bug.

Hello Sergiusz,

Thanks for the update, but the thing is when the datadog installed, why it is changing the default SCC of openshift-authentication-operator from anyuid to datadog uid. It shouldn't have changed the default SCC until and unless customer tweak something manually. Customer didn't changed the priority of datadog SCC. So everytime if datadog installed, openshift-authentication-operator will give an error as its SCC is getting changed automatically. 

Let me know your thoughts on this. 



Regards,
Pawan Kumar

Comment 18 Sergiusz Urbaniak 2021-08-09 14:37:13 UTC
most likely datadog SCC declares itself with higher priority than the out-of-the box ones. Can you actually send us the concrete datadog SCC definition?

Comment 19 pawankum 2021-08-12 10:36:26 UTC
(In reply to Sergiusz Urbaniak from comment #18)
> most likely datadog SCC declares itself with higher priority than the
> out-of-the box ones. Can you actually send us the concrete datadog SCC
> definition?

Hello Sergiusz,

I dont have any other data apart from must-gather. Client has also remove datadog to solve this issue. But they have attached the must-gather before removing the datadog. You can fetch the must-gather from #Comment5 from the case 02965667.

Let me know if that works for you and please share further thoughts on this.



Regards,
Pawan Kumar

Comment 20 Sergiusz Urbaniak 2021-08-16 12:12:51 UTC
sprint review: this issue is being worked on

Comment 21 pawankum 2021-08-17 05:08:43 UTC
Hello Sergiusz,

Any further update on this? I got two more cases 03012297 and 03012287 for the same issue.



Regards,
Pawan Kumar

Comment 23 Sergiusz Urbaniak 2021-08-17 14:38:46 UTC
I debugged this further today and indeed the given datadog-agent-cluster-agent has the same priority as anyuid (priority 10) but "wins" as it is more restrictive than anyuid.

The audit log reveals:

```
"securitycontextconstraints.admission.openshift.io/reason": "\"datadog-cluster-agent\" is most restrictive, not denied, and chosen over \"anyuid\" because \"datadog-cluster-agent\" forbids host volume mounts",
```

Generally speaking the way to fix this is to lower the priority of the datadog-cluster-agent SCC. Tightening out of the box constraints will lead to unexpected results.

I am still investigating if we can lower the SCC surface area for authentication-operator, but this is still being investigated.

Comment 24 pawankum 2021-08-18 10:49:48 UTC
Hello Sergiusz,

Thank you very much for providing the update. 

I just want to know, if there is any workaround we can follow to solve this situation? I have few clients who are facing this issue in current version and some of them cannot remove datadog. Is there any way we can reduce the priority of datadog? Will it affect the functionality?


Regards
Pawan

Comment 25 Sergiusz Urbaniak 2021-08-18 10:59:09 UTC
A PR is out which fixes the issue. We have to set runAsUser explicitely so anyuid is continued to be picked up by the SCC algorithm.

QE: steps for verification:

1. create the following "datadog-cluster-agent" SCC:

allowHostDirVolumePlugin: false
allowHostIPC: false
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
allowPrivilegeEscalation: false
allowPrivilegedContainer: false
allowedCapabilities: null
apiVersion: security.openshift.io/v1
defaultAddCapabilities: null
fsGroup:
  type: MustRunAs
groups: []
kind: SecurityContextConstraints
metadata:
  annotations:
    meta.helm.sh/release-name: datadog
    meta.helm.sh/release-namespace: apmt-monitoring
  labels:
    app.kubernetes.io/instance: datadog
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: datadog
    app.kubernetes.io/version: "7"
    helm.sh/chart: datadog-2.16.5
  name: datadog-cluster-agent
priority: 10
readOnlyRootFilesystem: false
requiredDropCapabilities:
- KILL
- MKNOD
- SETUID
- SETGID
runAsUser:
  type: MustRunAsRange
seLinuxContext:
  type: MustRunAs
supplementalGroups:
  type: RunAsAny
users: []
volumes:
- configMap
- downwardAPI
- emptyDir
- persistentVolumeClaim
- projected
- secret

2. delete the authentication-operator pod:

$ oc -n openshift-authentication-operator delete pod authentication-operator-5c746c5d85-6hs8x

3. Check that the newly authentication-operator pod started successfully:

$ oc -n openshift-authentication-operator get pod 
NAME                                       READY   STATUS    RESTARTS   AGE
authentication-operator-5c746c5d85-g6257   1/1     Running   0          2m12s

Comment 26 Sergiusz Urbaniak 2021-08-18 11:01:47 UTC
@pawankum yes if you reset the datadog-cluster-agent priority to 9 (less than anyuid) then authentication-operator will be started. that is definitely a viable workaround.

Comment 27 pawankum 2021-08-18 11:14:23 UTC
Thank you Sergiusz,

Appreciated your help in this case.



Regards,
Pawan Kumar

Comment 29 liyao 2021-08-23 07:11:20 UTC
Follow the steps in Comment 25 for verification

1. when tested in fresh cluster 4.9.0-0.nightly-2021-08-22-070405 including the fix, 'newly authentication-operator pod started successfully' as expected result

2. when tested in fresh cluster 4.9.0-0.nightly-2021-08-18-033031 NOT including the fix
in step3, check 'newly authentication-operator pod' and it's CrashLoopBackOff status
$ oc get pods -n openshift-authentication-operator
NAME                                       READY   STATUS             RESTARTS         AGE
authentication-operator-66c9c7cc95-bqpx4   0/1     CrashLoopBackOff   11 (3m56s ago)   35m

check authentication operator pod logs, this is the same with Comment 1
$ oc logs -f authentication-operator-66c9c7cc95-bqpx4 -n openshift-authentication-operator
Copying system trust bundle
cp: cannot remove '/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem': Permission denied

Comment 34 errata-xmlrpc 2021-10-18 17:34:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.