Description of problem: OSD 4.4.11 customer cluster had monitoring stack fail when prometheus could not open the query log file "/prometheus/queries.active". After review of the file and directory permissions etc SRE decided to start fresh and deleted the prometheus PVCs, PVs, and EBS volumes. The cluster-monitoring-operator was scaled down to 0 then up to 1 to recreated prometheus-k8s-{0,1} pods. Exact same problem persists with a fresh volume. Version-Release number of selected component (if applicable): OCP 4.4.11 How reproducible: 100% on this one cluster. Steps to Reproduce: 1. Configure prometheus to use persistent storage: https://github.com/openshift/managed-cluster-config/blob/master/deploy/cluster-monitoring-config/cluster-monitoring-config.yaml#L16-L22 Actual results: The "prometheus" container cannot start, fails to open query log file. Expected results: The "prometheus" container starts. Additional info: Must gather will be attached in private comment. We have the cluster running without persistent storage for the weekend and expect it can be reproduced by simply reverting configuration to the standard config. This will be done Monday to see if the problem persists.
Hello, just following up on this. Has someone looked into this yet? If so, can you let us know if you think it's a problem with customer's configuration or an issue with the prometheus operator itself?
I've just looked at the must-gather and I couldn't see anything suspicious. The initialization of the query tracker is one of the first things that Prometheus does when starting so it might be that the whole volume isn't writable by the Prometheus process. Could you get access to the volume and check its permissions?
Closing since we can't reproduce the issue ourselves. As I wrote earlier, neither CMO nor prometheus-operator nor prometheus modifies the permissions/ownership of the volume. If it happened, it is very likely that the volume has been mounted by another pod at some time which has changed the ownership.
@Simon I deleted the PV and PVC for prometheus in an attempt to reset and get this working. It didn't help. This is a production customer cluster so I hesitate to monkey around with the configuration but at this time we have no persistent storage on the cluster for prometheus so that we can get alerts. If you have some specific things you'd like us to collect about the pods and storage while it's in the failed state please let me know and we can do that. I assume it'll be broken again if we turn back on the persistent storage again..
In both cases where we have seen this, the customer provided access to an SCC with the `RunAsAny` capability to the `system:authenticated` group. In one case, this was done by granting access to the default anyuid SCC via RBAC. In the other, a new SCC was created, and bound to system:authenticated. We need: - a short term fix to allow for PVs to be created with the right permissions for Prometheus - a long term fix that even if Prometheus has access to more permissive SCCs, that it doesn't get into a broken state This should be able to be replicated by the following: oc adm policy add-scc-to-group anyuid system:authenticated
Thank you for the thorough investigation! The way I understand looks that a fix here would involve setting a proper SCC for prometheus (and potentially all other statefulsets) and as far as I see that would be `openshift.io/scc: restricted`?
I think this has broader implications that have not been considered yet. Granting an additional permission to a class of workloads causing them to break is unexpected. In almost every user security model, adding permissions has no negative effects. Unfortunately, the difference in defaulting between anyuid and restricted here causes the addition of anyuid to have a side effect on running workloads. The underlying bug is that granting a broader SCC changes defaulting on existing workloads. An admin doing this may break infrastructure workloads or her own workloads. It may be that we need to split defaulting and restriction capabilities so that only changing the defaulting rules would have an impact on workloads. Also, the underlying assumption of SCC was that defaulting was ok if the user didn't specify a UID, but that assumption is not correct in the presence of PV's or other on disk entities. The right outcome would be that admins are safe to grant anyuid to workloads without breaking things.
I've done some further investigation into the root cause: The `restricted` SCC has the .fsGroup.type set to MustRunAs. This causes the .spec.securityContext.fsGroup of pods to be set to the supplemental groups range of the namespace (e.g. 1000260000), and when the container is starting, the /prometheus volume is rewritten with those uid/gid permissions. When the SCC is swapped to `anyuid`, the fsGroup setting is removed, and the uid/gid rewrite never happens. This means the container is running with it's default "nobody" user (uid/gid 99), while the volume permissions are still set to have the old dynamic uid/gid (e.g. 1000260000). This is what causes the permissions issue. If the permissions of the volume are rewritten to have the uid/gid of 99 (nobody user) then the volume works as before. As such, the following semi-permanent workaround will restore working Prometheus volume permissions while still using the anyuid SCC: 1. Ensure you have cluster-admin permissions 2. Scale down CVO/CMO/Prom operators (ensures that changes to the prometheus statefulset are not overwritten) oc -n openshift-cluster-version scale deploy/cluster-version-operator --replicas=0 oc -n openshift-monitoring scale deploy/cluster-monitoring-operator --replicas=0 oc -n openshift-monitoring scale deploy/prometheus-operator --replicas=0 3. Wait for all operator pods to terminate 4. Patch fsGroup on prometheus statefulset (this will rewrite the permissions to use the uid/gid of 99) oc -n openshift-monitoring patch statefulset/prometheus-k8s -p '{"spec":{"template":{"spec":{"securityContext":{"fsGroup": 99}}}}}' 5. Wait for prometheus pods to restart and come up 6. Restore CVO operator (which will bring up other operators) oc -n openshift-cluster-version scale deploy/cluster-version-operator --replicas=1 7. Wait for prometheus pods to restart one final time
Would it make sense to grant the nonroot SCC to the prometheus service account and run the prometheus pods with an arbitrary user id (e.g. "securityContext: {runAsUser: 1000, fsGroup: 2000}}")?
This needs a broader architectural discussion on how SCC settings are handled in conjunction with statefulsets.
Reassigning to me to clarify with architects how to proceed.
UpcomingSprint: This issue is planned for the upcoming sprint (193).
Test with payload 4.7.0-0.nightly-2020-11-23-195308 [hongyli@hongyli-fed Downloads]$ oc -n openshift-monitoring get pod alertmanager-main-0 -oyaml|grep scc openshift.io/scc: nonroot [hongyli@hongyli-fed Downloads]$ oc -n openshift-monitoring get pod alertmanager-main-1 -oyaml|grep scc openshift.io/scc: nonroot [hongyli@hongyli-fed Downloads]$ oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml|grep scc openshift.io/scc: nonroot [hongyli@hongyli-fed Downloads]$ oc -n openshift-monitoring get pod prometheus-k8s-1 -oyaml|grep scc openshift.io/scc: nonroot
[hongyli@hongyli-fed Downloads]$ oc -n openshift-user-workload-monitoring get pod prometheus-user-workload-0 -oyaml|grep scc openshift.io/scc: nonroot [hongyli@hongyli-fed Downloads]$ oc -n openshift-user-workload-monitoring get pod prometheus-user-workload-1 -oyaml|grep scc openshift.io/scc: nonroot [hongyli@hongyli-fed Downloads]$ oc -n openshift-user-workload-monitoring get pod thanos-ruler-user-workload-0 -oyaml|grep scc openshift.io/scc: nonroot [hongyli@hongyli-fed Downloads]$ oc -n openshift-user-workload-monitoring get pod thanos-ruler-user-workload-1 -oyaml|grep scc openshift.io/scc: nonroot
[hongyli@hongyli-fed Downloads]$ oc -n openshift-user-workload-monitoring get pod thanos-ruler-user-workload-1 -oyaml|grep ' fsGroup' fsGroup: 65534 [hongyli@hongyli-fed Downloads]$ oc -n openshift-user-workload-monitoring get pod thanos-ruler-user-workload-0 -oyaml|grep ' fsGroup' fsGroup: 65534 [hongyli@hongyli-fed Downloads]$ oc -n openshift-user-workload-monitoring get pod prometheus-user-workload-1 -oyaml|grep ' fsGroup' fsGroup: 65534 [hongyli@hongyli-fed Downloads]$ oc -n openshift-user-workload-monitoring get pod prometheus-user-workload-0 -oyaml|grep ' fsGroup' fsGroup: 65534 [hongyli@hongyli-fed Downloads]$ oc -n openshift-monitoring get pod alertmanager-main-0 -oyaml|grep ' fsGroup' fsGroup: 65534 [hongyli@hongyli-fed Downloads]$ oc -n openshift-monitoring get pod alertmanager-main-1 -oyaml|grep ' fsGroup' fsGroup: 65534 [hongyli@hongyli-fed Downloads]$ oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml|grep ' fsGroup' fsGroup: 65534 [hongyli@hongyli-fed Downloads]$ oc -n openshift-monitoring get pod prometheus-k8s-1 -oyaml|grep ' fsGroup' fsGroup: 65534
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633