Bug 1868976 - Prometheus error opening query log file on EBS backed PVC
Summary: Prometheus error opening query log file on EBS backed PVC
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.7.0
Assignee: Sergiusz Urbaniak
QA Contact: hongyan li
URL:
Whiteboard:
Depends On:
Blocks: 1900988
TreeView+ depends on / blocked
 
Reported: 2020-08-14 18:53 UTC by Naveen Malik
Modified: 2021-05-25 17:05 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: In certain scenarios elevating SCCs (security context constraints) caused Prometheus statefulset deployments to fail. If elevating SCCs globally user, and group IDs for filesystem mounts would change which would cause permission denied error for Prometheus at startup. Consequence: Prometheus cannot start and crashloops. Fix: The `nonroot` SCC is used for monitoring statefulset deployments. This necessitates to configure the following Kubernetes security context settings for all monitoring statefulset deployments: securityContext: fsGroup: 65534 runAsNonRoot: true runAsUser: 65534 Result: The consequence is that: a) All statefulset monitoring deployments (Alertmanager, Prometheus, Thanos Ruler) run as the "nobody" user (ID 65534) b) The filesystem group ID is set to the "nobody" ID 65534 as well. Kubelet ensures to recursively set the group ID at pod startup as documented in https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#configure-volume-permission-and-ownership-change-policy-for-pods: > By default, Kubernetes recursively changes ownership and permissions for the contents of each volume to match the fsGroup specified in a Pod's securityContext when that volume is mounted.
Clone Of:
: 1900988 (view as bug list)
Environment:
Last Closed: 2021-02-24 15:15:36 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 981 0 None closed Bug 1868976: jsonnet: configure SCCs 2021-02-16 20:42:28 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:16:17 UTC

Description Naveen Malik 2020-08-14 18:53:17 UTC
Description of problem:
OSD 4.4.11 customer cluster had monitoring stack fail when prometheus could not open the query log file "/prometheus/queries.active".


After review of the file and directory permissions etc SRE decided to start fresh and deleted the prometheus PVCs, PVs, and EBS volumes.  The cluster-monitoring-operator was scaled down to 0 then up to 1 to recreated prometheus-k8s-{0,1} pods.  Exact same problem persists with a fresh volume.


Version-Release number of selected component (if applicable):
OCP 4.4.11


How reproducible:
100% on this one cluster.


Steps to Reproduce:
1. Configure prometheus to use persistent storage: https://github.com/openshift/managed-cluster-config/blob/master/deploy/cluster-monitoring-config/cluster-monitoring-config.yaml#L16-L22

Actual results:
The "prometheus" container cannot start, fails to open query log file.


Expected results:
The "prometheus" container starts.


Additional info:

Must gather will be attached in private comment.

We have the cluster running without persistent storage for the weekend and expect it can be reproduced by simply reverting configuration to the standard config.  This will be done Monday to see if the problem persists.

Comment 2 Dustin Row 2020-08-19 21:40:48 UTC
Hello, just following up on this. Has someone looked into this yet? If so, can you let us know if you think it's a problem with customer's configuration or an issue with the prometheus operator itself?

Comment 4 Simon Pasquier 2020-08-24 14:56:52 UTC
I've just looked at the must-gather and I couldn't see anything suspicious. The initialization of the query tracker is one of the first things that Prometheus does when starting so it might be that the whole volume isn't writable by the Prometheus process. Could you get access to the volume and check its permissions?

Comment 15 Simon Pasquier 2020-09-08 14:48:31 UTC
Closing since we can't reproduce the issue ourselves. As I wrote earlier, neither CMO nor prometheus-operator nor prometheus modifies the permissions/ownership of the volume. If it happened, it is very likely that the volume has been mounted by another pod at some time which has changed the ownership.

Comment 16 Naveen Malik 2020-09-08 15:46:51 UTC
@Simon I deleted the PV and PVC for prometheus in an attempt to reset and get this working.  It didn't help.  This is a production customer cluster so I hesitate to monkey around with the configuration but at this time we have no persistent storage on the cluster for prometheus so that we can get alerts.  If you have some specific things you'd like us to collect about the pods and storage while it's in the failed state please let me know and we can do that.  I assume it'll be broken again if we turn back on the persistent storage again..

Comment 25 Christoph Blecker 2020-09-18 19:23:00 UTC
In both cases where we have seen this, the customer provided access to an SCC with the `RunAsAny` capability to the `system:authenticated` group.

In one case, this was done by granting access to the default anyuid SCC via RBAC. In the other, a new SCC was created, and bound to system:authenticated.

We need:
- a short term fix to allow for PVs to be created with the right permissions for Prometheus
- a long term fix that even if Prometheus has access to more permissive SCCs, that it doesn't get into a broken state

This should be able to be replicated by the following:
oc adm policy add-scc-to-group anyuid system:authenticated

Comment 27 Sergiusz Urbaniak 2020-09-21 07:02:43 UTC
Thank you for the thorough investigation! The way I understand looks that a fix here would involve setting a proper SCC for prometheus (and potentially all other statefulsets) and as far as I see that would be `openshift.io/scc: restricted`?

Comment 28 Clayton Coleman 2020-09-23 13:18:30 UTC
I think this has broader implications that have not been considered yet.  Granting an additional permission to a class of workloads causing them to break is unexpected.  In almost every user security model, adding permissions has no negative effects.  Unfortunately, the difference in defaulting between anyuid and restricted here causes the addition of anyuid to have a side effect on running workloads.  The underlying bug is that granting a broader SCC changes defaulting on existing workloads.  An admin doing this may break infrastructure workloads or her own workloads.  It may be that we need to split defaulting and restriction capabilities so that only changing the defaulting rules would have an impact on workloads.  Also, the underlying assumption of SCC was that defaulting was ok if the user didn't specify a UID, but that assumption is not correct in the presence of PV's or other on disk entities.  The right outcome would be that admins are safe to grant anyuid to workloads without breaking things.

Comment 29 Christoph Blecker 2020-09-23 21:38:30 UTC
I've done some further investigation into the root cause:

The `restricted` SCC has the .fsGroup.type set to MustRunAs. This causes the .spec.securityContext.fsGroup of pods to be set to the supplemental groups range of the namespace (e.g. 1000260000), and when the container is starting, the /prometheus volume is rewritten with those uid/gid permissions.

When the SCC is swapped to `anyuid`, the fsGroup setting is removed, and the uid/gid rewrite never happens. This means the container is running with it's default "nobody" user (uid/gid 99), while the volume permissions are still set to have the old dynamic uid/gid (e.g. 1000260000). This is what causes the permissions issue.

If the permissions of the volume are rewritten to have the uid/gid of 99 (nobody user) then the volume works as before.


As such, the following semi-permanent workaround will restore working Prometheus volume permissions while still using the anyuid SCC:

1. Ensure you have cluster-admin permissions

2. Scale down CVO/CMO/Prom operators (ensures that changes to the prometheus statefulset are not overwritten)
oc -n openshift-cluster-version scale deploy/cluster-version-operator --replicas=0
oc -n openshift-monitoring scale deploy/cluster-monitoring-operator --replicas=0
oc -n openshift-monitoring scale deploy/prometheus-operator --replicas=0

3. Wait for all operator pods to terminate

4. Patch fsGroup on prometheus statefulset (this will rewrite the permissions to use the uid/gid of 99)
oc -n openshift-monitoring patch statefulset/prometheus-k8s -p '{"spec":{"template":{"spec":{"securityContext":{"fsGroup": 99}}}}}'

5. Wait for prometheus pods to restart and come up

6. Restore CVO operator (which will bring up other operators)
oc -n openshift-cluster-version scale deploy/cluster-version-operator --replicas=1

7. Wait for prometheus pods to restart one final time

Comment 30 Simon Pasquier 2020-09-24 11:03:08 UTC
Would it make sense to grant the nonroot SCC to the prometheus service account and run the prometheus pods with an arbitrary user id (e.g.  "securityContext: {runAsUser: 1000, fsGroup: 2000}}")?

Comment 31 Sergiusz Urbaniak 2020-10-21 13:25:14 UTC
This needs a broader architectural discussion on how SCC settings are handled in conjunction with statefulsets.

Comment 32 Sergiusz Urbaniak 2020-10-21 13:31:42 UTC
Reassigning to me to clarify with architects how to proceed.

Comment 37 Sergiusz Urbaniak 2020-11-13 09:05:49 UTC
UpcomingSprint: This issue is planned for the upcoming sprint (193).

Comment 39 hongyan li 2020-11-24 12:31:57 UTC
Test with payload 4.7.0-0.nightly-2020-11-23-195308

[hongyli@hongyli-fed Downloads]$ oc -n openshift-monitoring get pod alertmanager-main-0 -oyaml|grep scc
    openshift.io/scc: nonroot
[hongyli@hongyli-fed Downloads]$ oc -n openshift-monitoring get pod alertmanager-main-1 -oyaml|grep scc
    openshift.io/scc: nonroot
[hongyli@hongyli-fed Downloads]$ oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml|grep scc
    openshift.io/scc: nonroot
[hongyli@hongyli-fed Downloads]$ oc -n openshift-monitoring get pod prometheus-k8s-1 -oyaml|grep scc
    openshift.io/scc: nonroot

Comment 40 hongyan li 2020-11-24 12:47:29 UTC
[hongyli@hongyli-fed Downloads]$ oc -n openshift-user-workload-monitoring get pod prometheus-user-workload-0 -oyaml|grep scc
    openshift.io/scc: nonroot
[hongyli@hongyli-fed Downloads]$ oc -n openshift-user-workload-monitoring get pod prometheus-user-workload-1 -oyaml|grep scc
    openshift.io/scc: nonroot
[hongyli@hongyli-fed Downloads]$ oc -n openshift-user-workload-monitoring get pod thanos-ruler-user-workload-0 -oyaml|grep scc
    openshift.io/scc: nonroot
[hongyli@hongyli-fed Downloads]$ oc -n openshift-user-workload-monitoring get pod thanos-ruler-user-workload-1 -oyaml|grep scc
    openshift.io/scc: nonroot

Comment 41 hongyan li 2020-11-24 12:48:05 UTC
[hongyli@hongyli-fed Downloads]$ oc -n openshift-user-workload-monitoring get pod prometheus-user-workload-0 -oyaml|grep scc
    openshift.io/scc: nonroot
[hongyli@hongyli-fed Downloads]$ oc -n openshift-user-workload-monitoring get pod prometheus-user-workload-1 -oyaml|grep scc
    openshift.io/scc: nonroot
[hongyli@hongyli-fed Downloads]$ oc -n openshift-user-workload-monitoring get pod thanos-ruler-user-workload-0 -oyaml|grep scc
    openshift.io/scc: nonroot
[hongyli@hongyli-fed Downloads]$ oc -n openshift-user-workload-monitoring get pod thanos-ruler-user-workload-1 -oyaml|grep scc
    openshift.io/scc: nonroot

Comment 42 hongyan li 2020-11-24 12:53:22 UTC
[hongyli@hongyli-fed Downloads]$ oc -n openshift-user-workload-monitoring get pod thanos-ruler-user-workload-1 -oyaml|grep ' fsGroup'
    fsGroup: 65534
[hongyli@hongyli-fed Downloads]$ oc -n openshift-user-workload-monitoring get pod thanos-ruler-user-workload-0 -oyaml|grep ' fsGroup'
    fsGroup: 65534
[hongyli@hongyli-fed Downloads]$ oc -n openshift-user-workload-monitoring get pod prometheus-user-workload-1 -oyaml|grep ' fsGroup'
    fsGroup: 65534
[hongyli@hongyli-fed Downloads]$ oc -n openshift-user-workload-monitoring get pod prometheus-user-workload-0 -oyaml|grep ' fsGroup'
    fsGroup: 65534
[hongyli@hongyli-fed Downloads]$ oc -n openshift-monitoring get pod alertmanager-main-0 -oyaml|grep ' fsGroup'
    fsGroup: 65534
[hongyli@hongyli-fed Downloads]$ oc -n openshift-monitoring get pod alertmanager-main-1 -oyaml|grep ' fsGroup'
    fsGroup: 65534
[hongyli@hongyli-fed Downloads]$ oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml|grep ' fsGroup'
    fsGroup: 65534
[hongyli@hongyli-fed Downloads]$ oc -n openshift-monitoring get pod prometheus-k8s-1 -oyaml|grep ' fsGroup'
    fsGroup: 65534

Comment 46 errata-xmlrpc 2021-02-24 15:15:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.