Please provide the rendered template with `oc get pod -n test-prometheus prometheus-0`
To add color here, this is the result of a file permission error within the container. The user the container process is running as does not have permissions on the working directory defined for the container image. The difference if behavior is caused by change in user. Possible ran as root in 3.11 and non-root in 4.x.
I suspect this might have something to do with the Dockerfile that creates the image having a VOLUME at the WORKDIR and some variation on how docker and cri-o handle that. https://github.com/prometheus/prometheus/blob/master/Dockerfile#L23
I could not recreate with a simplified prometheus pod using the same image. $ cat prometheus.yaml apiVersion: v1 kind: Pod metadata: name: prometheus spec: containers: - name: prometheus image: quay.io/prometheus/prometheus:v2.20.1 command: - /bin/sh - "-c" - "echo 'this should be in the logs' && sleep 86400" volumeMounts: - mountPath: /prometheus name: prometheus-data-test-prometheus volumes: - name: prometheus-data-test-prometheus emptyDir: {} $ oc get pod -oyaml prometheus apiVersion: v1 kind: Pod metadata: annotations: k8s.v1.cni.cncf.io/network-status: |- [{ "name": "", "interface": "eth0", "ips": [ "10.128.2.20" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: |- [{ "name": "", "interface": "eth0", "ips": [ "10.128.2.20" ], "default": true, "dns": {} }] openshift.io/scc: restricted creationTimestamp: "2020-09-01T15:18:00Z" managedFields: ... name: prometheus namespace: demo2 resourceVersion: "64710" selfLink: /api/v1/namespaces/demo2/pods/prometheus uid: e57a9224-7a34-4a30-874d-0c5918cd35c4 spec: containers: - command: - /bin/sh - -c - echo 'this should be in the logs' && sleep 86400 image: quay.io/prometheus/prometheus:v2.20.1 imagePullPolicy: IfNotPresent name: prometheus resources: {} securityContext: capabilities: drop: - KILL - MKNOD - SETGID - SETUID runAsUser: 1000580000 terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /prometheus name: prometheus-data-test-prometheus - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: default-token-nbmv8 readOnly: true dnsPolicy: ClusterFirst enableServiceLinks: true imagePullSecrets: - name: default-dockercfg-nbsvq nodeName: ip-10-0-175-173.us-west-1.compute.internal preemptionPolicy: PreemptLowerPriority priority: 0 restartPolicy: Never schedulerName: default-scheduler securityContext: fsGroup: 1000580000 seLinuxOptions: level: s0:c24,c14 serviceAccount: default serviceAccountName: default terminationGracePeriodSeconds: 0 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 volumes: - emptyDir: {} name: prometheus-data-test-prometheus - name: default-token-nbmv8 secret: defaultMode: 420 secretName: default-token-nbmv8 status: conditions: ... containerStatuses: - containerID: cri-o://82038ebb37461109a07d963def3ebb8600df5bb5ff84fb5ef096efa4e69cb25c image: quay.io/prometheus/prometheus:v2.20.1 imageID: quay.io/prometheus/prometheus@sha256:788260ebd13613456c168d2eed8290f119f2b6301af2507ff65908d979c66c17 lastState: {} name: prometheus ready: true restartCount: 0 started: true state: running: startedAt: "2020-09-01T15:18:02Z" hostIP: 10.0.175.173 phase: Running podIP: 10.128.2.20 podIPs: - ip: 10.128.2.20 qosClass: BestEffort startTime: "2020-09-01T15:18:00Z" $ oc get pod NAME READY STATUS RESTARTS AGE prometheus 1/1 Running 0 4m28s $ oc get events LAST SEEN TYPE REASON OBJECT MESSAGE 4m2s Normal Scheduled pod/prometheus Successfully assigned demo2/prometheus to ip-10-0-175-173.us-west-1.compute.internal 4m Normal AddedInterface pod/prometheus Add eth0 [10.128.2.20/23] 4m Normal Pulled pod/prometheus Container image "quay.io/prometheus/prometheus:v2.20.1" already present on machine 4m Normal Created pod/prometheus Created container prometheus 4m Normal Started pod/prometheus Started container prometheus In the CU situation, the volume mounted at /prometheus is a PVC vs an emptydir so there might be something there. If I go onto the node and check selinux labels # pwd /var/lib/kubelet/pods/e57a9224-7a34-4a30-874d-0c5918cd35c4/volumes/kubernetes.io~empty-dir/prometheus-data-test-prometheus # ls -alZ total 0 drwxrwsrwx. 2 root 1000580000 system_u:object_r:container_file_t:s0:c14,c24 6 Sep 1 15:18 . drwxr-xr-x. 3 root root system_u:object_r:var_lib_t:s0 45 Sep 1 15:18 . So the selinux labels and chown to the runAsUser is working.
Forgot to mention in my previous comment, the recreate attempt without NFS was just to ensure that the baseline functionality works and that it is likely related to this particular storage setup.
Sending to storage to validate the NFS storage configuration being used.
I haven't been able to reproduce with a simple PVC, supplimental groups and the restricted scc: first, I create a mount on a host: ```shell oc debug node/$nodename sh-4.2# chroot /host sh-4.4# mkdir -p /mnt/data sh-4.4# touch /mnt/data/hi sh-4.4# chgrp -R 777 /mnt/data/ sh-4.4# chmod -R 2770 /mnt/data sh-4.4# ls -la /mnt/data total 0 drwxrws---. 2 root 7777 19 Sep 28 23:36 . drwxr-xr-x. 3 root root 18 Sep 28 23:15 .. -rwxrws---. 1 root 7777 0 Sep 28 23:36 hello ``` then run `oc apply -f /tmp/yaml` on this yaml ``` --- apiVersion: v1 kind: Pod metadata: annotations: openshift.io/scc: restricted name: test-pod labels: name: test-pod spec: restartPolicy: Never securityContext: supplementalGroups: [7777] runAsUser: 1000100000 containers: - name: test image: quay.io/haircommander/centos-test:latest imagePullPolicy: IfNotPresent name: test volumeMounts: - name: task-pv-storage mountPath: /test nodeSelector: kubernetes.io/hostname: $nodename # from above volumes: - name: task-pv-storage persistentVolumeClaim: claimName: task-pv-claim --- apiVersion: v1 kind: PersistentVolume metadata: name: task-pv-volume labels: type: local spec: storageClassName: manual capacity: storage: 10Gi accessModes: - ReadWriteOnce hostPath: path: "/mnt/data/" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: task-pv-claim spec: storageClassName: manual accessModes: - ReadWriteOnce resources: requests: storage: 3Gi ``` and it starts up just fine ```shell $ oc get pod/test-pod NAME READY STATUS RESTARTS AGE test-pod 1/1 Running 0 13s ``` Can someone provide me a fuller reproducer?
Unfortunately, until I can reproduce, it will be hard to solve this one. I will try again next sprint
The issue is coming from https://github.com/opencontainers/runc/commit/5e0e67d76cc99d76c8228d48f38f37034503f315 (which I believe was introduced in 4.5). This is failing because runc attempts to chdir to a volume it has no access to (it's owned and in the group of the container user, not root). The aforementioned commit changes the order of the function so that we chdir before changing to the correct user. I have attached a PR submitted upstream that re-changes the order, hopefully fulfilling both cases.
the fix has been merged, I have backported the relevant PR to our fork, and triggered a build in brew. It should come in the next 4.6.z release, and the initial 4.7 release
If this BZ needs to be included in the 4.7 release notes as a bug fix, please enter Doc Text by EOB 2/12. Thank you, Michael
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633