Description of problem: After upgrading OCP 3.4 to 3.5 the registry console is stuck in a crash loop and showing this in the logs as well as in the web interface: Could't link /container/registry-brand to /etc/os-release: exit status 1: ln: cannot remove '/etc/os-release': Permission denied Version-Release number of selected component (if applicable): - OS: RHEL Atomic Host 7.3.5 and 7.4.0 - OCP: 3.4 and 3.5 - Registry Console: openshift3/registry-console:v3.5 How reproducible: Only in customer environment Steps to Reproduce: 1. deploy registry-console pod in OCP 3.5 2.- Trying to launch cockpit-kube-launch manually it still gives 'Permission denied': sh-4.2$ /usr/libexec/cockpit-kube-launch & [1] 20 sh-4.2$ 2017/08/08 06:11:21 Could't link /container/registry-brand to /etc/os-release: exit status 1: ln: cannot remove '/etc/os-release': Permission denied [1]+ Done(1) /usr/libexec/cockpit-kube-launch sh-4.2$ ls -la /etc/os-release -rwxrwxr-x. 1 root root 495 Sep 27 2016 /etc/os-release sh-4.2$ rm /etc/os-release rm: cannot remove '/etc/os-release': Permission denied User belongs to root group: sh-4.2$ id uid=1000010000 gid=0(root) groups=0(root),1000010000 SELinux context: $ oc debug dc/registry-console Debugging with pod/registry-console-debug, original command: <image entrypoint> Waiting for pod to start ... Pod IP: 10.131.0.171 If you don't see a command prompt, try pressing enter. sh-4.2$ ls -laZ /etc/os-release -rwxrwxr-x. root root system_u:object_r:svirt_sandbox_file_t:s0:c2,c3 /etc/os-release 3. This fails even running it manually with docker: # docker run -it -u 1000010000 --entrypoint /bin/bash registry.access.redhat.com/openshift3/registry-console:v3.5 bash-4.2$ /usr/libexec/cockpit-kube-launch & [1] 5 bash-4.2$ 2017/08/10 07:26:18 Error checking for openshift endpoint open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory 2017/08/10 07:26:18 Could't link /container/registry-brand to /etc/os-release: exit status 1: ln: cannot remove '/etc/os-release': Permission denied bash-4.2$ rm /etc/os-release rm: cannot remove '/etc/os-release': Permission denied [1]+ Exit 1 /usr/libexec/cockpit-kube-launch bash-4.2$ Actual results: registry-console pod unable to run in OCP 3.5 running in RHEL Atomic Host Expected results: registry-console pod should be able to run Additional info: $ oc get dc registry-console -o yaml apiVersion: v1 kind: DeploymentConfig metadata: annotations: openshift.io/generated-by: OpenShiftNewApp creationTimestamp: 2017-07-19T15:19:45Z generation: 11 labels: app: registry-console createdBy: registry-console-template name: registry-console name: registry-console namespace: default resourceVersion: "459460" selfLink: /oapi/v1/namespaces/default/deploymentconfigs/registry-console uid: b2074efc-6c95-11e7-a927-02aabbf40070 spec: replicas: 1 selector: name: registry-console strategy: activeDeadlineSeconds: 21600 resources: {} rollingParams: intervalSeconds: 1 maxSurge: 25% maxUnavailable: 25% timeoutSeconds: 600 updatePeriodSeconds: 1 type: Rolling template: metadata: annotations: openshift.io/generated-by: OpenShiftNewApp creationTimestamp: null labels: app: registry-console name: registry-console spec: containers: - env: - name: OPENSHIFT_OAUTH_PROVIDER_URL value: https://xxx:8443 - name: OPENSHIFT_OAUTH_CLIENT_ID value: cockpit-oauth-client - name: KUBERNETES_INSECURE value: "false" - name: COCKPIT_KUBE_INSECURE value: "false" - name: REGISTRY_ONLY value: "true" - name: REGISTRY_HOST value: docker-registry-default.xxx image: openshift3/registry-console:v3.5 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 3 httpGet: path: /ping port: 9090 scheme: HTTP initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 5 name: registry-console ports: - containerPort: 9090 protocol: TCP readinessProbe: failureThreshold: 3 httpGet: path: /ping port: 9090 scheme: HTTP periodSeconds: 10 successThreshold: 1 timeoutSeconds: 5 resources: {} terminationMessagePath: /dev/termination-log dnsPolicy: ClusterFirst restartPolicy: Always securityContext: {} terminationGracePeriodSeconds: 30 test: false triggers: - type: ConfigChange status: availableReplicas: 0 conditions: - lastTransitionTime: 2017-07-24T13:38:03Z lastUpdateTime: 2017-07-24T13:38:03Z message: replication controller "registry-console-3" has failed progressing reason: ProgressDeadlineExceeded status: "False" type: Progressing - lastTransitionTime: 2017-07-24T18:47:38Z lastUpdateTime: 2017-07-24T18:47:38Z message: Deployment config does not have minimum availability. status: "False" type: Available details: causes: - type: Manual message: manual change latestVersion: 3 observedGeneration: 11 replicas: 1 unavailableReplicas: 1 updatedReplicas: 0
I am experiencing this on OCP 3.6 as well.
@shea, can you write elsewhere in the container? What about other containers if you, are you able to change files in /etc/.
I should add the following: - RPM based install - The initial deployment succeeded, subsequent deployements have failed.
@peter in a debug container I can edit and save /etc/os-release, but I cannot seem to create new files or directories in etc or elsewhere
@shea, thanks. A couple follow up questions. Are there any corresponding errors in the journal or container logs when you try to create a new file? Could I get the output of docker info?
I can confirm that in the existing functional pod I can indeed write to /etc/ and create files, the failed pod (in debug mode) cannot.
Docker info is as follows: Containers: 20 Running: 5 Paused: 0 Stopped: 15 Images: 5 Server Version: 1.12.6 Storage Driver: devicemapper Pool Name: docker--vg-docker--pool Pool Blocksize: 524.3 kB Base Device Size: 10.74 GB Backing Filesystem: xfs Data file: Metadata file: Data Space Used: 2.575 GB Data Space Total: 21.36 GB Data Space Available: 18.79 GB Metadata Space Used: 1.004 MB Metadata Space Total: 54.53 MB Metadata Space Available: 53.52 MB Thin Pool Minimum Free Space: 2.136 GB Udev Sync Supported: true Deferred Removal Enabled: true Deferred Deletion Enabled: false Deferred Deleted Device Count: 0 Library Version: 1.02.135-RHEL7 (2016-11-16) Logging Driver: journald Cgroup Driver: systemd Plugins: Volume: local Network: host bridge overlay null Authorization: rhel-push-plugin Swarm: inactive Runtimes: docker-runc runc Default Runtime: docker-runc Security Options: seccomp selinux Kernel Version: 3.10.0-514.26.1.el7.x86_64 Operating System: OpenShift Enterprise OSType: linux Architecture: x86_64 Number of Docker Hooks: 2 CPUs: 2 Total Memory: 15.51 GiB Name: cwypla-174 ID: KMU4:7RUA:Q2JE:Q4MD:C7QO:6NC6:KNVP:SBKG:TAOG:QG57:ITMT:PXBL Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): false Http Proxy: <manually omitted> Https Proxy: <manually omitted> No Proxy: <manually omitted> Registry: https://registry.access.redhat.com/v1/ WARNING: bridge-nf-call-iptables is disabled WARNING: bridge-nf-call-ip6tables is disabled Insecure Registries: <manually omitted> 127.0.0.0/8 Registries: registry.access.redhat.com (secure), docker.io (secure) As for logs, so far absolutely nothing is showing up when attempting (and failing) to write into that directory. I will increase the logging continue to troubleshoot and provide any additional details here.
In my case there is definitely a difference in the node it is running on, as a switch to `nodeSelector: app` from `nodeSelector: infra` has produced positive results. This test can safely highlight that this is a node issue and not a container specific issue, even though other containers (ie. the registry) are functioning appropriately on infra nodes.
Right, I think it's something about the way the docker is configured. My best guess is something in /etc/docker/seccomp.json or a selinux configuration.
Thanks, I will keep digging and post with any results. Nothing has come up in comparing selinux or docker configurations between functional and non-functional systems so far, but I will keep at it. Thanks @peter!
FWIW, setting selinux to permissive still does not cause the container to run nor are any additional messages placed into audit.log.
Ok, i think this may have to do with openshift scc. There is a READONLYROOTFS option that might be causing this behavior. @shea, could you compare the output of oc get scc on a 'app' vs a 'infra' node. To see if this might be the setting causing this issue.
@Peter, I am seeing this issue with a client. When I see dockerfile under /root/build, it tried to change permission 775 /etc but when I check the permission, it is 755. SCC is default [root@ip-10-3-3-42 ~]# oc get scc NAME PRIV CAPS SELINUX RUNASUSER FSGROUP SUPGROUP PRIORITY READONLYROOTFS VOLUMES anyuid false [] MustRunAs RunAsAny RunAsAny RunAsAny 10 false [configMap downwardAPI emptyDir persistentVolumeClaim secret] hostaccess false [] MustRunAs MustRunAsRange MustRunAs RunAsAny <none> false [configMap downwardAPI emptyDir hostPath persistentVolumeClaim secret] hostmount-anyuid false [] MustRunAs RunAsAny RunAsAny RunAsAny <none> false [configMap downwardAPI emptyDir hostPath nfs persistentVolumeClaim secret] hostnetwork false [] MustRunAs MustRunAsRange MustRunAs MustRunAs <none> false [configMap downwardAPI emptyDir persistentVolumeClaim secret] nonroot false [] MustRunAs MustRunAsNonRoot RunAsAny RunAsAny <none> false [configMap downwardAPI emptyDir persistentVolumeClaim secret] privileged true [] RunAsAny RunAsAny RunAsAny RunAsAny <none> false [*] restricted false [] MustRunAs MustRunAsRange MustRunAs RunAsAny <none> false [configMap downwardAPI emptyDir persistentVolumeClaim secret] I don't know why the permission of etc folder is still 755 and I remember the registry-console can deploy with restriced scc. Any thought? - Jooho Lee.
So it seems on some nodes /etc gets set to 755 regardless of what is in the container itself. Opened a work around upstream https://github.com/cockpit-project/cockpit/pull/7752
@peter sorry, notifications went to the wrong folder. I'm not sure exactly how to verify the scc on different nodes (as I am/was unaware that there is a "node" context for SCC... I thought it was above that level). Interestingly this issue crept up in one other environment with this customer where it was previously working and (seemingly) changing the scc to privileged has allowed it to function again. I realize this doesn't highlight the root cause yet, but seems like a viable workaround for now.
This also makes sense given the discussion above about changing permissions; I'm wondering how we patch this for existing clusters such that we don't have to specify the privileged scc for that container.
@shea, we'll be putting out an updated image soon that hopefully will work around the issue. I'll update here when they get released.
Great thanks @peter!
My customer has hit this issue. It turns out it only applied when Dynatrace Oneagent version 1.127.147 was installed on the node. When Oneagent was uninstalled, deployments of registry-console started working again. As far as I know, Oneagent does some actions towards Docker. Without Oneagent installed, the /etc directory permission in the registry-console container is 0775. With Oneagent installed, the /etc directory permission in the registry-console container is 0755. This would block modifications to files running in restricted SCC.
lklykken, can you try the cockpit/kubernetes image with oneagent and see if it works now.
(In reply to Peter from comment #43) > lklykken, can you try the cockpit/kubernetes image with oneagent and see if > it works now. Unable at this time. The Oneagent issue was fixed by vendor upon request from customer. As issue was related, people observing this behaviour should verify if there are third party software fiddling with the Docker engine.
Hi Joel, Could you please help verify? Since we need Oneagent to see if it works now.
Or other third party softwares
Waiting for customer to verify this information
Pending on verification for a long time, we lack of environment to check this bug. Do you have idea how this could be VERIFIED if we couldn't get response from customer?
Hi Yapei, The only thing that comes to my mind would be to test having this third-party software installed trying to reproduce customer environment, but I'm not sure how feasible is this at this time. I'm going to reach out again the customer to see if we can have a confirmation from them. I'll keep you posted.
Hi, Just got feedback from the customer: "Both clusters running a daemonset with Dynatrace OneAgent v1.131.129 Everything seems to be running fine on both clusters so far" so most likely there was an issue with that older version of Dynatrace OneAgent
Thanks Joel! Per customer feedback, will move the bug to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188