Bug 1487672
Summary: | registry-console stuck in crash loop after upgrade from 3.4 | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Joel Rosental R. <jrosenta> |
Component: | Registry Console | Assignee: | Peter <pvolpe> |
Status: | CLOSED ERRATA | QA Contact: | Yadan Pei <yapei> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 3.5.0 | CC: | aos-bugs, jlee, jrosenta, lklykken, pvolpe, rhowe, shea.stewart, smunilla, yapei |
Target Milestone: | --- | Keywords: | Unconfirmed |
Target Release: | 3.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-11-28 22:09:17 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Joel Rosental R.
2017-09-01 14:49:25 UTC
I am experiencing this on OCP 3.6 as well. @shea, can you write elsewhere in the container? What about other containers if you, are you able to change files in /etc/. I should add the following: - RPM based install - The initial deployment succeeded, subsequent deployements have failed. @peter in a debug container I can edit and save /etc/os-release, but I cannot seem to create new files or directories in etc or elsewhere @shea, thanks. A couple follow up questions. Are there any corresponding errors in the journal or container logs when you try to create a new file? Could I get the output of docker info? I can confirm that in the existing functional pod I can indeed write to /etc/ and create files, the failed pod (in debug mode) cannot. Docker info is as follows: Containers: 20 Running: 5 Paused: 0 Stopped: 15 Images: 5 Server Version: 1.12.6 Storage Driver: devicemapper Pool Name: docker--vg-docker--pool Pool Blocksize: 524.3 kB Base Device Size: 10.74 GB Backing Filesystem: xfs Data file: Metadata file: Data Space Used: 2.575 GB Data Space Total: 21.36 GB Data Space Available: 18.79 GB Metadata Space Used: 1.004 MB Metadata Space Total: 54.53 MB Metadata Space Available: 53.52 MB Thin Pool Minimum Free Space: 2.136 GB Udev Sync Supported: true Deferred Removal Enabled: true Deferred Deletion Enabled: false Deferred Deleted Device Count: 0 Library Version: 1.02.135-RHEL7 (2016-11-16) Logging Driver: journald Cgroup Driver: systemd Plugins: Volume: local Network: host bridge overlay null Authorization: rhel-push-plugin Swarm: inactive Runtimes: docker-runc runc Default Runtime: docker-runc Security Options: seccomp selinux Kernel Version: 3.10.0-514.26.1.el7.x86_64 Operating System: OpenShift Enterprise OSType: linux Architecture: x86_64 Number of Docker Hooks: 2 CPUs: 2 Total Memory: 15.51 GiB Name: cwypla-174 ID: KMU4:7RUA:Q2JE:Q4MD:C7QO:6NC6:KNVP:SBKG:TAOG:QG57:ITMT:PXBL Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): false Http Proxy: <manually omitted> Https Proxy: <manually omitted> No Proxy: <manually omitted> Registry: https://registry.access.redhat.com/v1/ WARNING: bridge-nf-call-iptables is disabled WARNING: bridge-nf-call-ip6tables is disabled Insecure Registries: <manually omitted> 127.0.0.0/8 Registries: registry.access.redhat.com (secure), docker.io (secure) As for logs, so far absolutely nothing is showing up when attempting (and failing) to write into that directory. I will increase the logging continue to troubleshoot and provide any additional details here. In my case there is definitely a difference in the node it is running on, as a switch to `nodeSelector: app` from `nodeSelector: infra` has produced positive results. This test can safely highlight that this is a node issue and not a container specific issue, even though other containers (ie. the registry) are functioning appropriately on infra nodes. Right, I think it's something about the way the docker is configured. My best guess is something in /etc/docker/seccomp.json or a selinux configuration. Thanks, I will keep digging and post with any results. Nothing has come up in comparing selinux or docker configurations between functional and non-functional systems so far, but I will keep at it. Thanks @peter! FWIW, setting selinux to permissive still does not cause the container to run nor are any additional messages placed into audit.log. Ok, i think this may have to do with openshift scc. There is a READONLYROOTFS option that might be causing this behavior. @shea, could you compare the output of oc get scc on a 'app' vs a 'infra' node. To see if this might be the setting causing this issue. @Peter, I am seeing this issue with a client. When I see dockerfile under /root/build, it tried to change permission 775 /etc but when I check the permission, it is 755. SCC is default [root@ip-10-3-3-42 ~]# oc get scc NAME PRIV CAPS SELINUX RUNASUSER FSGROUP SUPGROUP PRIORITY READONLYROOTFS VOLUMES anyuid false [] MustRunAs RunAsAny RunAsAny RunAsAny 10 false [configMap downwardAPI emptyDir persistentVolumeClaim secret] hostaccess false [] MustRunAs MustRunAsRange MustRunAs RunAsAny <none> false [configMap downwardAPI emptyDir hostPath persistentVolumeClaim secret] hostmount-anyuid false [] MustRunAs RunAsAny RunAsAny RunAsAny <none> false [configMap downwardAPI emptyDir hostPath nfs persistentVolumeClaim secret] hostnetwork false [] MustRunAs MustRunAsRange MustRunAs MustRunAs <none> false [configMap downwardAPI emptyDir persistentVolumeClaim secret] nonroot false [] MustRunAs MustRunAsNonRoot RunAsAny RunAsAny <none> false [configMap downwardAPI emptyDir persistentVolumeClaim secret] privileged true [] RunAsAny RunAsAny RunAsAny RunAsAny <none> false [*] restricted false [] MustRunAs MustRunAsRange MustRunAs RunAsAny <none> false [configMap downwardAPI emptyDir persistentVolumeClaim secret] I don't know why the permission of etc folder is still 755 and I remember the registry-console can deploy with restriced scc. Any thought? - Jooho Lee. So it seems on some nodes /etc gets set to 755 regardless of what is in the container itself. Opened a work around upstream https://github.com/cockpit-project/cockpit/pull/7752 @peter sorry, notifications went to the wrong folder. I'm not sure exactly how to verify the scc on different nodes (as I am/was unaware that there is a "node" context for SCC... I thought it was above that level). Interestingly this issue crept up in one other environment with this customer where it was previously working and (seemingly) changing the scc to privileged has allowed it to function again. I realize this doesn't highlight the root cause yet, but seems like a viable workaround for now. This also makes sense given the discussion above about changing permissions; I'm wondering how we patch this for existing clusters such that we don't have to specify the privileged scc for that container. @shea, we'll be putting out an updated image soon that hopefully will work around the issue. I'll update here when they get released. Great thanks @peter! My customer has hit this issue. It turns out it only applied when Dynatrace Oneagent version 1.127.147 was installed on the node. When Oneagent was uninstalled, deployments of registry-console started working again. As far as I know, Oneagent does some actions towards Docker. Without Oneagent installed, the /etc directory permission in the registry-console container is 0775. With Oneagent installed, the /etc directory permission in the registry-console container is 0755. This would block modifications to files running in restricted SCC. lklykken, can you try the cockpit/kubernetes image with oneagent and see if it works now. (In reply to Peter from comment #43) > lklykken, can you try the cockpit/kubernetes image with oneagent and see if > it works now. Unable at this time. The Oneagent issue was fixed by vendor upon request from customer. As issue was related, people observing this behaviour should verify if there are third party software fiddling with the Docker engine. Hi Joel, Could you please help verify? Since we need Oneagent to see if it works now. Or other third party softwares Waiting for customer to verify this information Pending on verification for a long time, we lack of environment to check this bug. Do you have idea how this could be VERIFIED if we couldn't get response from customer? Hi Yapei, The only thing that comes to my mind would be to test having this third-party software installed trying to reproduce customer environment, but I'm not sure how feasible is this at this time. I'm going to reach out again the customer to see if we can have a confirmation from them. I'll keep you posted. Hi, Just got feedback from the customer: "Both clusters running a daemonset with Dynatrace OneAgent v1.131.129 Everything seems to be running fine on both clusters so far" so most likely there was an issue with that older version of Dynatrace OneAgent Thanks Joel! Per customer feedback, will move the bug to VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:3188 |