Bug 1487672

Summary:	registry-console stuck in crash loop after upgrade from 3.4
Product:	OpenShift Container Platform	Reporter:	Joel Rosental R. <jrosenta>
Component:	Registry Console	Assignee:	Peter <pvolpe>
Status:	CLOSED ERRATA	QA Contact:	Yadan Pei <yapei>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.5.0	CC:	aos-bugs, jlee, jrosenta, lklykken, pvolpe, rhowe, shea.stewart, smunilla, yapei
Target Milestone:	---	Keywords:	Unconfirmed
Target Release:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-11-28 22:09:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Joel Rosental R. 2017-09-01 14:49:25 UTC

Description of problem:
After upgrading OCP 3.4 to 3.5 the registry console is stuck in a crash loop and showing this in the logs as well as in the web interface:

Could't link /container/registry-brand to /etc/os-release: exit status 1: ln: cannot remove '/etc/os-release': Permission denied

Version-Release number of selected component (if applicable):
- OS: RHEL Atomic Host 7.3.5 and 7.4.0
- OCP: 3.4 and 3.5
- Registry Console: openshift3/registry-console:v3.5

How reproducible:
Only in customer environment

Steps to Reproduce:
1. deploy registry-console pod in OCP 3.5

2.- Trying to launch cockpit-kube-launch manually it still gives 'Permission denied':

sh-4.2$ /usr/libexec/cockpit-kube-launch &
[1] 20
sh-4.2$ 2017/08/08 06:11:21 Could't link /container/registry-brand to /etc/os-release: exit status 1: ln: cannot remove '/etc/os-release': Permission denied

[1]+  Done(1)                 /usr/libexec/cockpit-kube-launch

sh-4.2$ ls -la /etc/os-release 
-rwxrwxr-x. 1 root root 495 Sep 27  2016 /etc/os-release
sh-4.2$ rm /etc/os-release 
rm: cannot remove '/etc/os-release': Permission denied

User belongs to root group:
sh-4.2$ id
uid=1000010000 gid=0(root) groups=0(root),1000010000

SELinux context:
$ oc debug dc/registry-console
Debugging with pod/registry-console-debug, original command: <image entrypoint>
Waiting for pod to start ...
Pod IP: 10.131.0.171
If you don't see a command prompt, try pressing enter.
sh-4.2$ ls -laZ /etc/os-release 
-rwxrwxr-x. root root system_u:object_r:svirt_sandbox_file_t:s0:c2,c3 /etc/os-release

3. This fails even running it manually with docker:
# docker run -it -u 1000010000  --entrypoint /bin/bash registry.access.redhat.com/openshift3/registry-console:v3.5
bash-4.2$ /usr/libexec/cockpit-kube-launch &
[1] 5
bash-4.2$ 2017/08/10 07:26:18 Error checking for openshift endpoint open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
2017/08/10 07:26:18 Could't link /container/registry-brand to /etc/os-release: exit status 1: ln: cannot remove '/etc/os-release': Permission denied
bash-4.2$ rm /etc/os-release
rm: cannot remove '/etc/os-release': Permission denied
[1]+  Exit 1                  /usr/libexec/cockpit-kube-launch
bash-4.2$


Actual results:
registry-console pod unable to run in OCP 3.5 running in RHEL Atomic Host

Expected results:
registry-console pod should be able to run

Additional info:


$ oc get dc registry-console -o yaml

apiVersion: v1
kind: DeploymentConfig
metadata:
  annotations:
    openshift.io/generated-by: OpenShiftNewApp
  creationTimestamp: 2017-07-19T15:19:45Z
  generation: 11
  labels:
    app: registry-console
    createdBy: registry-console-template
    name: registry-console
  name: registry-console
  namespace: default
  resourceVersion: "459460"
  selfLink: /oapi/v1/namespaces/default/deploymentconfigs/registry-console
  uid: b2074efc-6c95-11e7-a927-02aabbf40070
spec:
  replicas: 1
  selector:
    name: registry-console
  strategy:
    activeDeadlineSeconds: 21600
    resources: {}
    rollingParams:
      intervalSeconds: 1
      maxSurge: 25%
      maxUnavailable: 25%
      timeoutSeconds: 600
      updatePeriodSeconds: 1
    type: Rolling
  template:
    metadata:
      annotations:
        openshift.io/generated-by: OpenShiftNewApp
      creationTimestamp: null
      labels:
        app: registry-console
        name: registry-console
    spec:
      containers:
      - env:
        - name: OPENSHIFT_OAUTH_PROVIDER_URL
          value: https://xxx:8443
        - name: OPENSHIFT_OAUTH_CLIENT_ID
          value: cockpit-oauth-client
        - name: KUBERNETES_INSECURE
          value: "false"
        - name: COCKPIT_KUBE_INSECURE
          value: "false"
        - name: REGISTRY_ONLY
          value: "true"
        - name: REGISTRY_HOST
          value: docker-registry-default.xxx
        image: openshift3/registry-console:v3.5
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /ping
            port: 9090
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: registry-console
        ports:
        - containerPort: 9090
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /ping
            port: 9090
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        resources: {}
        terminationMessagePath: /dev/termination-log
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      securityContext: {}
      terminationGracePeriodSeconds: 30
  test: false
  triggers:
  - type: ConfigChange
status:
  availableReplicas: 0
  conditions:
  - lastTransitionTime: 2017-07-24T13:38:03Z
    lastUpdateTime: 2017-07-24T13:38:03Z
    message: replication controller "registry-console-3" has failed progressing
    reason: ProgressDeadlineExceeded
    status: "False"
    type: Progressing
  - lastTransitionTime: 2017-07-24T18:47:38Z
    lastUpdateTime: 2017-07-24T18:47:38Z
    message: Deployment config does not have minimum availability.
    status: "False"
    type: Available
  details:
    causes:
    - type: Manual
    message: manual change
  latestVersion: 3
  observedGeneration: 11
  replicas: 1
  unavailableReplicas: 1
  updatedReplicas: 0

Comment 7 shea.stewart 2017-09-25 00:03:04 UTC

I am experiencing this on OCP 3.6 as well.

Comment 8 Peter 2017-09-25 00:04:15 UTC

@shea, can you write elsewhere in the container? What about other containers if you, are you able to change files in /etc/.

Comment 9 shea.stewart 2017-09-25 00:04:52 UTC

I should add the following: 
- RPM based install
- The initial deployment succeeded, subsequent deployements have failed.

Comment 10 shea.stewart 2017-09-25 00:08:31 UTC

@peter in a debug container I can edit and save /etc/os-release, but I cannot seem to create new files or directories in etc or elsewhere

Comment 11 Peter 2017-09-25 00:11:58 UTC

@shea, thanks. A couple follow up questions.

Are there any corresponding errors in the journal or container logs when you try to create a new file?

Could I get the output of docker info?

Comment 12 shea.stewart 2017-09-25 00:18:41 UTC

I can confirm that in  the existing functional pod I can indeed write to /etc/ and create files, the failed pod (in debug mode) cannot.

Comment 13 shea.stewart 2017-09-25 00:28:16 UTC

Docker info is as follows: 

Containers: 20
 Running: 5
 Paused: 0
 Stopped: 15
Images: 5
Server Version: 1.12.6
Storage Driver: devicemapper
 Pool Name: docker--vg-docker--pool
 Pool Blocksize: 524.3 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file:
 Metadata file:
 Data Space Used: 2.575 GB
 Data Space Total: 21.36 GB
 Data Space Available: 18.79 GB
 Metadata Space Used: 1.004 MB
 Metadata Space Total: 54.53 MB
 Metadata Space Available: 53.52 MB
 Thin Pool Minimum Free Space: 2.136 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: journald
Cgroup Driver: systemd
Plugins:
 Volume: local
 Network: host bridge overlay null
 Authorization: rhel-push-plugin
Swarm: inactive
Runtimes: docker-runc runc
Default Runtime: docker-runc
Security Options: seccomp selinux
Kernel Version: 3.10.0-514.26.1.el7.x86_64
Operating System: OpenShift Enterprise
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 2
CPUs: 2
Total Memory: 15.51 GiB
Name: cwypla-174
ID: KMU4:7RUA:Q2JE:Q4MD:C7QO:6NC6:KNVP:SBKG:TAOG:QG57:ITMT:PXBL
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Http Proxy: <manually omitted>
Https Proxy: <manually omitted>
No Proxy: <manually omitted>
Registry: https://registry.access.redhat.com/v1/
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Insecure Registries:
 <manually omitted>
 127.0.0.0/8
Registries: registry.access.redhat.com (secure), docker.io (secure)


As for logs, so far absolutely nothing is showing up when attempting (and failing) to write into that directory. I will increase the logging continue to troubleshoot and provide any additional details here.

Comment 14 shea.stewart 2017-09-25 00:50:11 UTC

In my case there is definitely a difference in the node it is running on, as a switch to `nodeSelector: app` from `nodeSelector: infra` has produced positive results. This test can safely highlight that this is a node issue and not a container specific issue, even though other containers (ie. the registry) are functioning appropriately on infra nodes.

Comment 15 Peter 2017-09-25 00:52:24 UTC

Right, I think it's something about the way the docker is configured. My best guess is something in /etc/docker/seccomp.json or a selinux configuration.

Comment 16 shea.stewart 2017-09-25 01:29:02 UTC

Thanks, I will keep digging and post with any results. Nothing has come up in comparing selinux or docker configurations between functional and non-functional systems so far, but I will keep at it. 

Thanks @peter!

Comment 17 shea.stewart 2017-09-25 01:39:53 UTC

FWIW, setting selinux to permissive still does not cause the container to run nor are any additional messages placed into audit.log.

Comment 18 Peter 2017-09-25 14:32:24 UTC

Ok, i think this may have to do with openshift scc. There is a READONLYROOTFS
option that might be causing this behavior.

@shea, could you compare the output of 

oc get scc

on a 'app' vs a 'infra' node. To see if this might be the setting causing this issue.

Comment 19 jooho lee 2017-09-25 16:07:07 UTC

@Peter,

I am seeing this issue with a client. When I see dockerfile under /root/build, it tried to change permission 775 /etc but when I check the permission, it is 755.

SCC is default
[root@ip-10-3-3-42 ~]# oc get scc
NAME               PRIV      CAPS      SELINUX     RUNASUSER          FSGROUP     SUPGROUP    PRIORITY   READONLYROOTFS   VOLUMES
anyuid             false     []        MustRunAs   RunAsAny           RunAsAny    RunAsAny    10         false            [configMap downwardAPI emptyDir persistentVolumeClaim secret]
hostaccess         false     []        MustRunAs   MustRunAsRange     MustRunAs   RunAsAny    <none>     false            [configMap downwardAPI emptyDir hostPath persistentVolumeClaim secret]
hostmount-anyuid   false     []        MustRunAs   RunAsAny           RunAsAny    RunAsAny    <none>     false            [configMap downwardAPI emptyDir hostPath nfs persistentVolumeClaim secret]
hostnetwork        false     []        MustRunAs   MustRunAsRange     MustRunAs   MustRunAs   <none>     false            [configMap downwardAPI emptyDir persistentVolumeClaim secret]
nonroot            false     []        MustRunAs   MustRunAsNonRoot   RunAsAny    RunAsAny    <none>     false            [configMap downwardAPI emptyDir persistentVolumeClaim secret]
privileged         true      []        RunAsAny    RunAsAny           RunAsAny    RunAsAny    <none>     false            [*]
restricted         false     []        MustRunAs   MustRunAsRange     MustRunAs   RunAsAny    <none>     false            [configMap downwardAPI emptyDir persistentVolumeClaim secret]


I don't know why the permission of etc folder is still 755 and I remember the registry-console can deploy with restriced scc.

Any thought?

- Jooho Lee.

Comment 37 Peter 2017-09-26 19:32:44 UTC

So it seems on some nodes /etc gets set to 755 regardless of what is in the container itself. Opened a work around upstream

https://github.com/cockpit-project/cockpit/pull/7752

Comment 38 shea.stewart 2017-09-27 14:56:44 UTC

@peter sorry, notifications went to the wrong folder. 

I'm not sure exactly how to verify the scc on different nodes (as I am/was unaware that there is a "node" context for SCC... I thought it was above that level). 

Interestingly this issue crept up in one other environment with this customer where it was previously working and (seemingly) changing the scc to privileged has allowed it to function again. I realize this doesn't highlight the root cause yet, but seems like a viable workaround for now.

Comment 39 shea.stewart 2017-09-27 14:58:02 UTC

This also makes sense given the discussion above about changing permissions; I'm wondering how we patch this for existing clusters such that we don't have to specify the privileged scc for that container.

Comment 40 Peter 2017-09-27 15:01:52 UTC

@shea, we'll be putting out an updated image soon that hopefully will work around the issue.

I'll update here when they get released.

Comment 41 shea.stewart 2017-09-27 15:51:49 UTC

Great thanks @peter!

Comment 42 Bosse Klykken 2017-10-09 13:32:00 UTC

My customer has hit this issue. It turns out it only applied when Dynatrace Oneagent version 1.127.147 was installed on the node. When Oneagent was uninstalled, deployments of registry-console started working again.

As far as I know, Oneagent does some actions towards Docker. Without Oneagent installed, the /etc directory permission in the registry-console container is 0775. With Oneagent installed, the /etc directory permission in the registry-console container is 0755. This would block modifications to files running in restricted SCC.

Comment 43 Peter 2017-10-09 13:33:13 UTC

lklykken, can you try the cockpit/kubernetes image with oneagent and see if it works now.

Comment 44 Bosse Klykken 2017-10-26 13:15:08 UTC

(In reply to Peter from comment #43)
> lklykken, can you try the cockpit/kubernetes image with oneagent and see if
> it works now.

Unable at this time.

The Oneagent issue was fixed by vendor upon request from customer. As issue was related, people observing this behaviour should verify if there are third party software fiddling with the Docker engine.

Comment 46 Yadan Pei 2017-10-27 03:02:21 UTC

Hi Joel,

Could you please help verify? Since we need Oneagent to see if it works now.

Comment 47 Yadan Pei 2017-10-27 03:04:02 UTC

Or other third party softwares

Comment 48 Yadan Pei 2017-11-02 02:08:25 UTC

Waiting for customer to verify this information

Comment 49 Yadan Pei 2017-11-08 07:23:35 UTC

Pending on verification for a long time, we lack of environment to check this bug. 

Do you have idea how this could be VERIFIED if we couldn't get response from customer?

Comment 50 Joel Rosental R. 2017-11-08 08:28:00 UTC

Hi Yapei,

The only thing that comes to my mind would be to test having this third-party software installed trying to reproduce customer environment, but I'm not sure how feasible is this at this time.


I'm going to reach out again the customer to see if we can have a confirmation from them.

I'll keep you posted.

Comment 51 Joel Rosental R. 2017-11-08 10:00:24 UTC

Hi,

Just got feedback from the customer:

"Both clusters running a daemonset with Dynatrace OneAgent v1.131.129

Everything seems to be running fine on both clusters so far"

so most likely there was an issue with that older version of Dynatrace OneAgent

Comment 52 Yadan Pei 2017-11-09 01:31:53 UTC

Thanks Joel!

Per customer feedback, will move the bug to VERIFIED.

Comment 56 errata-xmlrpc 2017-11-28 22:09:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188