1686556 – Unable to join etcd cluster against authenticated registries

Bug 1686556 - Unable to join etcd cluster against authenticated registries

Summary: Unable to join etcd cluster against authenticated registries

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Miloslav Trmač
QA Contact:	weiwei jiang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-07 17:12 UTC by Clayton Coleman
Modified:	2019-03-12 14:24 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-03-09 19:22:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 535	0	None	closed	Bug: 1686556: Symlink /root/.docker/config.json to kubelet auth	2020-06-30 03:34:56 UTC

Description Clayton Coleman 2019-03-07 17:12:37 UTC

Since Sunday we have been unable to perform release builds - we thought this was an inability to pivot from authenticated registries but a fix was supposedly delivered Thursday morning.

Blocks most general development work and OCP promotion.

Comment 1 Clayton Coleman 2019-03-07 17:13:14 UTC

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5380

Comment 2 Clayton Coleman 2019-03-07 17:13:55 UTC

... and the updated RHCoS didn't fix the issue (finishing first sentence).

Comment 3 Colin Walters 2019-03-07 17:16:12 UTC

We need to bump `machine-os-content` to point to the latest from
https://releases-redhat-coreos.cloud.paas.upshift.redhat.com/
specifically oscontainer/machine-os-content: docker-registry-default.cloud.registry.upshift.redhat.com/redhat-coreos/maipo@sha256:c09f455cc09673a1a13ae7b54cc4348cda0411e06dfa79ecd0130b35d62e8670

Comment 4 Colin Walters 2019-03-07 18:32:31 UTC

```
$ oc --token "$(cat rhcos-apici-secret)" tag rhcos/maipo:latest ocp/4.0:machine-os-content
Tag ocp/4.0:machine-os-content set to rhcos/maipo@sha256:fa5c75c77d54bd4480d2ecf6109ffbe49af9bf2cbc176fa9dc689475ebeda124.
```

Comment 5 Colin Walters 2019-03-07 19:58:01 UTC

Hmm,
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5381/
which is right after 5380 passed...but then all the other ones failed with the same symptom.

Anyways let's ignore that as some sort of fluke.

In https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5395/ I see:
2019/03/07 19:32:01 Resolved ocp/4.0:machine-os-content to sha256:fa5c75c77d54bd4480d2ecf6109ffbe49af9bf2cbc176fa9dc689475ebeda124

*confused* (time passes) OK it turns out I tagged the wrong image.  Fixed.

However...I am still uncertain if machine-os-content being old is the real problem here.  We definitely need to fix it as it means *upgrades* certainly can't work (as @jsafrane hit), but...AIUI this shouldn't block the masters from coming up.  I filed https://github.com/openshift/machine-config-operator/issues/530 to improve debugging.

Comment 6 Colin Walters 2019-03-07 21:40:46 UTC

Some further debugging reveals this is probably a crio problem.

etcd is a static pod, it's failing with:

```Mar 07 21:14:36 osiris-pvh4c-master-1 hyperkube[3672]: E0307 21:14:36.342009    3672 pod_workers.go:186] Error syncing pod 890327cc2d4b2fecb352de02a64279db ("etcd-member-osiris-pvh4c-master-1_kube-system(890327cc2d4b2fecb352de02a64279db)"), skipping: failed to "CreatePodSandbox" for "etcd-member-osiris-pvh4c-master-1_kube-system(890327cc2d4b2fecb352de02a64279db)" with CreatePodSandboxError: "CreatePodSandbox for pod \"etcd-member-osiris-pvh4c-master-1_kube-system(890327cc2d4b2fecb352de02a64279db)\" failed: rpc error: code = Unknown desc = error creating pod sandbox with name \"k8s_etcd-member-osiris-pvh4c-master-1_kube-system_890327cc2d4b2fecb352de02a64279db_0\": Error determining manifest MIME type for docker://registry.svc.ci.openshift.org/ocp/4.0-2019-03-07-195609@sha256:2501f69c6510834165835ac9cf4e2a71c2f65ed1165c9e1cb1e4a65c3ec7ef2c: Error reading manifest sha256:2501f69c6510834165835ac9cf4e2a71c2f65ed1165c9e1cb1e4a65c3ec7ef2c in registry.svc.ci.openshift.org/ocp/4.0-2019-03-07-195609: unauthorized: authentication required"```

Yet, `skopeo inspect --authfile=/var/lib/kubelet/config.json docker://registry.svc.ci.openshift.org/ocp/4.0-2019-03-07-195609@sha256:2501f69c6510834165835ac9cf4e2a71c2f65ed1165c9e1cb1e4a65c3ec7ef2c` works fine.

Comment 7 Colin Walters 2019-03-07 21:44:23 UTC

For reference I was debugging this via:

```
env OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=registry.svc.ci.openshift.org/ocp/release:4.0.0-0.ci-2019-03-07-195609 openshift-installer ...
```

On libvirt, so I could easily ssh to the bootstrap/master machines and debug.

Comment 8 Mrunal Patel 2019-03-07 22:35:41 UTC

I was able to gather this from CRI-O in debug mode:
Mar 07 22:33:24 ip-10-0-139-117 crio[5563]: time="2019-03-07 22:33:24.932830830Z" level=info msg="Attempting to run pod sandbox with infra container: kube-system/etcd-member-ip-10-0-139-117.ec2.internal/POD"
Mar 07 22:33:24 ip-10-0-139-117 crio[5563]: time="2019-03-07 22:33:24.932894564Z" level=debug msg="parsed reference into "[overlay@/var/lib/containers/storage+/var/run/containers/storage]registry.svc.ci.openshift.org/ocp/4.0-2019-03-07-195609@sha256:2501f69c6510834165835ac9cf4e2a71c2f65ed1165c9e1cb1e4a65c3ec7ef2c""
Mar 07 22:33:24 ip-10-0-139-117 crio[5563]: time="2019-03-07 22:33:24.932936392Z" level=debug msg="reference "[overlay@/var/lib/containers/storage+/var/run/containers/storage]registry.svc.ci.openshift.org/ocp/4.0-2019-03-07-195609@sha256:2501f69c6510834165835ac9cf4e2a71c2f65ed1165c9e1cb1e4a65c3ec7ef2c" does not resolve to an image ID"
Mar 07 22:33:24 ip-10-0-139-117 crio[5563]: time="2019-03-07 22:33:24.932960092Z" level=debug msg="couldn't find image "registry.svc.ci.openshift.org/ocp/4.0-2019-03-07-195609@sha256:2501f69c6510834165835ac9cf4e2a71c2f65ed1165c9e1cb1e4a65c3ec7ef2c", retrieving it"
Mar 07 22:33:25 ip-10-0-139-117 crio[5563]: time="2019-03-07 22:33:25.100048733Z" level=debug msg="parsed reference into "[overlay@/var/lib/containers/storage+/var/run/containers/storage]registry.svc.ci.openshift.org/ocp/4.0-2019-03-07-195609@sha256:2501f69c6510834165835ac9cf4e2a71c2f65ed1165c9e1cb1e4a65c3ec7ef2c""
Mar 07 22:33:25 ip-10-0-139-117 crio[5563]: time="2019-03-07 22:33:25.100292095Z" level=debug msg="Using registries.d directory /etc/containers/registries.d for sigstore configuration"
Mar 07 22:33:25 ip-10-0-139-117 crio[5563]: time="2019-03-07 22:33:25.100426077Z" level=debug msg=" Using "default-docker" configuration"
Mar 07 22:33:25 ip-10-0-139-117 crio[5563]: time="2019-03-07 22:33:25.100439218Z" level=debug msg=" No signature storage configuration found for registry.svc.ci.openshift.org/ocp/4.0-2019-03-07-195609@sha256:2501f69c6510834165835ac9cf4e2a71c2f65ed1165c9e1cb1e4a65c3ec7ef2c"

Comment 9 Mrunal Patel 2019-03-07 23:52:38 UTC

We are failing here:

 podContainer, err := s.StorageRuntimeServer().CreatePodSandbox(s.ImageContext(),                                                                                                                                                      
                name, id,                                                                                                                                                                                                                     
                s.config.PauseImage, "",                                                                                                                                                                                                      
                containerName,                                                                                                                                                                                                                
                req.GetConfig().GetMetadata().GetName(),                                                                                                                                                                                      
                req.GetConfig().GetMetadata().GetUid(),                                                                                                                                                                                       
                namespace,                                                                                                                                                                                                                    
                attempt,                                                                                                                                                                                                                      
                s.defaultIDMappings,                                                                                                                                                                                                          
                nil)                                                                                                                                                                                                                          
        if errors.Cause(err) == storage.ErrDuplicateName {                                                                                                                                                                                    
                return nil, fmt.Errorf("pod sandbox with name %q already exists", name)                                                                                                                                                       
        }                                                                                                                                                                                                                                     
        if err != nil {                                                                                                                                                                                                                       
                return nil, fmt.Errorf("error creating pod sandbox with name %q: %v", name, err)                                                                                                                                              
        }                                                                                                                                                                                                                                     
        defer func() {                                                                                                                                                                                                                        
                if err != nil {                                                                                                                                                                                                               
                        if err2 := s.StorageRuntimeServer().RemovePodSandbox(id); err2 != nil {                                                                                                                                               
                                logrus.Warnf("couldn't cleanup pod sandbox %q: %v", id, err2)                                                                                                                                                 
                        }                                                                                                                                                                                                                     
                }                                                                                                                                                                                                                             
        }()             

The issue is that we don't get auth from the kubelet for the pause container. It is our job to manage the sandbox including the
image and the auth for it. I copied the auth from /var/lib/kubelet/config.json to ~/.docker/config.json but that doesn't fix the issue either
as we need additional plumbing on the pull code here to read the auth.

Comment 10 Mrunal Patel 2019-03-08 00:08:13 UTC

okay an update. I made an operator error in setting the pause_image. Once I fixed that it worked fine.

I set the pause image to
pause_image = "registry.svc.ci.openshift.org/ocp/4.0-2019-03-07-195609@sha256:2501f69c6510834165835ac9cf4e2a71c2f65ed1165c9e1cb1e4a65c3ec7ef2c"
pause_command = "/usr/bin/pod"
and copy the /var/lib/kubelet/config.json to /root/.docker/config.json then it works.
I think the symlink idea that Colin proposed maybe the simplest fix.

Comment 11 Colin Walters 2019-03-08 01:03:48 UTC

> I think the symlink idea that Colin proposed maybe the simplest fix.


I have a half-done patch for it and I'll work on it tomorrow.  But...let's try to fix this two ways?  Any reason not to also patch crio to read the kubelet auth directly too?

Comment 12 Mrunal Patel 2019-03-08 01:35:00 UTC

Yes, sure. We can add configuration to cri-o to read from different auth files. We want to avoid hard-coding the /var/lib/kubelet/config.json path.

Comment 13 Valentin Rothberg 2019-03-08 11:10:34 UTC

I think I have an idea of how to handle this in containers/image more or less transparently. Dealing with it there would allow other users (e.g., Buildah and Podman) to support this use case as well.

I'll spin up a PR.

Comment 14 Valentin Rothberg 2019-03-08 12:44:51 UTC

Mrunal, Colin,

please take a look at the following PRs:

- https://github.com/containers/image/pull/595 to allow passing additional authentication files through the SystemContext.
- https://github.com/containers/skopeo/pull/612 as the mandatory Skopeo PR to test if it's actually working. You can try the functionality out, for instance, via `$ skopeo inspect --additional-authfile $FILE ...`.

Comment 15 Colin Walters 2019-03-08 16:06:56 UTC

https://github.com/openshift/machine-config-operator/pull/535

Comment 16 W. Trevor King 2019-03-08 18:00:18 UTC

Also in this space: https://github.com/kubernetes-sigs/cri-o/pull/2115

Comment 17 Colin Walters 2019-03-09 19:22:18 UTC

I think we're probably good, https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5459 passed, but e2e-aws-serial failed for some other reason.

Comment 18 Miloslav Trmač 2019-03-11 22:29:41 UTC

CRI-O 1.12.9 and 1.13.2 contain a pause_image_auth_file config option / --pause-image-auth-file CLI option.  I have filed https://github.com/openshift/machine-config-operator/pull/540 to take advantage of them.

Note You need to log in before you can comment on or make changes to this bug.