Since Sunday we have been unable to perform release builds - we thought this was an inability to pivot from authenticated registries but a fix was supposedly delivered Thursday morning. Blocks most general development work and OCP promotion.
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5380
... and the updated RHCoS didn't fix the issue (finishing first sentence).
We need to bump `machine-os-content` to point to the latest from https://releases-redhat-coreos.cloud.paas.upshift.redhat.com/ specifically oscontainer/machine-os-content: docker-registry-default.cloud.registry.upshift.redhat.com/redhat-coreos/maipo@sha256:c09f455cc09673a1a13ae7b54cc4348cda0411e06dfa79ecd0130b35d62e8670
``` $ oc --token "$(cat rhcos-apici-secret)" tag rhcos/maipo:latest ocp/4.0:machine-os-content Tag ocp/4.0:machine-os-content set to rhcos/maipo@sha256:fa5c75c77d54bd4480d2ecf6109ffbe49af9bf2cbc176fa9dc689475ebeda124. ```
Hmm, https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5381/ which is right after 5380 passed...but then all the other ones failed with the same symptom. Anyways let's ignore that as some sort of fluke. In https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5395/ I see: 2019/03/07 19:32:01 Resolved ocp/4.0:machine-os-content to sha256:fa5c75c77d54bd4480d2ecf6109ffbe49af9bf2cbc176fa9dc689475ebeda124 *confused* (time passes) OK it turns out I tagged the wrong image. Fixed. However...I am still uncertain if machine-os-content being old is the real problem here. We definitely need to fix it as it means *upgrades* certainly can't work (as @jsafrane hit), but...AIUI this shouldn't block the masters from coming up. I filed https://github.com/openshift/machine-config-operator/issues/530 to improve debugging.
Some further debugging reveals this is probably a crio problem. etcd is a static pod, it's failing with: ```Mar 07 21:14:36 osiris-pvh4c-master-1 hyperkube[3672]: E0307 21:14:36.342009 3672 pod_workers.go:186] Error syncing pod 890327cc2d4b2fecb352de02a64279db ("etcd-member-osiris-pvh4c-master-1_kube-system(890327cc2d4b2fecb352de02a64279db)"), skipping: failed to "CreatePodSandbox" for "etcd-member-osiris-pvh4c-master-1_kube-system(890327cc2d4b2fecb352de02a64279db)" with CreatePodSandboxError: "CreatePodSandbox for pod \"etcd-member-osiris-pvh4c-master-1_kube-system(890327cc2d4b2fecb352de02a64279db)\" failed: rpc error: code = Unknown desc = error creating pod sandbox with name \"k8s_etcd-member-osiris-pvh4c-master-1_kube-system_890327cc2d4b2fecb352de02a64279db_0\": Error determining manifest MIME type for docker://registry.svc.ci.openshift.org/ocp/4.0-2019-03-07-195609@sha256:2501f69c6510834165835ac9cf4e2a71c2f65ed1165c9e1cb1e4a65c3ec7ef2c: Error reading manifest sha256:2501f69c6510834165835ac9cf4e2a71c2f65ed1165c9e1cb1e4a65c3ec7ef2c in registry.svc.ci.openshift.org/ocp/4.0-2019-03-07-195609: unauthorized: authentication required"``` Yet, `skopeo inspect --authfile=/var/lib/kubelet/config.json docker://registry.svc.ci.openshift.org/ocp/4.0-2019-03-07-195609@sha256:2501f69c6510834165835ac9cf4e2a71c2f65ed1165c9e1cb1e4a65c3ec7ef2c` works fine.
For reference I was debugging this via: ``` env OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=registry.svc.ci.openshift.org/ocp/release:4.0.0-0.ci-2019-03-07-195609 openshift-installer ... ``` On libvirt, so I could easily ssh to the bootstrap/master machines and debug.
I was able to gather this from CRI-O in debug mode: Mar 07 22:33:24 ip-10-0-139-117 crio[5563]: time="2019-03-07 22:33:24.932830830Z" level=info msg="Attempting to run pod sandbox with infra container: kube-system/etcd-member-ip-10-0-139-117.ec2.internal/POD" Mar 07 22:33:24 ip-10-0-139-117 crio[5563]: time="2019-03-07 22:33:24.932894564Z" level=debug msg="parsed reference into "[overlay@/var/lib/containers/storage+/var/run/containers/storage]registry.svc.ci.openshift.org/ocp/4.0-2019-03-07-195609@sha256:2501f69c6510834165835ac9cf4e2a71c2f65ed1165c9e1cb1e4a65c3ec7ef2c"" Mar 07 22:33:24 ip-10-0-139-117 crio[5563]: time="2019-03-07 22:33:24.932936392Z" level=debug msg="reference "[overlay@/var/lib/containers/storage+/var/run/containers/storage]registry.svc.ci.openshift.org/ocp/4.0-2019-03-07-195609@sha256:2501f69c6510834165835ac9cf4e2a71c2f65ed1165c9e1cb1e4a65c3ec7ef2c" does not resolve to an image ID" Mar 07 22:33:24 ip-10-0-139-117 crio[5563]: time="2019-03-07 22:33:24.932960092Z" level=debug msg="couldn't find image "registry.svc.ci.openshift.org/ocp/4.0-2019-03-07-195609@sha256:2501f69c6510834165835ac9cf4e2a71c2f65ed1165c9e1cb1e4a65c3ec7ef2c", retrieving it" Mar 07 22:33:25 ip-10-0-139-117 crio[5563]: time="2019-03-07 22:33:25.100048733Z" level=debug msg="parsed reference into "[overlay@/var/lib/containers/storage+/var/run/containers/storage]registry.svc.ci.openshift.org/ocp/4.0-2019-03-07-195609@sha256:2501f69c6510834165835ac9cf4e2a71c2f65ed1165c9e1cb1e4a65c3ec7ef2c"" Mar 07 22:33:25 ip-10-0-139-117 crio[5563]: time="2019-03-07 22:33:25.100292095Z" level=debug msg="Using registries.d directory /etc/containers/registries.d for sigstore configuration" Mar 07 22:33:25 ip-10-0-139-117 crio[5563]: time="2019-03-07 22:33:25.100426077Z" level=debug msg=" Using "default-docker" configuration" Mar 07 22:33:25 ip-10-0-139-117 crio[5563]: time="2019-03-07 22:33:25.100439218Z" level=debug msg=" No signature storage configuration found for registry.svc.ci.openshift.org/ocp/4.0-2019-03-07-195609@sha256:2501f69c6510834165835ac9cf4e2a71c2f65ed1165c9e1cb1e4a65c3ec7ef2c"
We are failing here: podContainer, err := s.StorageRuntimeServer().CreatePodSandbox(s.ImageContext(), name, id, s.config.PauseImage, "", containerName, req.GetConfig().GetMetadata().GetName(), req.GetConfig().GetMetadata().GetUid(), namespace, attempt, s.defaultIDMappings, nil) if errors.Cause(err) == storage.ErrDuplicateName { return nil, fmt.Errorf("pod sandbox with name %q already exists", name) } if err != nil { return nil, fmt.Errorf("error creating pod sandbox with name %q: %v", name, err) } defer func() { if err != nil { if err2 := s.StorageRuntimeServer().RemovePodSandbox(id); err2 != nil { logrus.Warnf("couldn't cleanup pod sandbox %q: %v", id, err2) } } }() The issue is that we don't get auth from the kubelet for the pause container. It is our job to manage the sandbox including the image and the auth for it. I copied the auth from /var/lib/kubelet/config.json to ~/.docker/config.json but that doesn't fix the issue either as we need additional plumbing on the pull code here to read the auth.
okay an update. I made an operator error in setting the pause_image. Once I fixed that it worked fine. I set the pause image to pause_image = "registry.svc.ci.openshift.org/ocp/4.0-2019-03-07-195609@sha256:2501f69c6510834165835ac9cf4e2a71c2f65ed1165c9e1cb1e4a65c3ec7ef2c" pause_command = "/usr/bin/pod" and copy the /var/lib/kubelet/config.json to /root/.docker/config.json then it works. I think the symlink idea that Colin proposed maybe the simplest fix.
> I think the symlink idea that Colin proposed maybe the simplest fix. I have a half-done patch for it and I'll work on it tomorrow. But...let's try to fix this two ways? Any reason not to also patch crio to read the kubelet auth directly too?
Yes, sure. We can add configuration to cri-o to read from different auth files. We want to avoid hard-coding the /var/lib/kubelet/config.json path.
I think I have an idea of how to handle this in containers/image more or less transparently. Dealing with it there would allow other users (e.g., Buildah and Podman) to support this use case as well. I'll spin up a PR.
Mrunal, Colin, please take a look at the following PRs: - https://github.com/containers/image/pull/595 to allow passing additional authentication files through the SystemContext. - https://github.com/containers/skopeo/pull/612 as the mandatory Skopeo PR to test if it's actually working. You can try the functionality out, for instance, via `$ skopeo inspect --additional-authfile $FILE ...`.
https://github.com/openshift/machine-config-operator/pull/535
Also in this space: https://github.com/kubernetes-sigs/cri-o/pull/2115
I think we're probably good, https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5459 passed, but e2e-aws-serial failed for some other reason.
CRI-O 1.12.9 and 1.13.2 contain a pause_image_auth_file config option / --pause-image-auth-file CLI option. I have filed https://github.com/openshift/machine-config-operator/pull/540 to take advantage of them.