Bug 1862979 - [OCP 4.6] failed to provision master node due to cannot get image from mirror registry
Summary: [OCP 4.6] failed to provision master node due to cannot get image from mirror...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.0
Assignee: Sinny Kumari
QA Contact: Yunfei Jiang
URL:
Whiteboard:
: 1862948 1863335 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-03 12:00 UTC by Yunfei Jiang
Modified: 2021-04-05 17:47 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:22:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
machine-config-daemon-firstboot.service (10.13 KB, text/plain)
2020-08-03 12:00 UTC, Yunfei Jiang
no flags Details
bootstrap log (1.43 MB, application/gzip)
2020-08-03 12:01 UTC, Yunfei Jiang
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1979 0 None closed Bug 1862979: daemon: fallback to podman copy when oc fails to extract image 2021-02-03 12:22:50 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:23:00 UTC

Description Yunfei Jiang 2020-08-03 12:00:07 UTC
Description of problem:
Try to install a cluster in a disconnect environment, the master machine can not get image from registry due to following error:
 
Aug 03 11:19:33 ip-10-0-66-177 machine-config-daemon[1939]: error: failed to run command oc (6 tries): timed out waiting for the condition: running oc image extract --path /:/run/mco-machine-os-content/os-content-608521709 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:21f2620684e969a963316a44b413b9743a78dc83c47df80cc9f6a6acb120c57c failed: error: unable to connect to image repository quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:21f2620684e969a963316a44b413b9743a78dc83c47df80cc9f6a6acb120c57c: Get "https://quay.io/v2/": Forbidden
Aug 03 11:19:33 ip-10-0-66-177 machine-config-daemon[1939]: : exit status 1
Aug 03 11:19:33 ip-10-0-66-177 systemd[1]: machine-config-daemon-firstboot.service: Main process exited, code=exited, status=1/FAILURE
Aug 03 11:19:33 ip-10-0-66-177 systemd[1]: machine-config-daemon-firstboot.service: Failed with result 'exit-code'.
Aug 03 11:19:33 ip-10-0-66-177 systemd[1]: Failed to start Machine Config Daemon Firstboot.
Aug 03 11:19:33 ip-10-0-66-177 systemd[1]: machine-config-daemon-firstboot.service: Consumed 857ms CPU time

Failed to provision master node.

Version-Release number of the following components:
OCP 4.6.0-0.nightly-2020-08-03-054919

How reproducible:
Always

Steps to Reproduce:
1. Create a disconnect cluster 

Actual results:
Create cluster successfully

Expected results:
Create cluster failed

Additional info:
1. Reproduced problem on GCP, vSphere
2. The above error message does not appear when creating a 4.5 disconnect cluster

Comment 1 Yunfei Jiang 2020-08-03 12:00:46 UTC
Created attachment 1703284 [details]
machine-config-daemon-firstboot.service

Comment 2 Yunfei Jiang 2020-08-03 12:01:25 UTC
Created attachment 1703285 [details]
bootstrap log

Comment 3 Yunfei Jiang 2020-08-03 12:02:36 UTC
this bug blocks all tests against disconnected environment

Comment 4 Stephen Benjamin 2020-08-03 15:21:52 UTC
We see the same on all baremetal IPv6 jobs, which must be disconnected due to quay not supporting IPv6:


Aug 03 12:25:32 master-0.ostest.test.metalkube.org machine-config-daemon[2451]: error: unable to connect to image repository quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:70cfcdee7fa0eac2578f32f197b410d0f50d5bb10ac56ba402eb758e50e76d04: Get "https://quay.io/v2/": dial tcp 34.198.42.182:443: connect: network is unreachable

Comment 5 Stephen Benjamin 2020-08-03 15:27:53 UTC
I think this beings against MCO, which is the source of the problem. Either MCD is running an `oc` command that's not accounting for disconnected installs, or `oc` itself has a problem.

Comment 6 Antonio Murdaca 2020-08-03 16:13:56 UTC
Is https://github.com/openshift/enhancements/pull/334 another option to avoid _not_ using oc?

Comment 7 Kirsten Garrison 2020-08-03 16:33:38 UTC
Adding upgrade blocker tag as well as we don't have any indication it doesn't block as of now. Feel free to remove if we find otherwise.

Comment 8 Vadim Rutkovsky 2020-08-03 16:35:03 UTC
*** Bug 1862948 has been marked as a duplicate of this bug. ***

Comment 9 Antonio Murdaca 2020-08-04 09:47:31 UTC
*** Bug 1863335 has been marked as a duplicate of this bug. ***

Comment 10 Sinny Kumari 2020-08-04 10:26:29 UTC
After having a brainstorming session with Antonio today, we came up with another solution to fix the problem and this involves minimal changes:
- We keep the current implementation (i.e keep using oc image extract) of CoreOS extensions support
- Until oc fixes gets in to support mirror registry- when `oc image extract` fails, we fallback to copying machine-os-content on nodes using `podman pull osImageURL && podman create osImageURL && podman cp container_ID:/ /run/machine-os-content/os-content-XXXX`

The fallback solution is applied only when oc image extract has failed.

Comment 11 Vadim Rutkovsky 2020-08-06 07:50:40 UTC
*** Bug 1862948 has been marked as a duplicate of this bug. ***

Comment 12 Vadim Rutkovsky 2020-08-06 07:51:51 UTC
This bug seems to affect proxy environments too (which are similar to disconnected - image cannot be downloaded directly from quay and `oc image extract` doesn't take mirrors/proxies into account)

Comment 16 Antonio Murdaca 2020-08-07 09:38:23 UTC
(In reply to Vadim Rutkovsky from comment #12)
> This bug seems to affect proxy environments too (which are similar to
> disconnected - image cannot be downloaded directly from quay and `oc image
> extract` doesn't take mirrors/proxies into account)

thanks Vadim, we're tackling that separately

Comment 17 Sinny Kumari 2020-08-07 09:59:12 UTC
(In reply to Antonio Murdaca from comment #16)
> (In reply to Vadim Rutkovsky from comment #12)
> > This bug seems to affect proxy environments too (which are similar to
> > disconnected - image cannot be downloaded directly from quay and `oc image
> > extract` doesn't take mirrors/proxies into account)
> 
> thanks Vadim, we're tackling that separately

Perhaps we should reopen the bug https://bugzilla.redhat.com/show_bug.cgi?id=1862948 which has proxy setup.

Comment 18 Micah Abbott 2020-08-10 20:01:57 UTC
@yunjiang Mike N. is on paternity leave and additionally, we do not have the infrastructure to test disconnected installs.  Would it be possible that you could retest this and indicate if the BZ is verified?

Comment 19 Vadim Rutkovsky 2020-08-11 07:47:54 UTC
(In reply to Sinny Kumari from comment #17)
> (In reply to Antonio Murdaca from comment #16)
> > (In reply to Vadim Rutkovsky from comment #12)
> > > This bug seems to affect proxy environments too (which are similar to
> > > disconnected - image cannot be downloaded directly from quay and `oc image
> > > extract` doesn't take mirrors/proxies into account)
> > 
> > thanks Vadim, we're tackling that separately
> 
> Perhaps we should reopen the bug
> https://bugzilla.redhat.com/show_bug.cgi?id=1862948 which has proxy setup.

No, proxy issue is caused by the very same rootcase (so I closed the proxy bug as dupe)

Comment 20 Vadim Rutkovsky 2020-08-11 07:49:51 UTC
*** Bug 1862948 has been marked as a duplicate of this bug. ***

Comment 21 Yunfei Jiang 2020-08-11 09:57:06 UTC
verified. PASS.
version: 4.6.0-0.nightly-2020-08-10-180431

Comment 23 errata-xmlrpc 2020-10-27 16:22:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 24 W. Trevor King 2021-04-05 17:47:04 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475


Note You need to log in before you can comment on or make changes to this bug.