Bug 1862979

Summary: [OCP 4.6] failed to provision master node due to cannot get image from mirror registry
Product: OpenShift Container Platform Reporter: Yunfei Jiang <yunjiang>
Component: Machine Config OperatorAssignee: Sinny Kumari <skumari>
Status: CLOSED ERRATA QA Contact: Yunfei Jiang <yunjiang>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.6CC: amurdaca, dhellmann, kgarriso, miabbott, skumari, somalley, stbenjam, vrutkovs, walters, wking, xtian, xxia
Target Milestone: ---Keywords: TestBlocker
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:22:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
machine-config-daemon-firstboot.service
none
bootstrap log none

Description Yunfei Jiang 2020-08-03 12:00:07 UTC
Description of problem:
Try to install a cluster in a disconnect environment, the master machine can not get image from registry due to following error:
 
Aug 03 11:19:33 ip-10-0-66-177 machine-config-daemon[1939]: error: failed to run command oc (6 tries): timed out waiting for the condition: running oc image extract --path /:/run/mco-machine-os-content/os-content-608521709 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:21f2620684e969a963316a44b413b9743a78dc83c47df80cc9f6a6acb120c57c failed: error: unable to connect to image repository quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:21f2620684e969a963316a44b413b9743a78dc83c47df80cc9f6a6acb120c57c: Get "https://quay.io/v2/": Forbidden
Aug 03 11:19:33 ip-10-0-66-177 machine-config-daemon[1939]: : exit status 1
Aug 03 11:19:33 ip-10-0-66-177 systemd[1]: machine-config-daemon-firstboot.service: Main process exited, code=exited, status=1/FAILURE
Aug 03 11:19:33 ip-10-0-66-177 systemd[1]: machine-config-daemon-firstboot.service: Failed with result 'exit-code'.
Aug 03 11:19:33 ip-10-0-66-177 systemd[1]: Failed to start Machine Config Daemon Firstboot.
Aug 03 11:19:33 ip-10-0-66-177 systemd[1]: machine-config-daemon-firstboot.service: Consumed 857ms CPU time

Failed to provision master node.

Version-Release number of the following components:
OCP 4.6.0-0.nightly-2020-08-03-054919

How reproducible:
Always

Steps to Reproduce:
1. Create a disconnect cluster 

Actual results:
Create cluster successfully

Expected results:
Create cluster failed

Additional info:
1. Reproduced problem on GCP, vSphere
2. The above error message does not appear when creating a 4.5 disconnect cluster

Comment 1 Yunfei Jiang 2020-08-03 12:00:46 UTC
Created attachment 1703284 [details]
machine-config-daemon-firstboot.service

Comment 2 Yunfei Jiang 2020-08-03 12:01:25 UTC
Created attachment 1703285 [details]
bootstrap log

Comment 3 Yunfei Jiang 2020-08-03 12:02:36 UTC
this bug blocks all tests against disconnected environment

Comment 4 Stephen Benjamin 2020-08-03 15:21:52 UTC
We see the same on all baremetal IPv6 jobs, which must be disconnected due to quay not supporting IPv6:


Aug 03 12:25:32 master-0.ostest.test.metalkube.org machine-config-daemon[2451]: error: unable to connect to image repository quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:70cfcdee7fa0eac2578f32f197b410d0f50d5bb10ac56ba402eb758e50e76d04: Get "https://quay.io/v2/": dial tcp 34.198.42.182:443: connect: network is unreachable

Comment 5 Stephen Benjamin 2020-08-03 15:27:53 UTC
I think this beings against MCO, which is the source of the problem. Either MCD is running an `oc` command that's not accounting for disconnected installs, or `oc` itself has a problem.

Comment 6 Antonio Murdaca 2020-08-03 16:13:56 UTC
Is https://github.com/openshift/enhancements/pull/334 another option to avoid _not_ using oc?

Comment 7 Kirsten Garrison 2020-08-03 16:33:38 UTC
Adding upgrade blocker tag as well as we don't have any indication it doesn't block as of now. Feel free to remove if we find otherwise.

Comment 8 Vadim Rutkovsky 2020-08-03 16:35:03 UTC
*** Bug 1862948 has been marked as a duplicate of this bug. ***

Comment 9 Antonio Murdaca 2020-08-04 09:47:31 UTC
*** Bug 1863335 has been marked as a duplicate of this bug. ***

Comment 10 Sinny Kumari 2020-08-04 10:26:29 UTC
After having a brainstorming session with Antonio today, we came up with another solution to fix the problem and this involves minimal changes:
- We keep the current implementation (i.e keep using oc image extract) of CoreOS extensions support
- Until oc fixes gets in to support mirror registry- when `oc image extract` fails, we fallback to copying machine-os-content on nodes using `podman pull osImageURL && podman create osImageURL && podman cp container_ID:/ /run/machine-os-content/os-content-XXXX`

The fallback solution is applied only when oc image extract has failed.

Comment 11 Vadim Rutkovsky 2020-08-06 07:50:40 UTC
*** Bug 1862948 has been marked as a duplicate of this bug. ***

Comment 12 Vadim Rutkovsky 2020-08-06 07:51:51 UTC
This bug seems to affect proxy environments too (which are similar to disconnected - image cannot be downloaded directly from quay and `oc image extract` doesn't take mirrors/proxies into account)

Comment 16 Antonio Murdaca 2020-08-07 09:38:23 UTC
(In reply to Vadim Rutkovsky from comment #12)
> This bug seems to affect proxy environments too (which are similar to
> disconnected - image cannot be downloaded directly from quay and `oc image
> extract` doesn't take mirrors/proxies into account)

thanks Vadim, we're tackling that separately

Comment 17 Sinny Kumari 2020-08-07 09:59:12 UTC
(In reply to Antonio Murdaca from comment #16)
> (In reply to Vadim Rutkovsky from comment #12)
> > This bug seems to affect proxy environments too (which are similar to
> > disconnected - image cannot be downloaded directly from quay and `oc image
> > extract` doesn't take mirrors/proxies into account)
> 
> thanks Vadim, we're tackling that separately

Perhaps we should reopen the bug https://bugzilla.redhat.com/show_bug.cgi?id=1862948 which has proxy setup.

Comment 18 Micah Abbott 2020-08-10 20:01:57 UTC
@yunjiang Mike N. is on paternity leave and additionally, we do not have the infrastructure to test disconnected installs.  Would it be possible that you could retest this and indicate if the BZ is verified?

Comment 19 Vadim Rutkovsky 2020-08-11 07:47:54 UTC
(In reply to Sinny Kumari from comment #17)
> (In reply to Antonio Murdaca from comment #16)
> > (In reply to Vadim Rutkovsky from comment #12)
> > > This bug seems to affect proxy environments too (which are similar to
> > > disconnected - image cannot be downloaded directly from quay and `oc image
> > > extract` doesn't take mirrors/proxies into account)
> > 
> > thanks Vadim, we're tackling that separately
> 
> Perhaps we should reopen the bug
> https://bugzilla.redhat.com/show_bug.cgi?id=1862948 which has proxy setup.

No, proxy issue is caused by the very same rootcase (so I closed the proxy bug as dupe)

Comment 20 Vadim Rutkovsky 2020-08-11 07:49:51 UTC
*** Bug 1862948 has been marked as a duplicate of this bug. ***

Comment 21 Yunfei Jiang 2020-08-11 09:57:06 UTC
verified. PASS.
version: 4.6.0-0.nightly-2020-08-10-180431

Comment 23 errata-xmlrpc 2020-10-27 16:22:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 24 W. Trevor King 2021-04-05 17:47:04 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475