1838421 – GCP UPI installation in shared VPC(XPN) failed

Bug 1838421 - GCP UPI installation in shared VPC(XPN) failed

Summary: GCP UPI installation in shared VPC(XPN) failed

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Jeremiah Stuever
QA Contact:	Yang Yang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-21 06:27 UTC by Yang Yang
Modified:	2020-10-27 16:00 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:00:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Log file (8.56 MB, application/gzip) 2020-05-21 06:31 UTC, Yang Yang	no flags	Details
bootstrap log tarball (5.49 MB, application/gzip) 2020-06-04 07:48 UTC, Yang Yang	no flags	Details
Bootstrap log (16.74 MB, application/gzip) 2020-06-17 08:04 UTC, Yang Yang	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:00:42 UTC

Description Yang Yang 2020-05-21 06:27:42 UTC

Description of problem:
Bootstraping a GCP 4.5 cluster in shared VPC fails recently.

Version-Release number of the following components:
4.5.0-0.nightly-2020-05-20-053050

How reproducible:
Often

Steps to Reproduce:
1. Install a GCP cluster in shared VPC
2.
3.

Actual results:
Installation fails

Expected results:
Installation is successful

Additional info:
May 20 12:17:01 yy4-f7r26-bootstrap.c.openshift-qe.internal bootkube.sh[2130]: Starting temporary bootstrap control plane... 
May 20 12:17:01 yy4-f7r26-bootstrap.c.openshift-qe.internal bootkube.sh[2130]: E0520 12:17:01.386322       1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods: dial tcp [::1]:6443: connect: connection refused 
May 20 12:17:01 yy4-f7r26-bootstrap.c.openshift-qe.internal bootkube.sh[2130]: [#1] failed to fetch discovery: Get https://localhost:6443/api?timeout=32s: dial tcp [::1]:6443: connect: connection refused

Comment 1 Yang Yang 2020-05-21 06:31:45 UTC

Created attachment 1690515 [details]
Log file

Comment 2 Jeremiah Stuever 2020-05-21 18:39:04 UTC

In your description, you said bootstrapping fails "recently". Does this indicate you had it working with a version prior to 4.5.0-0.nightly-2020-05-20-053050? I'm trying the same version using my XPN script to see if I can reproduce.

Comment 3 Yang Yang 2020-05-22 01:33:04 UTC

The issue does not always but often happen. It sometimes works with 4.5.0-0.nightly-2020-05-18-225907 and sometimes fails.

Comment 6 Yang Yang 2020-05-27 11:12:34 UTC

I do not see this error today hence remove testblocker keyword.

Comment 8 Jeremiah Stuever 2020-05-27 22:17:33 UTC

It looks like the cluster failed to properly configure the pause image.

May 21 03:00:59 yy4-f7r26-bootstrap.c.openshift-qe.internal crio[2018]: time="2020-05-21 03:00:59.190732656Z" level=warning msg="imageStatus: can't find k8s.gcr.io/pause:3.1" id=9ed10c1f-21df-452b-b7a2-e4d4e772f1aa

Comment 23 Yang Yang 2020-06-04 07:48:41 UTC

Created attachment 1694987 [details]
bootstrap log tarball

Comment 33 Yang Yang 2020-06-17 08:04:56 UTC

Created attachment 1697753 [details]
Bootstrap log

Comment 36 Abhinav Dahiya 2020-06-19 17:18:29 UTC

Moving to ON_QA since we think this should be fixed after the cluster id changes and using a nat with auto mode.

Comment 37 Yang Yang 2020-07-14 08:42:47 UTC

Experienced pull image failure on worker node 2/5 times with 4.6.0-0.nightly-2020-07-14-035247. I'm not sure if there's something wrong with quay.io. I'll keep monitoring it.

$ systemctl status machine-config-daemon-pull.service
● machine-config-daemon-pull.service - Machine Config Daemon Pull
   Loaded: loaded (/etc/systemd/system/machine-config-daemon-pull.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Tue 2020-07-14 07:34:14 UTC; 1h 0min ago
  Process: 1680 ExecStart=/bin/sh -c /usr/bin/podman pull --authfile=/var/lib/kubelet/config.json --quiet 'quay>
  Process: 1665 ExecStart=/bin/sh -c /bin/mkdir -p /run/bin && chcon --reference=/usr/bin /run/bin (code=exited>
 Main PID: 1680 (code=exited, status=125)
      CPU: 510ms

Jul 14 07:33:39 yyxpn12-gq4xq-w-a-0.c.openshift-qe.internal systemd[1]: Starting Machine Config Daemon Pull...
Jul 14 07:34:14 yyxpn12-gq4xq-w-a-0.c.openshift-qe.internal sh[1680]: Error: error pulling image "quay.io/opens>
Jul 14 07:34:14 yyxpn12-gq4xq-w-a-0.c.openshift-qe.internal systemd[1]: machine-config-daemon-pull.service: Mai>
Jul 14 07:34:14 yyxpn12-gq4xq-w-a-0.c.openshift-qe.internal systemd[1]: machine-config-daemon-pull.service: Fai>
Jul 14 07:34:14 yyxpn12-gq4xq-w-a-0.c.openshift-qe.internal systemd[1]: Failed to start Machine Config Daemon P>
Jul 14 07:34:14 yyxpn12-gq4xq-w-a-0.c.openshift-qe.internal systemd[1]: machine-config-daemon-pull.service: Con>

Comment 40 Yang Yang 2020-08-11 04:00:44 UTC

It's not reproduced recently. Moving it to verified state.

Comment 43 errata-xmlrpc 2020-10-27 16:00:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.