1745811 – OCP 4.2 - Azure image-registry pods in CrashLoopBackoff after cluster install

Bug 1745811 - OCP 4.2 - Azure image-registry pods in CrashLoopBackoff after cluster install

Summary: OCP 4.2 - Azure image-registry pods in CrashLoopBackoff after cluster install

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Image Registry
Sub Component:
Version:	4.2.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Ricardo Maraschini
QA Contact:	XiuJuan Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-08-27 02:04 UTC by Walid A.
Modified:	2019-10-16 06:38 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:37:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-image-registry-operator pull 382	0	None	None	None	2019-09-03 08:15:23 UTC
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:38:02 UTC

Description Walid A. 2019-08-27 02:04:47 UTC

Description of problem:
After installing an 4.2 OCP cluster in Azure, using nightly build 4.2.0-0.nightly-2019-08-25-233755, cluster operator image-registry has not successfully rolled out.  The image-registry pods in openshift-image-regitry namespace are both in CrashLoopbackOff state.


# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          152m    Unable to apply 4.2.0-0.nightly-2019-08-25-233755: the cluster operator image-registry has not yet successfully rolled out

# oc get co image-registry
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
image-registry             False       True          False      10h

 # oc get pods --all-namespaces | grep registry
openshift-image-registry                                cluster-image-registry-operator-564bd5dd8f-rg7dk                  2/2     Running            0          4h16m
openshift-image-registry                                image-registry-5d76bcb8d8-jnb48                                   0/1     CrashLoopBackOff   54         4h16m
openshift-image-registry                                image-registry-89b7dbc78-9kn2k                                    0/1     CrashLoopBackOff   54         4h16m
openshift-image-registry                                node-ca-6kfq8                                                     1/1     Running            0          4h16m
openshift-image-registry                                node-ca-c2g7l                                                     1/1     Running            0          4h16m
openshift-image-registry                                node-ca-g5597                                                     1/1     Running            0          4h16m
openshift-image-registry                                node-ca-mxxcd                                                     1/1     Running            0          4h16m
openshift-image-registry                                node-ca-nw8g5                                                     1/1     Running            0          4h16m

# oc logs -n openshift-image-registry image-registry-5d76bcb8d8-jnb48 
time="2019-08-26T19:18:17.207010527Z" level=info msg="start registry" distribution_version=v2.6.0+unknown go.version=go1.11.6 openshift_version=v4.2.0-201908251340+a3a2106-dirty
time="2019-08-26T19:18:17.207664529Z" level=info msg="caching project quota objects with TTL 1m0s" go.version=go1.11.6
panic: storage: service returned error: StatusCode=400, ErrorCode=InvalidResourceName, ErrorMessage=The specifed resource name contains invalid characters.
RequestId:7faf7f4c-901e-00cc-2443-5cbfc0000000
Time:2019-08-26T19:18:17.3080595Z, RequestInitiated=Mon, 26 Aug 2019 19:18:16 GMT, RequestId=7faf7f4c-901e-00cc-2443-5cbfc0000000, API Version=2016-05-31, QueryParameterName=, QueryParameterValue=

goroutine 1 [running]:
github.com/openshift/image-registry/vendor/github.com/docker/distribution/registry/handlers.NewApp(0x199c320, 0xc000042038, 0xc0004f1180, 0xc0002d6a00)
	/go/src/github.com/openshift/image-registry/vendor/github.com/docker/distribution/registry/handlers/app.go:127 +0x335a
github.com/openshift/image-registry/pkg/dockerregistry/server/supermiddleware.NewApp(0x199c320, 0xc000042038, 0xc0004f1180, 0x19a28e0, 0xc000496ea0, 0x19a4900)
	/go/src/github.com/openshift/image-registry/pkg/dockerregistry/server/supermiddleware/app.go:96 +0x85
github.com/openshift/image-registry/pkg/dockerregistry/server.NewApp(0x199c320, 0xc000042038, 0x1982aa0, 0xc00000e738, 0xc0004f1180, 0xc0004006e0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/openshift/image-registry/pkg/dockerregistry/server/app.go:138 +0x287
github.com/openshift/image-registry/pkg/cmd/dockerregistry.NewServer(0x199c320, 0xc000042038, 0xc0004f1180, 0xc0004006e0, 0x0, 0x0, 0x19bc4a0)
	/go/src/github.com/openshift/image-registry/pkg/cmd/dockerregistry/dockerregistry.go:187 +0x190
github.com/openshift/image-registry/pkg/cmd/dockerregistry.Execute(0x197f280, 0xc00000e1a8)
	/go/src/github.com/openshift/image-registry/pkg/cmd/dockerregistry/dockerregistry.go:162 +0x96a
main.main()
	/go/src/github.com/openshift/image-registry/cmd/dockerregistry/main.go:93 +0x3f6
 

# oc get pods -n openshift-image-registry 
NAME                                               READY   STATUS             RESTARTS   AGE
cluster-image-registry-operator-564bd5dd8f-rg7dk   2/2     Running            0          4h17m
image-registry-5d76bcb8d8-jnb48                    0/1     CrashLoopBackOff   54         4h17m
image-registry-89b7dbc78-9kn2k                     0/1     CrashLoopBackOff   54         4h17m
node-ca-6kfq8                                      1/1     Running            0          4h17m
node-ca-c2g7l                                      1/1     Running            0          4h17m
node-ca-g5597                                      1/1     Running            0          4h17m
node-ca-mxxcd                                      1/1     Running            0          4h17m
node-ca-nw8g5                                      1/1     Running            0          4h17m


Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-08-25-233755

How reproducible:
Happened once after install

Steps to Reproduce:
1.  Install and Azure 4.2 cluster with  worker nodes and 3 worker nodes in eastus, instance-type=Standard_D4s_v3, using nightly build:
OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE: registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2019-08-25-233755
2.  Copy the kubeconfig to your jump host and check cluster by running oc commands
3.  oc get co, oc get clusterversion, oc get pods -n openshift-image-registry

Actual results:

image-registry-5d76bcb8d8-jnb48                    0/1     CrashLoopBackOff   54         4h17m
image-registry-89b7dbc78-9kn2k                     0/1     CrashLoopBackOff   54         4h17m

Expected results:
cluster operator image-registry should be deployed successfully and available
and image-registry pods Running

Additional info:
Links to must-gather logs and pod logs will be in next comment

Comment 2 XiuJuan Wang 2019-08-27 05:37:15 UTC

I don't reproduce this issue with 4.2.0-0.nightly-2019-08-26-202352 payload
$ oc get pods  -n openshift-image-registry 
NAME                                              READY   STATUS    RESTARTS   AGE
cluster-image-registry-operator-dbf75d975-mfb5w   2/2     Running   0          66m
image-registry-58654c58cd-gsctf                   1/1     Running   0          63m
node-ca-9522m                                     1/1     Running   0          63m
node-ca-gsh84                                     1/1     Running   0          63m
node-ca-j2nnp                                     1/1     Running   0          63m
node-ca-tz627                                     1/1     Running   0          63m
node-ca-zd5r2                                     1/1     Running   0          63m
[xiuwang@dhcp-141-173 tmp]$ oc get clusterversion 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-08-26-202352   True        False         60m     Cluster version is 4.2.0-0.nightly-2019-08-26-202352


@Walid
Could you paste which version of installer you used?

Comment 3 Walid A. 2019-08-27 09:22:17 UTC

@XiuJuan I used installer from automated installer Jenkins job:

./openshift-install v4.2.0-201908251340-dirty
built from commit c2e6b0afd7f33ae0125d1ac96f3948919748ffc5
release image registry.svc.ci.openshift.org/ocp/release@sha256:0143dc81a5d87bba07c20b7800742b58e24a201f1a1706b0a4f5374cab22e415

Comment 4 XiuJuan Wang 2019-08-28 01:53:01 UTC

Hmm, My jenkins jobs both succeed installation with 4.2.0-0.nightly-2019-08-25-233755 and 4.2.0-0.nightly-2019-08-26-202352 in azure yesterday.

Comment 5 Nick Stielau 2019-08-29 16:24:48 UTC

Right now we're about 50% pass rate on Azure CI, which is running every 4 hours, and on-demand from PRs.  You may be hitting some of the same flakes CI is.  


https://testgrid.k8s.io/redhat-openshift-release-informing#redhat-canary-openshift-ocp-installer-e2e-azure-4.2

Comment 9 XiuJuan Wang 2019-09-09 03:44:22 UTC

When set the REGISTRY_STORAGE_AZURE_CONTAINER=qe-xiuwang--azure-909-74z7s-image-registry-qyctpisglrltvbixdqrw manually in 4.2.0-0.nightly-2019-09-08-180038 version，the image-registry pod will be CrashLoopBackOff.

It's better to restrict a sequence of dashes saved in config.imageregistry.operator.openshift.io

Comment 10 Adam Kaplan 2019-09-09 12:05:53 UTC

XiuJuan - the fix ensures that we don't have an invalid container name for IPI installs, or UPI installations where the container name is not provided. If a customer provides a bad container name for UPI installs, failing with a CrashLoopBackoff - though not ideal - is not a release blocker. Please file a separate bug for this.

Please verify that if a container name is not provided (UPI or IPI installs), that we generate a valid container name on Azure.

Comment 11 XiuJuan Wang 2019-09-10 03:06:32 UTC

Adam,
Thanks, I will open a new low bug to track issue in comment #9 .

Installed several azure clusters with IPI installation, also checked azure ci, didn't find the image-registry pods in CrashLoopBackoff error.
All generated container name has no a sequence of dashes, such as hongli-azupg-x2v5q-image-registry-uckvbxabnnvpkfmdopbgjxuyuici

Checked with 4.2.0-0.nightly-2019-09-08-180038 payload.

Comment 12 errata-xmlrpc 2019-10-16 06:37:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.