1702519 – Cluster operator machine-config is reporting a failure: error pool master is not ready, retrying. Status: (total: 3, updated: 0, unavailable: 0) pivot choking on Quay 429s

Bug 1702519 - Cluster operator machine-config is reporting a failure: error pool master is not ready, retrying. Status: (total: 3, updated: 0, unavailable: 0) pivot choking on Quay 429s

Summary: Cluster operator machine-config is reporting a failure: error pool master is ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	RHCOS
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.1.z
Assignee:	Colin Walters
QA Contact:	Micah Abbott
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1702518 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-24 02:35 UTC by Ryan Cook
Modified:	2020-04-22 16:24 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-04-22 16:24:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
must gather results (1.45 MB, application/gzip) 2019-04-24 03:26 UTC, Ryan Cook	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift installer pull 1674	None	closed	Bug 1702519: rhcos: Bump to 410.8.20190425.0	2020-11-24 21:05:09 UTC
Github	openshift pivot pull 51	None	closed	Don't use `skopeo inspect` for canonical (digested) pull specs	2020-11-24 21:05:07 UTC
Red Hat Product Errata	RHBA-2020:1446	None	None	None	2020-04-22 16:24:23 UTC

Description Ryan Cook 2019-04-24 02:35:55 UTC

Description of problem: Cannot install new clusters due to the following error

DEBUG Still waiting for the cluster to initialize: Cluster operator machine-config is reporting a failure: Failed to resync 4.1.0-rc.0 because: error pool master is not ready, retrying. Status: (total: 3, updated: 0, unavailable: 0) 

Version-Release number of the following components:
./openshift-install v4.1.0-201904211700-dirty
built from commit f3b726cc151f5a3d66bc7e23e81b3013f1347a7e
release image quay.io/openshift-release-dev/ocp-release@sha256:345ec9351ecc1d78c16cf0853fe0ef2d9f48dd493da5fdffc18fa18f45707867


How reproducible: Attempt to preform an install. I'm using east-1

Steps to Reproduce:
1. attempt to deploy
2. wait to see if machine-config is successsful
3.

Actual results:
DEBUG Still waiting for the cluster to initialize: Working towards 4.1.0-rc.0: 78% complete 
DEBUG Still waiting for the cluster to initialize: Working towards 4.1.0-rc.0: 89% complete 
DEBUG Still waiting for the cluster to initialize: Working towards 4.1.0-rc.0: 91% complete 
DEBUG Still waiting for the cluster to initialize: Working towards 4.1.0-rc.0: 95% complete 
DEBUG Still waiting for the cluster to initialize: Multiple errors are preventing progress:
* Cluster operator authentication is still updating: missing version information for integrated-oauth-server
* Cluster operator machine-config is reporting a failure: Failed to resync 4.1.0-rc.0 because: error pool master is not ready, retrying. Status: (total: 3, updated: 0, unavailable: 0)
* Cluster operator monitoring is still updating
* Cluster operator openshift-samples is still updating 
DEBUG Still waiting for the cluster to initialize: Working towards 4.1.0-rc.0: 98% complete 
DEBUG Still waiting for the cluster to initialize: Working towards 4.1.0-rc.0: 99% complete 
DEBUG Still waiting for the cluster to initialize: Cluster operator machine-config is reporting a failure: Failed to resync 4.1.0-rc.0 because: error pool master is not ready, retrying. Status: (total: 3, updated: 0, unavailable: 0) 
DEBUG Still waiting for the cluster to initialize: Cluster operator machine-config is reporting a failure: Failed to resync 4.1.0-rc.0 because: error pool master is not ready, retrying. Status: (total: 3, updated: 0, unavailable: 0) 



Expected results:
Clean install

Additional info:
This happened with the 0.16.1 installer and the 4.0.0-0.9 release

Comment 1 Ryan Cook 2019-04-24 03:26:36 UTC

Created attachment 1557952 [details]
must gather results

Comment 3 W. Trevor King 2019-04-24 05:13:24 UTC

*** Bug 1702518 has been marked as a duplicate of this bug. ***

Comment 4 W. Trevor King 2019-04-24 05:14:15 UTC

Linking possibly-related discussion in bug 1671816, bug 1677198, bug 1695721, and bug 1701409 (although they have total: 3, updated: 0, unavailable: 1).

Comment 5 W. Trevor King 2019-04-24 06:29:57 UTC

Poking around in comment 1's must-gather, other operators look happy:

  $ yaml2json <cluster-scoped-resources/config.openshift.io/clusteroperators.yaml | jq -r '.items[] | (.status.conditions[] | select(.type == "Available").status) + "\t" + (.status.versions[] | select(.name == "operator").version) + "\t" + .metadata.name'
  True	4.1.0-rc.0	authentication
  True	4.1.0-rc.0	cloud-credential
  True	4.1.0-rc.0	cluster-autoscaler
  True	4.1.0-rc.0	console
  True	4.1.0-rc.0	dns
  True	4.1.0-rc.0	image-registry
  True	4.1.0-rc.0	ingress
  True	4.1.0-rc.0	kube-apiserver
  True	4.1.0-rc.0	kube-controller-manager
  True	4.1.0-rc.0	kube-scheduler
  True	4.1.0-rc.0	machine-api
  False	4.1.0-rc.0	machine-config
  True	4.1.0-rc.0	marketplace
  True	4.1.0-rc.0	monitoring
  True	4.1.0-rc.0	network
  True	4.1.0-rc.0	node-tuning
  True	4.1.0-rc.0	openshift-apiserver
  True	4.1.0-rc.0	openshift-controller-manager
  True	4.1.0-rc.0	openshift-samples
  True	4.1.0-rc.0	operator-lifecycle-manager
  True	4.1.0-rc.0	operator-lifecycle-manager-catalog
  True	4.1.0-rc.0	service-ca
  True	4.1.0-rc.0	service-catalog-apiserver
  True	4.1.0-rc.0	service-catalog-controller-manager
  True	4.1.0-rc.0	storage

Time when we started trying to remove the bootstrap crutch:

  $ yaml2json <namespaces/kube-system/core/configmaps.yaml | jq -r '.items[] | select(.metadata.name == "bootstrap").metadata.creationTimestamp'
  2019-04-24T03:07:12Z

Times for those logs:

  $ cat namespaces/openshift-machine-config-operator/pods/machine-config-operator-f868ddcc8-pr64p/machine-config-operator/machine-config-operator/logs/current.log
  2019-04-24T03:04:19.520498032Z I0424 03:04:19.520248       1 start.go:42] Version: 4.1.0-201904211700-dirty
  ...
  2019-04-24T03:04:55.445250004Z I0424 03:04:55.445183       1 sync.go:56] Initialization complete
  2019-04-24T03:07:24.909684422Z W0424 03:07:24.909172       1 reflector.go:270] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.MachineConfig ended with: too old resource version: 3112 (4473)
  2019-04-24T03:07:24.909762056Z W0424 03:07:24.909629       1 reflector.go:270] k8s.io/apiextensions-apiserver/pkg/client/informers/externalversions/factory.go:117: watch of *v1beta1.CustomResourceDefinition ended with: too old resource version: 2283 (4016)
  2019-04-24T03:07:24.909803993Z W0424 03:07:24.909709       1 reflector.go:270] k8s.io/client-go/informers/factory.go:132: watch of *v1.ServiceAccount ended with: too old resource version: 3178 (4020)
  2019-04-24T03:07:24.909922814Z W0424 03:07:24.909887       1 reflector.go:270] k8s.io/client-go/informers/factory.go:132: watch of *v1.DaemonSet ended with: too old resource version: 3365 (4031)
  2019-04-24T03:07:24.910079245Z W0424 03:07:24.910051       1 reflector.go:270] k8s.io/client-go/informers/factory.go:132: watch of *v1.Deployment ended with: too old resource version: 2931 (4031)
  2019-04-24T03:07:24.941899618Z W0424 03:07:24.941869       1 reflector.go:270] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.MachineConfigPool ended with: too old resource version: 3416 (4473)
  2019-04-24T03:07:24.996911626Z W0424 03:07:24.996874       1 reflector.go:270] github.com/openshift/client-go/config/informers/externalversions/factory.go:101: watch of *v1.Infrastructure ended with: too old resource version: 1700 (5257)
  2019-04-24T03:07:25.017919495Z W0424 03:07:25.015938       1 reflector.go:270] github.com/openshift/client-go/config/informers/externalversions/factory.go:101: watch of *v1.Network ended with: too old resource version: 1573 (5257)
  2019-04-24T03:07:25.154733485Z W0424 03:07:25.154690       1 reflector.go:270] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.ControllerConfig ended with: too old resource version: 2936 (5258)
  2019-04-24T03:08:59.200733656Z E0424 03:08:59.200068       1 operator.go:279] error pool master is not ready, retrying. Status: (total: 3, updated: 0, unavailable: 0)
  ...
  2019-04-24T03:22:56.025285931Z E0424 03:22:56.025237       1 operator.go:279] error pool master is not ready, retrying. Status: (total: 3, updated: 0, unavailable: 0)

So things start to go bad a few seconds after we get into bootstrap teardown.  This suggests the production control-plane was not as ready as cluster-bootstrap thought.  Poking at the Kubernetes API-server operator:

  $ tail -n5 namespaces/openshift-kube-apiserver-operator/pods/kube-apiserver-operator-7fccf67d7b-55skm/operator/operator/logs/previous.log
  2019-04-24T03:05:38.198377996Z W0424 03:05:38.198365       1 builder.go:108] Restart triggered because of file /var/run/secrets/serving-cert/tls.crt was created
  2019-04-24T03:05:38.198450313Z I0424 03:05:38.198426       1 observer_polling.go:78] Observed change: file:/var/run/secrets/serving-cert/tls.key (current: "5bd1e7ba20a29623886f148284bcc9447886755b3f277c331cf88f1ba6021f64", lastKnown: "")
  2019-04-24T03:05:38.198494543Z F0424 03:05:38.198474       1 leaderelection.go:65] leaderelection lost
  2019-04-24T03:05:38.215799103Z I0424 03:05:38.198582       1 node_controller.go:134] Shutting down NodeController
  2019-04-24T03:05:38.215799103Z F0424 03:05:38.207396       1 builder.go:217] server exited
  $ cat namespaces/openshift-kube-apiserver-operator/pods/kube-apiserver-operator-7fccf67d7b-55skm/operator/operator/logs/current.log 
  2019-04-24T03:05:42.773262543Z I0424 03:05:42.773141       1 cmd.go:138] Using service-serving-cert provided certificates
  ...
  2019-04-24T03:05:43.262565928Z I0424 03:05:43.262521       1 leaderelection.go:205] attempting to acquire leader lease  openshift-kube-apiserver-operator/kube-apiserver-operator-lock...
  2019-04-24T03:07:50.430738178Z I0424 03:07:50.430189       1 leaderelection.go:214] successfully acquired lease openshift-kube-apiserver-operator/kube-apiserver-operator-lock
  2019-04-24T03:07:50.430922075Z I0424 03:07:50.430884       1 event.go:221] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator-lock", UID:"a9c8df6d-663d-11e9-a0d9-0e347a4e8faa", APIVersion:"v1", ResourceVersion:"5839", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' d9159df0-663d-11e9-bc7f-0a580a820007 became leader
  2019-04-24T03:07:50.449530023Z I0424 03:07:50.449500       1 certrotationcontroller.go:452] Starting CertRotation
  2019-04-24T03:07:50.44970853Z I0424 03:07:50.449676       1 targetconfigcontroller.go:315] Starting TargetConfigController

Ok, 7:50 is a bit after 7:12, but this still feels like "picking up after the bootstrap operator went away".

  ...
  2019-04-24T03:07:50.552327567Z I0424 03:07:50.552306       1 prune_controller.go:335] No excluded revisions to prune, skipping
  2019-04-24T03:07:50.552649463Z I0424 03:07:50.552599       1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"901b0c8f-663d-11e9-a0d9-0e347a4e8faa", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'LoglevelChange' Changed loglevel level to "2"
  2019-04-24T03:07:52.264619167Z E0424 03:07:52.264568       1 monitoring_resource_controller.go:182] key failed with : the server could not find the requested resource
  2019-04-24T03:07:52.264797632Z I0424 03:07:52.264752       1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"901b0c8f-663d-11e9-a0d9-0e347a4e8faa", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'ServiceMonitorCreateFailed' Failed to create ServiceMonitor.monitoring.coreos.com/v1: the server could not find the requested resource

No monitor types this early.  Probably ok.  Summarizing revision transitions:

$ grep -A6 NodeStatus namespaces/openshift-kube-apiserver-operator/pods/kube-apiserver-operator-7fccf67d7b-55skm/operator/operator/logs/current.log | sed 's/^/  /'
  2019-04-24T03:07:54.247361356Z I0424 03:07:54.247313       1 installer_controller.go:305] "ip-10-1-146-153.ec2.internal" moving to (v1.NodeStatus) {
  2019-04-24T03:07:54.247361356Z  NodeName: (string) (len=28) "ip-10-1-146-153.ec2.internal",
  2019-04-24T03:07:54.247361356Z  CurrentRevision: (int32) 1,
  2019-04-24T03:07:54.247361356Z  TargetRevision: (int32) 0,
  2019-04-24T03:07:54.247361356Z  LastFailedRevision: (int32) 0,
  2019-04-24T03:07:54.247361356Z  LastFailedRevisionErrors: ([]string) <nil>
  2019-04-24T03:07:54.247361356Z }
  --
  2019-04-24T03:07:56.447392056Z I0424 03:07:56.447344       1 installer_controller.go:305] "ip-10-1-146-153.ec2.internal" moving to (v1.NodeStatus) {
  2019-04-24T03:07:56.447392056Z  NodeName: (string) (len=28) "ip-10-1-146-153.ec2.internal",
  2019-04-24T03:07:56.447392056Z  CurrentRevision: (int32) 1,
  2019-04-24T03:07:56.447392056Z  TargetRevision: (int32) 0,
  2019-04-24T03:07:56.447392056Z  LastFailedRevision: (int32) 0,
  2019-04-24T03:07:56.447392056Z  LastFailedRevisionErrors: ([]string) <nil>
  2019-04-24T03:07:56.447392056Z }
  --
  2019-04-24T03:07:58.448236954Z I0424 03:07:58.448188       1 installer_controller.go:346] "ip-10-1-138-229.ec2.internal" moving to (v1.NodeStatus) {
  2019-04-24T03:07:58.448236954Z  NodeName: (string) (len=28) "ip-10-1-138-229.ec2.internal",
  2019-04-24T03:07:58.448236954Z  CurrentRevision: (int32) 0,
  2019-04-24T03:07:58.448236954Z  TargetRevision: (int32) 1,
  2019-04-24T03:07:58.448236954Z  LastFailedRevision: (int32) 0,
  2019-04-24T03:07:58.448236954Z  LastFailedRevisionErrors: ([]string) <nil>
  2019-04-24T03:07:58.448236954Z }
  --
  2019-04-24T03:08:31.204416455Z I0424 03:08:31.204361       1 installer_controller.go:305] "ip-10-1-138-229.ec2.internal" moving to (v1.NodeStatus) {
  2019-04-24T03:08:31.204416455Z  NodeName: (string) (len=28) "ip-10-1-138-229.ec2.internal",
  2019-04-24T03:08:31.204416455Z  CurrentRevision: (int32) 1,
  2019-04-24T03:08:31.204416455Z  TargetRevision: (int32) 0,
  2019-04-24T03:08:31.204416455Z  LastFailedRevision: (int32) 0,
  2019-04-24T03:08:31.204416455Z  LastFailedRevisionErrors: ([]string) <nil>
  2019-04-24T03:08:31.204416455Z }
  --
  2019-04-24T03:08:42.006662213Z I0424 03:08:42.006613       1 installer_controller.go:346] "ip-10-1-171-77.ec2.internal" moving to (v1.NodeStatus) {
  2019-04-24T03:08:42.006662213Z  NodeName: (string) (len=27) "ip-10-1-171-77.ec2.internal",
  2019-04-24T03:08:42.006662213Z  CurrentRevision: (int32) 1,
  2019-04-24T03:08:42.006662213Z  TargetRevision: (int32) 2,
  2019-04-24T03:08:42.006662213Z  LastFailedRevision: (int32) 0,
  2019-04-24T03:08:42.006662213Z  LastFailedRevisionErrors: ([]string) <nil>
  2019-04-24T03:08:42.006662213Z }
  --
  2019-04-24T03:10:07.689588613Z I0424 03:10:07.689549       1 installer_controller.go:305] "ip-10-1-171-77.ec2.internal" moving to (v1.NodeStatus) {
  ...

So that's ip-10-1-146-153 already in revision 1, ip-10-1-138-229 transitioning to revision 1 in ~30s during this bootstrap-teardown window.  And then ip-10-1-171-77 showing up still in the window and talking about moving from revision 1 to 2?  Dunno what's going on there.  But all of these revision rotations mean we have no Kubernetes API-server logs going back to the sensitive time:

  $ head -n1 namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-*/kube-apiserver-*/kube-apiserver-*/logs/*.log  | sed 's/^/  /' | sed 's/  *$//'
  ==> namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-1-138-229.ec2.internal/kube-apiserver-5/kube-apiserver-5/logs/current.log <==
  2019-04-24T03:15:20.047764603Z I0424 03:15:20.047619       1 plugins.go:84] Registered admission plugin "NamespaceLifecycle"

  ==> namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-1-138-229.ec2.internal/kube-apiserver-5/kube-apiserver-5/logs/previous.log <==

  ==> namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-1-138-229.ec2.internal/kube-apiserver-cert-syncer-5/kube-apiserver-cert-syncer-5/logs/current.log <==
  2019-04-24T03:15:20.207484722Z I0424 03:15:20.207272       1 observer_polling.go:106] Starting file observer

  ==> namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-1-138-229.ec2.internal/kube-apiserver-cert-syncer-5/kube-apiserver-cert-syncer-5/logs/previous.log <==

  ==> namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-1-146-153.ec2.internal/kube-apiserver-5/kube-apiserver-5/logs/current.log <==
  2019-04-24T03:13:31.952974714Z I0424 03:13:31.952830       1 plugins.go:84] Registered admission plugin "NamespaceLifecycle"

  ==> namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-1-146-153.ec2.internal/kube-apiserver-5/kube-apiserver-5/logs/previous.log <==

  ==> namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-1-146-153.ec2.internal/kube-apiserver-cert-syncer-5/kube-apiserver-cert-syncer-5/logs/current.log <==
  2019-04-24T03:13:32.137985264Z I0424 03:13:32.137796       1 observer_polling.go:106] Starting file observer

  ==> namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-1-146-153.ec2.internal/kube-apiserver-cert-syncer-5/kube-apiserver-cert-syncer-5/logs/previous.log <==

  ==> namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-1-171-77.ec2.internal/kube-apiserver-5/kube-apiserver-5/logs/current.log <==
  2019-04-24T03:11:38.148597366Z I0424 03:11:38.148457       1 plugins.go:84] Registered admission plugin "NamespaceLifecycle"

  ==> namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-1-171-77.ec2.internal/kube-apiserver-5/kube-apiserver-5/logs/previous.log <==

  ==> namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-1-171-77.ec2.internal/kube-apiserver-cert-syncer-5/kube-apiserver-cert-syncer-5/logs/current.log <==
  2019-04-24T03:11:38.978312453Z I0424 03:11:38.978119       1 observer_polling.go:106] Starting file observer

  ==> namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-1-171-77.ec2.internal/kube-apiserver-cert-syncer-5/kube-apiserver-cert-syncer-5/logs/previous.log <==

Nodes are all marked Degraded:

  $ for NODE in cluster-scoped-resources/core/nodes/*; do yaml2json <"${NODE}" | jq -r '.metadata | .creationTimestamp + " " + .annotations["machine.openshift.io/machine"] + " " + .annotations["machineconfiguration.openshift.io/state"]'; done | sort
  2019-04-24T03:02:56Z openshift-machine-api/east-1-brld6-master-0 Degraded
  2019-04-24T03:03:18Z openshift-machine-api/east-1-brld6-master-1 Degraded
  2019-04-24T03:03:18Z openshift-machine-api/east-1-brld6-master-2 Degraded
  2019-04-24T03:09:22Z openshift-machine-api/east-1-brld6-worker-us-east-1a-77kwr Degraded

and we only have one worker.

Comment 6 W. Trevor King 2019-04-24 06:45:28 UTC

$ grep -B1 'Too Many Requests\|Degraded' namespaces/openshift-machine-config-operator/pods/machine-config-daemon-9r8k2/machine-config-daemon/machine-config-daemon/logs/current.log
2019-04-24T03:04:47.841890544Z I0424 03:04:47.431975    4713 run.go:16] Running: skopeo inspect --authfile /var/lib/kubelet/config.json docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b83c73b6ff400b6afe9ff5e079e1195379872a0571c002ac0a01f1c37f47aa8b
2019-04-24T03:04:51.591795606Z time="2019-04-24T03:04:51Z" level=fatal msg="Error determining repository tags: Invalid status code returned when fetching tags list 429 (Too Many Requests)"
--
2019-04-24T03:04:56.591969019Z I0424 03:04:56.241619    4713 run.go:16] Running: skopeo inspect --authfile /var/lib/kubelet/config.json docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b83c73b6ff400b6afe9ff5e079e1195379872a0571c002ac0a01f1c37f47aa8b
2019-04-24T03:04:58.841790389Z time="2019-04-24T03:04:58Z" level=fatal msg="Error determining repository tags: Invalid status code returned when fetching tags list 429 (Too Many Requests)"
--
2019-04-24T03:05:08.841855213Z I0424 03:05:08.531882    4713 run.go:16] Running: skopeo inspect --authfile /var/lib/kubelet/config.json docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b83c73b6ff400b6afe9ff5e079e1195379872a0571c002ac0a01f1c37f47aa8b
2019-04-24T03:05:10.841823636Z time="2019-04-24T03:05:10Z" level=fatal msg="Error determining repository tags: Invalid status code returned when fetching tags list 429 (Too Many Requests)"
--
2019-04-24T03:05:30.841783442Z I0424 03:05:30.515240    4713 run.go:16] Running: skopeo inspect --authfile /var/lib/kubelet/config.json docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b83c73b6ff400b6afe9ff5e079e1195379872a0571c002ac0a01f1c37f47aa8b
2019-04-24T03:05:33.091810207Z time="2019-04-24T03:05:32Z" level=fatal msg="Error determining repository tags: Invalid status code returned when fetching tags list 429 (Too Many Requests)"
--
2019-04-24T03:06:13.091793091Z I0424 03:06:12.791645    4713 run.go:16] Running: skopeo inspect --authfile /var/lib/kubelet/config.json docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b83c73b6ff400b6afe9ff5e079e1195379872a0571c002ac0a01f1c37f47aa8b
2019-04-24T03:06:17.341824502Z time="2019-04-24T03:06:16Z" level=fatal msg="Error determining repository tags: Invalid status code returned when fetching tags list 429 (Too Many Requests)"
--
2019-04-24T03:07:37.34180683Z I0424 03:07:36.934406    4713 run.go:16] Running: skopeo inspect --authfile /var/lib/kubelet/config.json docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b83c73b6ff400b6afe9ff5e079e1195379872a0571c002ac0a01f1c37f47aa8b
2019-04-24T03:07:43.404162266Z E0424 03:07:43.404073    4660 writer.go:119] Marking Degraded due to: failed to run pivot: failed to start pivot.service: exit status 1

Ah, there we go.  So this is the recent Quay rate-limit ajustment vs. a maybe overly careful pivot.  Assigning to the RHCOS folks about making pivot more robust in the face of registry 429s.  Possibly by... I dunno, doesn't look like it's hot-looping or anything.  But hopefully something ;).

Comment 7 Colin Walters 2019-04-24 08:19:45 UTC

> Ah, there we go.  So this is the recent Quay rate-limit ajustment vs. a maybe overly careful pivot.  Assigning to the RHCOS folks about making pivot more robust in the face of registry 429s.  Possibly by... I dunno, doesn't look like it's hot-looping or anything.  But hopefully something ;).

Hm, we could indeed change the MCD+pivot to know to back off more gracefully in the face of 429 and in general treat it as a non-fatal error.  

Though we need to keep in mind the early pivot case where there's no MCD.

Also related: https://github.com/openshift/machine-config-operator/issues/585 which I imagine would be a big win.

Comment 9 Ryan Cook 2019-04-24 13:33:59 UTC

Is there any different in the endpoints or the rate limiting in east-1?  Last night I had about 4 failures and this morning a coworker in another time zone had close to five. I ask because after his last failure in us-east-1 i deployed in ca-central without incident. I also had success in us-east-2 and us-west-2 without the error.

Comment 10 Micah Abbott 2019-04-24 13:49:07 UTC

I filed an issue with `containers/image` to handle HTTP 429 a bit more gracefully - https://github.com/containers/image/issues/618

We'll still need to adjust `pivot` to react appropriately, too.

Comment 11 W. Trevor King 2019-04-25 06:11:24 UTC

pivot#51 landed.

Comment 12 Colin Walters 2019-04-25 07:55:35 UTC

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=21300941

Comment 13 Colin Walters 2019-04-25 12:23:38 UTC

https://github.com/openshift/installer/pull/1674

Comment 14 W. Trevor King 2019-04-25 20:29:59 UTC

installer#1674 landed, and is live in CI:

  $ oc adm release info --commits --changes-from registry.svc.ci.openshift.org/ocp/release:4.1.0-0.ci-2019-04-25-183706 registry.svc.ci.openshift.org/ocp/release:4.1.0-0.ci-2019-04-25-184318
            FROM                          TO
  Name:     4.1.0-0.ci-2019-04-25-183706  4.1.0-0.ci-2019-04-25-184318
  Created:  1h                            1h

  Version:       4.1.0-0.ci-2019-04-25-183706  4.1.0-0.ci-2019-04-25-184318
  Upgrade From:                                No

  Images Changed:
    installer                                     https://github.com/openshift/installer/compare/5f91f75072cdafa4806b38d9c6f65113a4b1b5c0...4097497cd768e2535b211e68bbe2a831f58513f0
    installer-artifacts                           https://github.com/openshift/installer/compare/5f91f75072cdafa4806b38d9c6f65113a4b1b5c0...4097497cd768e2535b211e68bbe2a831f58513f0

  Images Rebuilt:

That release image still has an old machine-os-content though, so future pivots (e.g. on upgrades) may still hit this if/when Quay brings the limit back down:

  $ oc image info $(oc adm release info --image-for=machine-os-content registry.svc.ci.openshift.org/ocp/release:4.1.0-0.ci-2019-04-25-184318)
  Name:       registry.svc.ci.openshift.org/ocp/4.1-2019-04-25-184318@sha256:a0cd0eb5ec7f676ea7b64d4d2e48abcad4a7649861d54532fce3fdca9ab19a44
  Media Type: application/vnd.docker.distribution.manifest.v2+json
  Created:    3d ago
  Image Size: 623.1MB
  OS:         linux
  Arch:       amd64
  Entrypoint: /noentry
  Labels:     com.coreos.ostree-commit=125584ea400554bf0a9e743964c3a4b87b0ae8100ba2f4d7c20a1cbadbbb8df5
              version=410.8.20190422.0

I'll see if I can figure out how the machine-os-content bump is getting along.

Comment 15 W. Trevor King 2019-04-25 21:16:03 UTC

$ oc image info registry.svc.ci.openshift.org/rhcos/machine-os-content:latest
Name:       registry.svc.ci.openshift.org/rhcos/machine-os-content:latest
Digest:     sha256:0a799822de435fcff54ea64868607eb8b61861a57f16079f8d95f9bb87068c2c
Media Type: application/vnd.docker.distribution.manifest.v2+json
Created:    8h ago
Image Size: 623.2MB
OS:         linux
Arch:       amd64
Entrypoint: /noentry
Labels:     com.coreos.ostree-commit=0f7d2687f9d9dfcbcd64bc030ced149dd196b09b5e3633a9186915b8c0764cd7
            version=410.8.20190425.1

Green promotion from [1,2], but that was before the above image.  Not sure why we haven't had another promotion run since.  Checking a random recent release-promotion job:

$ oc image info $(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.1/221/artifacts/release-images-latest/release-images-latest | jq -r '.spec.tags[] | select(.name == "machine-os-content").from.name')Name:       registry.svc.ci.openshift.org/ocp/4.1-2019-04-25-194305@sha256:a0cd0eb5ec7f676ea7b64d4d2e48abcad4a7649861d54532fce3fdca9ab19a44
Media Type: application/vnd.docker.distribution.manifest.v2+json
Created:    3d ago
Image Size: 623.1MB
OS:         linux
Arch:       amd64
Entrypoint: /noentry
Labels:     com.coreos.ostree-commit=125584ea400554bf0a9e743964c3a4b87b0ae8100ba2f4d7c20a1cbadbbb8df5
            version=410.8.20190422.0

So we need a new, green promotion.

[1]: https://prow.svc.ci.openshift.org/?job=release-promote-openshift-machine-os-content-e2e-aws-4.1
[2]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.1/46

Comment 16 W. Trevor King 2019-04-26 16:53:22 UTC

Green promotion [1], and recent CI runs have the new machine-os-content:

$ oc image info $(curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.1/267/artifacts/release-images-latest/release-images-latest | jq -r '.spec.tags[] | select(.name == "machine-os-content").from.name')
Name:       registry.svc.ci.openshift.org/ocp/4.1-2019-04-26-142222@sha256:0a799822de435fcff54ea64868607eb8b61861a57f16079f8d95f9bb87068c2c
Media Type: application/vnd.docker.distribution.manifest.v2+json
Created:    1d ago
Image Size: 623.2MB
OS:         linux
Arch:       amd64
Entrypoint: /noentry
Labels:     com.coreos.ostree-commit=0f7d2687f9d9dfcbcd64bc030ced149dd196b09b5e3633a9186915b8c0764cd7
            version=410.8.20190425.1

[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.1/49

Comment 18 Sergiusz Urbaniak 2019-10-28 09:55:35 UTC

fwiw we saw one of those failures recently:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.1/472

Comment 21 errata-xmlrpc 2020-04-22 16:24:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1446

Note You need to log in before you can comment on or make changes to this bug.