1824353 – image registry fails to publish internal hostname

Bug 1824353 - image registry fails to publish internal hostname

Summary: image registry fails to publish internal hostname

Keywords:
Status:	CLOSED DUPLICATE of bug 1827325
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-16 00:34 UTC by Ben Parees
Modified:	2020-04-24 17:08 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-04-24 17:07:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ben Parees 2020-04-16 00:34:28 UTC

Description of problem:
Many tests failing on GCP jobs because the internal registry hostname is not published:

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-blocking#release-openshift-origin-installer-e2e-gcp-4.5

sample job:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1435

sample output:
[sig-builds][Feature:Builds][valueFrom] process valueFrom in build strategy environment variables should fail resolving unresolvable valueFrom in docker build environment variable references [Suite:openshift/conformance/parallel] fail [github.com/openshift/origin/test/extended/builds/valuefrom.go:42]: Unexpected error:
    <*errors.errorString | 0xc0014f26c0>: {
        s: "Timed out waiting for internal registry hostname to be published",
    }
    Timed out waiting for internal registry hostname to be published
occurred

How reproducible:
frequently failing on GCP, does not appear to be happening on AWS.

Comment 1 Ben Parees 2020-04-16 00:37:13 UTC

Note that this could also be an issue in the openshift controller manager itself picking up the published internal hostname.  I see at least some indication the OCM is crashlooping in that run:

[sig-arch] Managed cluster should have no crashlooping pods in core namespaces over two minutes [Suite:openshift/conformance/parallel] fail [github.com/openshift/origin/test/extended/operators/cluster.go:114]: Expected
    <[]string | len:2, cap:2>: [
        "Pod openshift-controller-manager/controller-manager-74652 is not healthy: I0415 22:14:49.423486       1 controller_manager.go:39] Starting controllers on 0.0.0.0:8443 (unknown)\nI0415 22:14:49.426100       1 controller_manager.go:50] DeploymentConfig controller using images from \"registry.svc.ci.openshift.org/ocp/4.5-2020-04-15-215108@sha256:0564a1ebc3c08432fb92c0d0c7ec2788ec7081e2455385f56d7e16278e3e4c8e\"\nI0415 22:14:49.426130       1 controller_manager.go:56] Build controller using images from \"registry.svc.ci.openshift.org/ocp/4.5-2020-04-15-215108@sha256:a20fbee183c84d16ca8df9664499f43e4097dda6e65dba39244efbbcd26d6d82\"\nI0415 22:14:49.426266       1 standalone_apiserver.go:98] Started health checks at 0.0.0.0:8443\nI0415 22:14:49.426342       1 leaderelection.go:242] attempting to acquire leader lease  openshift-controller-manager/openshift-master-controllers...\n",
        "Pod openshift-insights/insights-operator-5fb99cf8-cbqnx is not healthy: :298] The operator is healthy\nI0415 22:30:06.892659       1 reflector.go:418] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:209: Watch close - *v1.ConfigMap total 0 items received\nI0415 22:30:06.898937       1 reflector.go:418] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:209: Watch close - *v1.ConfigMap total 0 items received\nI0415 22:30:07.204060       1 reflector.go:324] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:209: watch of *v1.ConfigMap ended with: too old resource version: 20430 (23409)\nI0415 22:30:07.204246       1 reflector.go:324] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:209: watch of *v1.ConfigMap ended with: too old resource version: 20432 (23409)\nI0415 22:30:08.218251       1 reflector.go:185] Listing and watching *v1.ConfigMap from k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:209\nI0415 22:30:08.218662       1 reflector.go:185] Listing and watching *v1.ConfigMap from k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:209\nI0415 22:30:12.716251       1 httplog.go:90] GET /metrics: (8.337405ms) 200 [Prometheus/2.15.2 10.129.2.8:45338]\nI0415 22:30:24.573277       1 observer_polling.go:116] Observed file \"/var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt\" has been modified (old=\"08dd5a33beb4e44e9f1fb9716b3d52be7bf5726facff4ebfd314839133c83785\", new=\"868e0fedc6879d0d0b890956a1ac7b838d3e3eb2007e37d72bce6497ae10ac5e\")\nW0415 22:30:24.573366       1 builder.go:101] Restart triggered because of file /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt was modified\nI0415 22:30:24.573698       1 diskrecorder.go:303] Found files to send: [/var/lib/insights-operator/insights-2020-04-15-221612.tar.gz]\nI0415 22:30:24.573757       1 insightsuploader.go:126] Uploading latest report since 0001-01-01T00:00:00Z\nI0415 22:30:24.575784       1 periodic.go:99] Shutting down\nF0415 22:30:24.575840       1 start.go:80] stopped\n",

Comment 2 Oleg Bulatov 2020-04-16 10:37:43 UTC

after 0415 22:15:10.823628 and before 0415 22:15:14.616070 the pod cluster-image-registry-operator was restarted.

Apr 15 22:14:49.230371 "Started container f211fb609454a511895d247511fc4b3b6d7a83351d3986eb5cefe2f643240c4a: openshift-image-registry/cluster-image-registry-operator-647d769f7f-nmp7k/cluster-image-registry-operator" id=dc008b9c-23ed-46ee-b2ae-f669e339682d
Apr 15 22:15:07.402583 "Started container 848e786b3998fb2cae01d295da1566a897ab8f7a7030d4f83f8602f26c54e7a6: openshift-image-registry/cluster-image-registry-operator-647d769f7f-nmp7k/cluster-image-registry-operator-watch" id=7df28035-7360-497c-9968-418dd6f1a489

Apr 15 22:15:14.617896 "Started container ea3ed0f67578f6b57831e43ba31ce675d697b33a80d3cbd46df01b109e1ff0b0: openshift-image-registry/cluster-image-registry-operator-647d769f7f-nmp7k/cluster-image-registry-operator" id=f4dcccbf-87a4-4d0f-bec8-4f785c06e0ef
Apr 15 22:15:15.040395 "Started container f00fb2c9269efa5a962d81a08e2f5a7cc15af9ad468c5f638c1dd1e5bd64df10: openshift-image-registry/cluster-image-registry-operator-647d769f7f-nmp7k/cluster-image-registry-operator-watch" id=a7adc417-ed76-4be5-9f36-4ffeb78489df

I don't understand why it was restarted and why it took more than 15 seconds to start both containers for a first time.

openshift-controller-manager-operator observed the internal registry hostname at ~22:15:00:

2020-04-15T22:15:00.241831929Z + 	"dockerPullSecret": map[string]interface{}{
2020-04-15T22:15:00.241831929Z + 		"internalRegistryHostname": string("image-registry.openshift-image-registry.svc:5000"),
2020-04-15T22:15:00.241831929Z + 	},

The test `[sig-builds][Feature:Builds] prune builds based on settings in the buildconfig should prune completed builds based on the successfulBuildsHistoryLimit setting [Suite:openshift/conformance/parallel]` needed the registry hostname half an hour later:

Apr 15 22:46:52.527: INFO: Waiting up to 2 minutes for the internal registry hostname to be published
...
Apr 15 22:48:52.838: INFO: the OCM pod logs indicate the build controller was started after the internal registry hostname has been set in the OCM config
Apr 15 22:48:52.859: INFO: OCM rollout still progressing or in error: True

  - lastTransitionTime: "2020-04-15T22:15:14Z"
    message: 'daemonset/controller-manager: updated number scheduled is 2, desired
      number scheduled is 3'
    reason: DesiredStateNotYetAchieved
    status: "True"
    type: Progressing

So the real problem is that the build tests expect openshift-controller-manager to have Progressing=False, but it hasn't.

Comment 3 Ben Parees 2020-04-16 15:37:33 UTC

this is also being seen in 4.4.


https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/2181

Comment 4 Ben Parees 2020-04-16 15:49:20 UTC

and it seems to be specific to GCP on both 4.4 and 4.5.
https://search.svc.ci.openshift.org/?search=Timed+out+waiting+for+internal+registry+hostname+to+be+published&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520

Comment 5 Adam Kaplan 2020-04-16 17:04:21 UTC

Moving this to the Node team - the operator is flapping because one of the OCM daemonset pods is being scheduled/de-scheduled.

```
I0415 22:33:14.580397       1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-controller-manager-operator", Name:"openshift-controller-manager-operator", UID:"4c3f5c77-260d-448f-a9db-28363d079d1d", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/openshift-controller-manager changed: Progressing message changed from "" to "Progressing: daemonset/controller-manager: updated number scheduled is 2, desired number scheduled is 3"
I0415 22:33:19.470334       1 httplog.go:90] verb="GET" URI="/metrics" latency=20.080164ms resp=200 UserAgent="Prometheus/2.15.2" srcIP="10.131.0.23:49602": 
I0415 22:33:34.115390       1 status_controller.go:172] clusteroperator/openshift-controller-manager diff {"status":{"conditions":[{"lastTransitionTime":"2020-04-15T22:05:33Z","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2020-04-15T22:15:14Z","status":"True","type":"Progressing"},{"lastTransitionTime":"2020-04-15T22:16:34Z","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2020-04-15T22:05:33Z","reason":"NoData","status":"Unknown","type":"Upgradeable"}]}}
I0415 22:33:34.126438       1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-controller-manager-operator", Name:"openshift-controller-manager-operator", UID:"4c3f5c77-260d-448f-a9db-28363d079d1d", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/openshift-controller-manager changed: Progressing message changed from "Progressing: daemonset/controller-manager: updated number scheduled is 2, desired number scheduled is 3" to ""
```

When trying fetching ocm logs, there is always at least one pod whose logs cannot be obtained.

Comment 6 Ryan Phillips 2020-04-16 19:28:30 UTC

We have fixes attempting to merge for this over on 1823949 .

*** This bug has been marked as a duplicate of bug 1823949 ***

Comment 7 Tomas Smetana 2020-04-23 16:43:28 UTC

I'm reopening this: There's again number of failures with the same symptom as in the description:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/2231

And since the bug this was closed as duplicate of, I would think it actually was not a duplicate.

Comment 8 Ted Yu 2020-04-23 18:06:15 UTC

I searched masters-journal from 2231 above.

There was no 'controller-manager: updated number scheduled is ' log (which Adam mentioned on the 16th).

Let me dig deeper.

Comment 9 Gabe Montero 2020-04-23 18:11:40 UTC

Using the run Tomas noted I'm seeing the same symptoms Adam noted in #Comment 5

From the extended test diagnostics, a bunch of these until the test times out and reports the failure:

Apr 23 09:48:18.959: INFO: OCM rollout still progressing or in error: True

In the OCM operatorr logs, repeated:

I0423 09:48:04.638402       1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-controller-manager-operator", Name:"openshift-controller-manager-operator", UID:"bf261058-d6e2-4b7b-b60a-7343c11fc55d", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/openshift-controller-manager changed: Progressing message changed from "" to "Progressing: daemonset/controller-manager: updated number scheduled is 2, desired number scheduled is 3"

Comment 10 Ted Yu 2020-04-23 18:20:39 UTC

Reading masters-journal, I am not sure how log for 9:12 appeared after 10:02 (maybe the journal was concatenation).

Apr 23 09:23:28.823816 ci-op-p5wtd-m-0.c.openshift-gce-devel-ci.internal hyperkube[1431]: E0423 09:23:28.823836    1431 kuberuntime_manager.go:955] getPodContainerStatuses for pod "controller-manager-lvcfq_openshift-controller-manager(8dd9b9f3-947d-41cf-81cb-27769d49969c)" failed: rpc error: code = Unknown desc = container with ID starting with b2a8bf049a5ce3fdd69b4e470f86f586c7789ac3395e491f609f0c219541f314 not found: ID does not exist

Apr 23 10:02:24.404127 ci-op-p5wtd-m-0.c.openshift-gce-devel-ci.internal hyperkube[1431]: I0423 10:02:24.404087    1431 kubelet_pods.go:933] Pod "controller-manager-lvcfq_openshift-controller-manager(8dd9b9f3-947d-41cf-81cb-27769d49969c)" is terminated, but some containers are still running

Apr 23 09:12:07.456997 ci-op-p5wtd-m-1.c.openshift-gce-devel-ci.internal hyperkube[1444]: I0423 09:12:07.456698    1444 kubelet.go:1913] SyncLoop (ADD, "api"): "kube-controller-manager-operator-85f5495bdc-pwjql_openshift-kube-controller-manager-operator(6a6fcfea-3273-4adf-8e37-2af9cc5bafff)"

Comment 11 Ben Parees 2020-04-23 18:23:44 UTC

related to https://bugzilla.redhat.com/show_bug.cgi?id=1827325 ?

Comment 12 Ted Yu 2020-04-23 18:35:17 UTC

For controller-manager-lvcfq :
```
Apr 23 09:20:08.577153 ci-op-p5wtd-m-0.c.openshift-gce-devel-ci.internal hyperkube[1431]: E0423 09:20:08.576995    1431 pod_workers.go:191] Error syncing pod 8dd9b9f3-947d-41cf-81cb-27769d49969c ("controller-manager-lvcfq_openshift-controller-manager(8dd9b9f3-947d-41cf-81cb-27769d49969c)"), skipping: failed to "CreatePodSandbox" for "controller-manager-lvcfq_openshift-controller-manager(8dd9b9f3-947d-41cf-81cb-27769d49969c)" with CreatePodSandboxError: "CreatePodSandbox for pod \"controller-manager-lvcfq_openshift-controller-manager(8dd9b9f3-947d-41cf-81cb-27769d49969c)\" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded"

Apr 23 10:02:22.404615 ci-op-p5wtd-m-0.c.openshift-gce-devel-ci.internal hyperkube[1431]: E0423 10:02:22.403265    1431 pod_workers.go:191] Error syncing pod 8dd9b9f3-947d-41cf-81cb-27769d49969c ("controller-manager-lvcfq_openshift-controller-manager(8dd9b9f3-947d-41cf-81cb-27769d49969c)"), skipping: rpc error: code = Unknown desc = container with ID starting with b2a8bf049a5ce3fdd69b4e470f86f586c7789ac3395e491f609f0c219541f314 not found: ID does not exist
```
I would think the above symptom was the same as BZ 1827325

Toward the end of masters-journal, there was controller-manager-wmkvh whose last log lines showed:
```
Apr 23 10:01:20.885326 ci-op-p5wtd-m-2.c.openshift-gce-devel-ci.internal hyperkube[1441]: I0423 10:01:20.885169    1441 status_manager.go:434] Ignoring same status for pod "controller-manager-wmkvh_openshift-controller-manager(34902f3e-bac1-4aff-9b4c-6be4e413cc55)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-23 09:22:54 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-23 09:32:41 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-23 09:32:41 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-23 09:22:54 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.0.6 PodIP:10.129.0.33 PodIPs:[{IP:10.129.0.33}] StartTime:2020-04-23 09:22:54 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:controller-manager State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2020-04-23 09:32:40 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:registry.svc.ci.openshift.org/ocp/4.4-2020-04-23-085604@sha256:b4a0e930b47be3ff54f414060d36da12130e1b593888d0e67d34bfc2b31b3a86 ImageID:registry.svc.ci.openshift.org/ocp/4.4-2020-04-23-085604@sha256:b4a0e930b47be3ff54f414060d36da12130e1b593888d0e67d34bfc2b31b3a86 ContainerID:cri-o://a0b169bf84e6b6dca602b1f04c3cd441e65f90661397dc564f3a6ee1563683d2 Started:0xc003466e78}] QOSClass:Burstable EphemeralContainerStatuses:[]}
Apr 23 10:01:20.885592 ci-op-p5wtd-m-2.c.openshift-gce-devel-ci.internal hyperkube[1441]: I0423 10:01:20.885558    1441 volume_manager.go:372] Waiting for volumes to attach and mount for pod "controller-manager-wmkvh_openshift-controller-manager(34902f3e-bac1-4aff-9b4c-6be4e413cc55)"
Apr 23 10:01:20.885719 ci-op-p5wtd-m-2.c.openshift-gce-devel-ci.internal hyperkube[1441]: I0423 10:01:20.885657    1441 volume_manager.go:403] All volumes are attached and mounted for pod "controller-manager-wmkvh_openshift-controller-manager(34902f3e-bac1-4aff-9b4c-6be4e413cc55)"
```

Comment 13 Ted Yu 2020-04-23 19:28:24 UTC

I found some multus related log lines.
```
Apr 23 09:14:10.627528 ci-op-p5wtd-m-1.c.openshift-gce-devel-ci.internal crio[1372]: time="2020-04-23 09:14:10.625614934Z" level=error msg="Error while adding pod to CNI network \"multus-cni-network\": Multus: [openshift-kube-controller-manager/installer-4-ci-op-p5wtd-m-1.c.openshift-gce-devel-ci.internal]: error setting the networks status: SetNetworkStatus: failed to query the pod installer-4-ci-op-p5wtd-m-1.c.openshift-gce-devel-ci.internal in out of cluster comm: etcdserver: request timed out"

Apr 23 09:22:20.789187 ci-op-p5wtd-m-1.c.openshift-gce-devel-ci.internal crio[1372]: time="2020-04-23 09:22:20.789117669Z" level=info msg="Got pod network &{Name:controller-manager-rjjcb Namespace:openshift-controller-manager ID:dff29d3dab21db12bb78d98b18e0cabd6e7b856a0bab497775e1033952264ef5 NetNS:/proc/37222/ns/net Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}]}"

Apr 23 09:22:50.843738 ci-op-p5wtd-m-1.c.openshift-gce-devel-ci.internal crio[1372]: time="2020-04-23 09:22:50.843681494Z" level=error msg="Error while adding pod to CNI network \"multus-cni-network\": Multus: [openshift-controller-manager/controller-manager-rjjcb]: error adding container to network \"openshift-sdn\": delegateAdd: cannot set \"openshift-sdn\" interface name to \"eth0\": validateIfName: no net namespace /proc/37222/ns/net found: failed to Statfs \"/proc/37222/ns/net\": no such file or directory"
```
There were more than one occurrence of '"/proc/37222/ns/net": no such file or directory'

I searched log for 'IpRanges:[' (note the second log above) all of which were IpRanges:[]. It seems this was not normal for multus.

Comment 14 Tomofumi Hayashi 2020-04-24 05:34:49 UTC

Hi Ted,

When I looks https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/2231/build-log.txt based on the comment, #comment7, then I cannot find the word 'Multus'/'multus'. So this bug is not related to multus.

The logs, you mentioned, shows us that Multus cannot communicate etcdserver (i.e. Kubernetes API) to get the pod/net-attach-def information, hence something
must be happen before the logs. In the case, if pod cannot be launched due to timeout, then the Pod network namespace might be removed, so multus may show  '"/proc/37222/ns/net": no such file or directory', due to the timeout. So multus failure is occurred by previous failure of Kube API.

Currently I cannot find the actual log you mentioned from URL, so now I only understands what is happend from above 3logs.
https://search.svc.ci.openshift.org/?search=Apr+23+09%3A22%3A50.843738+ci-op-p5wtd-m-1.c.openshift-gce-devel-ci.interna&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Could you please points me the logs that you saw it for further analysis?


In addition, 2nd log, 'Got pod network(snip)IpRanges:[]}}}' is comes from ocicni, the part of crio. This 'IpRanges:{}' is the part of CNI's runtimeConfig, which is invoked before multus invocation, and it is not used (i.e. empty) in case of OpenShift default case, so it is not the root cause. 
https://github.com/cri-o/ocicni/blob/master/pkg/ocicni/ocicni.go#L860

Comment 15 Ted Yu 2020-04-24 07:51:44 UTC

You can find masters-journal under:
https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/2231/artifacts/e2e-gcp/nodes/

where my quoted snippet can be found.

w.r.t. IpRanges:[], I saw it in other CI runs - so it was not the cause.

Thanks for taking a look.

Comment 17 Ryan Phillips 2020-04-24 17:07:04 UTC


*** This bug has been marked as a duplicate of bug 1827325 ***

Comment 18 Ryan Phillips 2020-04-24 17:08:37 UTC

We are tracking this bug over on 1827325 where the networking is deadlocked. The fix is to pass down golang contexts to timeout after 2 minutes.

Note You need to log in before you can comment on or make changes to this bug.