Description of problem: Many tests failing on GCP jobs because the internal registry hostname is not published: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-blocking#release-openshift-origin-installer-e2e-gcp-4.5 sample job: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1435 sample output: [sig-builds][Feature:Builds][valueFrom] process valueFrom in build strategy environment variables should fail resolving unresolvable valueFrom in docker build environment variable references [Suite:openshift/conformance/parallel] fail [github.com/openshift/origin/test/extended/builds/valuefrom.go:42]: Unexpected error: <*errors.errorString | 0xc0014f26c0>: { s: "Timed out waiting for internal registry hostname to be published", } Timed out waiting for internal registry hostname to be published occurred How reproducible: frequently failing on GCP, does not appear to be happening on AWS.
Note that this could also be an issue in the openshift controller manager itself picking up the published internal hostname. I see at least some indication the OCM is crashlooping in that run: [sig-arch] Managed cluster should have no crashlooping pods in core namespaces over two minutes [Suite:openshift/conformance/parallel] fail [github.com/openshift/origin/test/extended/operators/cluster.go:114]: Expected <[]string | len:2, cap:2>: [ "Pod openshift-controller-manager/controller-manager-74652 is not healthy: I0415 22:14:49.423486 1 controller_manager.go:39] Starting controllers on 0.0.0.0:8443 (unknown)\nI0415 22:14:49.426100 1 controller_manager.go:50] DeploymentConfig controller using images from \"registry.svc.ci.openshift.org/ocp/4.5-2020-04-15-215108@sha256:0564a1ebc3c08432fb92c0d0c7ec2788ec7081e2455385f56d7e16278e3e4c8e\"\nI0415 22:14:49.426130 1 controller_manager.go:56] Build controller using images from \"registry.svc.ci.openshift.org/ocp/4.5-2020-04-15-215108@sha256:a20fbee183c84d16ca8df9664499f43e4097dda6e65dba39244efbbcd26d6d82\"\nI0415 22:14:49.426266 1 standalone_apiserver.go:98] Started health checks at 0.0.0.0:8443\nI0415 22:14:49.426342 1 leaderelection.go:242] attempting to acquire leader lease openshift-controller-manager/openshift-master-controllers...\n", "Pod openshift-insights/insights-operator-5fb99cf8-cbqnx is not healthy: :298] The operator is healthy\nI0415 22:30:06.892659 1 reflector.go:418] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:209: Watch close - *v1.ConfigMap total 0 items received\nI0415 22:30:06.898937 1 reflector.go:418] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:209: Watch close - *v1.ConfigMap total 0 items received\nI0415 22:30:07.204060 1 reflector.go:324] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:209: watch of *v1.ConfigMap ended with: too old resource version: 20430 (23409)\nI0415 22:30:07.204246 1 reflector.go:324] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:209: watch of *v1.ConfigMap ended with: too old resource version: 20432 (23409)\nI0415 22:30:08.218251 1 reflector.go:185] Listing and watching *v1.ConfigMap from k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:209\nI0415 22:30:08.218662 1 reflector.go:185] Listing and watching *v1.ConfigMap from k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:209\nI0415 22:30:12.716251 1 httplog.go:90] GET /metrics: (8.337405ms) 200 [Prometheus/2.15.2 10.129.2.8:45338]\nI0415 22:30:24.573277 1 observer_polling.go:116] Observed file \"/var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt\" has been modified (old=\"08dd5a33beb4e44e9f1fb9716b3d52be7bf5726facff4ebfd314839133c83785\", new=\"868e0fedc6879d0d0b890956a1ac7b838d3e3eb2007e37d72bce6497ae10ac5e\")\nW0415 22:30:24.573366 1 builder.go:101] Restart triggered because of file /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt was modified\nI0415 22:30:24.573698 1 diskrecorder.go:303] Found files to send: [/var/lib/insights-operator/insights-2020-04-15-221612.tar.gz]\nI0415 22:30:24.573757 1 insightsuploader.go:126] Uploading latest report since 0001-01-01T00:00:00Z\nI0415 22:30:24.575784 1 periodic.go:99] Shutting down\nF0415 22:30:24.575840 1 start.go:80] stopped\n",
after 0415 22:15:10.823628 and before 0415 22:15:14.616070 the pod cluster-image-registry-operator was restarted. Apr 15 22:14:49.230371 "Started container f211fb609454a511895d247511fc4b3b6d7a83351d3986eb5cefe2f643240c4a: openshift-image-registry/cluster-image-registry-operator-647d769f7f-nmp7k/cluster-image-registry-operator" id=dc008b9c-23ed-46ee-b2ae-f669e339682d Apr 15 22:15:07.402583 "Started container 848e786b3998fb2cae01d295da1566a897ab8f7a7030d4f83f8602f26c54e7a6: openshift-image-registry/cluster-image-registry-operator-647d769f7f-nmp7k/cluster-image-registry-operator-watch" id=7df28035-7360-497c-9968-418dd6f1a489 Apr 15 22:15:14.617896 "Started container ea3ed0f67578f6b57831e43ba31ce675d697b33a80d3cbd46df01b109e1ff0b0: openshift-image-registry/cluster-image-registry-operator-647d769f7f-nmp7k/cluster-image-registry-operator" id=f4dcccbf-87a4-4d0f-bec8-4f785c06e0ef Apr 15 22:15:15.040395 "Started container f00fb2c9269efa5a962d81a08e2f5a7cc15af9ad468c5f638c1dd1e5bd64df10: openshift-image-registry/cluster-image-registry-operator-647d769f7f-nmp7k/cluster-image-registry-operator-watch" id=a7adc417-ed76-4be5-9f36-4ffeb78489df I don't understand why it was restarted and why it took more than 15 seconds to start both containers for a first time. openshift-controller-manager-operator observed the internal registry hostname at ~22:15:00: 2020-04-15T22:15:00.241831929Z + "dockerPullSecret": map[string]interface{}{ 2020-04-15T22:15:00.241831929Z + "internalRegistryHostname": string("image-registry.openshift-image-registry.svc:5000"), 2020-04-15T22:15:00.241831929Z + }, The test `[sig-builds][Feature:Builds] prune builds based on settings in the buildconfig should prune completed builds based on the successfulBuildsHistoryLimit setting [Suite:openshift/conformance/parallel]` needed the registry hostname half an hour later: Apr 15 22:46:52.527: INFO: Waiting up to 2 minutes for the internal registry hostname to be published ... Apr 15 22:48:52.838: INFO: the OCM pod logs indicate the build controller was started after the internal registry hostname has been set in the OCM config Apr 15 22:48:52.859: INFO: OCM rollout still progressing or in error: True - lastTransitionTime: "2020-04-15T22:15:14Z" message: 'daemonset/controller-manager: updated number scheduled is 2, desired number scheduled is 3' reason: DesiredStateNotYetAchieved status: "True" type: Progressing So the real problem is that the build tests expect openshift-controller-manager to have Progressing=False, but it hasn't.
this is also being seen in 4.4. https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/2181
and it seems to be specific to GCP on both 4.4 and 4.5. https://search.svc.ci.openshift.org/?search=Timed+out+waiting+for+internal+registry+hostname+to+be+published&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520
Moving this to the Node team - the operator is flapping because one of the OCM daemonset pods is being scheduled/de-scheduled. ``` I0415 22:33:14.580397 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-controller-manager-operator", Name:"openshift-controller-manager-operator", UID:"4c3f5c77-260d-448f-a9db-28363d079d1d", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/openshift-controller-manager changed: Progressing message changed from "" to "Progressing: daemonset/controller-manager: updated number scheduled is 2, desired number scheduled is 3" I0415 22:33:19.470334 1 httplog.go:90] verb="GET" URI="/metrics" latency=20.080164ms resp=200 UserAgent="Prometheus/2.15.2" srcIP="10.131.0.23:49602": I0415 22:33:34.115390 1 status_controller.go:172] clusteroperator/openshift-controller-manager diff {"status":{"conditions":[{"lastTransitionTime":"2020-04-15T22:05:33Z","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2020-04-15T22:15:14Z","status":"True","type":"Progressing"},{"lastTransitionTime":"2020-04-15T22:16:34Z","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2020-04-15T22:05:33Z","reason":"NoData","status":"Unknown","type":"Upgradeable"}]}} I0415 22:33:34.126438 1 event.go:278] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-controller-manager-operator", Name:"openshift-controller-manager-operator", UID:"4c3f5c77-260d-448f-a9db-28363d079d1d", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/openshift-controller-manager changed: Progressing message changed from "Progressing: daemonset/controller-manager: updated number scheduled is 2, desired number scheduled is 3" to "" ``` When trying fetching ocm logs, there is always at least one pod whose logs cannot be obtained.
We have fixes attempting to merge for this over on 1823949 . *** This bug has been marked as a duplicate of bug 1823949 ***
I'm reopening this: There's again number of failures with the same symptom as in the description: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/2231 And since the bug this was closed as duplicate of, I would think it actually was not a duplicate.
I searched masters-journal from 2231 above. There was no 'controller-manager: updated number scheduled is ' log (which Adam mentioned on the 16th). Let me dig deeper.
Using the run Tomas noted I'm seeing the same symptoms Adam noted in #Comment 5 From the extended test diagnostics, a bunch of these until the test times out and reports the failure: Apr 23 09:48:18.959: INFO: OCM rollout still progressing or in error: True In the OCM operatorr logs, repeated: I0423 09:48:04.638402 1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-controller-manager-operator", Name:"openshift-controller-manager-operator", UID:"bf261058-d6e2-4b7b-b60a-7343c11fc55d", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/openshift-controller-manager changed: Progressing message changed from "" to "Progressing: daemonset/controller-manager: updated number scheduled is 2, desired number scheduled is 3"
Reading masters-journal, I am not sure how log for 9:12 appeared after 10:02 (maybe the journal was concatenation). Apr 23 09:23:28.823816 ci-op-p5wtd-m-0.c.openshift-gce-devel-ci.internal hyperkube[1431]: E0423 09:23:28.823836 1431 kuberuntime_manager.go:955] getPodContainerStatuses for pod "controller-manager-lvcfq_openshift-controller-manager(8dd9b9f3-947d-41cf-81cb-27769d49969c)" failed: rpc error: code = Unknown desc = container with ID starting with b2a8bf049a5ce3fdd69b4e470f86f586c7789ac3395e491f609f0c219541f314 not found: ID does not exist Apr 23 10:02:24.404127 ci-op-p5wtd-m-0.c.openshift-gce-devel-ci.internal hyperkube[1431]: I0423 10:02:24.404087 1431 kubelet_pods.go:933] Pod "controller-manager-lvcfq_openshift-controller-manager(8dd9b9f3-947d-41cf-81cb-27769d49969c)" is terminated, but some containers are still running Apr 23 09:12:07.456997 ci-op-p5wtd-m-1.c.openshift-gce-devel-ci.internal hyperkube[1444]: I0423 09:12:07.456698 1444 kubelet.go:1913] SyncLoop (ADD, "api"): "kube-controller-manager-operator-85f5495bdc-pwjql_openshift-kube-controller-manager-operator(6a6fcfea-3273-4adf-8e37-2af9cc5bafff)"
related to https://bugzilla.redhat.com/show_bug.cgi?id=1827325 ?
For controller-manager-lvcfq : ``` Apr 23 09:20:08.577153 ci-op-p5wtd-m-0.c.openshift-gce-devel-ci.internal hyperkube[1431]: E0423 09:20:08.576995 1431 pod_workers.go:191] Error syncing pod 8dd9b9f3-947d-41cf-81cb-27769d49969c ("controller-manager-lvcfq_openshift-controller-manager(8dd9b9f3-947d-41cf-81cb-27769d49969c)"), skipping: failed to "CreatePodSandbox" for "controller-manager-lvcfq_openshift-controller-manager(8dd9b9f3-947d-41cf-81cb-27769d49969c)" with CreatePodSandboxError: "CreatePodSandbox for pod \"controller-manager-lvcfq_openshift-controller-manager(8dd9b9f3-947d-41cf-81cb-27769d49969c)\" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded" Apr 23 10:02:22.404615 ci-op-p5wtd-m-0.c.openshift-gce-devel-ci.internal hyperkube[1431]: E0423 10:02:22.403265 1431 pod_workers.go:191] Error syncing pod 8dd9b9f3-947d-41cf-81cb-27769d49969c ("controller-manager-lvcfq_openshift-controller-manager(8dd9b9f3-947d-41cf-81cb-27769d49969c)"), skipping: rpc error: code = Unknown desc = container with ID starting with b2a8bf049a5ce3fdd69b4e470f86f586c7789ac3395e491f609f0c219541f314 not found: ID does not exist ``` I would think the above symptom was the same as BZ 1827325 Toward the end of masters-journal, there was controller-manager-wmkvh whose last log lines showed: ``` Apr 23 10:01:20.885326 ci-op-p5wtd-m-2.c.openshift-gce-devel-ci.internal hyperkube[1441]: I0423 10:01:20.885169 1441 status_manager.go:434] Ignoring same status for pod "controller-manager-wmkvh_openshift-controller-manager(34902f3e-bac1-4aff-9b4c-6be4e413cc55)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-23 09:22:54 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-23 09:32:41 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-23 09:32:41 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-23 09:22:54 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.0.6 PodIP:10.129.0.33 PodIPs:[{IP:10.129.0.33}] StartTime:2020-04-23 09:22:54 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:controller-manager State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2020-04-23 09:32:40 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:registry.svc.ci.openshift.org/ocp/4.4-2020-04-23-085604@sha256:b4a0e930b47be3ff54f414060d36da12130e1b593888d0e67d34bfc2b31b3a86 ImageID:registry.svc.ci.openshift.org/ocp/4.4-2020-04-23-085604@sha256:b4a0e930b47be3ff54f414060d36da12130e1b593888d0e67d34bfc2b31b3a86 ContainerID:cri-o://a0b169bf84e6b6dca602b1f04c3cd441e65f90661397dc564f3a6ee1563683d2 Started:0xc003466e78}] QOSClass:Burstable EphemeralContainerStatuses:[]} Apr 23 10:01:20.885592 ci-op-p5wtd-m-2.c.openshift-gce-devel-ci.internal hyperkube[1441]: I0423 10:01:20.885558 1441 volume_manager.go:372] Waiting for volumes to attach and mount for pod "controller-manager-wmkvh_openshift-controller-manager(34902f3e-bac1-4aff-9b4c-6be4e413cc55)" Apr 23 10:01:20.885719 ci-op-p5wtd-m-2.c.openshift-gce-devel-ci.internal hyperkube[1441]: I0423 10:01:20.885657 1441 volume_manager.go:403] All volumes are attached and mounted for pod "controller-manager-wmkvh_openshift-controller-manager(34902f3e-bac1-4aff-9b4c-6be4e413cc55)" ```
I found some multus related log lines. ``` Apr 23 09:14:10.627528 ci-op-p5wtd-m-1.c.openshift-gce-devel-ci.internal crio[1372]: time="2020-04-23 09:14:10.625614934Z" level=error msg="Error while adding pod to CNI network \"multus-cni-network\": Multus: [openshift-kube-controller-manager/installer-4-ci-op-p5wtd-m-1.c.openshift-gce-devel-ci.internal]: error setting the networks status: SetNetworkStatus: failed to query the pod installer-4-ci-op-p5wtd-m-1.c.openshift-gce-devel-ci.internal in out of cluster comm: etcdserver: request timed out" Apr 23 09:22:20.789187 ci-op-p5wtd-m-1.c.openshift-gce-devel-ci.internal crio[1372]: time="2020-04-23 09:22:20.789117669Z" level=info msg="Got pod network &{Name:controller-manager-rjjcb Namespace:openshift-controller-manager ID:dff29d3dab21db12bb78d98b18e0cabd6e7b856a0bab497775e1033952264ef5 NetNS:/proc/37222/ns/net Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}]}" Apr 23 09:22:50.843738 ci-op-p5wtd-m-1.c.openshift-gce-devel-ci.internal crio[1372]: time="2020-04-23 09:22:50.843681494Z" level=error msg="Error while adding pod to CNI network \"multus-cni-network\": Multus: [openshift-controller-manager/controller-manager-rjjcb]: error adding container to network \"openshift-sdn\": delegateAdd: cannot set \"openshift-sdn\" interface name to \"eth0\": validateIfName: no net namespace /proc/37222/ns/net found: failed to Statfs \"/proc/37222/ns/net\": no such file or directory" ``` There were more than one occurrence of '"/proc/37222/ns/net": no such file or directory' I searched log for 'IpRanges:[' (note the second log above) all of which were IpRanges:[]. It seems this was not normal for multus.
Hi Ted, When I looks https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/2231/build-log.txt based on the comment, #comment7, then I cannot find the word 'Multus'/'multus'. So this bug is not related to multus. The logs, you mentioned, shows us that Multus cannot communicate etcdserver (i.e. Kubernetes API) to get the pod/net-attach-def information, hence something must be happen before the logs. In the case, if pod cannot be launched due to timeout, then the Pod network namespace might be removed, so multus may show '"/proc/37222/ns/net": no such file or directory', due to the timeout. So multus failure is occurred by previous failure of Kube API. Currently I cannot find the actual log you mentioned from URL, so now I only understands what is happend from above 3logs. https://search.svc.ci.openshift.org/?search=Apr+23+09%3A22%3A50.843738+ci-op-p5wtd-m-1.c.openshift-gce-devel-ci.interna&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job Could you please points me the logs that you saw it for further analysis? In addition, 2nd log, 'Got pod network(snip)IpRanges:[]}}}' is comes from ocicni, the part of crio. This 'IpRanges:{}' is the part of CNI's runtimeConfig, which is invoked before multus invocation, and it is not used (i.e. empty) in case of OpenShift default case, so it is not the root cause. https://github.com/cri-o/ocicni/blob/master/pkg/ocicni/ocicni.go#L860
You can find masters-journal under: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/2231/artifacts/e2e-gcp/nodes/ where my quoted snippet can be found. w.r.t. IpRanges:[], I saw it in other CI runs - so it was not the cause. Thanks for taking a look.
*** This bug has been marked as a duplicate of bug 1827325 ***
We are tracking this bug over on 1827325 where the networking is deadlocked. The fix is to pass down golang contexts to timeout after 2 minutes.