Bug 1851540

Summary:	openstack CI jobs fail bootstrap with degraded operators
Product:	OpenShift Container Platform	Reporter:	Jacob Tanenbaum <jtanenba>
Component:	Installer	Assignee:	Mike Fedosin <mfedosin>
Installer sub component:	OpenShift on OpenStack	QA Contact:	David Sanz <dsanzmor>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	aconstan, dmellado, pprinett, vrutkovs
Version:	4.6	Keywords:	UpcomingSprint
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:09:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jacob Tanenbaum 2020-06-26 21:24:01 UTC

Description of problem:
when running an openshift CI job for ovn-kubernetes that runs on openstack bootstrap fails with the following build log

 level=debug msg="Still waiting for the Kubernetes API: Get https://api.5dklyvvb-57db1.shiftstack.devcluster.openshift.com:6443/version?timeout=32s: dial tcp 128.31.24.246:6443: connect: connection refused"
level=info msg="API v1.18.3 up"
level=info msg="Waiting up to 30m0s for bootstrapping to complete..."
level=error msg="Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingEndpoints::RouteStatus_FailedHost: IngressStateEndpointsDegraded: No endpoints found for oauth-server\nRouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.5dklyvvb-57db1.shiftstack.devcluster.openshift.com: []"
level=info msg="Cluster operator authentication Progressing is Unknown with NoData: "
level=info msg="Cluster operator authentication Available is Unknown with NoData: "
level=error msg="Cluster operator etcd Degraded is True with InstallerPodContainerWaiting_ContainerCreating::InstallerPodNetworking_FailedCreatePodSandBox::StaticPods_Error: InstallerPodContainerWaitingDegraded: Pod \"installer-2-5dklyvvb-57db1-znvf5-master-0\" on node \"5dklyvvb-57db1-znvf5-master-0\" container \"installer\" is waiting for 21m21.218939514s because \"\"\nInstallerPodNetworkingDegraded: Pod \"installer-2-5dklyvvb-57db1-znvf5-master-0\" on node \"5dklyvvb-57db1-znvf5-master-0\" observed degraded networking: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-2-5dklyvvb-57db1-znvf5-master-0_openshift-etcd_69b740db-5c05-4b1a-b206-2752e4c4836f_0(1bfcd436e404ede9cbffde5ec95abfe255971410e86a73b6ec2be4d83c9977c6): Multus: [openshift-etcd/installer-2-5dklyvvb-57db1-znvf5-master-0]: error adding container to network \"ovn-kubernetes\": delegateAdd: error invoking confAdd - \"ovn-k8s-cni-overlay\": error in getting result from AddNetwork: CNI request failed with status 400: '[openshift-etcd/installer-2-5dklyvvb-57db1-znvf5-master-0] failed to get pod annotation: timed out waiting for the condition\nInstallerPodNetworkingDegraded: '\nStaticPodsDegraded: pods \"etcd-5dklyvvb-57db1-znvf5-master-0\" not found\nStaticPodsDegraded: pods \"etcd-5dklyvvb-57db1-znvf5-master-2\" not found"
level=info msg="Cluster operator etcd Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 0; 1 nodes are at revision 2"
level=info msg="Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available."
level=info msg="Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available."
level=error msg="Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default"
level=info msg="Cluster operator insights Disabled is False with AsExpected: "
level=error msg="Cluster operator kube-apiserver Degraded is True with InstallerPodContainerWaiting_ContainerCreating::InstallerPodNetworking_FailedCreatePodSandBox::StaticPods_Error: StaticPodsDegraded: pods \"kube-apiserver-5dklyvvb-57db1-znvf5-master-0\" not found\nInstallerPodContainerWaitingDegraded: Pod \"installer-2-5dklyvvb-57db1-znvf5-master-0\" on node \"5dklyvvb-57db1-znvf5-master-0\" container \"installer\" is waiting for 19m18.928707598s because \"\"\nInstallerPodNetworkingDegraded: Pod \"installer-2-5dklyvvb-57db1-znvf5-master-0\" on node \"5dklyvvb-57db1-znvf5-master-0\" observed degraded networking: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-2-5dklyvvb-57db1-znvf5-master-0_openshift-kube-apiserver_c42f0077-99aa-4ee1-9ff3-975f580485b3_0(dda711d79e9644b879ad158ba3fe4cbbc20e23eea57735bf031bc47e364f00b2): Multus: [openshift-kube-apiserver/installer-2-5dklyvvb-57db1-znvf5-master-0]: error adding container to network \"ovn-kubernetes\": delegateAdd: error invoking confAdd - \"ovn-k8s-cni-overlay\": error in getting result from AddNetwork: CNI request failed with status 400: '[openshift-kube-apiserver/installer-2-5dklyvvb-57db1-znvf5-master-0] failed to get pod annotation: timed out waiting for the condition\nInstallerPodNetworkingDegraded: '"
level=info msg="Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 1 nodes are at revision 0; 2 nodes are at revision 2; 0 nodes have achieved new revision 3"
level=error msg="Cluster operator kube-controller-manager Degraded is True with InstallerPodContainerWaiting_ContainerCreating::InstallerPodNetworking_FailedCreatePodSandBox::NodeInstaller_InstallerPodFailed::StaticPods_Error: NodeInstallerDegraded: 1 nodes are failing on revision 2:\nNodeInstallerDegraded: static pod of revision 2 has been installed, but is not ready while new revision 3 is pending\nStaticPodsDegraded: pods \"kube-controller-manager-5dklyvvb-57db1-znvf5-master-0\" not found\nStaticPodsDegraded: pods \"kube-controller-manager-5dklyvvb-57db1-znvf5-master-2\" not found\nInstallerPodContainerWaitingDegraded: Pod \"installer-3-5dklyvvb-57db1-znvf5-master-0\" on node \"5dklyvvb-57db1-znvf5-master-0\" container \"installer\" is waiting for 20m24.958580335s because \"\"\nInstallerPodNetworkingDegraded: Pod \"installer-3-5dklyvvb-57db1-znvf5-master-0\" on node \"5dklyvvb-57db1-znvf5-master-0\" observed degraded networking: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-3-5dklyvvb-57db1-znvf5-master-0_openshift-kube-controller-manager_bb3261b9-c77a-43dd-b9de-183016ab3ed9_0(78ba35a0f2b59ddd550baecad58e86d355e4e4a23cc58956649d17734819a958): Multus: [openshift-kube-controller-manager/installer-3-5dklyvvb-57db1-znvf5-master-0]: error adding container to network \"ovn-kubernetes\": delegateAdd: error invoking confAdd - \"ovn-k8s-cni-overlay\": error in getting result from AddNetwork: CNI request failed with status 400: '[openshift-kube-controller-manager/installer-3-5dklyvvb-57db1-znvf5-master-0] failed to get pod annotation: timed out waiting for the condition\nInstallerPodNetworkingDegraded: '"
level=info msg="Cluster operator kube-controller-manager Progressing is True with NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 6"
level=info msg="Cluster operator kube-controller-manager Available is False with StaticPods_ZeroNodesActive: StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 6"
level=error msg="Cluster operator kube-scheduler Degraded is True with InstallerPodContainerWaiting_ContainerCreating::InstallerPodNetworking_FailedCreatePodSandBox::NodeInstaller_InstallerPodFailed: NodeInstallerDegraded: 1 nodes are failing on revision 3:\nNodeInstallerDegraded: \nInstallerPodContainerWaitingDegraded: Pod \"installer-4-5dklyvvb-57db1-znvf5-master-0\" on node \"5dklyvvb-57db1-znvf5-master-0\" container \"installer\" is waiting for 18m8.345090966s because \"\"\nInstallerPodNetworkingDegraded: Pod \"installer-4-5dklyvvb-57db1-znvf5-master-0\" on node \"5dklyvvb-57db1-znvf5-master-0\" observed degraded networking: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-4-5dklyvvb-57db1-znvf5-master-0_openshift-kube-scheduler_2648de19-66e2-4c42-861c-c0380ab9952b_0(fcf26e64af606f876a502d2b07f7f3c13d02f623320736beb435de9884b97149): Multus: [openshift-kube-scheduler/installer-4-5dklyvvb-57db1-znvf5-master-0]: error adding container to network \"ovn-kubernetes\": delegateAdd: error invoking confAdd - \"ovn-k8s-cni-overlay\": error in getting result from AddNetwork: CNI request failed with status 400: '[openshift-kube-scheduler/installer-4-5dklyvvb-57db1-znvf5-master-0] failed to get pod annotation: timed out waiting for the condition\nInstallerPodNetworkingDegraded: '"
level=info msg="Cluster operator kube-scheduler Progressing is True with NodeInstaller: NodeInstallerProgressing: 1 nodes are at revision 0; 2 nodes are at revision 4; 0 nodes have achieved new revision 6"
level=info msg="Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available"
level=info msg="Cluster operator machine-config Progressing is True with : Unable to apply 0.0.1-2020-06-25-124534"
level=error msg="Cluster operator machine-config Degraded is True with RenderConfigFailed: Unable to apply 0.0.1-2020-06-25-124534: openshift-config-managed/kube-cloud-config configmap is required on platform OpenStack but not found: configmap \"kube-cloud-config\" not found"
level=info msg="Cluster operator machine-config Available is False with : Cluster not available for 0.0.1-2020-06-25-124534"
level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack."
level=error msg="Cluster operator monitoring Degraded is True with UpdatingClusterMonitoringOperatorFailed: Failed to rollout the stack. Error: running task Updating Cluster Monitoring Operator failed: reconciling Cluster Monitoring Operator ServiceMonitor failed: creating ServiceMonitor object failed: the server could not find the requested resource (post servicemonitors.monitoring.coreos.com)"
level=info msg="Cluster operator monitoring Available is False with : "
level=info msg="Cluster operator network Progressing is True with Deploying: DaemonSet \"openshift-multus/network-metrics-daemon\" is waiting for other operators to become ready"
level=error msg="Cluster operator openshift-apiserver Degraded is True with APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver"
level=info msg="Cluster operator operator-lifecycle-manager-packageserver Available is False with : "
level=info msg="Cluster operator operator-lifecycle-manager-packageserver Progressing is True with : Working toward 0.15.1"
level=debug msg="Fetching Bootstrap SSH Key Pair..."
level=debug msg="Loading Bootstrap SSH Key Pair..."
level=debug msg="Using Bootstrap SSH Key Pair loaded from state file"
level=debug msg="Gathering master journals ..."
level=debug msg="Gathering master containers ..."
level=debug msg="Waiting for logs ..."
level=debug msg="Log bundle written to /var/home/core/log-bundle-20200625132610.tar.gz"
level=info msg="Bootstrap gather logs captured here \"/tmp/artifacts/installer/log-bundle-20200625132610.tar.gz\""
level=fatal msg="Bootstrap failed to complete: failed to wait for bootstrapping to complete: timed out waiting for the condition"
2020/06/25 13:27:07 Container setup in pod e2e-openstack failed, exit code 1, reason Error
2020/06/25 13:35:09 Copied 63.33MB of artifacts from e2e-openstack to /logs/artifacts/e2e-openstack
2020/06/25 13:35:09 Releasing lease for "openstack-quota-slice"
2020/06/25 13:35:09 No custom metadata found and prow metadata already exists. Not updating the metadata.
2020/06/25 13:35:11 Ran for 51m48s
error: some steps failed:
  * could not run steps: step e2e-openstack failed: template pod "e2e-openstack" failed: the pod ci-op-5dklyvvb/e2e-openstack failed after 48m45s (failed containers: setup): ContainerFailed one or more containers exited
Container setup exited with code 1, reason Error
---
13 to installer's internal agent"
level=debug msg="Gathering master journals ..."
level=debug msg="Gathering master containers ..."
level=debug msg="Waiting for logs ..."
level=debug msg="Log bundle written to /var/home/core/log-bundle-20200625132610.tar.gz"
level=info msg="Bootstrap gather logs captured here \"/tmp/artifacts/installer/log-bundle-20200625132610.tar.gz\""
level=fatal msg="Bootstrap failed to complete: failed to wait for bootstrapping to complete: timed out waiting for the condition"
---
2020/06/25 13:35:11 could not load result reporting options: failed to read file "": open : no such file or directory




these errors are observed in the ovnkube-master logs:
E0625 13:04:08.030786       1 leaderelection.go:331] error retrieving resource lock openshift-ovn-kubernetes/ovn-kubernetes-master: etcdserver: request timed out
I0625 13:04:31.118452       1 pods.go:225] [openshift-kube-scheduler/revision-pruner-3-5dklyvvb-57db1-znvf5-master-0] addLogicalPort took 15.030661881s
E0625 13:04:31.118489       1 ovn.go:412] error while creating logical port openshift-kube-scheduler_revision-pruner-3-5dklyvvb-57db1-znvf5-master-0 stdout: "", stderr: "2020-06-25T13:04:31Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\n" (OVN command '/usr/bin/ovn-nbctl --timeout=15 --may-exist lsp-add 5dklyvvb-57db1-znvf5-master-0 openshift-kube-scheduler_revision-pruner-3-5dklyvvb-57db1-znvf5-master-0 -- lsp-set-addresses openshift-kube-scheduler_revision-pruner-3-5dklyvvb-57db1-znvf5-master-0 0a:58:0a:80:00:08 10.128.0.8 -- set logical_switch_port openshift-kube-scheduler_revision-pruner-3-5dklyvvb-57db1-znvf5-master-0 external-ids:namespace=openshift-kube-scheduler external-ids:pod=true -- lsp-set-port-security openshift-kube-scheduler_revision-pruner-3-5dklyvvb-57db1-znvf5-master-0 0a:58:0a:80:00:08 10.128.0.8' failed: signal: alarm clock)
I0625 13:04:34.278272       1 pods.go:225] [openshift-etcd/installer-2-5dklyvvb-57db1-znvf5-master-0] addLogicalPort took 15.023252311s
E0625 13:04:34.278324       1 ovn.go:412] failed to get pod addresses for pod openshift-etcd_installer-2-5dklyvvb-57db1-znvf5-master-0 on node: 5dklyvvb-57db1-znvf5-master-0, err: Error while obtaining dynamic addresses for openshift-etcd_installer-2-5dklyvvb-57db1-znvf5-master-0: stdout: "", stderr: "2020-06-25T13:04:34Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\n", error: OVN command '/usr/bin/ovn-nbctl --timeout=15 --if-exists get logical_switch_port openshift-etcd_installer-2-5dklyvvb-57db1-znvf5-master-0 dynamic_addresses addresses' failed: signal: alarm clock

Version-Release number of selected component (if applicable):
4.6 

How reproducible:
all the time with CI testing 

Actual results:
Tests run 

Expected results:


Additional info:
link to a CI run 
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/9801/rehearse-9801-pull-ci-openshift-cluster-network-operator-master-e2e-openstack/1276133859636809728/

Comment 2 Ricardo Carrillo Cruz 2020-08-03 11:48:13 UTC

Unassigning as I'm on long vacation.

Comment 3 Alexander Constantinescu 2020-08-03 15:03:19 UTC

Hi Daniel

Assigned this to you, could you please have a look at CI and check what is going on with this job on Openstack specifically?

Thanks, 
Alex

Comment 4 Daniel Mellado 2020-08-04 08:53:38 UTC

So, overall, this 'alarm clock' failed signal seems to be from ovsdb timing out. In any case, the way that the ports are being gotten, using ovn-nbctl no longer matches ovnkube behavior anymore, as now we speak directly to the db. I'll trigger a new run from this to see if it still fails.

Comment 13 David Sanz 2020-09-03 08:38:49 UTC

Verified on 4.6.0-0.nightly-2020-09-02-210353

Comment 15 errata-xmlrpc 2020-10-27 16:09:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196