Bug 1851540 - openstack CI jobs fail bootstrap with degraded operators
Summary: openstack CI jobs fail bootstrap with degraded operators
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Mike Fedosin
QA Contact: David Sanz
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-26 21:24 UTC by jtanenba
Modified: 2020-09-03 08:38 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2031 None closed Bug 1851540: OpenStack: set hostnames on early stages of machine provisioning 2020-09-21 08:54:31 UTC

Description jtanenba 2020-06-26 21:24:01 UTC
Description of problem:
when running an openshift CI job for ovn-kubernetes that runs on openstack bootstrap fails with the following build log

 level=debug msg="Still waiting for the Kubernetes API: Get https://api.5dklyvvb-57db1.shiftstack.devcluster.openshift.com:6443/version?timeout=32s: dial tcp 128.31.24.246:6443: connect: connection refused"
level=info msg="API v1.18.3 up"
level=info msg="Waiting up to 30m0s for bootstrapping to complete..."
level=error msg="Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingEndpoints::RouteStatus_FailedHost: IngressStateEndpointsDegraded: No endpoints found for oauth-server\nRouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.5dklyvvb-57db1.shiftstack.devcluster.openshift.com: []"
level=info msg="Cluster operator authentication Progressing is Unknown with NoData: "
level=info msg="Cluster operator authentication Available is Unknown with NoData: "
level=error msg="Cluster operator etcd Degraded is True with InstallerPodContainerWaiting_ContainerCreating::InstallerPodNetworking_FailedCreatePodSandBox::StaticPods_Error: InstallerPodContainerWaitingDegraded: Pod \"installer-2-5dklyvvb-57db1-znvf5-master-0\" on node \"5dklyvvb-57db1-znvf5-master-0\" container \"installer\" is waiting for 21m21.218939514s because \"\"\nInstallerPodNetworkingDegraded: Pod \"installer-2-5dklyvvb-57db1-znvf5-master-0\" on node \"5dklyvvb-57db1-znvf5-master-0\" observed degraded networking: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-2-5dklyvvb-57db1-znvf5-master-0_openshift-etcd_69b740db-5c05-4b1a-b206-2752e4c4836f_0(1bfcd436e404ede9cbffde5ec95abfe255971410e86a73b6ec2be4d83c9977c6): Multus: [openshift-etcd/installer-2-5dklyvvb-57db1-znvf5-master-0]: error adding container to network \"ovn-kubernetes\": delegateAdd: error invoking confAdd - \"ovn-k8s-cni-overlay\": error in getting result from AddNetwork: CNI request failed with status 400: '[openshift-etcd/installer-2-5dklyvvb-57db1-znvf5-master-0] failed to get pod annotation: timed out waiting for the condition\nInstallerPodNetworkingDegraded: '\nStaticPodsDegraded: pods \"etcd-5dklyvvb-57db1-znvf5-master-0\" not found\nStaticPodsDegraded: pods \"etcd-5dklyvvb-57db1-znvf5-master-2\" not found"
level=info msg="Cluster operator etcd Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 0; 1 nodes are at revision 2"
level=info msg="Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available."
level=info msg="Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available."
level=error msg="Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default"
level=info msg="Cluster operator insights Disabled is False with AsExpected: "
level=error msg="Cluster operator kube-apiserver Degraded is True with InstallerPodContainerWaiting_ContainerCreating::InstallerPodNetworking_FailedCreatePodSandBox::StaticPods_Error: StaticPodsDegraded: pods \"kube-apiserver-5dklyvvb-57db1-znvf5-master-0\" not found\nInstallerPodContainerWaitingDegraded: Pod \"installer-2-5dklyvvb-57db1-znvf5-master-0\" on node \"5dklyvvb-57db1-znvf5-master-0\" container \"installer\" is waiting for 19m18.928707598s because \"\"\nInstallerPodNetworkingDegraded: Pod \"installer-2-5dklyvvb-57db1-znvf5-master-0\" on node \"5dklyvvb-57db1-znvf5-master-0\" observed degraded networking: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-2-5dklyvvb-57db1-znvf5-master-0_openshift-kube-apiserver_c42f0077-99aa-4ee1-9ff3-975f580485b3_0(dda711d79e9644b879ad158ba3fe4cbbc20e23eea57735bf031bc47e364f00b2): Multus: [openshift-kube-apiserver/installer-2-5dklyvvb-57db1-znvf5-master-0]: error adding container to network \"ovn-kubernetes\": delegateAdd: error invoking confAdd - \"ovn-k8s-cni-overlay\": error in getting result from AddNetwork: CNI request failed with status 400: '[openshift-kube-apiserver/installer-2-5dklyvvb-57db1-znvf5-master-0] failed to get pod annotation: timed out waiting for the condition\nInstallerPodNetworkingDegraded: '"
level=info msg="Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 1 nodes are at revision 0; 2 nodes are at revision 2; 0 nodes have achieved new revision 3"
level=error msg="Cluster operator kube-controller-manager Degraded is True with InstallerPodContainerWaiting_ContainerCreating::InstallerPodNetworking_FailedCreatePodSandBox::NodeInstaller_InstallerPodFailed::StaticPods_Error: NodeInstallerDegraded: 1 nodes are failing on revision 2:\nNodeInstallerDegraded: static pod of revision 2 has been installed, but is not ready while new revision 3 is pending\nStaticPodsDegraded: pods \"kube-controller-manager-5dklyvvb-57db1-znvf5-master-0\" not found\nStaticPodsDegraded: pods \"kube-controller-manager-5dklyvvb-57db1-znvf5-master-2\" not found\nInstallerPodContainerWaitingDegraded: Pod \"installer-3-5dklyvvb-57db1-znvf5-master-0\" on node \"5dklyvvb-57db1-znvf5-master-0\" container \"installer\" is waiting for 20m24.958580335s because \"\"\nInstallerPodNetworkingDegraded: Pod \"installer-3-5dklyvvb-57db1-znvf5-master-0\" on node \"5dklyvvb-57db1-znvf5-master-0\" observed degraded networking: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-3-5dklyvvb-57db1-znvf5-master-0_openshift-kube-controller-manager_bb3261b9-c77a-43dd-b9de-183016ab3ed9_0(78ba35a0f2b59ddd550baecad58e86d355e4e4a23cc58956649d17734819a958): Multus: [openshift-kube-controller-manager/installer-3-5dklyvvb-57db1-znvf5-master-0]: error adding container to network \"ovn-kubernetes\": delegateAdd: error invoking confAdd - \"ovn-k8s-cni-overlay\": error in getting result from AddNetwork: CNI request failed with status 400: '[openshift-kube-controller-manager/installer-3-5dklyvvb-57db1-znvf5-master-0] failed to get pod annotation: timed out waiting for the condition\nInstallerPodNetworkingDegraded: '"
level=info msg="Cluster operator kube-controller-manager Progressing is True with NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 6"
level=info msg="Cluster operator kube-controller-manager Available is False with StaticPods_ZeroNodesActive: StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 6"
level=error msg="Cluster operator kube-scheduler Degraded is True with InstallerPodContainerWaiting_ContainerCreating::InstallerPodNetworking_FailedCreatePodSandBox::NodeInstaller_InstallerPodFailed: NodeInstallerDegraded: 1 nodes are failing on revision 3:\nNodeInstallerDegraded: \nInstallerPodContainerWaitingDegraded: Pod \"installer-4-5dklyvvb-57db1-znvf5-master-0\" on node \"5dklyvvb-57db1-znvf5-master-0\" container \"installer\" is waiting for 18m8.345090966s because \"\"\nInstallerPodNetworkingDegraded: Pod \"installer-4-5dklyvvb-57db1-znvf5-master-0\" on node \"5dklyvvb-57db1-znvf5-master-0\" observed degraded networking: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-4-5dklyvvb-57db1-znvf5-master-0_openshift-kube-scheduler_2648de19-66e2-4c42-861c-c0380ab9952b_0(fcf26e64af606f876a502d2b07f7f3c13d02f623320736beb435de9884b97149): Multus: [openshift-kube-scheduler/installer-4-5dklyvvb-57db1-znvf5-master-0]: error adding container to network \"ovn-kubernetes\": delegateAdd: error invoking confAdd - \"ovn-k8s-cni-overlay\": error in getting result from AddNetwork: CNI request failed with status 400: '[openshift-kube-scheduler/installer-4-5dklyvvb-57db1-znvf5-master-0] failed to get pod annotation: timed out waiting for the condition\nInstallerPodNetworkingDegraded: '"
level=info msg="Cluster operator kube-scheduler Progressing is True with NodeInstaller: NodeInstallerProgressing: 1 nodes are at revision 0; 2 nodes are at revision 4; 0 nodes have achieved new revision 6"
level=info msg="Cluster operator kube-storage-version-migrator Available is False with _NoMigratorPod: Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available"
level=info msg="Cluster operator machine-config Progressing is True with : Unable to apply 0.0.1-2020-06-25-124534"
level=error msg="Cluster operator machine-config Degraded is True with RenderConfigFailed: Unable to apply 0.0.1-2020-06-25-124534: openshift-config-managed/kube-cloud-config configmap is required on platform OpenStack but not found: configmap \"kube-cloud-config\" not found"
level=info msg="Cluster operator machine-config Available is False with : Cluster not available for 0.0.1-2020-06-25-124534"
level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack."
level=error msg="Cluster operator monitoring Degraded is True with UpdatingClusterMonitoringOperatorFailed: Failed to rollout the stack. Error: running task Updating Cluster Monitoring Operator failed: reconciling Cluster Monitoring Operator ServiceMonitor failed: creating ServiceMonitor object failed: the server could not find the requested resource (post servicemonitors.monitoring.coreos.com)"
level=info msg="Cluster operator monitoring Available is False with : "
level=info msg="Cluster operator network Progressing is True with Deploying: DaemonSet \"openshift-multus/network-metrics-daemon\" is waiting for other operators to become ready"
level=error msg="Cluster operator openshift-apiserver Degraded is True with APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver"
level=info msg="Cluster operator operator-lifecycle-manager-packageserver Available is False with : "
level=info msg="Cluster operator operator-lifecycle-manager-packageserver Progressing is True with : Working toward 0.15.1"
level=debug msg="Fetching Bootstrap SSH Key Pair..."
level=debug msg="Loading Bootstrap SSH Key Pair..."
level=debug msg="Using Bootstrap SSH Key Pair loaded from state file"
level=debug msg="Gathering master journals ..."
level=debug msg="Gathering master containers ..."
level=debug msg="Waiting for logs ..."
level=debug msg="Log bundle written to /var/home/core/log-bundle-20200625132610.tar.gz"
level=info msg="Bootstrap gather logs captured here \"/tmp/artifacts/installer/log-bundle-20200625132610.tar.gz\""
level=fatal msg="Bootstrap failed to complete: failed to wait for bootstrapping to complete: timed out waiting for the condition"
2020/06/25 13:27:07 Container setup in pod e2e-openstack failed, exit code 1, reason Error
2020/06/25 13:35:09 Copied 63.33MB of artifacts from e2e-openstack to /logs/artifacts/e2e-openstack
2020/06/25 13:35:09 Releasing lease for "openstack-quota-slice"
2020/06/25 13:35:09 No custom metadata found and prow metadata already exists. Not updating the metadata.
2020/06/25 13:35:11 Ran for 51m48s
error: some steps failed:
  * could not run steps: step e2e-openstack failed: template pod "e2e-openstack" failed: the pod ci-op-5dklyvvb/e2e-openstack failed after 48m45s (failed containers: setup): ContainerFailed one or more containers exited
Container setup exited with code 1, reason Error
---
13 to installer's internal agent"
level=debug msg="Gathering master journals ..."
level=debug msg="Gathering master containers ..."
level=debug msg="Waiting for logs ..."
level=debug msg="Log bundle written to /var/home/core/log-bundle-20200625132610.tar.gz"
level=info msg="Bootstrap gather logs captured here \"/tmp/artifacts/installer/log-bundle-20200625132610.tar.gz\""
level=fatal msg="Bootstrap failed to complete: failed to wait for bootstrapping to complete: timed out waiting for the condition"
---
2020/06/25 13:35:11 could not load result reporting options: failed to read file "": open : no such file or directory




these errors are observed in the ovnkube-master logs:
E0625 13:04:08.030786       1 leaderelection.go:331] error retrieving resource lock openshift-ovn-kubernetes/ovn-kubernetes-master: etcdserver: request timed out
I0625 13:04:31.118452       1 pods.go:225] [openshift-kube-scheduler/revision-pruner-3-5dklyvvb-57db1-znvf5-master-0] addLogicalPort took 15.030661881s
E0625 13:04:31.118489       1 ovn.go:412] error while creating logical port openshift-kube-scheduler_revision-pruner-3-5dklyvvb-57db1-znvf5-master-0 stdout: "", stderr: "2020-06-25T13:04:31Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\n" (OVN command '/usr/bin/ovn-nbctl --timeout=15 --may-exist lsp-add 5dklyvvb-57db1-znvf5-master-0 openshift-kube-scheduler_revision-pruner-3-5dklyvvb-57db1-znvf5-master-0 -- lsp-set-addresses openshift-kube-scheduler_revision-pruner-3-5dklyvvb-57db1-znvf5-master-0 0a:58:0a:80:00:08 10.128.0.8 -- set logical_switch_port openshift-kube-scheduler_revision-pruner-3-5dklyvvb-57db1-znvf5-master-0 external-ids:namespace=openshift-kube-scheduler external-ids:pod=true -- lsp-set-port-security openshift-kube-scheduler_revision-pruner-3-5dklyvvb-57db1-znvf5-master-0 0a:58:0a:80:00:08 10.128.0.8' failed: signal: alarm clock)
I0625 13:04:34.278272       1 pods.go:225] [openshift-etcd/installer-2-5dklyvvb-57db1-znvf5-master-0] addLogicalPort took 15.023252311s
E0625 13:04:34.278324       1 ovn.go:412] failed to get pod addresses for pod openshift-etcd_installer-2-5dklyvvb-57db1-znvf5-master-0 on node: 5dklyvvb-57db1-znvf5-master-0, err: Error while obtaining dynamic addresses for openshift-etcd_installer-2-5dklyvvb-57db1-znvf5-master-0: stdout: "", stderr: "2020-06-25T13:04:34Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\n", error: OVN command '/usr/bin/ovn-nbctl --timeout=15 --if-exists get logical_switch_port openshift-etcd_installer-2-5dklyvvb-57db1-znvf5-master-0 dynamic_addresses addresses' failed: signal: alarm clock

Version-Release number of selected component (if applicable):
4.6 

How reproducible:
all the time with CI testing 

Actual results:
Tests run 

Expected results:


Additional info:
link to a CI run 
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/9801/rehearse-9801-pull-ci-openshift-cluster-network-operator-master-e2e-openstack/1276133859636809728/

Comment 2 Ricardo Carrillo Cruz 2020-08-03 11:48:13 UTC
Unassigning as I'm on long vacation.

Comment 3 Alexander Constantinescu 2020-08-03 15:03:19 UTC
Hi Daniel

Assigned this to you, could you please have a look at CI and check what is going on with this job on Openstack specifically?

Thanks, 
Alex

Comment 4 Daniel Mellado 2020-08-04 08:53:38 UTC
So, overall, this 'alarm clock' failed signal seems to be from ovsdb timing out. In any case, the way that the ports are being gotten, using ovn-nbctl no longer matches ovnkube behavior anymore, as now we speak directly to the db. I'll trigger a new run from this to see if it still fails.

Comment 13 David Sanz 2020-09-03 08:38:49 UTC
Verified on 4.6.0-0.nightly-2020-09-02-210353


Note You need to log in before you can comment on or make changes to this bug.