Bug 1927244

Summary: UPI installation with Kuryr timing out on bootstrap stage
Product: OpenShift Container Platform Reporter: rlobillo
Component: NetworkingAssignee: Maysa Macedo <mdemaced>
Networking sub component: kuryr QA Contact: GenadiC <gcheresh>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: high CC: mbridges, mdulko
Version: 4.7   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Kuryr changed the mechanism to detect the OpenStack Subnet used by the cluster's nodes. Kuryr relied on the Network of the cluster's nodes Subnet having a specific tag, but the tag was removed for IPI Installations causing the need to discover it from the OpenShift Machine objects, which the creation is removed on one of the UPI steps. Consequence: Installations with Kuryr SDN timing out on the Bootstrap stage. Fix: Continue adding the ID of the Neutron Subnet to Kuryr, instead of only relying on Machine objects. Result: Installation with Kuryr on UPI succeeds.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:43:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1929168, 1931347    
Attachments:
Description Flags
openshift-installer log bundle
none
installation and test logs none

Description rlobillo 2021-02-10 11:55:58 UTC
Description of problem:

Following this:
 
https://docs.openshift.com/container-platform/4.7/installing/installing_openstack/installing-openstack-user-kuryr.html#installation-osp-converting-ignition-resources_installing-openstack-user-kuryr

bootstrap-complete command is timing out:

INFO Waiting up to 20m0s for the Kubernetes API at https://api.ostest.shiftstack.com:6443...
INFO API v1.20.0+ba45583 up
INFO Waiting up to 30m0s for bootstrapping to complete...
ERROR Attempted to gather ClusterOperator status after wait failure: listing ClusterOperator objects: Get "https://api.ostest.shiftstack.com:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 10.46.44.166:6443: connect: connection refused
INFO Use the following commands to gather logs from the cluster
INFO openshift-install gather bootstrap --help
FATAL failed to wait for bootstrapping to complete: timed out waiting for the condition

The keepalived VIP is moved to master-2, but there are not any kube-api containers running there:

$ openstack port list | grep api
| a8cee914-c40d-4578-b781-99634aeb0ce4 | ostest-vmzfj-api-port                                | fa:16:3e:95:74:9d | ip_address='10.196.0.5', subnet_id='de581745-c45f-4a9c-8ee8-0cec3b8bacdb'     | DOWN   |

[core@ostest-vmzfj-master-2 ~]$ ip a | grep inet
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host 
    inet 10.196.3.104/16 brd 10.196.255.255 scope global dynamic noprefixroute ens3
    inet 10.196.0.5/32 scope global ens3
    inet6 fe80::1b90:80d6:f7e:dced/64 scope link noprefixroute 

[core@ostest-vmzfj-master-2 ~]$ sudo crictl ps 
CONTAINER           IMAGE                                                                                                                    CREATED             STATE               NAME                 ATTEMPT             POD ID
fb6dd5f32ed1f       5af7159d316af17f38072eef0e7745389989017725a8c320cbd168cfaefe070d                                                         2 minutes ago       Running             kuryr-cni            2                   615c95db5f094
b074bef3bc3c8       97c854b8868a24ef3e5a538145ecbecbba48ee6370be09ae164a3a35bef2932d                                                         22 minutes ago      Running             kube-multus          0                   4820cc531e120
02af4286eb352       0a0c7e16e7894a279f968f623f0f31d1280369bb72e29072292f56bf153d3be4                                                         25 minutes ago      Running             haproxy              1                   ca44cfcd9eacb
2dde7cbca8d75       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e123461c26c61423ad0d4b9e12f231f100369aadf3fdd1ba28aba211f4c222df   26 minutes ago      Running             mdns-publisher       0                   75b0f2690833c
9f3c567e64ee2       f513bff2bbca49470048b7f39d65544d8090270061c667bc3e1b3545863aa2c2                                                         26 minutes ago      Running             keepalived-monitor   0                   7191d9c878c03
11f673d4f3d5c       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:38787bc323485664a97880ab37d0d51cdc13d50df8ffd58fa95be8196a16b0d6   26 minutes ago      Running             keepalived           0                   7191d9c878c03
ea2d11b0bbff2       f513bff2bbca49470048b7f39d65544d8090270061c667bc3e1b3545863aa2c2                                                         26 minutes ago      Running             coredns-monitor      0                   3321725141fca
a9f1c88f94768       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8d159c7e01d99c9ccaf26e1d997a5f56830b9e2e7b2799928b3b8663e04903d8   26 minutes ago      Running             coredns              0                   3321725141fca
8337c04ff07e8       f513bff2bbca49470048b7f39d65544d8090270061c667bc3e1b3545863aa2c2                                                         27 minutes ago      Running             haproxy-monitor      0                   ca44cfcd9eacb


Version-Release number of selected component (if applicable):

Observed on 4.7.0-0.nightly-2021-02-09-024347

The last successful UPI installation took place with 4.7.0-0.nightly-2021-01-27-110023 (https://rhos-ci-staging-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/osasinfra/view/shiftstack_ci/job/DFG-osasinfra-shiftstack_ci-ocp_verification-osp16.1-ocp4.7-upi/4)

Furthermore, the installation is successful if OpenShiftSDN is configured.

How reproducible: Always

Steps to Reproduce: Run Kuryr CI job: 
https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/osasinfra/view/shiftstack_ci/job/DFG-osasinfra-shiftstack_ci-ocp_verification-osp16.1-ocp4.7-upi/

Actual results: Installation failure.


Expected results: Successful installation.


Additional info: Attaching sosreport and OCP installation logs

Comment 1 rlobillo 2021-02-10 13:11:08 UTC
sos-report: http://rhos-release.virt.bos.redhat.com/log/bz1927244/

Comment 2 rlobillo 2021-02-10 13:29:34 UTC
Created attachment 1756207 [details]
openshift-installer log bundle

Comment 4 rlobillo 2021-02-22 08:44:20 UTC
Verified on OCP4.8.0-0.nightly-2021-02-21-102854 over OSP13 (2021-01-20.1) with Amphora provider.

OCP installation with UPI succeeded:

time="2021-02-21T13:54:03-05:00" level=debug msg="Cluster is initialized"
time="2021-02-21T13:54:03-05:00" level=info msg="Waiting up to 10m0s for the openshift-console route to be created..."
time="2021-02-21T13:54:03-05:00" level=debug msg="Route found in openshift-console namespace: console"
time="2021-02-21T13:54:03-05:00" level=debug msg="OpenShift console route is admitted"
time="2021-02-21T13:54:03-05:00" level=info msg="Install complete!"
time="2021-02-21T13:54:03-05:00" level=info msg="To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/home/cloud-user/ostest/auth/kubeconfig'"
time="2021-02-21T13:54:03-05:00" level=info msg="Access the OpenShift web-console here: https://console-openshift-console.apps.ostest.shiftstack.com"
time="2021-02-21T13:54:03-05:00" level=info msg="Login to the console with user: \"kubeadmin\", and password: \"fYnBX-8rtrM-KeteY-fhoDT\""
time="2021-02-21T13:54:03-05:00" level=debug msg="Time elapsed per stage:"
time="2021-02-21T13:54:03-05:00" level=debug msg="Cluster Operators: 17m8s"
time="2021-02-21T13:54:03-05:00" level=info msg="Time elapsed: 17m8s"


Bootstrapping stage was performed succesfully:

time="2021-02-21T13:13:22-05:00" level=info msg="API v1.20.0+01ab7fd up"
time="2021-02-21T13:13:22-05:00" level=info msg="Waiting up to 30m0s for bootstrapping to complete..."
time="2021-02-21T13:29:08-05:00" level=debug msg="Bootstrap status: complete"
time="2021-02-21T13:29:08-05:00" level=info msg="It is now safe to remove the bootstrap resources"
time="2021-02-21T13:29:08-05:00" level=debug msg="Time elapsed per stage:"
time="2021-02-21T13:29:08-05:00" level=debug msg="Bootstrap Complete: 16m40s"
time="2021-02-21T13:29:08-05:00" level=debug msg="               API: 54s"
time="2021-02-21T13:29:08-05:00" level=info msg="Time elapsed: 16m40s"


Tempest tests were executed succesfully: https://rhos-ci-staging-jenkins.lab.eng.tlv2.redhat.com/job/DFG-osasinfra-shiftstack_ci-ocp_verification-osp13-ocp4.7-upi/7//artifact/tempest-results/tempest-results-kuryr.1.html

NP tests were executed succesfully: https://rhos-ci-staging-jenkins.lab.eng.tlv2.redhat.com/job/DFG-osasinfra-shiftstack_ci-ocp_verification-osp13-ocp4.7-upi/7//artifact/np_test_results/np_kubetest.html#a7c8b2ea-dafb-435a-ae63-ea6c5c596374

"NetworkPolicy_between_server_and_client_should_enforce_policy_based_on_PodSelector_and_NamespaceSelector_[Feature:NetworkPolicy-07]" needed to be re-executed and it passed:

Comment 5 rlobillo 2021-02-22 08:57:48 UTC
Verified on OCP4.8.0-0.nightly-2021-02-21-102854 over OSP13 (2021-01-20.1) with Amphora provider.

OCP installation with UPI succeeded:

time="2021-02-21T13:54:03-05:00" level=debug msg="Cluster is initialized"
time="2021-02-21T13:54:03-05:00" level=info msg="Waiting up to 10m0s for the openshift-console route to be created..."
time="2021-02-21T13:54:03-05:00" level=debug msg="Route found in openshift-console namespace: console"
time="2021-02-21T13:54:03-05:00" level=debug msg="OpenShift console route is admitted"
time="2021-02-21T13:54:03-05:00" level=info msg="Install complete!"
time="2021-02-21T13:54:03-05:00" level=info msg="To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/home/cloud-user/ostest/auth/kubeconfig'"
time="2021-02-21T13:54:03-05:00" level=info msg="Access the OpenShift web-console here: https://console-openshift-console.apps.ostest.shiftstack.com"
time="2021-02-21T13:54:03-05:00" level=info msg="Login to the console with user: \"kubeadmin\", and password: \"fYnBX-8rtrM-KeteY-fhoDT\""
time="2021-02-21T13:54:03-05:00" level=debug msg="Time elapsed per stage:"
time="2021-02-21T13:54:03-05:00" level=debug msg="Cluster Operators: 17m8s"
time="2021-02-21T13:54:03-05:00" level=info msg="Time elapsed: 17m8s"


Bootstrapping stage was performed successfully:

time="2021-02-21T13:13:22-05:00" level=info msg="API v1.20.0+01ab7fd up"
time="2021-02-21T13:13:22-05:00" level=info msg="Waiting up to 30m0s for bootstrapping to complete..."
time="2021-02-21T13:29:08-05:00" level=debug msg="Bootstrap status: complete"
time="2021-02-21T13:29:08-05:00" level=info msg="It is now safe to remove the bootstrap resources"
time="2021-02-21T13:29:08-05:00" level=debug msg="Time elapsed per stage:"
time="2021-02-21T13:29:08-05:00" level=debug msg="Bootstrap Complete: 16m40s"
time="2021-02-21T13:29:08-05:00" level=debug msg="               API: 54s"
time="2021-02-21T13:29:08-05:00" level=info msg="Time elapsed: 16m40s"


Tempest tests passed: https://rhos-ci-staging-jenkins.lab.eng.tlv2.redhat.com/job/DFG-osasinfra-shiftstack_ci-ocp_verification-osp13-ocp4.7-upi/7//artifact/tempest-results/tempest-results-kuryr.1.html

NP tests passed: https://rhos-ci-staging-jenkins.lab.eng.tlv2.redhat.com/job/DFG-osasinfra-shiftstack_ci-ocp_verification-osp13-ocp4.7-upi/7//artifact/np_test_results/np_kubetest.html#a7c8b2ea-dafb-435a-ae63-ea6c5c596374 (*)

  (*) "NetworkPolicy_between_server_and_client_should_enforce_policy_based_on_PodSelector_and_NamespaceSelector_[Feature:NetworkPolicy-07]" failed on first attempt but passed on second one. Logs attached.

Conformance tests passed: https://rhos-ci-staging-jenkins.lab.eng.tlv2.redhat.com/job/DFG-osasinfra-shiftstack_ci-ocp_verification-osp13-ocp4.7-upi/7//artifact/conformance-test-results/conformance_ocp-tests.html (**)

  (**) [sig-scheduling]_SchedulerPredicates_[Serial]_validates_resource_limits_of_pods_that_are_allowed_to_run and 
[sig-api-machinery]_AdmissionWebhook_[Privileged:ClusterAdmin]_should_mutate_pod_and_apply_defaults_after_mutation failed on first attempt but they passed on second execution. Logs attached.

Installation and test logs attached.

Comment 6 rlobillo 2021-02-22 08:58:56 UTC
Created attachment 1758574 [details]
installation and test logs

Comment 9 errata-xmlrpc 2021-07-27 22:43:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438