Bug 1874914
| Summary: | operator install ingress | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Pawel Krupa <pkrupa> |
| Component: | Cloud Compute | Assignee: | Alberto <agarcial> |
| Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> |
| Status: | CLOSED DUPLICATE | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | amcdermo, aos-bugs, bparees, kgarriso, wking |
| Version: | 4.6 | Keywords: | Reopened |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: |
operator install ingress
|
|
| Last Closed: | 2020-09-22 14:55:01 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Pawel Krupa
2020-09-02 14:57:29 UTC
Reassigning to network edge team after sdn team investigation showed that network layer is working (however incorrectly reporting as degraded). (In reply to Pawel Krupa from comment #1) > Reassigning to network edge team after sdn team investigation showed that > network layer is working (however incorrectly reporting as degraded). Per a slack conversation (https://coreos.slack.com/archives/CBN38N3MW/p1599057596438400): https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere/1301147570361339904/artifacts/e2e-vsphere/gather-extra/nodes.json shows there are no worker nodes, which prevented the router deployment from scheduling pods. @Pawel, did your investigations lead elsewhere in terms of reassigning this? Closing as there were no machines for the router pods to run on. (Comment #2). Last time I was involved in a "missing worker nodes" thread, installer+MCO were pointed to. Sending to MCO. Just recording this here from the Slack conversation, conclusion seems to have been MAO -> AWS API problem paulfantom 1 day ago Seems like MAO cannot contact external API and schedule worker nodes because of network issue: E0902 14:22:06.006230 1 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="RequestError: send request failed\ncaused by: Post \"https://ec2.us-east-2.amazonaws.com/\": dial tcp 52.95.20.2:443: i/o timeout" "controller"="machine_controller" "request"={"Namespace":"openshift-machine-api","Name":"ci-op-88nqgzcz-f516a-rdlqw-worker-us-east-2a-8gcxs"} As per Scott's comment above moving this over to MAO so they can take a look >Reassigning to network edge team after sdn team investigation showed that network layer is working (however incorrectly reporting as degraded). @Pawel Is there a bug to track this operator reports degraded incorrectly? >Just recording this here from the Slack conversation, conclusion seems to have been MAO -> AWS API problem How could this be the conclusion if the Bug is reporting this on vSphere? Checking logs in the link in the BZ description: E0902 13:39:24.611990 1 controller.go:279] ci-op-88nqgzcz-0aec4-sp29m-master-2: error updating machine: admission webhook "validation.machine.machine.openshift.io" denied the request: providerSpec.diskGiB: Invalid value: 60: diskGiB is below minimum value (120) This should be fixed by [1] Looking at other periodic jobs in the search link: 1 - release-openshift-ocp-installer-e2e-gcp-rt-4.4 [2]: - All machines are "Provisioned" phase but they fail to become nodes [3]. - MCO pool reports no workers [4]. Something must be preventing the machine config daemon from running on the node. This can be ignition, CSRs or systemd failures. - The machine approver is not authorising CSRs [5] "CSR csr-kfpxh not authorized: failed to find machine for node ci-op-9bfj80yl-c4901-stgcz-worker-c-x25rd.c.openshift-gce-devel-" Looking at the machines manifests [6] "address": "ci-op-9bfj80yl-c4901-stgcz-worker-c-x25rd.c.openshift-gce-devel-ci.internal". So it seems the CSR is truncating the nodeName resulting in a miss match with the address stored in the machine resource. This should be addressed in >=4.5 by https://github.com/openshift/cluster-api-provider-gcp/pull/88 and https://github.com/openshift/machine-config-operator/pull/1711 2 - periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy [7]: The machine API can't reach AWS and requests timeout. This is likely to be because the infra pre created for the test is miss configured and does not let the cloud requests to succeed. 3 - release-openshift-ocp-installer-e2e-azure-ovn-4.6 [8]: Operators have not egress traffic. See [9] "time="2020-09-04T07:33:35Z" level=error msg="error while validating cloud credentials: failed checking create cloud creds: adal: Failed to execute the refresh request. Error = 'Post \"https://login.microsoftonline.com/6047c7e9-b2ad-488d-a54e-dc3f6be6a7ee/oauth2/token?api-version=1.0\": dial tcp: lookup login.microsoftonline.com on 172.30.0.10:53: read udp 10.128.0.30:52694->172.30.0.10:53: read: connection refused'" controller=secretannotator" 4 - release-openshift-ocp-installer-e2e-azure-serial-4.6 Operators have not egress traffic. See [10] time="2020-09-04T07:49:42Z" level=error msg="error while creating service principal" controller=secretannotator error="adal: Failed to execute the refresh request. Error = 'Post \"https://login.microsoftonline.com/6047c7e9-b2ad-488d-a54e-dc3f6be6a7ee/oauth2/token?api-version=1.0\": dial tcp: i/o timeout'" time="2020-09-04T07:49:42Z" level=error msg="error while validating cloud credentials: failed checking create cloud creds: adal: Failed to execute the refresh request. Error = 'Post \"https://login.microsoftonline.com/6047c7e9-b2ad-488d-a54e-dc3f6be6a7ee/oauth2/token?api-version=1.0\": dial tcp: i/o timeout'" controller=secretannotator time="2020-09-04T07:49:42Z" level=info msg="syncing cluster operator status" controller=secretannotator_status time="2020-09-04T07:49:42Z" level=info msg="cluster operator status updated" controller=secretannotator_status time="2020-09-04T07:49:47Z" level=info msg="validating cloud cred secret" controller=secretannotator I'm closing this as 4.5 is fixed in >=4.5. I'm creating https://bugzilla.redhat.com/show_bug.cgi?id=1875773 to track 2. I'm creating https://bugzilla.redhat.com/show_bug.cgi?id=1875774 to track 3 and 4. [1] https://github.com/openshift/release/pull/11534 [2] https://prow.ci.openshift.org/job-history/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-rt-4.4 [3] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-rt-4.4/1301785600797446144/artifacts/e2e-gcp/pods/openshift-machine-api_machine-api-controllers-5d8c99b9dc-5vhxf_machine-controller.log [4] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-rt-4.4/1301785600797446144/artifacts/e2e-gcp/machineconfigpools.json [5] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-rt-4.4/1301785600797446144/artifacts/e2e-gcp/pods/openshift-cluster-machine-approver_machine-approver-6cf86d9544-lk58q_machine-approver-controller.log [6] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-rt-4.4/1301785600797446144/artifacts/e2e-gcp/machines.json [7] https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1301773744699609088/artifacts/e2e-aws-proxy/gather-extra/pods/openshift-machine-api_machine-api-controllers-64f45bf95d-rq6jw_machine-controller.log [8] https://prow.ci.openshift.org/job-history/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-ovn-4.6 [9] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-ovn-4.6/1301773738949218304/artifacts/e2e-azure/pods/openshift-cloud-credential-operator_cloud-credential-operator-57c5dfd7-q4wph_cloud-credential-operator.log [10] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.6/1301773746377330688/artifacts/e2e-azure-serial/pods/openshift-cloud-credential-operator_cloud-credential-operator-57c5dfd7-nf5b8_cloud-credential-operator.log closing as per https://bugzilla.redhat.com/show_bug.cgi?id=1874914#c8 *** This bug has been marked as a duplicate of bug 1875773 *** closing as per https://bugzilla.redhat.com/show_bug.cgi?id=1874914#c8 |