Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1874914

Summary:	operator install ingress
Product:	OpenShift Container Platform	Reporter:	Pawel Krupa <pkrupa>
Component:	Cloud Compute	Assignee:	Alberto <agarcial>
Cloud Compute sub component:	Other Providers	QA Contact:	sunzhaohua <zhsun>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	amcdermo, aos-bugs, bparees, kgarriso, wking
Version:	4.6	Keywords:	Reopened
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:	operator install ingress
Last Closed:	2020-09-22 14:55:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pawel Krupa 2020-09-02 14:57:29 UTC

test:
operator install ingress 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=operator+install+ingress


https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere/1301147570361339904

On vSphere jobs network operator seems to hang and report only :

  Operator unavailable (Startup): The network is starting up

This in turn causes cluster installation to fail

Comment 1 Pawel Krupa 2020-09-02 15:11:45 UTC

Reassigning to network edge team after sdn team investigation showed that network layer is working (however incorrectly reporting as degraded).

Comment 2 Andrew McDermott 2020-09-02 17:02:02 UTC

(In reply to Pawel Krupa from comment #1)
> Reassigning to network edge team after sdn team investigation showed that
> network layer is working (however incorrectly reporting as degraded).

Per a slack conversation (https://coreos.slack.com/archives/CBN38N3MW/p1599057596438400):
 
https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-vsphere/1301147570361339904/artifacts/e2e-vsphere/gather-extra/nodes.json shows there are no worker nodes, which prevented the router deployment from scheduling pods.

@Pawel, did your investigations lead elsewhere in terms of reassigning this?

Comment 3 Andrew McDermott 2020-09-03 15:53:22 UTC

Closing as there were no machines for the router pods to run on. (Comment #2).

Comment 4 Ben Parees 2020-09-03 18:23:25 UTC

Last time I was involved in a "missing worker nodes" thread, installer+MCO were pointed to.

Sending to MCO.

Comment 6 Scott Dodson 2020-09-03 18:46:11 UTC

Just recording this here from the Slack conversation, conclusion seems to have been MAO -> AWS API problem

paulfantom  1 day ago
Seems like MAO cannot contact external API and schedule worker nodes because of network issue:
E0902 14:22:06.006230       1 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="RequestError: send request failed\ncaused by: Post \"https://ec2.us-east-2.amazonaws.com/\": dial tcp 52.95.20.2:443: i/o timeout"  "controller"="machine_controller" "request"={"Namespace":"openshift-machine-api","Name":"ci-op-88nqgzcz-f516a-rdlqw-worker-us-east-2a-8gcxs"}

Comment 7 Kirsten Garrison 2020-09-03 18:48:51 UTC

As per Scott's comment above moving this over to MAO so they can take a look

Comment 8 Alberto 2020-09-04 10:44:11 UTC

>Reassigning to network edge team after sdn team investigation showed that network layer is working (however incorrectly reporting as degraded).

@Pawel Is there a bug to track this operator reports degraded incorrectly?

>Just recording this here from the Slack conversation, conclusion seems to have been MAO -> AWS API problem

How could this be the conclusion if the Bug is reporting this on vSphere?

Checking logs in the link in the BZ description:

E0902 13:39:24.611990       1 controller.go:279] ci-op-88nqgzcz-0aec4-sp29m-master-2: error updating machine: admission webhook "validation.machine.machine.openshift.io" denied the request: providerSpec.diskGiB: Invalid value: 60: diskGiB is below minimum value (120)

This should be fixed by [1]

Looking at other periodic jobs in the search link:
1 - release-openshift-ocp-installer-e2e-gcp-rt-4.4 [2]:

- All machines are "Provisioned" phase but they fail to become nodes [3].
- MCO pool reports no workers [4]. Something must be preventing the machine config daemon from running on the node. This can be ignition, CSRs or systemd failures.
- The machine approver is not authorising CSRs [5]
"CSR csr-kfpxh not authorized: failed to find machine for node ci-op-9bfj80yl-c4901-stgcz-worker-c-x25rd.c.openshift-gce-devel-"
Looking at the machines manifests [6] "address": "ci-op-9bfj80yl-c4901-stgcz-worker-c-x25rd.c.openshift-gce-devel-ci.internal".

So it seems the CSR is truncating the nodeName resulting in a miss match with the address stored in the machine resource.
This should be addressed in >=4.5 by https://github.com/openshift/cluster-api-provider-gcp/pull/88 and https://github.com/openshift/machine-config-operator/pull/1711

2 - periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy [7]:
The machine API can't reach AWS and requests timeout. This is likely to be because the infra pre created for the test is miss configured and does not let the cloud requests to succeed.

3 - release-openshift-ocp-installer-e2e-azure-ovn-4.6 [8]:
Operators have not egress traffic. See [9]

"time="2020-09-04T07:33:35Z" level=error msg="error while validating cloud credentials: failed checking create cloud creds: adal: Failed to execute the refresh request. Error = 'Post \"https://login.microsoftonline.com/6047c7e9-b2ad-488d-a54e-dc3f6be6a7ee/oauth2/token?api-version=1.0\": dial tcp: lookup login.microsoftonline.com on 172.30.0.10:53: read udp 10.128.0.30:52694->172.30.0.10:53: read: connection refused'" controller=secretannotator"

4 - release-openshift-ocp-installer-e2e-azure-serial-4.6
Operators have not egress traffic. See [10]

time="2020-09-04T07:49:42Z" level=error msg="error while creating service principal" controller=secretannotator error="adal: Failed to execute the refresh request. Error = 'Post \"https://login.microsoftonline.com/6047c7e9-b2ad-488d-a54e-dc3f6be6a7ee/oauth2/token?api-version=1.0\": dial tcp: i/o timeout'"
time="2020-09-04T07:49:42Z" level=error msg="error while validating cloud credentials: failed checking create cloud creds: adal: Failed to execute the refresh request. Error = 'Post \"https://login.microsoftonline.com/6047c7e9-b2ad-488d-a54e-dc3f6be6a7ee/oauth2/token?api-version=1.0\": dial tcp: i/o timeout'" controller=secretannotator
time="2020-09-04T07:49:42Z" level=info msg="syncing cluster operator status" controller=secretannotator_status
time="2020-09-04T07:49:42Z" level=info msg="cluster operator status updated" controller=secretannotator_status
time="2020-09-04T07:49:47Z" level=info msg="validating cloud cred secret" controller=secretannotator 

I'm closing this as 4.5 is fixed in >=4.5.

I'm creating https://bugzilla.redhat.com/show_bug.cgi?id=1875773 to track 2.

I'm creating https://bugzilla.redhat.com/show_bug.cgi?id=1875774 to track 3 and 4.


[1] https://github.com/openshift/release/pull/11534

[2] https://prow.ci.openshift.org/job-history/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-rt-4.4

[3] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-rt-4.4/1301785600797446144/artifacts/e2e-gcp/pods/openshift-machine-api_machine-api-controllers-5d8c99b9dc-5vhxf_machine-controller.log

[4] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-rt-4.4/1301785600797446144/artifacts/e2e-gcp/machineconfigpools.json

[5] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-rt-4.4/1301785600797446144/artifacts/e2e-gcp/pods/openshift-cluster-machine-approver_machine-approver-6cf86d9544-lk58q_machine-approver-controller.log

[6] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-rt-4.4/1301785600797446144/artifacts/e2e-gcp/machines.json

[7] https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1301773744699609088/artifacts/e2e-aws-proxy/gather-extra/pods/openshift-machine-api_machine-api-controllers-64f45bf95d-rq6jw_machine-controller.log

[8] https://prow.ci.openshift.org/job-history/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-ovn-4.6

[9] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-ovn-4.6/1301773738949218304/artifacts/e2e-azure/pods/openshift-cloud-credential-operator_cloud-credential-operator-57c5dfd7-q4wph_cloud-credential-operator.log

[10] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.6/1301773746377330688/artifacts/e2e-azure-serial/pods/openshift-cloud-credential-operator_cloud-credential-operator-57c5dfd7-nf5b8_cloud-credential-operator.log

Comment 10 Alberto 2020-09-22 14:55:01 UTC

closing as per https://bugzilla.redhat.com/show_bug.cgi?id=1874914#c8

*** This bug has been marked as a duplicate of bug 1875773 ***

Comment 11 Alberto 2020-09-22 14:55:39 UTC

closing as per https://bugzilla.redhat.com/show_bug.cgi?id=1874914#c8