Bug 1725259
| Summary: | [ci] [azure] Managed cluster should should expose cluster services outside the cluster | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Abhinav Dahiya <adahiya> |
| Component: | Networking | Assignee: | Miciah Dashiel Butler Masters <mmasters> |
| Networking sub component: | router | QA Contact: | Hongan Li <hongli> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | aos-bugs, dmace, wking |
| Version: | 4.2.0 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.2.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-10-16 06:32:46 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Abhinav Dahiya
2019-06-28 21:25:44 UTC
the following error from logs: ``` curl -X GET -s -S -o /tmp/body -D /tmp/headers "https://prometheus-k8s-openshift-monitoring.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com" -w '{"code":%{http_code}}' -k 2>/tmp/error 1>/tmp/output || rc=$? echo "{\"test\":1,\"rc\":$(echo $rc),\"curl\":$(cat /tmp/output),\"error\":$(cat /tmp/error | json_escape),\"body\":\"$(cat /tmp/body | base64 -w 0 -)\",\"headers\":$(cat /tmp/headers | json_escape)}"' Jun 28 20:48:54.596: INFO: stderr: "+ set -euo pipefail\n+ rc=0\n+ curl -X GET -s -S -o /tmp/body -D /tmp/headers https://console-openshift-console.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com -w '{\"code\":%{http_code}}' -k\n+ rc=7\n++ echo 7\n++ cat /tmp/output\n++ cat /tmp/error\n++ json_escape\n++ python -c 'import json,sys; print json.dumps(sys.stdin.read())'\n++ cat /tmp/body\ncat: /tmp/body: No such file or directory\n++ base64 -w 0 -\n++ cat /tmp/headers\n++ json_escape\n++ python -c 'import json,sys; print json.dumps(sys.stdin.read())'\n+ echo '{\"test\":0,\"rc\":7,\"curl\":{\"code\":000},\"error\":\"curl: (7) Failed connect to console-openshift-console.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com:443; Connection timed out\\n\",\"body\":\"\",\"headers\":\"\"}'\n+ rc=0\n+ curl -X GET -s -S -o /tmp/body -D /tmp/headers https://prometheus-k8s-openshift-monitoring.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com -w '{\"code\":%{http_code}}' -k\n+ rc=7\n++ echo 7\n++ cat /tmp/output\n++ cat /tmp/error\n++ json_escape\n++ python -c 'import json,sys; print json.dumps(sys.stdin.read())'\n++ cat /tmp/body\n++ base64 -w 0 -\ncat: /tmp/body: No such file or directory\n++ json_escape\n++ cat /tmp/headers\n++ python -c 'import json,sys; print json.dumps(sys.stdin.read())'\n+ echo '{\"test\":1,\"rc\":7,\"curl\":{\"code\":000},\"error\":\"curl: (7) Failed connect to prometheus-k8s-openshift-monitoring.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com:443; Connection timed out\\n\",\"body\":\"\",\"headers\":\"\"}'\n" Jun 28 20:48:54.597: INFO: stdout: "{\"test\":0,\"rc\":7,\"curl\":{\"code\":000},\"error\":\"curl: (7) Failed connect to console-openshift-console.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com:443; Connection timed out\\n\",\"body\":\"\",\"headers\":\"\"}\n{\"test\":1,\"rc\":7,\"curl\":{\"code\":000},\"error\":\"curl: (7) Failed connect to prometheus-k8s-openshift-monitoring.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com:443; Connection timed out\\n\",\"body\":\"\",\"headers\":\"\"}\n" ``` indicates that the router is refusing connections? IMO we don't have enough data to consider this a blocker. https://ci-search-ci-search-next.svc.ci.openshift.org/chart?search=failed%3A+.*Managed+cluster+should+should+expose+cluster+services+outside+the+cluster&maxAge=48h&context=2&type=all&name=pull-ci-.*-azure There is no testgrid data for Azure jobs yet. https://testgrid.k8s.io/redhat-openshift-release-informing#redhat-canary-openshift-ocp-installer-e2e-azure-4.2&sort-by-flakiness= we have testgrid now to help find out. https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/53 https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/52 New error for this case: fail [github.com/openshift/origin/test/extended/util/url/url.go:134]: Unexpected error: <*errors.errorString | 0xc0002733f0>: { s: "timed out waiting for the condition", } timed out waiting for the condition occurred (In reply to Chuan Yu from comment #4) > https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp- > installer-e2e-azure-4.2/53 > https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp- > installer-e2e-azure-4.2/52 > > New error for this case: > fail [github.com/openshift/origin/test/extended/util/url/url.go:134]: > Unexpected error: > <*errors.errorString | 0xc0002733f0>: { > s: "timed out waiting for the condition", > } > timed out waiting for the condition > occurred currently the azure public routes are broken see https://bugzilla.redhat.com/show_bug.cgi?id=1743728 I would like to keep this bug open to track the flake rate leaving aside the breakage currently. Also it would be nice if ingress-operator provided information that the DNS records were actually not created. > Also it would be nice if ingress-operator provided information that the DNS records were actually not created.
The ingresscontroller's conditions indicate the failure:
- lastTransitionTime: 2019-08-19T19:00:47Z
message: 'The record failed to provision in some zones: [{/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/os4-common/providers/Microsoft.Network/dnszones/ci.azure.devcluster.openshift.com
map[]}]'
reason: FailedZones
status: "False"
type: DNSReady
However the clusteroperator does not:
{
"lastTransitionTime": "2019-08-19T19:00:21Z",
"reason": "NoIngressControllersDegraded",
"status": "False",
"type": "Degraded"
},
// ...
{
"lastTransitionTime": "2019-08-19T19:04:40Z",
"message": "desired and current number of IngressControllers are equal",
"status": "True",
"type": "Available"
}
We should set the clusteroperator's "Available" condition to false when an ingresscontroller has a "DNSReady" condition that is false.
So far I have determined that traffic to the Azure LB from the host network of a worker node that is not running an ingress-controller pod is dropped. The clusters that CI deploys have 3 worker nodes and ingress-controller pods on 2 of them, and the execpod uses host networking, so if it happens to be scheduled on the odd node, its requests get dropped, which causes the test failures. I am still investigating why the traffic to the LB is dropped. We believe this was solved by https://github.com/openshift/origin/pull/23688. Will open new bugs as necessary. checked testgrid and it passed in recent ci job, so moving to verified. https://testgrid.k8s.io/redhat-openshift-release-informing#redhat-canary-openshift-ocp-installer-e2e-azure-4.2&sort-by-flakiness= Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922 |