[Feature:Platform] Managed cluster should should expose cluster services outside the cluster [Suite:openshift/conformance/parallel] e2e test is failing pretty heavily on Azure example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/23301/pull-ci-openshift-origin-master-e2e-azure/19 This is blocking Azure CI to become green.
the following error from logs: ``` curl -X GET -s -S -o /tmp/body -D /tmp/headers "https://prometheus-k8s-openshift-monitoring.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com" -w '{"code":%{http_code}}' -k 2>/tmp/error 1>/tmp/output || rc=$? echo "{\"test\":1,\"rc\":$(echo $rc),\"curl\":$(cat /tmp/output),\"error\":$(cat /tmp/error | json_escape),\"body\":\"$(cat /tmp/body | base64 -w 0 -)\",\"headers\":$(cat /tmp/headers | json_escape)}"' Jun 28 20:48:54.596: INFO: stderr: "+ set -euo pipefail\n+ rc=0\n+ curl -X GET -s -S -o /tmp/body -D /tmp/headers https://console-openshift-console.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com -w '{\"code\":%{http_code}}' -k\n+ rc=7\n++ echo 7\n++ cat /tmp/output\n++ cat /tmp/error\n++ json_escape\n++ python -c 'import json,sys; print json.dumps(sys.stdin.read())'\n++ cat /tmp/body\ncat: /tmp/body: No such file or directory\n++ base64 -w 0 -\n++ cat /tmp/headers\n++ json_escape\n++ python -c 'import json,sys; print json.dumps(sys.stdin.read())'\n+ echo '{\"test\":0,\"rc\":7,\"curl\":{\"code\":000},\"error\":\"curl: (7) Failed connect to console-openshift-console.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com:443; Connection timed out\\n\",\"body\":\"\",\"headers\":\"\"}'\n+ rc=0\n+ curl -X GET -s -S -o /tmp/body -D /tmp/headers https://prometheus-k8s-openshift-monitoring.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com -w '{\"code\":%{http_code}}' -k\n+ rc=7\n++ echo 7\n++ cat /tmp/output\n++ cat /tmp/error\n++ json_escape\n++ python -c 'import json,sys; print json.dumps(sys.stdin.read())'\n++ cat /tmp/body\n++ base64 -w 0 -\ncat: /tmp/body: No such file or directory\n++ json_escape\n++ cat /tmp/headers\n++ python -c 'import json,sys; print json.dumps(sys.stdin.read())'\n+ echo '{\"test\":1,\"rc\":7,\"curl\":{\"code\":000},\"error\":\"curl: (7) Failed connect to prometheus-k8s-openshift-monitoring.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com:443; Connection timed out\\n\",\"body\":\"\",\"headers\":\"\"}'\n" Jun 28 20:48:54.597: INFO: stdout: "{\"test\":0,\"rc\":7,\"curl\":{\"code\":000},\"error\":\"curl: (7) Failed connect to console-openshift-console.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com:443; Connection timed out\\n\",\"body\":\"\",\"headers\":\"\"}\n{\"test\":1,\"rc\":7,\"curl\":{\"code\":000},\"error\":\"curl: (7) Failed connect to prometheus-k8s-openshift-monitoring.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com:443; Connection timed out\\n\",\"body\":\"\",\"headers\":\"\"}\n" ``` indicates that the router is refusing connections?
IMO we don't have enough data to consider this a blocker. https://ci-search-ci-search-next.svc.ci.openshift.org/chart?search=failed%3A+.*Managed+cluster+should+should+expose+cluster+services+outside+the+cluster&maxAge=48h&context=2&type=all&name=pull-ci-.*-azure There is no testgrid data for Azure jobs yet.
https://testgrid.k8s.io/redhat-openshift-release-informing#redhat-canary-openshift-ocp-installer-e2e-azure-4.2&sort-by-flakiness= we have testgrid now to help find out.
https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/53 https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/52 New error for this case: fail [github.com/openshift/origin/test/extended/util/url/url.go:134]: Unexpected error: <*errors.errorString | 0xc0002733f0>: { s: "timed out waiting for the condition", } timed out waiting for the condition occurred
(In reply to Chuan Yu from comment #4) > https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp- > installer-e2e-azure-4.2/53 > https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp- > installer-e2e-azure-4.2/52 > > New error for this case: > fail [github.com/openshift/origin/test/extended/util/url/url.go:134]: > Unexpected error: > <*errors.errorString | 0xc0002733f0>: { > s: "timed out waiting for the condition", > } > timed out waiting for the condition > occurred currently the azure public routes are broken see https://bugzilla.redhat.com/show_bug.cgi?id=1743728 I would like to keep this bug open to track the flake rate leaving aside the breakage currently. Also it would be nice if ingress-operator provided information that the DNS records were actually not created.
> Also it would be nice if ingress-operator provided information that the DNS records were actually not created. The ingresscontroller's conditions indicate the failure: - lastTransitionTime: 2019-08-19T19:00:47Z message: 'The record failed to provision in some zones: [{/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/os4-common/providers/Microsoft.Network/dnszones/ci.azure.devcluster.openshift.com map[]}]' reason: FailedZones status: "False" type: DNSReady However the clusteroperator does not: { "lastTransitionTime": "2019-08-19T19:00:21Z", "reason": "NoIngressControllersDegraded", "status": "False", "type": "Degraded" }, // ... { "lastTransitionTime": "2019-08-19T19:04:40Z", "message": "desired and current number of IngressControllers are equal", "status": "True", "type": "Available" } We should set the clusteroperator's "Available" condition to false when an ingresscontroller has a "DNSReady" condition that is false.
So far I have determined that traffic to the Azure LB from the host network of a worker node that is not running an ingress-controller pod is dropped. The clusters that CI deploys have 3 worker nodes and ingress-controller pods on 2 of them, and the execpod uses host networking, so if it happens to be scheduled on the odd node, its requests get dropped, which causes the test failures. I am still investigating why the traffic to the LB is dropped.
We believe this was solved by https://github.com/openshift/origin/pull/23688. Will open new bugs as necessary.
checked testgrid and it passed in recent ci job, so moving to verified. https://testgrid.k8s.io/redhat-openshift-release-informing#redhat-canary-openshift-ocp-installer-e2e-azure-4.2&sort-by-flakiness=
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922