Bug 1725259

Summary:	[ci] [azure] Managed cluster should should expose cluster services outside the cluster
Product:	OpenShift Container Platform	Reporter:	Abhinav Dahiya <adahiya>
Component:	Networking	Assignee:	Miciah Dashiel Butler Masters <mmasters>
Networking sub component:	router	QA Contact:	Hongan Li <hongli>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	aos-bugs, dmace, wking
Version:	4.2.0
Target Milestone:	---
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-10-16 06:32:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Abhinav Dahiya 2019-06-28 21:25:44 UTC

[Feature:Platform] Managed cluster should should expose cluster services outside the cluster [Suite:openshift/conformance/parallel]

e2e test is failing pretty heavily on Azure

example:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/23301/pull-ci-openshift-origin-master-e2e-azure/19

This is blocking Azure CI to become green.

Comment 1 Abhinav Dahiya 2019-06-28 21:27:34 UTC

the following error from logs:

```
curl -X GET -s -S -o /tmp/body -D /tmp/headers "https://prometheus-k8s-openshift-monitoring.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com" -w '{"code":%{http_code}}' -k 2>/tmp/error 1>/tmp/output || rc=$? echo "{\"test\":1,\"rc\":$(echo $rc),\"curl\":$(cat /tmp/output),\"error\":$(cat /tmp/error | json_escape),\"body\":\"$(cat /tmp/body | base64 -w 0 -)\",\"headers\":$(cat /tmp/headers | json_escape)}"' Jun 28 20:48:54.596: INFO: stderr: "+ set -euo pipefail\n+ rc=0\n+ curl -X GET -s -S -o /tmp/body -D /tmp/headers https://console-openshift-console.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com -w '{\"code\":%{http_code}}' -k\n+ rc=7\n++ echo 7\n++ cat /tmp/output\n++ cat /tmp/error\n++ json_escape\n++ python -c 'import json,sys; print json.dumps(sys.stdin.read())'\n++ cat /tmp/body\ncat: /tmp/body: No such file or directory\n++ base64 -w 0 -\n++ cat /tmp/headers\n++ json_escape\n++ python -c 'import json,sys; print json.dumps(sys.stdin.read())'\n+ echo '{\"test\":0,\"rc\":7,\"curl\":{\"code\":000},\"error\":\"curl: (7) Failed connect to console-openshift-console.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com:443; Connection timed out\\n\",\"body\":\"\",\"headers\":\"\"}'\n+ rc=0\n+ curl -X GET -s -S -o /tmp/body -D /tmp/headers https://prometheus-k8s-openshift-monitoring.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com -w '{\"code\":%{http_code}}' -k\n+ rc=7\n++ echo 7\n++ cat /tmp/output\n++ cat /tmp/error\n++ json_escape\n++ python -c 'import json,sys; print json.dumps(sys.stdin.read())'\n++ cat /tmp/body\n++ base64 -w 0 -\ncat: /tmp/body: No such file or directory\n++ json_escape\n++ cat /tmp/headers\n++ python -c 'import json,sys; print json.dumps(sys.stdin.read())'\n+ echo '{\"test\":1,\"rc\":7,\"curl\":{\"code\":000},\"error\":\"curl: (7) Failed connect to prometheus-k8s-openshift-monitoring.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com:443; Connection timed out\\n\",\"body\":\"\",\"headers\":\"\"}'\n" Jun 28 20:48:54.597: INFO: stdout: "{\"test\":0,\"rc\":7,\"curl\":{\"code\":000},\"error\":\"curl: (7) Failed connect to console-openshift-console.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com:443; Connection timed out\\n\",\"body\":\"\",\"headers\":\"\"}\n{\"test\":1,\"rc\":7,\"curl\":{\"code\":000},\"error\":\"curl: (7) Failed connect to prometheus-k8s-openshift-monitoring.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com:443; Connection timed out\\n\",\"body\":\"\",\"headers\":\"\"}\n"
```

indicates that the router is refusing connections?

Comment 2 Dan Mace 2019-08-06 20:46:41 UTC

IMO we don't have enough data to consider this a blocker.

https://ci-search-ci-search-next.svc.ci.openshift.org/chart?search=failed%3A+.*Managed+cluster+should+should+expose+cluster+services+outside+the+cluster&maxAge=48h&context=2&type=all&name=pull-ci-.*-azure

There is no testgrid data for Azure jobs yet.

Comment 3 Abhinav Dahiya 2019-08-15 00:52:44 UTC

https://testgrid.k8s.io/redhat-openshift-release-informing#redhat-canary-openshift-ocp-installer-e2e-azure-4.2&sort-by-flakiness=

we have testgrid now to help find out.

Comment 4 Chuan Yu 2019-08-20 02:56:16 UTC

https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/53
https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/52

New error for this case:
fail [github.com/openshift/origin/test/extended/util/url/url.go:134]: Unexpected error:
    <*errors.errorString | 0xc0002733f0>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
occurred

Comment 5 Abhinav Dahiya 2019-08-20 16:37:18 UTC

(In reply to Chuan Yu from comment #4)
> https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-
> installer-e2e-azure-4.2/53
> https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-
> installer-e2e-azure-4.2/52
> 
> New error for this case:
> fail [github.com/openshift/origin/test/extended/util/url/url.go:134]:
> Unexpected error:
>     <*errors.errorString | 0xc0002733f0>: {
>         s: "timed out waiting for the condition",
>     }
>     timed out waiting for the condition
> occurred

currently the azure public routes are broken see https://bugzilla.redhat.com/show_bug.cgi?id=1743728

I would like to keep this bug open to track the flake rate leaving aside the breakage currently.

Also it would be nice if ingress-operator provided information that the DNS records were actually not created.

Comment 6 Miciah Dashiel Butler Masters 2019-08-21 01:25:54 UTC

> Also it would be nice if ingress-operator provided information that the DNS records were actually not created.

The ingresscontroller's conditions indicate the failure:

      - lastTransitionTime: 2019-08-19T19:00:47Z
        message: 'The record failed to provision in some zones: [{/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/os4-common/providers/Microsoft.Network/dnszones/ci.azure.devcluster.openshift.com
          map[]}]'
        reason: FailedZones
        status: "False"
        type: DNSReady

However the clusteroperator does not:

                    {
                        "lastTransitionTime": "2019-08-19T19:00:21Z",
                        "reason": "NoIngressControllersDegraded",
                        "status": "False",
                        "type": "Degraded"
                    },
                    // ...
                    {
                        "lastTransitionTime": "2019-08-19T19:04:40Z",
                        "message": "desired and current number of IngressControllers are equal",
                        "status": "True",
                        "type": "Available"
                    }

We should set the clusteroperator's "Available" condition to false when an ingresscontroller has a "DNSReady" condition that is false.

Comment 7 Miciah Dashiel Butler Masters 2019-08-26 18:06:38 UTC

So far I have determined that traffic to the Azure LB from the host network of a worker node that is not running an ingress-controller pod is dropped.  The clusters that CI deploys have 3 worker nodes and ingress-controller pods on 2 of them, and the execpod uses host networking, so if it happens to be scheduled on the odd node, its requests get dropped, which causes the test failures.  I am still investigating why the traffic to the LB is dropped.

Comment 8 Dan Mace 2019-08-30 15:04:08 UTC

We believe this was solved by https://github.com/openshift/origin/pull/23688. Will open new bugs as necessary.

Comment 10 Hongan Li 2019-09-04 02:41:03 UTC

checked testgrid and it passed in recent ci job, so moving to verified.

https://testgrid.k8s.io/redhat-openshift-release-informing#redhat-canary-openshift-ocp-installer-e2e-azure-4.2&sort-by-flakiness=

Comment 11 errata-xmlrpc 2019-10-16 06:32:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922