Bug 1725259 - [ci] [azure] Managed cluster should should expose cluster services outside the cluster
Summary: [ci] [azure] Managed cluster should should expose cluster services outside th...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.2.0
Assignee: Miciah Dashiel Butler Masters
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-06-28 21:25 UTC by Abhinav Dahiya
Modified: 2022-08-04 22:24 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:32:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 23688 0 'None' 'closed' 'e2e: use container network to access routes' 2019-12-03 03:09:18 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:32:57 UTC

Description Abhinav Dahiya 2019-06-28 21:25:44 UTC
[Feature:Platform] Managed cluster should should expose cluster services outside the cluster [Suite:openshift/conformance/parallel]

e2e test is failing pretty heavily on Azure

example:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/23301/pull-ci-openshift-origin-master-e2e-azure/19

This is blocking Azure CI to become green.

Comment 1 Abhinav Dahiya 2019-06-28 21:27:34 UTC
the following error from logs:

```
curl -X GET -s -S -o /tmp/body -D /tmp/headers "https://prometheus-k8s-openshift-monitoring.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com" -w '{"code":%{http_code}}' -k 2>/tmp/error 1>/tmp/output || rc=$? echo "{\"test\":1,\"rc\":$(echo $rc),\"curl\":$(cat /tmp/output),\"error\":$(cat /tmp/error | json_escape),\"body\":\"$(cat /tmp/body | base64 -w 0 -)\",\"headers\":$(cat /tmp/headers | json_escape)}"' Jun 28 20:48:54.596: INFO: stderr: "+ set -euo pipefail\n+ rc=0\n+ curl -X GET -s -S -o /tmp/body -D /tmp/headers https://console-openshift-console.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com -w '{\"code\":%{http_code}}' -k\n+ rc=7\n++ echo 7\n++ cat /tmp/output\n++ cat /tmp/error\n++ json_escape\n++ python -c 'import json,sys; print json.dumps(sys.stdin.read())'\n++ cat /tmp/body\ncat: /tmp/body: No such file or directory\n++ base64 -w 0 -\n++ cat /tmp/headers\n++ json_escape\n++ python -c 'import json,sys; print json.dumps(sys.stdin.read())'\n+ echo '{\"test\":0,\"rc\":7,\"curl\":{\"code\":000},\"error\":\"curl: (7) Failed connect to console-openshift-console.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com:443; Connection timed out\\n\",\"body\":\"\",\"headers\":\"\"}'\n+ rc=0\n+ curl -X GET -s -S -o /tmp/body -D /tmp/headers https://prometheus-k8s-openshift-monitoring.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com -w '{\"code\":%{http_code}}' -k\n+ rc=7\n++ echo 7\n++ cat /tmp/output\n++ cat /tmp/error\n++ json_escape\n++ python -c 'import json,sys; print json.dumps(sys.stdin.read())'\n++ cat /tmp/body\n++ base64 -w 0 -\ncat: /tmp/body: No such file or directory\n++ json_escape\n++ cat /tmp/headers\n++ python -c 'import json,sys; print json.dumps(sys.stdin.read())'\n+ echo '{\"test\":1,\"rc\":7,\"curl\":{\"code\":000},\"error\":\"curl: (7) Failed connect to prometheus-k8s-openshift-monitoring.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com:443; Connection timed out\\n\",\"body\":\"\",\"headers\":\"\"}'\n" Jun 28 20:48:54.597: INFO: stdout: "{\"test\":0,\"rc\":7,\"curl\":{\"code\":000},\"error\":\"curl: (7) Failed connect to console-openshift-console.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com:443; Connection timed out\\n\",\"body\":\"\",\"headers\":\"\"}\n{\"test\":1,\"rc\":7,\"curl\":{\"code\":000},\"error\":\"curl: (7) Failed connect to prometheus-k8s-openshift-monitoring.apps.ci-op-nin15svi-5cef7.ci.azure.devcluster.openshift.com:443; Connection timed out\\n\",\"body\":\"\",\"headers\":\"\"}\n"
```

indicates that the router is refusing connections?

Comment 4 Chuan Yu 2019-08-20 02:56:16 UTC
https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/53
https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/52

New error for this case:
fail [github.com/openshift/origin/test/extended/util/url/url.go:134]: Unexpected error:
    <*errors.errorString | 0xc0002733f0>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
occurred

Comment 5 Abhinav Dahiya 2019-08-20 16:37:18 UTC
(In reply to Chuan Yu from comment #4)
> https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-
> installer-e2e-azure-4.2/53
> https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-
> installer-e2e-azure-4.2/52
> 
> New error for this case:
> fail [github.com/openshift/origin/test/extended/util/url/url.go:134]:
> Unexpected error:
>     <*errors.errorString | 0xc0002733f0>: {
>         s: "timed out waiting for the condition",
>     }
>     timed out waiting for the condition
> occurred

currently the azure public routes are broken see https://bugzilla.redhat.com/show_bug.cgi?id=1743728

I would like to keep this bug open to track the flake rate leaving aside the breakage currently.

Also it would be nice if ingress-operator provided information that the DNS records were actually not created.

Comment 6 Miciah Dashiel Butler Masters 2019-08-21 01:25:54 UTC
> Also it would be nice if ingress-operator provided information that the DNS records were actually not created.

The ingresscontroller's conditions indicate the failure:

      - lastTransitionTime: 2019-08-19T19:00:47Z
        message: 'The record failed to provision in some zones: [{/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/os4-common/providers/Microsoft.Network/dnszones/ci.azure.devcluster.openshift.com
          map[]}]'
        reason: FailedZones
        status: "False"
        type: DNSReady

However the clusteroperator does not:

                    {
                        "lastTransitionTime": "2019-08-19T19:00:21Z",
                        "reason": "NoIngressControllersDegraded",
                        "status": "False",
                        "type": "Degraded"
                    },
                    // ...
                    {
                        "lastTransitionTime": "2019-08-19T19:04:40Z",
                        "message": "desired and current number of IngressControllers are equal",
                        "status": "True",
                        "type": "Available"
                    }

We should set the clusteroperator's "Available" condition to false when an ingresscontroller has a "DNSReady" condition that is false.

Comment 7 Miciah Dashiel Butler Masters 2019-08-26 18:06:38 UTC
So far I have determined that traffic to the Azure LB from the host network of a worker node that is not running an ingress-controller pod is dropped.  The clusters that CI deploys have 3 worker nodes and ingress-controller pods on 2 of them, and the execpod uses host networking, so if it happens to be scheduled on the odd node, its requests get dropped, which causes the test failures.  I am still investigating why the traffic to the LB is dropped.

Comment 8 Dan Mace 2019-08-30 15:04:08 UTC
We believe this was solved by https://github.com/openshift/origin/pull/23688. Will open new bugs as necessary.

Comment 10 Hongan Li 2019-09-04 02:41:03 UTC
checked testgrid and it passed in recent ci job, so moving to verified.

https://testgrid.k8s.io/redhat-openshift-release-informing#redhat-canary-openshift-ocp-installer-e2e-azure-4.2&sort-by-flakiness=

Comment 11 errata-xmlrpc 2019-10-16 06:32:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.