Bug 1694226

Summary: cluster upgrade should maintain a functioning cluster during upgrade: Available: v1.quota.openshift.io is not ready: 503
Product: OpenShift Container Platform Reporter: Ben Parees <bparees>
Component: MasterAssignee: Michal Fojtik <mfojtik>
Status: CLOSED ERRATA QA Contact: Xingxing Xia <xxia>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.1.0CC: adahiya, aos-bugs, jokerman, mifiedle, mmccomas, nmoraiti, wking, yapei
Target Milestone: ---Keywords: BetaBlocker
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:46:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1698672, 1698950, 1700504    
Bug Blocks:    
Attachments:
Description Flags
Recent instances of this error in CI none

Description Ben Parees 2019-03-29 20:03:54 UTC
Description of problem:
fail [k8s.io/kubernetes/test/e2e/framework/util.go:2396]: Expected error:
    <*errors.errorString | 0xc421fb9920>: {
        s: "failed to get logs from pod-secrets-d75c0e4f-51d8-11e9-9953-0a58ac101164 for secret-env-test: an error on the server (\"unknown\") has prevented the request from succeeding (get pods pod-secrets-d75c0e4f-51d8-11e9-9953-0a58ac101164)",
    }
    failed to get logs from pod-secrets-d75c0e4f-51d8-11e9-9953-0a58ac101164 for secret-env-test: an error on the server ("unknown") has prevented the request from succeeding (get pods pod-secrets-d75c0e4f-51d8-11e9-9953-0a58ac101164)
not to have occurred

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/726

Seems like the kube api server had a failure responding to requests during the upgrade.

Comment 3 Michal Fojtik 2019-04-01 10:11:46 UTC
Ben, do we have info about how often we do see this flake?

Comment 4 Ben Parees 2019-04-01 13:54:08 UTC
The two i linked were the 2 i saw in the 2 days of history i went through, but you can query all job runs for the last 7 days here:

https://search.svc.ci.openshift.org/?search=failed+to+get+logs+from+pod&maxAge=168h&context=2&type=all

Comment 5 Russell Teague 2019-04-03 13:06:04 UTC
Seeing consistent failures on this test.  The search linked above is not picking up failures in the last 12 hours.

https://openshift-gce-devel.appspot.com/builds/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/

(Build Cop)

Comment 6 Abhinav Dahiya 2019-04-04 17:33:40 UTC
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/908

CVO reported successful upgrade ie Available at 12:58:01 but the completion is at 13:38:42.
{
    "apiVersion": "v1",
    "items": [
        {
            "apiVersion": "config.openshift.io/v1",
            "kind": "ClusterVersion",
            "metadata": {
                "creationTimestamp": "2019-04-04T12:38:16Z",
                "generation": 2,
                "name": "version",
                "resourceVersion": "46307",
                "selfLink": "/apis/config.openshift.io/v1/clusterversions/version",
                "uid": "850c19ae-56d6-11e9-97a2-122b11cdb986"
            },
            "spec": {
                "channel": "stable-4.0",
                "clusterID": "5250a589-158a-42c9-a86b-e312876f4705",
                "desiredUpdate": {
                    "image": "registry.svc.ci.openshift.org/ocp/release:4.0.0-0.ci-2019-04-04-121901",
                    "version": ""
                },
                "upstream": "https://api.openshift.com/api/upgrades_info/v1/graph"
            },
            "status": {
                "availableUpdates": null,
                "conditions": [
                    {
                        "lastTransitionTime": "2019-04-04T12:58:01Z",
                        "message": "Done applying 4.0.0-0.ci-2019-04-04-121901",
                        "status": "True",
                        "type": "Available"
                    },
                    {
                        "lastTransitionTime": "2019-04-04T13:43:27Z",
                        "status": "False",
                        "type": "Failing"
                    },
                    {
                        "lastTransitionTime": "2019-04-04T13:48:42Z",
                        "message": "Cluster version is 4.0.0-0.ci-2019-04-04-121901",
                        "status": "False",
                        "type": "Progressing"
                    },
                    {
                        "lastTransitionTime": "2019-04-04T12:38:36Z",
                        "message": "Unable to retrieve available updates: unknown version 4.0.0-0.ci-2019-04-04-121901",
                        "reason": "RemoteFailed",
                        "status": "False",
                        "type": "RetrievedUpdates"
                    }
                ],
                "desired": {
                    "image": "registry.svc.ci.openshift.org/ocp/release:4.0.0-0.ci-2019-04-04-121901",
                    "version": "4.0.0-0.ci-2019-04-04-121901"
                },
                "history": [
                    {
                        "completionTime": "2019-04-04T13:48:42Z",
                        "image": "registry.svc.ci.openshift.org/ocp/release:4.0.0-0.ci-2019-04-04-121901",
                        "startedTime": "2019-04-04T13:00:46Z",
                        "state": "Completed",
                        "version": "4.0.0-0.ci-2019-04-04-121901"
                    },
                    {
                        "completionTime": "2019-04-04T13:00:46Z",
                        "image": "registry.svc.ci.openshift.org/ocp/release@sha256:38615fee13cc324aded26048a26e075cc6d3247f87cea90e49f0685bf798c304",
                        "startedTime": "2019-04-04T12:38:36Z",
                        "state": "Completed",
                        "version": "4.0.0-0.ci-2019-04-04-081851"
                    }
                ],
                "observedGeneration": 2,
                "versionHash": "S3imd-IFzHk="
            }
        }
    ],
    "kind": "List",
    "metadata": {
        "resourceVersion": "",
        "selfLink": ""
    }
}


and openshift-apiserver is avialable false at 13:50:12
curl https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/908/artifacts/e2e-aws-upgrade/clusteroperators.json | jq '.items[] | select(.status.conditions[] | .type == "Available" and .status != "True") | [.metadata.name, .status.conditions]'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 61290  100 61290    0     0   126k      0 --:--:-- --:--:-- --:--:--  125k
[
  "openshift-apiserver",
  [
    {
      "lastTransitionTime": "2019-04-04T13:35:42Z",
      "reason": "AsExpected",
      "status": "False",
      "type": "Failing"
    },
    {
      "lastTransitionTime": "2019-04-04T13:35:48Z",
      "reason": "AsExpected",
      "status": "False",
      "type": "Progressing"
    },
    {
      "lastTransitionTime": "2019-04-04T13:50:12Z",
      "message": "Available: v1.quota.openshift.io is not ready: 503",
      "reason": "Available",
      "status": "False",
      "type": "Available"
    },
    {
      "lastTransitionTime": "2019-04-04T13:35:42Z",
      "reason": "AsExpected",
      "status": "True",
      "type": "Upgradeable"
    }
  ]
]

Comment 7 Abhinav Dahiya 2019-04-04 17:36:22 UTC
https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/907/artifacts/e2e-aws-upgrade/ has similar error of openshift-apiserver failing.

curl https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/907/artifacts/e2e-aws-upgrade/clusteroperators.json | jq '.items[] | select(.status.conditions[] | .type == "Available" and .status != "True") | [.metadata.name, .status.conditions]'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 61530  100 61530    0     0   130k      0 --:--:-- --:--:-- --:--:--  130k
[
  "openshift-apiserver",
  [
    {
      "lastTransitionTime": "2019-04-04T13:05:44Z",
      "reason": "AsExpected",
      "status": "False",
      "type": "Failing"
    },
    {
      "lastTransitionTime": "2019-04-04T13:06:02Z",
      "reason": "AsExpected",
      "status": "False",
      "type": "Progressing"
    },
    {
      "lastTransitionTime": "2019-04-04T13:22:01Z",
      "message": "Available: v1.quota.openshift.io is not ready: 503",
      "reason": "Available",
      "status": "False",
      "type": "Available"
    },
    {
      "lastTransitionTime": "2019-04-04T13:05:44Z",
      "reason": "AsExpected",
      "status": "True",
      "type": "Upgradeable"
    }
  ]
]

Comment 8 Abhinav Dahiya 2019-04-04 17:42:38 UTC
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/904 is failing with similar error.

curl https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/905/artifacts/e2e-aws-upgrade/clusteroperators.json | jq '.items[] | select(.status.conditions[] | .type == "Available" and .status != "True") | [.metadata.name, .status.conditions]'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 61288  100 61288    0     0   130k      0 --:--:-- --:--:-- --:--:--  130k
[
  "openshift-apiserver",
  [
    {
      "lastTransitionTime": "2019-04-04T11:04:25Z",
      "reason": "AsExpected",
      "status": "False",
      "type": "Failing"
    },
    {
      "lastTransitionTime": "2019-04-04T11:04:31Z",
      "reason": "AsExpected",
      "status": "False",
      "type": "Progressing"
    },
    {
      "lastTransitionTime": "2019-04-04T11:20:38Z",
      "message": "Available: v1.quota.openshift.io is not ready: 503",
      "reason": "Available",
      "status": "False",
      "type": "Available"
    },
    {
      "lastTransitionTime": "2019-04-04T11:04:25Z",
      "reason": "AsExpected",
      "status": "True",
      "type": "Upgradeable"
    }
  ]
]

Comment 9 Michal Fojtik 2019-04-05 11:02:50 UTC
(In reply to Abhinav Dahiya from comment #8)

This is https://bugzilla.redhat.com/show_bug.cgi?id=1696387

> https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-
> openshift-origin-installer-e2e-aws-upgrade-4.0/904 is failing with similar
> error.
> 
> curl
> https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-
> installer-e2e-aws-upgrade-4.0/905/artifacts/e2e-aws-upgrade/clusteroperators.
> json | jq '.items[] | select(.status.conditions[] | .type == "Available" and
> .status != "True") | [.metadata.name, .status.conditions]'
>   % Total    % Received % Xferd  Average Speed   Time    Time     Time 
> Current
>                                  Dload  Upload   Total   Spent    Left  Speed
> 100 61288  100 61288    0     0   130k      0 --:--:-- --:--:-- --:--:-- 
> 130k
> [
>   "openshift-apiserver",
>   [
>     {
>       "lastTransitionTime": "2019-04-04T11:04:25Z",
>       "reason": "AsExpected",
>       "status": "False",
>       "type": "Failing"
>     },
>     {
>       "lastTransitionTime": "2019-04-04T11:04:31Z",
>       "reason": "AsExpected",
>       "status": "False",
>       "type": "Progressing"
>     },
>     {
>       "lastTransitionTime": "2019-04-04T11:20:38Z",
>       "message": "Available: v1.quota.openshift.io is not ready: 503",
>       "reason": "Available",
>       "status": "False",
>       "type": "Available"
>     },
>     {
>       "lastTransitionTime": "2019-04-04T11:04:25Z",
>       "reason": "AsExpected",
>       "status": "True",
>       "type": "Upgradeable"
>     }
>   ]
> ]

Comment 10 Michal Fojtik 2019-04-09 11:20:29 UTC
*** Bug 1696387 has been marked as a duplicate of this bug. ***

Comment 11 W. Trevor King 2019-04-09 17:15:20 UTC
*** Bug 1698033 has been marked as a duplicate of this bug. ***

Comment 12 Michal Fojtik 2019-04-10 17:59:26 UTC
https://github.com/openshift/origin/pull/22425 merged, we should not see "message": "Available: v1.quota.openshift.io is not ready: 503" anymore.

Comment 13 W. Trevor King 2019-04-10 18:57:47 UTC
[1] (launched just before origin#22425 landed) hit this.  I'll check back in in a few hours to make sure these have gone away.

[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/1585/pull-ci-openshift-installer-master-e2e-aws/5108

Comment 14 W. Trevor King 2019-04-10 21:24:52 UTC
[1] has another, despite starting well after origin#22425 landed.  But for some reason it's still running an older origin commit:

  $ curl -s https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-samples-operator/129/pull-ci-openshift-cluster-samples-operator-master-e2e-aws-image-ecosystem/343?log#log | grep 'Available: v1.quota.openshift.io is not ready: 503'
  Apr 10 19:41:39.739 W clusteroperator/openshift-apiserver changed Available to False: Available: Available: v1.quota.openshift.io is not ready: 503
  Apr 10 19:41:46.944 W clusteroperator/openshift-apiserver changed Available to False: Available: Available: v1.quota.openshift.io is not ready: 503
  Apr 10 19:41:56.542 W clusteroperator/openshift-apiserver changed Available to False: Available: Available: v1.quota.openshift.io is not ready: 503
  Apr 10 19:42:03.754 W clusteroperator/openshift-apiserver changed Available to False: Available: Available: v1.quota.openshift.io is not ready: 503
  Apr 10 19:42:10.944 W clusteroperator/openshift-apiserver changed Available to False: Available: Available: v1.quota.openshift.io is not ready: 503
  Apr 10 19:42:18.141 W clusteroperator/openshift-apiserver changed Available to False: Available: Available: v1.quota.openshift.io is not ready: 503
  $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-samples-operator/129/pull-ci-openshift-cluster-samples-operator-master-e2e-aws-image-ecosystem/343/artifacts/release-latest/release-payload-latest/image-references | jq -r '.spec.tags[] | select(.name == "hyperkube").annotations'
  {
    "io.openshift.build.commit.id": "af45cda5bce85838501f67afade94c6871fd1e4f",
    "io.openshift.build.commit.ref": "master",
    "io.openshift.build.source-location": "https://github.com/openshift/origin",
    "io.openshift.build.versions": "kubernetes=1.13.4"
  }
  $ git log --first-parent --format='%ad %h %d %s' --date=iso -3 origin/master |cat
  2019-04-10 12:59:01 -0700 2108314cd8  (origin/release-4.0, origin/master, origin/HEAD) Merge pull request #22504 from smarterclayton/handle_multiple_target_path
  2019-04-10 10:40:39 -0700 d212b13acc  Merge pull request #22425 from mfojtik/crq-to-crd
  2019-04-10 08:09:03 -0400 af45cda5bc  Merge pull request #22521 from deads2k/quota-pick

[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-samples-operator/129/pull-ci-openshift-cluster-samples-operator-master-e2e-aws-image-ecosystem/343

Comment 15 Xingxing Xia 2019-04-11 08:37:17 UTC
Still reproduced in latest payload 4.0.0-0.nightly-2019-04-10-182914 which does not yet build in above fix PR. Will check again when new payload includes it.

Comment 16 Mike Fiedler 2019-04-11 12:06:44 UTC
Marking BetaBlocker based on apiserver upgrade failure in duplicate https://bugzilla.redhat.com/show_bug.cgi?id=1696387

Comment 18 W. Trevor King 2019-04-11 18:37:51 UTC
Created attachment 1554620 [details]
Recent instances of this error in CI

Only instances since the fix are in upgrade tests, so I think we're good :).

Comment 19 Xingxing Xia 2019-04-19 06:30:29 UTC
Verified in latest payload 4.1.0-0.nightly-2019-04-18-210657 , the error message is not seen.

Comment 21 errata-xmlrpc 2019-06-04 10:46:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758