Bug 1628961 - [free-stg] upgrade failed due to '/apis/metrics.k8s.io/v1beta1' -> "Error from server (ServiceUnavailable): the server is currently unable to handle the request"
Summary: [free-stg] upgrade failed due to '/apis/metrics.k8s.io/v1beta1' -> "Error fro...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.11.0
Assignee: Scott Dodson
QA Contact: Weinan Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-09-14 14:07 UTC by Justin Pierce
Modified: 2018-11-26 15:51 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-11-26 15:51:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
listings from apiservice / pods (7.66 KB, text/plain)
2018-09-14 14:07 UTC, Justin Pierce
no flags Details
master api log excerpt (11.45 KB, text/plain)
2018-09-14 14:25 UTC, Justin Pierce
no flags Details

Description Justin Pierce 2018-09-14 14:07:43 UTC
Created attachment 1483339 [details]
listings from apiservice / pods

Description of problem:
During an openshift-ansible based upgrade from 3.11.0-0.32.0->3.11.3, the upgrade stopped after trying and failing for several minutes to get metrics:

fatal: [free-stg-master-03fb6]: FAILED! => {"attempts": 30, "changed": true, "cmd": ["oc", "get", "--raw", "/apis/metrics.k8s.io/v1beta1"], "delta": "0:00:00.358927", "end": "2018-09-14 13:26:36.640073", "msg": "non-zero return code", "rc": 1, "start": "2018-09-14 13:26:36.281146", "stderr": "Error from server (ServiceUnavailable): the server is currently unable to handle the request", "stderr_lines": ["Error from server (ServiceUnavailable): the server is currently unable to handle the request"], "stdout": "", "stdout_lines": []}


Version-Release number of selected component (if applicable):
v3.11.3

Steps to Reproduce:
1. Error occurred running openshift-ansible upgrade from 3.11.0-0.32.0 -> 3.11.3

Additional info:
See attachments for detailed listings

Comment 1 Justin Pierce 2018-09-14 14:25:29 UTC
Created attachment 1483342 [details]
master api log excerpt

Comment 2 David Eads 2018-09-14 14:38:37 UTC
This is neat.  The "listings from apiservice / pods " attachment indicates that the metrics pod is running, but the core-apiserver cannot find a route to the service IP.

Comment 3 Justin Pierce 2018-09-14 15:04:39 UTC
Further investigation found that controller pods were in crash loop backoff.

F0914 14:21:01.166381       1 plugins.go:136] Could not create hostpath recycler pod from file /etc/origin/master/recycler_pod.yaml: failed to read file path /etc/origin/master/recycler_pod.yaml: open /etc/origin/master/recycler_pod.yaml: no such file or directory

This was traced to an openshift-ansible problem.

Comment 4 Scott Dodson 2018-09-14 15:05:05 UTC
At least in this specific environment (free-stg) controllers were crashing due to misconfiguration introduced in last night's build. Deploying the recycler pod definition and restarting controllers seems to have cleared up the metrics server api failure. I imagine controllers dying prevent the network from properly converging after the control plane was restarted.

https://github.com/openshift/openshift-ansible/pull/10082 to fix that

I'm taking this bug for Upgrade component and verifying in my environment.

Comment 5 Scott Dodson 2018-09-14 18:05:21 UTC
https://github.com/openshift/openshift-ansible/pull/10089 release-3.11 pick

Comment 6 Wei Sun 2018-09-17 02:48:06 UTC
The PR 10089 has been merged to openshift-ansible-3.11.7-1


Note You need to log in before you can comment on or make changes to this bug.