Bug 1599428

Summary:

Need a better error message when oc commands timeout/fail during storage upgrade

Product:

OpenShift Container Platform

Reporter:

Mike Fiedler <mifiedle>

Component:

Assignee:

Juan Vallejo <jvallejo>

Status:

CLOSED ERRATA

QA Contact:

Mike Fiedler <mifiedle>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

3.10.0

CC:

aos-bugs, deads, jokerman, mifiedle, mmccomas, vlaad

Target Milestone:

---

Target Release:

3.11.0

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-10-11 07:21:36 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1616840

Bug Blocks:

Attachments:

Description	Flags
loglevel=8 output	none
oc output with loglevel=8	none
patched oc log	none

Description Mike Fiedler 2018-07-09 19:03:18 UTC

Description of problem:


Upgrading from 3.9 to 3.10.  When the upgrade gets to the "Upgrade all storage" task, oc commands no longer work or take a very long time to timeout.

It looks like this:

oc get pods -n default
E0709 18:59:34.877759  102108 round_trippers.go:169] CancelRequest not implemented
E0709 19:00:38.886405  102108 round_trippers.go:169] CancelRequest not implemented
E0709 19:01:10.889660  102108 round_trippers.go:169] CancelRequest not implemented
E0709 19:01:42.893057  102108 round_trippers.go:169] CancelRequest not implemented
NAME                       READY     STATUS    RESTARTS   AGE
docker-registry-1-dkhpr    1/1       Running   0          47m
registry-console-1-6jglj   1/1       Running   1          46m
router-1-g4d5h             1/1       Running   0          47m


Version-Release number of selected component (if applicable): 3.10.15


How reproducible: Always during storage upgrade phase of upgrade


Steps to Reproduce:
1. Create a small 3.9 cluster - verify oc commands work
2. Run the upgrade.yml playbook
3. When the upgrade hits "Upgrade all storage", try run an oc get  command

Actual results:

See above

Expected results:

A more human consumable error message.  

Additional info:

Comment 1 Mike Fiedler 2018-07-10 15:16:18 UTC

Created attachment 1457853 [details]
loglevel=8 output

Comment 2 Juan Vallejo 2018-07-16 23:00:25 UTC

Does the oc binary still give the "CancelRequest not implemented" error if used against a different cluster than the one that was just upgraded?

Also, can you confirm that the version of `oc` that you're getting this error on is 3.10?

Based on your attachment, the error message appears to be originating from the UserAgent round-tripper's CancelRequest method [2]. It could be that, due to altered config during the upgrade process, the round-tripper being used here [3] does not implement a CancelRequest method, causing the error message seen (and the delay in executing commands).

Adding David in case he can provide more information as well.

1. https://github.com/openshift/openshift-ansible/blob/release-3.10/playbooks/openshift-master/private/upgrade.yml#L69
2. https://github.com/openshift/origin/blob/release-3.10/vendor/k8s.io/kubernetes/staging/src/k8s.io/client-go/transport/round_trippers.go#L169
3. https://github.com/openshift/origin/blob/release-3.10/vendor/k8s.io/kubernetes/staging/src/k8s.io/client-go/transport/round_trippers.go#L37

Comment 3 Mike Fiedler 2018-07-17 20:05:10 UTC

1.  Using the client against a different cluster is successful.  To be clear, the cluster where the error occurs is in the middle of an upgrade, it has not yet been fully upgraded.

2.  At the time the error occurs, the client is 3.10 and the api servers are still 3.9:

root@ip-172-31-20-191: ~ # oc get pods
E0717 20:02:00.400949    8386 round_trippers.go:169] CancelRequest not implemented
^C


root@ip-172-31-20-191: ~ # oc version
oc v3.10.18
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-31-20-191.us-west-2.compute.internal:8443
openshift v3.9.33
kubernetes v1.9.1+a0ce1bc657

Comment 4 Juan Vallejo 2018-08-01 19:47:51 UTC

Mike, could you provide --loglevel 8 output?
The reason why this is happening is most likely because a wrapped round tripper does not implement the CancelRequest method.

Comment 5 Vikas Laad 2018-08-02 15:54:49 UTC

Created attachment 1472760 [details]
oc output with loglevel=8

oc v3.10.27
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-31-61-1.us-west-2.compute.internal:8443
openshift v3.9.40
kubernetes v1.9.1+a0ce1bc657

Comment 6 Vikas Laad 2018-08-06 15:28:26 UTC

Created attachment 1473652 [details]
patched oc log

Comment 7 Juan Vallejo 2018-08-06 19:37:07 UTC

Origin PR: https://github.com/openshift/origin/pull/20554

Comment 8 Mike Fiedler 2018-08-08 12:15:29 UTC

Moving to MODIFIED until a build is ready for QE

Comment 9 Xingxing Xia 2018-08-14 03:15:15 UTC

Mike, FYI, the PR is merged in OCP new puddles >= v3.11.0-0.12.0, thx

Comment 10 Mike Fiedler 2018-08-28 21:46:14 UTC

Verified on 3.11.0-0.24.0

Comment 12 errata-xmlrpc 2018-10-11 07:21:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652