Bug 1599428 - Need a better error message when oc commands timeout/fail during storage upgrade
Summary: Need a better error message when oc commands timeout/fail during storage upgrade
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: oc
Version: 3.10.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: 3.11.0
Assignee: Juan Vallejo
QA Contact: Mike Fiedler
URL:
Whiteboard:
Depends On: 1616840
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-09 19:03 UTC by Mike Fiedler
Modified: 2018-10-11 07:21 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-11 07:21:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
loglevel=8 output (7.09 KB, text/plain)
2018-07-10 15:16 UTC, Mike Fiedler
no flags Details
oc output with loglevel=8 (7.19 KB, text/plain)
2018-08-02 15:54 UTC, Vikas Laad
no flags Details
patched oc log (6.61 KB, text/plain)
2018-08-06 15:28 UTC, Vikas Laad
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:2652 0 None None None 2018-10-11 07:21:55 UTC

Description Mike Fiedler 2018-07-09 19:03:18 UTC
Description of problem:


Upgrading from 3.9 to 3.10.  When the upgrade gets to the "Upgrade all storage" task, oc commands no longer work or take a very long time to timeout.

It looks like this:

oc get pods -n default
E0709 18:59:34.877759  102108 round_trippers.go:169] CancelRequest not implemented
E0709 19:00:38.886405  102108 round_trippers.go:169] CancelRequest not implemented
E0709 19:01:10.889660  102108 round_trippers.go:169] CancelRequest not implemented
E0709 19:01:42.893057  102108 round_trippers.go:169] CancelRequest not implemented
NAME                       READY     STATUS    RESTARTS   AGE
docker-registry-1-dkhpr    1/1       Running   0          47m
registry-console-1-6jglj   1/1       Running   1          46m
router-1-g4d5h             1/1       Running   0          47m


Version-Release number of selected component (if applicable): 3.10.15


How reproducible: Always during storage upgrade phase of upgrade


Steps to Reproduce:
1. Create a small 3.9 cluster - verify oc commands work
2. Run the upgrade.yml playbook
3. When the upgrade hits "Upgrade all storage", try run an oc get  command

Actual results:

See above

Expected results:

A more human consumable error message.  

Additional info:

Comment 1 Mike Fiedler 2018-07-10 15:16:18 UTC
Created attachment 1457853 [details]
loglevel=8 output

Comment 2 Juan Vallejo 2018-07-16 23:00:25 UTC
Does the oc binary still give the "CancelRequest not implemented" error if used against a different cluster than the one that was just upgraded?

Also, can you confirm that the version of `oc` that you're getting this error on is 3.10?

Based on your attachment, the error message appears to be originating from the UserAgent round-tripper's CancelRequest method [2]. It could be that, due to altered config during the upgrade process, the round-tripper being used here [3] does not implement a CancelRequest method, causing the error message seen (and the delay in executing commands).

Adding David in case he can provide more information as well.

1. https://github.com/openshift/openshift-ansible/blob/release-3.10/playbooks/openshift-master/private/upgrade.yml#L69
2. https://github.com/openshift/origin/blob/release-3.10/vendor/k8s.io/kubernetes/staging/src/k8s.io/client-go/transport/round_trippers.go#L169
3. https://github.com/openshift/origin/blob/release-3.10/vendor/k8s.io/kubernetes/staging/src/k8s.io/client-go/transport/round_trippers.go#L37

Comment 3 Mike Fiedler 2018-07-17 20:05:10 UTC
1.  Using the client against a different cluster is successful.  To be clear, the cluster where the error occurs is in the middle of an upgrade, it has not yet been fully upgraded.

2.  At the time the error occurs, the client is 3.10 and the api servers are still 3.9:

root@ip-172-31-20-191: ~ # oc get pods
E0717 20:02:00.400949    8386 round_trippers.go:169] CancelRequest not implemented
^C


root@ip-172-31-20-191: ~ # oc version
oc v3.10.18
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-31-20-191.us-west-2.compute.internal:8443
openshift v3.9.33
kubernetes v1.9.1+a0ce1bc657

Comment 4 Juan Vallejo 2018-08-01 19:47:51 UTC
Mike, could you provide --loglevel 8 output?
The reason why this is happening is most likely because a wrapped round tripper does not implement the CancelRequest method.

Comment 5 Vikas Laad 2018-08-02 15:54:49 UTC
Created attachment 1472760 [details]
oc output with loglevel=8

oc v3.10.27
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-31-61-1.us-west-2.compute.internal:8443
openshift v3.9.40
kubernetes v1.9.1+a0ce1bc657

Comment 6 Vikas Laad 2018-08-06 15:28:26 UTC
Created attachment 1473652 [details]
patched oc log

Comment 7 Juan Vallejo 2018-08-06 19:37:07 UTC
Origin PR: https://github.com/openshift/origin/pull/20554

Comment 8 Mike Fiedler 2018-08-08 12:15:29 UTC
Moving to MODIFIED until a build is ready for QE

Comment 9 Xingxing Xia 2018-08-14 03:15:15 UTC
Mike, FYI, the PR is merged in OCP new puddles >= v3.11.0-0.12.0, thx

Comment 10 Mike Fiedler 2018-08-28 21:46:14 UTC
Verified on 3.11.0-0.24.0

Comment 12 errata-xmlrpc 2018-10-11 07:21:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652


Note You need to log in before you can comment on or make changes to this bug.