Bug 1599428

Summary: Need a better error message when oc commands timeout/fail during storage upgrade
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: ocAssignee: Juan Vallejo <jvallejo>
Status: CLOSED ERRATA QA Contact: Mike Fiedler <mifiedle>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.10.0CC: aos-bugs, deads, jokerman, mifiedle, mmccomas, vlaad
Target Milestone: ---   
Target Release: 3.11.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-11 07:21:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1616840    
Bug Blocks:    
Attachments:
Description Flags
loglevel=8 output
none
oc output with loglevel=8
none
patched oc log none

Description Mike Fiedler 2018-07-09 19:03:18 UTC
Description of problem:


Upgrading from 3.9 to 3.10.  When the upgrade gets to the "Upgrade all storage" task, oc commands no longer work or take a very long time to timeout.

It looks like this:

oc get pods -n default
E0709 18:59:34.877759  102108 round_trippers.go:169] CancelRequest not implemented
E0709 19:00:38.886405  102108 round_trippers.go:169] CancelRequest not implemented
E0709 19:01:10.889660  102108 round_trippers.go:169] CancelRequest not implemented
E0709 19:01:42.893057  102108 round_trippers.go:169] CancelRequest not implemented
NAME                       READY     STATUS    RESTARTS   AGE
docker-registry-1-dkhpr    1/1       Running   0          47m
registry-console-1-6jglj   1/1       Running   1          46m
router-1-g4d5h             1/1       Running   0          47m


Version-Release number of selected component (if applicable): 3.10.15


How reproducible: Always during storage upgrade phase of upgrade


Steps to Reproduce:
1. Create a small 3.9 cluster - verify oc commands work
2. Run the upgrade.yml playbook
3. When the upgrade hits "Upgrade all storage", try run an oc get  command

Actual results:

See above

Expected results:

A more human consumable error message.  

Additional info:

Comment 1 Mike Fiedler 2018-07-10 15:16:18 UTC
Created attachment 1457853 [details]
loglevel=8 output

Comment 2 Juan Vallejo 2018-07-16 23:00:25 UTC
Does the oc binary still give the "CancelRequest not implemented" error if used against a different cluster than the one that was just upgraded?

Also, can you confirm that the version of `oc` that you're getting this error on is 3.10?

Based on your attachment, the error message appears to be originating from the UserAgent round-tripper's CancelRequest method [2]. It could be that, due to altered config during the upgrade process, the round-tripper being used here [3] does not implement a CancelRequest method, causing the error message seen (and the delay in executing commands).

Adding David in case he can provide more information as well.

1. https://github.com/openshift/openshift-ansible/blob/release-3.10/playbooks/openshift-master/private/upgrade.yml#L69
2. https://github.com/openshift/origin/blob/release-3.10/vendor/k8s.io/kubernetes/staging/src/k8s.io/client-go/transport/round_trippers.go#L169
3. https://github.com/openshift/origin/blob/release-3.10/vendor/k8s.io/kubernetes/staging/src/k8s.io/client-go/transport/round_trippers.go#L37

Comment 3 Mike Fiedler 2018-07-17 20:05:10 UTC
1.  Using the client against a different cluster is successful.  To be clear, the cluster where the error occurs is in the middle of an upgrade, it has not yet been fully upgraded.

2.  At the time the error occurs, the client is 3.10 and the api servers are still 3.9:

root@ip-172-31-20-191: ~ # oc get pods
E0717 20:02:00.400949    8386 round_trippers.go:169] CancelRequest not implemented
^C


root@ip-172-31-20-191: ~ # oc version
oc v3.10.18
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-31-20-191.us-west-2.compute.internal:8443
openshift v3.9.33
kubernetes v1.9.1+a0ce1bc657

Comment 4 Juan Vallejo 2018-08-01 19:47:51 UTC
Mike, could you provide --loglevel 8 output?
The reason why this is happening is most likely because a wrapped round tripper does not implement the CancelRequest method.

Comment 5 Vikas Laad 2018-08-02 15:54:49 UTC
Created attachment 1472760 [details]
oc output with loglevel=8

oc v3.10.27
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-31-61-1.us-west-2.compute.internal:8443
openshift v3.9.40
kubernetes v1.9.1+a0ce1bc657

Comment 6 Vikas Laad 2018-08-06 15:28:26 UTC
Created attachment 1473652 [details]
patched oc log

Comment 7 Juan Vallejo 2018-08-06 19:37:07 UTC
Origin PR: https://github.com/openshift/origin/pull/20554

Comment 8 Mike Fiedler 2018-08-08 12:15:29 UTC
Moving to MODIFIED until a build is ready for QE

Comment 9 Xingxing Xia 2018-08-14 03:15:15 UTC
Mike, FYI, the PR is merged in OCP new puddles >= v3.11.0-0.12.0, thx

Comment 10 Mike Fiedler 2018-08-28 21:46:14 UTC
Verified on 3.11.0-0.24.0

Comment 12 errata-xmlrpc 2018-10-11 07:21:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652