1599428 – Need a better error message when oc commands timeout/fail during storage upgrade

Bug 1599428 - Need a better error message when oc commands timeout/fail during storage upgrade

Summary: Need a better error message when oc commands timeout/fail during storage upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	oc
Sub Component:
Version:	3.10.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.11.0
Assignee:	Juan Vallejo
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:
Depends On:	1616840
Blocks:
TreeView+	depends on / blocked

Reported:	2018-07-09 19:03 UTC by Mike Fiedler
Modified:	2018-10-11 07:21 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-10-11 07:21:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
loglevel=8 output (7.09 KB, text/plain) 2018-07-10 15:16 UTC, Mike Fiedler	no flags	Details
oc output with loglevel=8 (7.19 KB, text/plain) 2018-08-02 15:54 UTC, Vikas Laad	no flags	Details
patched oc log (6.61 KB, text/plain) 2018-08-06 15:28 UTC, Vikas Laad	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:2652	0	None	None	None	2018-10-11 07:21:55 UTC

Description Mike Fiedler 2018-07-09 19:03:18 UTC

Description of problem:


Upgrading from 3.9 to 3.10.  When the upgrade gets to the "Upgrade all storage" task, oc commands no longer work or take a very long time to timeout.

It looks like this:

oc get pods -n default
E0709 18:59:34.877759  102108 round_trippers.go:169] CancelRequest not implemented
E0709 19:00:38.886405  102108 round_trippers.go:169] CancelRequest not implemented
E0709 19:01:10.889660  102108 round_trippers.go:169] CancelRequest not implemented
E0709 19:01:42.893057  102108 round_trippers.go:169] CancelRequest not implemented
NAME                       READY     STATUS    RESTARTS   AGE
docker-registry-1-dkhpr    1/1       Running   0          47m
registry-console-1-6jglj   1/1       Running   1          46m
router-1-g4d5h             1/1       Running   0          47m


Version-Release number of selected component (if applicable): 3.10.15


How reproducible: Always during storage upgrade phase of upgrade


Steps to Reproduce:
1. Create a small 3.9 cluster - verify oc commands work
2. Run the upgrade.yml playbook
3. When the upgrade hits "Upgrade all storage", try run an oc get  command

Actual results:

See above

Expected results:

A more human consumable error message.  

Additional info:

Comment 1 Mike Fiedler 2018-07-10 15:16:18 UTC

Created attachment 1457853 [details]
loglevel=8 output

Comment 2 Juan Vallejo 2018-07-16 23:00:25 UTC

Does the oc binary still give the "CancelRequest not implemented" error if used against a different cluster than the one that was just upgraded?

Also, can you confirm that the version of `oc` that you're getting this error on is 3.10?

Based on your attachment, the error message appears to be originating from the UserAgent round-tripper's CancelRequest method [2]. It could be that, due to altered config during the upgrade process, the round-tripper being used here [3] does not implement a CancelRequest method, causing the error message seen (and the delay in executing commands).

Adding David in case he can provide more information as well.

1. https://github.com/openshift/openshift-ansible/blob/release-3.10/playbooks/openshift-master/private/upgrade.yml#L69
2. https://github.com/openshift/origin/blob/release-3.10/vendor/k8s.io/kubernetes/staging/src/k8s.io/client-go/transport/round_trippers.go#L169
3. https://github.com/openshift/origin/blob/release-3.10/vendor/k8s.io/kubernetes/staging/src/k8s.io/client-go/transport/round_trippers.go#L37

Comment 3 Mike Fiedler 2018-07-17 20:05:10 UTC

1.  Using the client against a different cluster is successful.  To be clear, the cluster where the error occurs is in the middle of an upgrade, it has not yet been fully upgraded.

2.  At the time the error occurs, the client is 3.10 and the api servers are still 3.9:

root@ip-172-31-20-191: ~ # oc get pods
E0717 20:02:00.400949    8386 round_trippers.go:169] CancelRequest not implemented
^C


root@ip-172-31-20-191: ~ # oc version
oc v3.10.18
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-31-20-191.us-west-2.compute.internal:8443
openshift v3.9.33
kubernetes v1.9.1+a0ce1bc657

Comment 4 Juan Vallejo 2018-08-01 19:47:51 UTC

Mike, could you provide --loglevel 8 output?
The reason why this is happening is most likely because a wrapped round tripper does not implement the CancelRequest method.

Comment 5 Vikas Laad 2018-08-02 15:54:49 UTC

Created attachment 1472760 [details]
oc output with loglevel=8

oc v3.10.27
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-31-61-1.us-west-2.compute.internal:8443
openshift v3.9.40
kubernetes v1.9.1+a0ce1bc657

Comment 6 Vikas Laad 2018-08-06 15:28:26 UTC

Created attachment 1473652 [details]
patched oc log

Comment 7 Juan Vallejo 2018-08-06 19:37:07 UTC

Origin PR: https://github.com/openshift/origin/pull/20554

Comment 8 Mike Fiedler 2018-08-08 12:15:29 UTC

Moving to MODIFIED until a build is ready for QE

Comment 9 Xingxing Xia 2018-08-14 03:15:15 UTC

Mike, FYI, the PR is merged in OCP new puddles >= v3.11.0-0.12.0, thx

Comment 10 Mike Fiedler 2018-08-28 21:46:14 UTC

Verified on 3.11.0-0.24.0

Comment 12 errata-xmlrpc 2018-10-11 07:21:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652

Note You need to log in before you can comment on or make changes to this bug.