1589252 – upgrade_control_plane.yml playbook - not able to run commands against master api during upgrade

Bug 1589252 - upgrade_control_plane.yml playbook - not able to run commands against master api during upgrade

Summary: upgrade_control_plane.yml playbook - not able to run commands against master ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	oc
Sub Component:
Version:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.9.z
Assignee:	Maciej Szulik
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-06-08 14:17 UTC by Vladislav Walek
Modified:	2019-04-09 14:20 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-04-09 14:20:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0619	0	None	None	None	2019-04-09 14:20:25 UTC

Description Vladislav Walek 2018-06-08 14:17:55 UTC

Description of problem:

when running the upgrade_control_plane.yml playbook, during the time, all the oc commands are taking too long.
After checking, the commands were waiting for servicecatalog.k8s.io/v1beta1, for the service catalog.
The log contains tasks to restart the masters one by one so the load balancing for master api should balance between masters.

Version-Release number of the following components:
rpm -q openshift-ansible
openshift-ansible-3.9.27-1.git.0.52e35b5.el7.noarch

rpm -q ansible
ansible-2.5.0-2.el7.noarch

ansible --version
ansible 2.5.0

How reproducible:

Steps to Reproduce:
1.
2.
3.

Comment 3 Scott Dodson 2018-06-08 15:41:46 UTC

BTW, Why are you using ansible 2.5? Please follow the documentation which instructs you to enable the ansible 2.4 repo and install ansible from there we've done zero testing with 3.9 and ansible 2.5

Comment 4 Paul Morie 2018-06-11 14:35:37 UTC

Hi-

What version are you upgrading from? Was service-catalog installed before the upgrade? If so, was service-catalog working correctly before the upgrade?

Thanks!

Comment 5 Vladislav Walek 2018-06-13 12:19:44 UTC

What version are you upgrading from? Which versions exactly?
openshift v3.9.14
kubernetes v1.9.1+a0ce1bc657

Was service-catalog installed before the upgrade? 
Yes

If so, was service-catalog working correctly before the upgrade?
We don't really use the service catalog at the moment, so there are no real life examples to show that it worked. But all pods in the service-catalog projects where running and where "ready".

Comment 6 Jeff Peeler 2018-06-15 20:25:49 UTC

I installed 3.9.14 and upgraded to 3.9.27 without any issues. I'd like to see the inventory file for the upgrade along with catalog logs before/after upgrade:

# kubectl logs -n kube-service-catalog -lapp=controller-manager
# kubectl logs -n kube-service-catalog -lapp=apiserver

I don't see anything in the attached ansible log that points to any issue as of yet.

Comment 14 Jeff Peeler 2018-07-24 18:47:10 UTC

I did an upgrade from 3.9.14 to 3.9.33 with multiple masters and could not reproduce any appreciable delays in oc get requests, except during the following expected portions:
- When the first master api was restarted (about 5 seconds)
- During the SDN upgrade (about 45 seconds)

In order to verify that having catalog run on multiple masters wasn't an issue (Service Catalog wasn't set to run on all the masters until 3.9.21), I installed 3.9.30 and did an upgrade to 3.9.33.

I never saw any non-catalog operations become completely unresponsive during the upgrade. I was monitoring this output during the entire upgrade:

#!/bin/bash
while true; do
  now=$(date)
  echo "$now: checking pods"
  oc get po --all-namespaces --loglevel=9
  now=$(date)
  echo "==="
  echo "$now: checking serviceinstances"
  oc get serviceinstances --loglevel=9
  now=$(date)
  echo "$now: $(oc version)"
  echo ""
  sleep 5
done

I'm told that the customer did not experience any problems in the latest upgrade (3.9.33), so I'm waiting to here back if the next upgrade also works correctly.

Comment 19 Jay Boyd 2018-11-15 14:04:45 UTC

I've reproduced this in the following environment:  upgrading AWS multi master (3 masters) from 3.10.76-1  to 3.11.44-1.   During storage upgrade most OC commands hang for 10 - 15 minutes.  Prior to upgrade I loaded about 125 namespaces with data via svt/openshift_scalability/cluster-loader.py.

I see https://bugzilla.redhat.com/show_bug.cgi?id=1599428  reported this same basic issue and was fixed but the fix was focused on clarifying the message "CancelRequest not implemented".  It doesn't appear to me cluster availability was addressed.

Excerpts from the installer logs:

2018-11-14 16:49:02,018 p=20116 u=root |  TASK [openshift_control_plane : Wait for APIs to become available] 
2018-11-14 16:49:05,719 p=20116 u=root |  TASK [openshift_control_plane : Get API logs] 
2018-11-14 16:49:05,803 p=20116 u=root |  TASK [openshift_control_plane : debug] 
2018-11-14 16:49:05,888 p=20116 u=root |  TASK [openshift_control_plane : fail] 
2018-11-14 16:49:05,975 p=20116 u=root |  TASK [openshift_control_plane : Check for apiservices/v1beta1.metrics.k8s.io registration] 
2018-11-14 16:49:06,478 p=20116 u=root |  TASK [openshift_control_plane : Wait for /apis/metrics.k8s.io/v1beta1 when registered] 
2018-11-14 16:49:06,563 p=20116 u=root |  TASK [openshift_control_plane : Check for apiservices/v1beta1.servicecatalog.k8s.io registration] 
2018-11-14 16:49:07,013 p=20116 u=root |  TASK [openshift_control_plane : Wait for /apis/servicecatalog.k8s.io/v1beta1 when registered] 
2018-11-14 16:49:07,530 p=20116 u=root |  TASK [Upgrade all storage] 


    "E1114 16:50:58.764325   79576 round_trippers.go:169] CancelRequest not implemented", 
    "E1114 16:51:30.769271   79576 round_trippers.go:169] CancelRequest not implemented", 
    "E1114 16:52:02.772536   79576 round_trippers.go:169] CancelRequest not implemented", 
    "E1114 16:52:34.775768   79576 round_trippers.go:169] CancelRequest not implemented", 
    "E1114 16:53:06.779309   79576 round_trippers.go:169] CancelRequest not implemented", 
    "E1114 16:53:38.782635   79576 round_trippers.go:169] CancelRequest not implemented", 
    "E1114 16:54:42.789704   79576 round_trippers.go:169] CancelRequest not implemented", 
    "E1114 16:55:14.792854   79576 round_trippers.go:169] CancelRequest not implemented", 
    "E1114 16:55:46.796156   79576 round_trippers.go:169] CancelRequest not implemented", 
    "E1114 16:56:18.800079   79576 round_trippers.go:169] CancelRequest not implemented", 
    "E1114 16:56:50.803123   79576 round_trippers.go:169] CancelRequest not implemented", 
    "E1114 16:57:22.806648   79576 round_trippers.go:169] CancelRequest not implemented", 
    "E1114 16:58:26.814095   79576 round_trippers.go:169] CancelRequest not implemented", 
    "E1114 16:58:58.817702   79576 round_trippers.go:169] CancelRequest not implemented", 
    "E1114 16:59:30.820948   79576 round_trippers.go:169] CancelRequest not implemented", 
    "E1114 17:00:02.824252   79576 round_trippers.go:169] CancelRequest not implemented", 
    "E1114 17:00:34.827742   79576 round_trippers.go:169] CancelRequest not implemented", 
    "E1114 17:01:06.937605   79576 round_trippers.go:169] CancelRequest not implemented", 
    "E1114 17:01:38.941411   79576 round_trippers.go:169] CancelRequest not implemented", 
    "E1114 17:02:10.944769   79576 round_trippers.go:169] CancelRequest not implemented", 
    "E1114 17:03:13.247153   84510 round_trippers.go:169] CancelRequest not implemented\nerror: could not calculate the list of available resources: Get https://ip-172-18-6-50.ec2.internal:443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)", 
    "E1114 17:03:13.247153   84510 round_trippers.go:169] CancelRequest not implemented", 




2018-11-14 17:06:01,011 p=20116 u=root |  TASK [Migrate legacy HPA scale target refs] 
2018-11-14 17:06:01,475 p=20116 u=root |  TASK [openshift_master_facts : Verify required variables are set] 
2018-11-14 17:06:01,691 p=20116 u=root |  TASK [openshift_master_facts : Set g_metrics_hostname] 


storage upgrade ran from 16:49:07 to 17:06.  It is interesting to note there is very limited Service Catalog data to be migrated.  I had Ansible Service Broker and Template Service Broker installed, but no Service Instances were present.  I've run through reproducing this numerous times including without loading data first, I don't believe the SVT data loading is necessary.  Also of interest, I believe storage upgrade always seems to take about the same amount of time regardless of no data loading or loading approx 125 namespaces, 500 or 1,000.


I had a script running in another terminal executing OC commands from a jump box (external to the cluster) with --request-timeout=35 which also showed these same timeouts (CancelRequest not implemented) during this same timeframe consistently.  The oc commands included  `oc get clusterservicebrokers`, `get hostsubnets` `get pods`, `get endpoints` etc.  The actual command doesn't seem to matter, most failed during this period with the same CancelRequest not implemented.

I captured events in the kube-service-catalog namespace and I don't believe anything terribly abnormal is occurring with Service Catalog. I have a variety of logs for review including:
 * openshift-ansible log (run with -vvv)
 * scripts & their output for executing oc commands with --loglevel=9
 * output from oc get events  -n kube-service-catalog
 * from 2 of the maters I collected the API, MASTER and NODE logs

I don't think Service Catalog is the problem here.  It could be an issue with Aggregated API or maybe its something else.

Comment 26 Jay Boyd 2018-12-07 19:22:47 UTC

Waiting on https://github.com/openshift/origin/pull/21633  that backports  kubernetes/kubernetes#66929 and kubernetes/kubernetes#67024 to 3.10.  Will retest with these in place.

Comment 27 Jay Boyd 2019-01-04 23:06:49 UTC

Tested upgrading oc v3.10.96 to openshift v3.11.66 and this appears to be rectified by https://github.com/openshift/origin/pull/21633.  I could not reproduce the hang or timeouts.

Comment 28 Jay Boyd 2019-01-07 16:25:28 UTC

back to assigned, these fixes are needed in 3.9 as well.

Comment 29 Juan Vallejo 2019-01-07 16:35:01 UTC

3.9 backport: https://github.com/openshift/origin/pull/21741

Comment 32 Jay Boyd 2019-03-06 13:28:20 UTC

fix first seen in v3.9.67-1

Comment 35 Xingxing Xia 2019-03-28 01:37:03 UTC

Will verify it today

Comment 36 Xingxing Xia 2019-03-28 09:36:47 UTC

Per comment 32, verified in both v3.9.65 --> v3.9.74 (version in above comment 33 link) upgrade and v3.9.68 --> v3.9.74 upgrade, neither hit the hang/timeout issue, given ASB/TSB are installed and 100 projects and apps dated are loaded before upgrade.

Comment 38 errata-xmlrpc 2019-04-09 14:20:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0619

Note You need to log in before you can comment on or make changes to this bug.