1591053 – Storage upgrade fails on loaded HA cluster: the server doesn't have a resource type \"clusterservicebrokers\" and ERROR: logging before flag.Parse

Bug 1591053 - Storage upgrade fails on loaded HA cluster: the server doesn't have a resource type \"clusterservicebrokers\" and ERROR: logging before flag.Parse

Summary: Storage upgrade fails on loaded HA cluster: the server doesn't have a resou...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.10.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.10.z
Assignee:	Scott Dodson
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1591758 (view as bug list)
Depends On:
Blocks:	1633767
TreeView+	depends on / blocked

Reported:	2018-06-14 00:15 UTC by Mike Fiedler
Modified:	2018-09-27 17:33 UTC (History)
CC List:	21 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Clones:	1633767 (view as bug list)
Environment:
Last Closed:	2018-07-20 12:10:26 UTC
Target Upstream Version:
Embargoed:
Flags:	wmeng: needinfo-

Attachments	(Terms of Use)
inventory, ansible log, syslogs from master (4.80 MB, application/x-gzip) 2018-06-14 00:18 UTC, Mike Fiedler	no flags	Details
ansible-playbook -vvv log for upgrade failure under 3.10.1 (223.29 KB, application/x-gzip) 2018-06-21 11:29 UTC, Mike Fiedler	no flags	Details
log on wmeng39syscon-master-etcd-1.0621-e03.qe.rhcloud.com (18.03 MB, application/x-xz) 2018-06-22 03:21 UTC, Weihua Meng	no flags	Details
ansible log - server doesn't have a resource type "dc" (350.33 KB, application/x-gzip) 2018-07-13 19:11 UTC, Mike Fiedler	no flags	Details
View All

Description Mike Fiedler 2018-06-14 00:15:28 UTC

Description of problem:

Upgrade 3.9.27 to 3.10.0-0.67.0 using openshift-ansible 3.10.0-0.67.0 results in a storage upgrade failure.

1. Installed an HA cluster at 3.9.27 (1 lb, 3 master/etcd colocated, 2 infra, 5 compute) on AWS - inventory attached.  Loaded the cluster up with:

- 10000 build configs
- 20000 completed builds
- 10000 deployments
- 10000 services
- 40000 secrets
- 10000 routes

2. Verified the cluster is working properly - nodes are ready, can deploy new applications from templates, pods are running OK.

3. Run openshift_node_group.yml to create the node config groups

4. Run upgrade_control_plane.yml to upgrade the control plane.   The storage upgrade fails

    "stderr": "Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored\nERROR: logging before flag.Parse: E0613 19:57:45.903432   70009 round_trippers.go:169] CancelR
equest not implemented\nERROR: logging before flag.Parse: E0613 19:58:17.908137   70009 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0613 19:58:49.911682   70009 round
_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0613 19:59:21.914653   70009 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0613 19:5
9:53.917839   70009 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0613 20:00:25.921695   70009 round_trippers.go:169] CancelRequest not implemented\nERROR: logging befo
re flag.Parse: E0613 20:00:57.924939   70009 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0613 20:01:29.928053   70009 round_trippers.go:169] CancelRequest not impleme
nted\nERROR: logging before flag.Parse: E0613 20:02:01.931478   70009 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0613 20:02:33.934770   70009 round_trippers.go:169] 
CancelRequest not implemented\nERROR: logging before flag.Parse: E0613 20:03:05.938117   70009 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0613 20:03:37.941278   7000
9 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0613 20:04:09.944540   70009 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E06
13 20:04:41.947849   70009 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0613 20:05:13.951562   70009 round_trippers.go:169] CancelRequest not implemented\nERROR: loggi
ng before flag.Parse: E0613 20:05:45.954826   70009 round_trippers.go:169] CancelRequest not implemented\nerror: the server doesn't have a resource type \"clusterservicebrokers\"", 
    "stderr_lines": [
        "Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored", 
        "ERROR: logging before flag.Parse: E0613 19:57:45.903432   70009 round_trippers.go:169] CancelRequest not implemented", 
        "ERROR: logging before flag.Parse: E0613 19:58:17.908137   70009 round_trippers.go:169] CancelRequest not implemented", 
        "ERROR: logging before flag.Parse: E0613 19:58:49.911682   70009 round_trippers.go:169] CancelRequest not implemented", 
        "ERROR: logging before flag.Parse: E0613 19:59:21.914653   70009 round_trippers.go:169] CancelRequest not implemented", 
        "ERROR: logging before flag.Parse: E0613 19:59:53.917839   70009 round_trippers.go:169] CancelRequest not implemented", 
        "ERROR: logging before flag.Parse: E0613 20:00:25.921695   70009 round_trippers.go:169] CancelRequest not implemented", 
        "ERROR: logging before flag.Parse: E0613 20:00:57.924939   70009 round_trippers.go:169] CancelRequest not implemented", 
        "ERROR: logging before flag.Parse: E0613 20:01:29.928053   70009 round_trippers.go:169] CancelRequest not implemented", 
        "ERROR: logging before flag.Parse: E0613 20:02:01.931478   70009 round_trippers.go:169] CancelRequest not implemented", 
        "ERROR: logging before flag.Parse: E0613 20:02:33.934770   70009 round_trippers.go:169] CancelRequest not implemented", 
        "ERROR: logging before flag.Parse: E0613 20:03:05.938117   70009 round_trippers.go:169] CancelRequest not implemented", 
        "ERROR: logging before flag.Parse: E0613 20:03:37.941278   70009 round_trippers.go:169] CancelRequest not implemented", 
        "ERROR: logging before flag.Parse: E0613 20:04:09.944540   70009 round_trippers.go:169] CancelRequest not implemented", 
        "ERROR: logging before flag.Parse: E0613 20:04:41.947849   70009 round_trippers.go:169] CancelRequest not implemented", 
        "ERROR: logging before flag.Parse: E0613 20:05:13.951562   70009 round_trippers.go:169] CancelRequest not implemented", 
        "ERROR: logging before flag.Parse: E0613 20:05:45.954826   70009 round_trippers.go:169] CancelRequest not implemented", 
        "error: the server doesn't have a resource type \"clusterservicebrokers\""
    ], 
    "stdout": "", 
    "stdout_lines": []



Version-Release number of selected component (if applicable): Upgrade 3.9.27 to 3.10.0-0.67.0 using openshift-ansible 3.10.0-0.67.0


How reproducible: Once so far in one attempt


Steps to Reproduce:
1. See above

Comment 1 Mike Fiedler 2018-06-14 00:18:34 UTC

Created attachment 1451077 [details]
inventory, ansible log, syslogs from master

1. openshift-ansible inventory used for upgrade_control_plane
2. ansible -vvv log
3. journals from 3 masters

Comment 2 Mike Fiedler 2018-06-14 00:21:28 UTC

Service broker and catalog pods:
# oc get clusterservicebroker
NAME                      AGE
ansible-service-broker    8h
template-service-broker   8h


# oc get pods --all-namespaces | grep -E "service"
kube-service-catalog                apiserver-424rn                                           1/1       Running     1          8h
kube-service-catalog                apiserver-wg44n                                           1/1       Running     1          8h
kube-service-catalog                apiserver-zlghf                                           1/1       Running     1          8h
kube-service-catalog                controller-manager-mjxpf                                  1/1       Running     4          8h
kube-service-catalog                controller-manager-wbnwv                                  1/1       Running     4          8h
kube-service-catalog                controller-manager-zd9cb                                  1/1       Running     3          8h
openshift-ansible-service-broker    asb-1-97dfq                                               1/1       Running     2          8h
openshift-ansible-service-broker    asb-etcd-1-lhgpz                                          1/1       Running     0          8h
openshift-template-service-broker   apiserver-9d2ww                                           1/1       Running     0          8h
openshift-template-service-broker   apiserver-bd4v6                                           1/1       Running     0          8h

Comment 3 Mike Fiedler 2018-06-14 01:47:49 UTC

Re-started the upgrade_control_plane.yml playbook after the failure and it completed successfully.   Will retry on a fresh cluster and see if the original failure reoccurs.

Comment 5 Vadim Rutkovsky 2018-06-14 10:26:35 UTC

Storage upgrade is now being retried twice with a delay of 30 secs, PR https://github.com/openshift/openshift-ansible/pull/8661

Comment 6 Scott Dodson 2018-06-14 14:46:34 UTC

That change would be in openshift-ansible-3.10.0-0.67.0 already. I'm not sure what else we can do.

One common mitigation against storage migration problems is to run the migration prior to performing the upgrade and then skip it during the upgrade.

oc adm migrate storage --include=*

Then set openshift_upgrade_pre_storage_migration_enabled=false in your inventory. Moving this to master encase there's anything for them to debug here.

Comment 8 Mike Fiedler 2018-06-14 18:14:37 UTC

Re: comment 7.   When I reproduced the issue it failed with the same message about clusterservicebrokers.  Coincidence?  or something about migrating this part of storage?

fatal: [ec2-35-164-67-17.us-west-2.compute.amazonaws.com]: FAILED! => {                                                                                                                                                                                                                   
    "changed": true,                                                                                                                                                                                                                                                                      
    "cmd": [                                                                                                                                                                                                                                                                              
        "oc",                                                                                                                                                                                                                                                                             
        "adm",                                                                                                                                                                                                                                                                            
        "--config=/etc/origin/master/admin.kubeconfig",                                                                                                                                                                                                                                   
        "migrate",                                                                                                                                                                                                                                                                        
        "storage",                                                                                                                                                                                                                                                                        
        "--include=*",                                                                                                                                                                                                                                                                    
        "--confirm"                                                                                                                                                                                                                                                                       
    ],                                                                                                                                                                                                                                                                                    
    "delta": "0:08:32.583116",                                                                                                                                                                                                                                                            
    "end": "2018-06-14 17:28:25.422116",                                                                                                                                                                                                                                                  
    "failed_when_result": true,                                                                                                                                                                                                                                                           
    "invocation": {                                                                                                                                                                                                                                                                       
        "module_args": {                                                                                                                                                                                                                                                                  
            "_raw_params": "oc adm --config=/etc/origin/master/admin.kubeconfig migrate storage --include=* --confirm",                                                                                                                                                                   
            "_uses_shell": false,                                                                                                                                                                                                                                                         
            "chdir": null,                                                                                                                                                                                                                                                                
            "creates": null,                                                                                                                                                                                                                                                              
            "executable": null,                                                                                                                                                                                                                                                           
            "removes": null,                                                                                                                                                                                                                                                              
            "stdin": null,                                                                                                                                                                                                                                                                
            "warn": true                                                                                                                                                                                                                                                                  
        }                                                                                                                                                                                                                                                                                 
    },                                                                                                                                                                                                                                                                                    
    "msg": "non-zero return code",                                                                                                                                                                                                                                                        
    "rc": 1,                                                                                                                                                                                                                                                                              
    "start": "2018-06-14 17:19:52.839000",                                                                                                                                                                                                                                                
    "stderr": "Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored\nERROR: logging before flag.Parse: E0614 17:20:25.369073   66519 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0614 17:20:5
7.373696   66519 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0614 17:21:29.377028   66519 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0614 17:22:01.380251   66519 round_trippers.go:169] Cancel
Request not implemented\nERROR: logging before flag.Parse: E0614 17:22:33.383775   66519 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0614 17:23:05.387153   66519 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before
 flag.Parse: E0614 17:23:37.390303   66519 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0614 17:24:09.393327   66519 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0614 17:24:41.396523   66519 rou
nd_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0614 17:25:13.399705   66519 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0614 17:25:45.402941   66519 round_trippers.go:169] CancelRequest not implemen
ted\nERROR: logging before flag.Parse: E0614 17:26:17.406043   66519 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0614 17:26:49.409307   66519 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0614 1
7:27:21.412609   66519 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0614 17:27:53.415840   66519 round_trippers.go:169] CancelRequest not implemented\nERROR: logging before flag.Parse: E0614 17:28:25.418992   66519 round_trippers.go:169] 
CancelRequest not implemented\nerror: the server doesn't have a resource type \"clusterservicebrokers\"",                                                                                                                                                                                 
    "stderr_lines": [                                                                                                                                                                                                                                                                     
        "Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored",                                                                                                                                                                           
        "ERROR: logging before flag.Parse: E0614 17:20:25.369073   66519 round_trippers.go:169] CancelRequest not implemented",                                                                                                                                                           
        "ERROR: logging before flag.Parse: E0614 17:20:57.373696   66519 round_trippers.go:169] CancelRequest not implemented",                                                                                                                                                           
        "ERROR: logging before flag.Parse: E0614 17:21:29.377028   66519 round_trippers.go:169] CancelRequest not implemented",                                                                                                                                                           
        "ERROR: logging before flag.Parse: E0614 17:22:01.380251   66519 round_trippers.go:169] CancelRequest not implemented",                                                                                                                                                           
        "ERROR: logging before flag.Parse: E0614 17:22:33.383775   66519 round_trippers.go:169] CancelRequest not implemented",                                                                                                                                                           
        "ERROR: logging before flag.Parse: E0614 17:23:05.387153   66519 round_trippers.go:169] CancelRequest not implemented",                                                                                                                                                           
        "ERROR: logging before flag.Parse: E0614 17:23:37.390303   66519 round_trippers.go:169] CancelRequest not implemented",                                                                                                                                                           
        "ERROR: logging before flag.Parse: E0614 17:24:09.393327   66519 round_trippers.go:169] CancelRequest not implemented",                                                                                                                                                           
        "ERROR: logging before flag.Parse: E0614 17:24:41.396523   66519 round_trippers.go:169] CancelRequest not implemented",                                                                                                                                                           
        "ERROR: logging before flag.Parse: E0614 17:25:13.399705   66519 round_trippers.go:169] CancelRequest not implemented",                                                                                                                                                           
        "ERROR: logging before flag.Parse: E0614 17:25:45.402941   66519 round_trippers.go:169] CancelRequest not implemented",                                                                                                                                                           
        "ERROR: logging before flag.Parse: E0614 17:26:17.406043   66519 round_trippers.go:169] CancelRequest not implemented", 
        "ERROR: logging before flag.Parse: E0614 17:26:49.409307   66519 round_trippers.go:169] CancelRequest not implemented", 
        "ERROR: logging before flag.Parse: E0614 17:27:21.412609   66519 round_trippers.go:169] CancelRequest not implemented", 
        "ERROR: logging before flag.Parse: E0614 17:27:53.415840   66519 round_trippers.go:169] CancelRequest not implemented", 
        "ERROR: logging before flag.Parse: E0614 17:28:25.418992   66519 round_trippers.go:169] CancelRequest not implemented", 
        "error: the server doesn't have a resource type \"clusterservicebrokers\""
    ], 
    "stdout": "", 
    "stdout_lines": []
}

Comment 9 Vikas Laad 2018-06-14 19:08:27 UTC

I saw similar problem on a different cluster, this cluster has CRIO runtime and separate etcd install

fatal: [ec2-54-212-222-67.us-west-2.compute.amazonaws.com]: FAILED! => {
    "changed": true, 
    "cmd": [
        "oc", 
        "adm", 
        "--config=/etc/origin/master/admin.kubeconfig", 
        "migrate", 
        "storage", 
        "--include=*", 
        "--confirm"
    ], 
    "delta": "0:00:00.131388", 
    "end": "2018-06-14 18:56:47.807243", 
    "failed_when_result": true, 
    "invocation": {
        "module_args": {
            "_raw_params": "oc adm --config=/etc/origin/master/admin.kubeconfig migrate storage --include=* --confirm", 
            "_uses_shell": false, 
            "chdir": null, 
            "creates": null, 
            "executable": null, 
            "removes": null, 
            "stdin": null, 
            "warn": true
        }
    }, 
    "msg": "non-zero return code", 
    "rc": 1, 
    "start": "2018-06-14 18:56:47.675855", 
    "stderr": "Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored\nerror: could not calculate the list of available resources: the server could not find the requested resource", 
    "stderr_lines": [
        "Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored", 
        "error: could not calculate the list of available resources: the server could not find the requested resource"
    ], 
    "stdout": "", 
    "stdout_lines": []
}

Comment 10 Mike Fiedler 2018-06-15 11:52:48 UTC

Marking this as TestBlocker.  Unable to upgrade this configuration without hitting failed storage migrations.

Comment 11 Maciej Szulik 2018-06-15 12:10:05 UTC

I've dropped the --confirm from that migration invocation since it only complains about stuff we're not using any more (based on https://github.com/openshift/origin/pull/19691), see https://github.com/openshift/openshift-ansible/pull/8781.

Comment 12 Maciej Szulik 2018-06-15 12:22:24 UTC

The problem with missing clusterservicebrokers resource is related to a service catalog API server having some issues. I'm moving this to service catalog team to figure out what's going on with their server. I think it'll be helpful for them to get the logs from the api server for investigation.

Comment 13 Jay Boyd 2018-06-15 15:19:09 UTC

Any chance we can get the logs from the Service Catalog API Server and Controller Pods?

Comment 14 Jay Boyd 2018-06-15 15:49:11 UTC

from 19:39:34 through 19:55:09 all 3 masters are reporting errors communicating with the Catalog API server:
available_controller.go:295] v1beta1.servicecatalog.k8s.io failed with: Get https://172.26.131.21:443: dial tcp 172.26.131.21:443: getsockopt: no route to host

These errors only list one  API Server IP - I would expect there are 3 running (1 on each master).  I'm not certain if this is the only SC API pod or if the others are working properly at this time.
 
I'd like to see all of this immediately after the failure:

oc describe endpoints -n kube-service-catalog apiserver
oc describe pods -l app=apiserver -n kube-service-catalog
oc describe apiservices v1beta1.servicecatalog.k8s.io -n kube-service-catalog

along with logs from all pods in the kube-service-catalog ns.

Comment 15 Jay Boyd 2018-06-15 16:10:31 UTC

Comment 2 shows there are in fact 3 SC API Server pods along with 3 SC Controller pods and at the time indicates all were available and the SC API Servers only had one restart each.

Comment 16 Mike Fiedler 2018-06-15 17:08:23 UTC

That cluster was terminated (AWS).  I am spinning up an identical one now - it is in the process of loading.  When/if the failure is reproduced I will gather the requested info in comment 14 and comment 13.

Comment 17 Jay Boyd 2018-06-15 17:50:12 UTC

One more log might help: kube-proxy

Thanks very much Mike, if it gets into that state and you are done with it, I'd like the chance to jump on it to poke although it kind of sounds like it may be timing specific as well.

Comment 18 Hongkai Liu 2018-06-15 17:56:02 UTC

From
# yum list installed | grep openshift
atomic-openshift.x86_64         3.9.27-1.git.0.964617d.el7

To
# yum list installed | grep openshift
atomic-openshift.x86_64         3.10.0-0.69.0.git.0.248ae38.el7

Crio as runtime.

Met this once, and rerunning the update playbook did work it round.
Afterwards, I hit this: https://bugzilla.redhat.com/show_bug.cgi?id=1591752
It does NOT have this part in the error msg: "logging before flag.Parse"

TASK [Upgrade all storage] **************************************************************************************************
fatal: [ec2-52-34-224-227.us-west-2.compute.amazonaws.com]: FAILED! => {"changed": true, "cmd": ["oc", "adm", "--config=/etc/origin/master/admin.kubeconfig", "migrate", "storage", "--include=*", "--confirm"], "delta": "0:08:28.225847", "end": "2018-06-15 16:38:03.922025", "failed_when_result": true, "msg": "non-zero return code", "rc": 1, "start": "2018-06-15 16:29:35.696178", "stderr": "Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored\nE0615 16:30:08.449939  129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:30:40.459640  129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:31:12.462841  129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:31:44.466322  129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:32:16.469730  129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:32:48.473245  129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:33:20.476680  129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:33:52.479746  129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:34:56.485943  129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:35:28.489320  129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:36:00.492423  129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:36:32.495544  129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:37:04.498949  129872 round_trippers.go:169] CancelRequest not implemented\nE0615 16:37:36.502617  129872 round_trippers.go:169] CancelRequest not implemented\nerror: the server doesn't have a resource type \"clusterservicebrokers\"", "stderr_lines": ["Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored", "E0615 16:30:08.449939  129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:30:40.459640  129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:31:12.462841  129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:31:44.466322  129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:32:16.469730  129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:32:48.473245  129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:33:20.476680  129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:33:52.479746  129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:34:56.485943  129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:35:28.489320  129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:36:00.492423  129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:36:32.495544  129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:37:04.498949  129872 round_trippers.go:169] CancelRequest not implemented", "E0615 16:37:36.502617  129872 round_trippers.go:169] CancelRequest not implemented", "error: the server doesn't have a resource type \"clusterservicebrokers\""], "stdout": "", "stdout_lines": []}
	to retry, use: --limit @/root/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.retry

PLAY RECAP ******************************************************************************************************************
ec2-18-236-173-89.us-west-2.compute.amazonaws.com : ok=41   changed=6    unreachable=0    failed=0   
ec2-52-11-68-23.us-west-2.compute.amazonaws.com : ok=241  changed=74   unreachable=0    failed=0   
ec2-52-34-224-227.us-west-2.compute.amazonaws.com : ok=418  changed=128  unreachable=0    failed=1   
ec2-54-245-184-161.us-west-2.compute.amazonaws.com : ok=241  changed=74   unreachable=0    failed=0   
localhost                  : ok=22   changed=0    unreachable=0    failed=0   



Failure summary:


  1. Hosts:    ec2-52-34-224-227.us-west-2.compute.amazonaws.com
     Play:     Pre master upgrade - Upgrade all storage
     Task:     Upgrade all storage
     Message:  non-zero return code

Comment 19 Scott Dodson 2018-06-15 18:20:20 UTC

*** Bug 1591758 has been marked as a duplicate of this bug. ***

Comment 21 Jay Boyd 2018-06-15 18:38:52 UTC

great, thank you, I'll take a look.

Comment 22 Mike Fiedler 2018-06-15 19:51:22 UTC

And of course my attempt to recreate (comment 16) did not reproduce it.   The best we have right now is comment 20.  I will attempt to reproduce it over the weekend.

Comment 23 Jay Boyd 2018-06-15 20:06:09 UTC

Thanks very much Mike.  I'm still reviewing the deployment from comment 20 but I'm not finding much as the Service Catalog resources were restarted as part of the 2nd attempt at upgrading.

Comment 31 Jay Boyd 2018-06-19 18:25:16 UTC

Trying to reproduce to gather additional data and trouble shoot

Comment 32 Zihan Tang 2018-06-20 09:15:45 UTC

I tried 3 times in different platform to upgrade HA env with service-catalog, and I did not reproduce this issue.

Comment 33 Mike Fiedler 2018-06-20 11:38:22 UTC

I think you have to have data loaded - I've never tried to do it on an empty cluster, but am hitting it on the majority of upgrades on loaded clusters (2000 projects loaded as per the description.

Comment 35 Jason Montleon 2018-06-20 12:24:38 UTC

Our upgrade tests don't load anything prior to performing the upgrade.

We're doing the following and still seeing this intermittently:
Install v3.9
Upgrade from v3.9 to v3.10
Run our CI tests to ensure our broker is functional after upgrade.

Comment 36 Jay Boyd 2018-06-20 13:01:17 UTC

I haven't been able to reproduce this yet.  I have a set of scripts to collect info every 5 seconds (including some oc commands to get hostsubnets, pods, service and endpoint status, iptables-save, and a curl against the catalog service) that I'd like to start prior to the upgrade.  Plus after discussing with Dan Williams I understand that we need to collect the logs from the sdn pods in the openshift-sdn namespace after the failure is observed.

Comment 37 Jay Boyd 2018-06-20 18:39:36 UTC

mffiedler got a reproduction for me while I was monitoring the process.

oc commands executed on master node hung from 16:39:42 to 16:46:03 (perhaps fine)

some scripted CURL commands against catalog apiserver service  https://apiserver.kube-service-catalog.svc:443  had errors prior to 16:33, but nothing after that, always got a quick response.  However, master API logs show:

16:40:40 master-api[36110]: memcache.go:153] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server was unable to return a response in the time allotted, but may still be processing the request
 16:41:50 repeat of above
 16:43:00 repeat of above
 16:44:10 repeat of above
 16:45:20 repeat of above
 16:45:58 master-api[36110]: memcache.go:153] couldn't get resource list for servicecatalog.k8s.io/v1beta1: an error on the server ("Error: 'read tcp 172.21.0.1:45285->172.27.237.234:443: read: no route to host'\nTrying to reach: 'https://172.27.237.234:443/apis/servicecatalog.k8s.io/v1beta1'") has prevented the request from succeeding


Ansible task failed here at 16:44:39

2018-06-20 16:36:05,803 p=70646 u=root |  PLAY [Pre master upgrade - Upgrade all storage] ...
2018-06-20 16:36:06,450 p=70646 u=root |  TASK [Upgrade all storage] 2018-06-20 16:36:06,450 p=70646 u=root |  task path: /usr/share/ansible/openshift-ansible/playbooks/openshift-master/private/upgrade.yml:63
2018-06-20 16:36:06,495 p=70646 u=root |  Using module file /usr/lib/python2.7/site-packages/ansible/modules/commands/command.py
2018-06-20 16:44:39,323 p=70646 u=root |  FAILED - RETRYING: Upgrade all storage (2 retries left).Result was: {
    "attempts": 1, 
    "changed": true, 
    "cmd": [
        "oc", 
        "adm", 
        "--config=/etc/origin/master/admin.kubeconfig", 
        "migrate", 
        "storage", 
        "--include=*", 
        "--confirm"
    ], 
    "delta": "0:08:32.557575", 
    "end": "2018-06-20 16:44:39.292596", 
    "failed_when_result": true, 
    "invocation": {
        "module_args": {
            "_raw_params": "oc adm --config=/etc/origin/master/admin.kubeconfig migrate storage --include=* --confirm", 
            "_uses_shell": false, 
            "chdir": null, 
            "creates": null, 
            "executable": null, 
            "removes": null, 
            "stdin": null, 
            "warn": true
        }
    }, 
    "msg": "non-zero return code", 
    "rc": 1, 
    "retries": 3, 
    "start": "2018-06-20 16:36:06.735021", 
    "stderr": "Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored
E0620 16:36:39.238143  117200 round_trippers.go:169] CancelRequest not implemented
...repeats same msg...
error: the server doesn't have a resource type \"clusterservicebrokers\"", 
.......

Comment 38 Jay Boyd 2018-06-20 18:55:16 UTC

very similar error seen last Friday

Jun 15 20:21:43 ip-172-31-33-38.us-west-2.compute.internal atomic-openshift-master-api[101277]: E0615 20:21:43.612129  101277 available_controller.go:295] v1beta1.servicecatalog.k8s.io failed with: Get https://172.24.53.127:443: dial tcp 172.24.53.127:443: getsockopt: no route to host

20:22:06 master-api[101277]: E0615 20:22:06.927984  101277 available_controller.go:295] v1beta1.servicecatalog.k8s.io failed with: Get https://172.24.53.127:443: dial tcp 172.24.53.127:443: getsockopt: no route to host

20:31:45 master-api[101277]: E0615 20:31:45.028654  101277 memcache.go:153] couldn't get resource list for servicecatalog.k8s.io/v1beta1: an error on the server ("Error: 'read tcp 172.20.0.1:43300->172.24.53.127:443: read: no route to host'\nTrying to reach: 'https://172.24.53.127:443/apis/servicecatalog.k8s.io/v1beta1'") has prevented the request from succeeding

20:31:45 master-controllers[101881]: E0615 20:31:45.035700  101881 resource_quota_controller.go:428] failed to discover resources: unable to retrieve the complete list of server APIs: servicecatalog.k8s.io/v1beta1: an error on the server ("Error: 'read tcp 172.20.0.1:43300->172.24.53.127:443: read: no route to host'\nTrying to reach: 'https://172.24.53.127:443/apis/servicecatalog.k8s.io/v1beta1'") has prevented the request from succeeding

20:31:45 master-controllers[101881]: W0615 20:31:45.036080  101881 garbagecollector.go:590] failed to discover some groups: %vmap[servicecatalog.k8s.io/v1beta1:an error on the server ("Error: 'read tcp 172.20.0.1:43300->172.24.53.127:443: read: no route to host'\nTrying to reach: 'https://172.24.53.127:443/apis/servicecatalog.k8s.io/v1beta1'") has prevented the request from succeeding]

with ansible log showing failure at 20:30:42: error: the server doesn't have a resource type \"clusterservicebrokers\"

Comment 39 Mike Fiedler 2018-06-21 01:07:27 UTC

@wmeng, if you have a chance, can you try the failure case from comment 24 using openshift-ansible 3.10.1 or higher?  I have attempted to reproduce with that level a couple of times and have not seen the error.   3.10.1 has additional retries for storage migration.

Comment 40 Weihua Meng 2018-06-21 01:22:45 UTC

openshift-ansible-3.10.1-1.git.157.2bb6250.el7.noarch.rpm
used in comment 34.

however, instances were deleted due to resources limitation.

Comment 42 Mike Fiedler 2018-06-21 11:29:50 UTC

Created attachment 1453415 [details]
ansible-playbook -vvv log for upgrade failure under 3.10.1

I hit the same upgrade failure as wmeng in comment 41 in 2 out of 3 upgrades with 3.10.1 - full ansible log attached.

@sdodson @jaboyd  Open a new bugzilla or work it under this one since it is still apparently in service catalog data migration?

could not calculate the list of available resources: Get https://ec2-35-161-126-109.us-west-2.compute.amazonaws.com:8443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)

===== Full task output - ansible log attached

FAILED! => {
    "attempts": 2,
    "changed": true,
    "cmd": [
        "oc",
        "adm",
        "--config=/etc/origin/master/admin.kubeconfig",
        "migrate",
        "storage",
        "--include=*",
        "--confirm"
    ],
    "delta": "0:00:32.181916",
    "end": "2018-06-21 01:36:35.489250",
    "failed_when_result": true,
    "invocation": {
        "module_args": {
            "_raw_params": "oc adm --config=/etc/origin/master/admin.kubeconfig migrate storage --include=* --confirm",
            "_uses_shell": false,
            "chdir": null,
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "warn": true
        }
    },
    "msg": "non-zero return code",
    "rc": 1,
    "start": "2018-06-21 01:36:03.307334",
    "stderr": "Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored\nerror: could not calculate the list of available resources: Get https://ec2-35-161-126-109.us-west-2.compute.amazonaws.com:8443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)",
    "stderr_lines": [
        "Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored",
        "error: could not calculate the list of available resources: Get https://ec2-35-161-126-109.us-west-2.compute.amazonaws.com:8443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
    ],
    "stdout": "",
    "stdout_lines": []
}

Comment 43 Jay Boyd 2018-06-21 14:45:53 UTC

Created attachment 1453513 [details]
logs that go with comment 37

Comment 44 Jay Boyd 2018-06-21 14:58:44 UTC

Looking for some help from network team regarding analysis and logs from comment 37.  Dan can you help review?   I've attached SDN logs.  Time period of the issue is 16:41 or so.  SDN logs look ok to me, grepping across all nodes for the catalog apiserver service, I only get these entries around this time period:

16:33:43.848601  122992 roundrobin.go:276] LoadBalancerRR: Setting endpoints for kube-service-catalog/apiserver:secure to [172.20.0.4:6443 172.21.0.5:6443 172.23.0.6:6443]
16:33:43.901614  122992 service.go:310] Adding new service port "kube-service-catalog/apiserver:secure" at 172.27.237.234:443/TCP
16:34:01.260158  120871 roundrobin.go:310] LoadBalancerRR: Setting endpoints for kube-service-catalog/apiserver:secure to [172.20.0.6:6443 172.21.0.5:6443 172.23.0.6:6443]
16:34:01.260203  120871 roundrobin.go:240] Delete endpoint 172.20.0.6:6443 for service "kube-service-catalog/apiserver:secure"
16:34:01.260219  120871 roundrobin.go:240] Delete endpoint 172.20.0.4:6443 for service "kube-service-catalog/apiserver:secure"
16:34:01.261035   56906 roundrobin.go:310] LoadBalancerRR: Setting endpoints for kube-service-catalog/apiserver:secure to [172.20.0.6:6443 172.21.0.5:6443 172.23.0.6:6443]
16:34:01.261099   56906 roundrobin.go:240] Delete endpoint 172.20.0.4:6443 for service "kube-service-catalog/apiserver:secure"
16:34:01.261116   56906 roundrobin.go:240] Delete endpoint 172.20.0.6:6443 for service "kube-service-catalog/apiserver:secure"
16:34:01.262471  122992 roundrobin.go:310] LoadBalancerRR: Setting endpoints for kube-service-catalog/apiserver:secure to [172.20.0.6:6443 172.21.0.5:6443 172.23.0.6:6443]
16:34:01.262514  122992 roundrobin.go:240] Delete endpoint 172.20.0.6:6443 for service "kube-service-catalog/apiserver:secure"
16:34:01.262530  122992 roundrobin.go:240] Delete endpoint 172.20.0.4:6443 for service "kube-service-catalog/apiserver:secure"


Yet master api is complaining about connectivity:
16:40:40 master-api[36110]: memcache.go:153] couldn't get resource list for servicecatalog.k8s.io/v1beta1: the server was unable to return a response in the time allotted, but may still be processing the request
 16:41:50 repeat of above
 16:43:00 repeat of above
 16:44:10 repeat of above
 16:45:20 repeat of above
 16:45:58 master-api[36110]: memcache.go:153] couldn't get resource list for servicecatalog.k8s.io/v1beta1: an error on the server ("Error: 'read tcp 172.21.0.1:45285->172.27.237.234:443: read: no route to host'\nTrying to reach: 'https://172.27.237.234:443/apis/servicecatalog.k8s.io/v1beta1'")

Comment 45 Weihua Meng 2018-06-22 03:21:38 UTC

Created attachment 1453646 [details]
log on wmeng39syscon-master-etcd-1.0621-e03.qe.rhcloud.com

Comment 46 Jay Boyd 2018-06-22 18:14:28 UTC

Reviewing the master log from comment 45:

storage upgrade failed with more general error, no mention of clusterservicebrokers  or catalog in any of the Ansible installer output:

command was: oc adm --config=/etc/origin/master/admin.kubeconfig migrate storage --include=* --confirm", 
            "_uses_shell": false, 
            "chdir": null, 
            "creates": null, 
            "executable": null, 
            "removes": null, 
            "stdin": null, 
            "warn": true
    "msg": "non-zero return code", 
    "rc": 1, 
    "retries": 3, 
    "start": "2018-06-21 05:39:57.052415", 
    "stderr": "error: could not calculate the list of available resources: the server was unable to return a response in the time allotted, but may still be processing the request", 

looks like this command was tried 3 times and each failed with the same issue:
"start": "2018-06-21 05:38:26.276022",
"end": "2018-06-21 05:39:26.623248", 

"start": "2018-06-21 05:39:57.052415"
"end":   "2018-06-21 05:40:57.314836",

"start": "2018-06-21 05:41:27.734848", 
"end": "2018-06-21 05:42:27


Reviewing the master logs attached via comment 45, I do see some of the same errors about connectivity to Catalog apiserver services:

failed to discover resources: unable to retrieve the complete list of server APIs: servicecatalog.k8s.io/v1beta1: the server was unable to return a response in the time allotted, but may still be processing the request

and "no route to host" issues all around 05:29 - 05:42.  It's unclear to me if these resulted in the errors seen in the Ansible logs.


I noted in comment 37 that oc commands I was executing on the master during the storage upgrade hung for over 7 minutes.  SDodson said this wasn't unexpected, but I do wonder if perhaps there should be more retries here on this command and even better, some backoff injected after a failure waiting for the system to settle down.

Comment 49 Jason Montleon 2018-06-26 18:03:06 UTC

With openshift-ansible-3.10.8-1.git.230.830efc0.el7.noarch I am repeatedly getting:

TASK [Upgrade all storage] *******************************************************************************************************************************************************************
FAILED - RETRYING: Upgrade all storage (2 retries left).
FAILED - RETRYING: Upgrade all storage (1 retries left).
fatal: [192.168.121.107.nip.io]: FAILED! => {"attempts": 2, "changed": true, "cmd": ["oc", "adm", "--config=/etc/origin/master/admin.kubeconfig", "migrate", "storage", "--include=*", "--confirm"], "delta": "0:00:32.271083", "end": "2018-06-26 17:43:34.909451", "failed_when_result": true, "msg": "non-zero return code", "rc": 1, "start": "2018-06-26 17:43:02.638368", "stderr": "Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored\nE0626 17:43:34.901012   12864 round_trippers.go:169] CancelRequest not implemented\nerror: could not calculate the list of available resources: Get https://192.168.121.107.nip.io:8443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "stderr_lines": ["Flag --confirm has been deprecated, storage migration does not support dry run, this flag is ignored", "E0626 17:43:34.901012   12864 round_trippers.go:169] CancelRequest not implemented", "error: could not calculate the list of available resources: Get https://192.168.121.107.nip.io:8443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)"], "stdout": "", "stdout_lines": []}

Comment 53 Vikas Laad 2018-06-26 19:43:41 UTC

root@ip-172-31-7-80: ~ # curl -v -k https://api-server.kube-service-catalog.svc:443/apis/servicecatalog.k8s.io/v1beta1
* Could not resolve host: api-server.kube-service-catalog.svc; Unknown error
* Closing connection 0
curl: (6) Could not resolve host: api-server.kube-service-catalog.svc; Unknown error


root@ip-172-31-7-80: ~ # curl -v -k https://172.xx.xx.xx:443/apis/servicecatalog.k8s.io/v1beta1
* Could not resolve host: 172.xx.xx.xx; Unknown error
* Closing connection 0
curl: (6) Could not resolve host: 172.xx.xx.xx; Unknown error

Comment 54 Jay Boyd 2018-06-26 20:25:45 UTC

Jordan, re comment 52:

I had a shell script that run every 5 seconds doing

 oc get hostsubnets -o wide
 oc get pods --all-namespaces -o wide
 oc get services -n kube-service-catalog -o wide
 oc describe endpoints -n kube-service-catalog

but found that the oc commands hung for over 6 minutes even with --request-timeout="35".  I was running this one of the master hosts.


@Vikas, re comment 53:  You want to previously get the service ip address:

$ oc get services apiserver -n kube-service-catalog -o wide
which will shows something like:
NAME        TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE       SELECTOR
apiserver   ClusterIP   172.27.237.234   <none>        443/TCP   2h        app=apiserver

You then want to use that cluster-ip service in the 2nd curl command - - not the 172.xx.xx.xx.  so in this case it would be 
curl -v -k https://172.27.237.234:443/apis/servicecatalog.k8s.io/v1beta1

and either do this on the master node or within a container

Comment 55 Jordan Liggitt 2018-06-26 20:54:23 UTC

If I understand correctly, the upgrade process is:

1. run pre-master upgrade (does storage migration prior to changing cluster state)
2. upgrade components (apiservers, nodes, sdn, router, registry, etc)
3. run post upgrade storage migration

The pre-master upgrade storage migration failed, and no other steps were performed.

I see the SDN pods have already been upgraded to 3.10:

oc get pods -n openshift-sdn  -o wide -o yaml | grep image:
      image: registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.10
      image: registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.10
      image: registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.10
      image: registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.10
      image: registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.10
      image: registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.10

the apiservers are still at 3.9:

oc get --raw /version/openshift
{
  "major": "3",
  "minor": "9+",
  "gitCommit": "964617d",
  "gitVersion": "v3.9.27",
  "buildDate": "2018-04-26T16:48:23Z"
}


I did not expect to see *any* 3.9->3.10 component changes in the cluster, given that the storage migration failed.

Scott, does that mean some upgrades are happening before or in parallel with the storage migration?

For the networking team, is it expected that rolling the SDN pods would leave the cluster in a state in which the API server could not resolve service/endpoint IPs? Do we have test coverage of that specific scenario (independent of any upgrade)?

Comment 56 Scott Dodson 2018-06-26 21:11:13 UTC

I need to verify, I believe the 3.10 daemonsets are created prior to upgrading the control plane. Also, the node *on the control plane hosts* is updated prior to upgrading the control plane from 3.9 to 3.10, so before step #1 in comment 55.

Comment 59 Dan Winship 2018-06-27 13:55:39 UTC

I would assume that if the SDN got upgraded to 3.10 and everything else was still running 3.9 that massive breakage would be expected...

Comment 60 Dan Winship 2018-06-27 13:56:51 UTC

in particular, you'd be running both the 3.9 in-process SDN code and the 3.10 daemonset SDN code *at the same time*, right?

Comment 61 Jordan Liggitt 2018-06-27 14:07:52 UTC

> in particular, you'd be running both the 3.9 in-process SDN code and the 3.10 daemonset SDN code *at the same time*, right?

on all the non-master nodes, yes. that does seem bad.

do the service catalog pods run only on master nodes?

Comment 62 Jay Boyd 2018-06-27 14:28:24 UTC

Yes, catalog pods run on all master nodes

Comment 63 Scott Dodson 2018-06-27 15:28:04 UTC

sdn pods should crashloop until the host level sdn services are shut down at which point they take over.

Comment 64 Ben Bennett 2018-06-27 19:58:44 UTC

@weliang could not reproduce it yet.  He's trying a different config to see if we can get it to show up.

Comment 65 Ben Bennett 2018-06-28 13:33:06 UTC

@Mike: Do you have a broken cluster I can look at?

Comment 66 Casey Callendrello 2018-06-28 14:24:13 UTC

Would it be possible to get logs from the SDN pod? These will not be in the system journal; they will only be available from "docker logs" or "kubectl logs".

Ben observed that the symptoms match those of a kube-proxy (service proxy) issue. If both the 3.10 and 3.9 proxies are running at the same time, that would be bad. (this is just a guess).

Comment 68 Ben Bennett 2018-06-28 17:28:07 UTC

@vlaad: api-server is incorrect (in comment #53). It needs to be all one word: apiserver.

This curl worked on the test cluster that @mfiedler set up:
  curl -v -k https://api-server.kube-service-catalog.svc:443/apis/servicecatalog.k8s.io/v1beta1

Comment 69 Ben Bennett 2018-06-28 19:07:46 UTC

The update is that the test cluster that showed the problem with ansible had no visible networking problems when I got on it.  The masters had working sdn, and the dns was functioning properly.  I could curl the service:
  curl -v -k https://apiserver.kube-service-catalog.svc.cluster.local:443/apis/servicecatalog.k8s.io/v1beta1
[my curl line in comment 68 is wrong, it has the - still]

When @vlaad re-ran the ansible playbook it completed the second time.  Perhaps some more time needs to be given for things to settle.  @vlaad is still re-running the upgrade to see if we can catch the failure.

Comment 70 Jay Boyd 2018-06-28 19:39:39 UTC

When I did a recreate with Mike in comment 37 I had scripts that executed commands every 5 seconds and I started them at the same time upgrade was started.  30 or 40 minutes in we hit the failure and some things experienced issues (oc commands hung, the master logs showed errors communicating with the catalog service, etc) but those issues only last 6-10 minutes.  I think manually poking around during the issue is going to be hard as its a short lived period of time when things are failing.

Comment 71 Ben Bennett 2018-06-29 18:20:28 UTC

I'm passing this back to the installer because it looks like things settle down, we may just need to increase the timeouts to allow things to settle.  I'm going to keep working to try to reproduce this at a useful log level, but our test cluster hit another bug blocking further reproduction.

Comment 75 Scott Dodson 2018-07-06 18:14:10 UTC

https://github.com/openshift/openshift-ansible/pull/9090 proposed fix

It waits for all core apis to come up and increases the number of retries on storage migration to 6 with 30 second delay between each one.

Comment 76 Wei Sun 2018-07-09 01:55:02 UTC

The PR https://github.com/openshift/openshift-ansible/pull/9099 is merged to openshift-ansible-3.10.15-1 , please check the bug.

Comment 84 Weihua Meng 2018-07-11 13:08:53 UTC

@Scott I think this is performance / time issue.
I suggest to increase more retry times or delay times from my upgrade test experience.

Comment 85 Scott Dodson 2018-07-11 18:46:09 UTC

I agree that this is either performance related or some sort of problem that takes time for the API service to get back to a state that will allow successful storage migration.

Since the pre upgrade storage migration can be completed very early in the process I'm going to simply move it into an early upgrade pre-check state which happens before we do anything about converting a host to static pods and any services have been restarted.

Comment 87 Weihua Meng 2018-07-13 06:03:42 UTC

I meet higher failure rate if system container on RHEL install OCP HA v3.9.33 with service catalog enabled and upgrade to OCP v3.10.

@Mike did you cover this upgrade path? could you have a try?
Thanks.

Comment 88 Mike Fiedler 2018-07-13 13:21:54 UTC

We previously covered this path with 3.9.27 -> 3.10.15 and did not see the error.  We will re-test today with 3.9.33 -> 3.10.16

Comment 89 Mike Fiedler 2018-07-13 19:11:03 UTC

@wmeng We hit a failure on 1 out of 4 upgrades from 3.9.33 to 3.10.17 today.   The task it failed on was a different from comment 80 but the actual failure is still an issue with a a resource type that should be there.   New bz?

  1. Hosts:    ec2-54-202-210-187.us-west-2.compute.amazonaws.com
     Play:     Upgrade Service Catalog
     Task:     scale up asb deploymentconfig
     Message:  ^[[0;31m{u'cmd': u'/usr/bin/oc get dc asb -o json -n openshift-ansible-service-broker', u'returncode': 1, u'results': [{}], u'stderr': u'error: the server doesn\'t have a resource type "dc"\n', u'stdout': u''}


 I'll attach the log

Comment 90 Mike Fiedler 2018-07-13 19:11:44 UTC

Created attachment 1458776 [details]
ansible log - server doesn't have a resource type "dc"

Comment 91 Weihua Meng 2018-07-16 01:10:57 UTC

Thanks Mike.

There is bug tracking the issue.
Bug 1579261 - [upgrade] ASB upgrade to 3.10 failed at 'scale up asb deploymentconfig'

Comment 93 Mike Fiedler 2018-07-18 12:47:25 UTC

Verified on 3.10.17.  System test team performed multiple 3.9.33 -> 3.10.17 upgrades and did not encounter this issue.   See comment 89 and comment 91.

Note You need to log in before you can comment on or make changes to this bug.

aos-bugs
chezhang
danw
dcbw
hongkliu
hongli
jaboyd
jforrest
jiazha
jliggitt
jokerman
mifiedle
mmccomas
sdodson
vlaad
vrutkovs
wmeng
wsun
xtian
zhsun
zitang