Bug 1579261

Summary:

[upgrade] ASB upgrade to 3.10 failed at 'scale up asb deploymentconfig'

Product:

OpenShift Container Platform

Reporter:

Zihan Tang <zitang>

Component:

Installer

Assignee:

Russell Teague <rteague>

Installer sub component:

openshift-ansible

QA Contact:

Johnny Liu <jialiu>

Status:

CLOSED DEFERRED

Docs Contact:

Severity:

high

Priority:

high

CC:

anli, aos-bugs, chezhang, jiazha, jmatthew, jmontleo, jokerman, mifiedle, mmccomas, sdodson, sgaikwad, shurley, spadgett, vrutkovs, wmeng, zhsun

Version:

3.10.0

Keywords:

Reopened

Target Milestone:

---

Target Release:

3.10.z

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-10-14 20:13:33 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
The master-controllers error logs	none

Description Zihan Tang 2018-05-17 09:19:42 UTC

Description of problem:
upgrade asb to 3.10 failed.

Version-Release number of selected component (if applicable):
openshift-ansible-3.10.0-0.47.0

How reproducible:
always

Steps to Reproduce:
1. install openshift v3.9 with service-catalog and ASB
2. run upgrade playbook job: 
ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.yml


Actual results:
upgrade failed :
log: 

FAILED - RETRYING: wait for migration to complete (1 retries left).Result was: {
    "attempts": 60, 
    "changed": false, 
    "failed": false, 
    "failed_when_result": false, 
    "invocation": {
        "module_args": {
            "all_namespaces": null, 
            "content": null, 
            "debug": false, 
            "delete_after": false, 
            "field_selector": null, 
            "files": null, 
            "force": false, 
            "kind": "Job", 
            "kubeconfig": "/etc/origin/master/admin.kubeconfig", 
            "name": "asb-etcd-migration", 
            "namespace": "openshift-ansible-service-broker", 
            "selector": null, 
            "state": "list"
        }
    }, 
    "results": {
        "cmd": "/usr/local/bin/oc get Job asb-etcd-migration -o json -n openshift-ansible-service-broker", 
        "results": [
            {
                "apiVersion": "batch/v1", 
                "kind": "Job", 
                "metadata": {
                    "creationTimestamp": "2018-05-17T07:03:29Z", 
                    "labels": {
                        "controller-uid": "66feae39-59a0-11e8-9b73-0e0cfe773c6a", 
                        "job-name": "asb-etcd-migration"
                    }, 
                    "name": "asb-etcd-migration", 
                    "namespace": "openshift-ansible-service-broker", 
                    "resourceVersion": "9270", 
                    "selfLink": "/apis/batch/v1/namespaces/openshift-ansible-service-broker/jobs/asb-etcd-migration", 
                    "uid": "66feae39-59a0-11e8-9b73-0e0cfe773c6a"
                }, 
                "spec": {
                    "backoffLimit": 3, 
                    "completions": 1, 
                    "parallelism": 1, 
                    "selector": {
                        "matchLabels": {
                            "controller-uid": "66feae39-59a0-11e8-9b73-0e0cfe773c6a"
                        }
                    }, 
                    "template": {
                        "metadata": {
                            "creationTimestamp": null, 
                            "labels": {
                                "controller-uid": "66feae39-59a0-11e8-9b73-0e0cfe773c6a", 
                                "job-name": "asb-etcd-migration"
                            }, 
                            "name": "asb-etcd-migration"
                        }, 
                        "spec": {
                            "containers": [
                                {
                                    "args": [
                                        "-host=asb-etcd.openshift-ansible-service-broker.svc", 
                                        "-ca-file=/var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt", 
                                        "-client-cert=/var/run/asb-etcd-auth/client.crt", 
                                        "-client-key=/var/run/asb-etcd-auth/client.key", 
                                        "-namespace=openshift-ansible-service-broker"
                                    ], 
                                    "command": [
                                        "/usr/bin/migration"
                                    ], 
                                    "env": [
                                        {
                                            "name": "BROKER_CONFIG", 
                                            "value": "/etc/ansible-service-broker/config.yaml"
                                        }, 
                                        {
                                            "name": "HTTP_PROXY"
                                        }, 
                                        {
                                            "name": "HTTPS_PROXY"
                                        }, 
                                        {
                                            "name": "NO_PROXY"
                                        }
                                    ], 
                                    "image": "registry.reg-aws.openshift.com:443/openshift3/ose-ansible-service-broker:v3.10", 
                                    "imagePullPolicy": "IfNotPresent", 
                                    "name": "asb", 
                                    "resources": {}, 
                                    "terminationMessagePath": "/dev/termination-log", 
                                    "terminationMessagePolicy": "File", 
                                    "volumeMounts": [
                                        {
                                            "mountPath": "/etc/ansible-service-broker", 
                                            "name": "config-volume"
                                        }, 
                                        {
                                            "mountPath": "/etc/tls/private", 
                                            "name": "asb-tls"
                                        }, 
                                        {
                                            "mountPath": "/var/run/asb-etcd-auth", 
                                            "name": "asb-etcd-auth"
                                        }
                                    ]
                                }
                            ], 
                            "dnsPolicy": "ClusterFirst", 
                            "restartPolicy": "Never", 
                            "schedulerName": "default-scheduler", 
                            "securityContext": {}, 
                            "serviceAccount": "asb", 
                            "serviceAccountName": "asb", 
                            "terminationGracePeriodSeconds": 30, 
                            "volumes": [
                                {
                                    "configMap": {
                                        "defaultMode": 420, 
                                        "items": [
                                            {
                                                "key": "broker-config", 
                                                "path": "config.yaml"
                                            }
                                        ], 
                                        "name": "broker-config"
                                    }, 
                                    "name": "config-volume"
                                }, 
                                {
                                    "name": "asb-tls", 
                                    "secret": {
                                        "defaultMode": 420, 
                                        "secretName": "asb-tls"
                                    }
                                }, 
                                {
                                    "name": "asb-etcd-auth", 
                                    "secret": {
                                        "defaultMode": 420, 
                                        "secretName": "broker-etcd-auth-secret"
                                    }
                                }
                            ]
                        }
                    }
                }, 
                "status": {}
            }
        ], 
        "returncode": 0
    }, 
    "retries": 61, 
    "state": "list"
}

TASK [ansible_service_broker : Update broker configmap to use CRD backend] *****
task path: /usr/share/ansible/openshift-ansible/roles/ansible_service_broker/tasks/migrate.yml:147
Thursday 17 May 2018  07:14:11 +0000 (0:11:13.195)       0:31:36.871 ********** 
skipping: [*****.com] => {
    "changed": false, 
    "skip_reason": "Conditional result was False", 
    "skipped": true
}
....
TASK [ansible_service_broker : scale up asb deploymentconfig] ******************
task path: /usr/share/ansible/openshift-ansible/roles/ansible_service_broker/tasks/migrate.yml:188
The full traceback is:
  File "/tmp/ansible_BxvJpZ/ansible_module_oc_scale.py", line 47, in <module>
    import ruamel.yaml as yaml

fatal: [ec2-54-173-24-176.compute-1.amazonaws.com]: FAILED! => {
    "changed": false, 
    "failed": true, 
    "invocation": {
        "module_args": {
            "debug": false, 
            "kind": "dc", 
            "kubeconfig": "/etc/origin/master/admin.kubeconfig", 
            "name": "asb", 
            "namespace": "openshift-ansible-service-broker", 
            "replicas": 1, 
            "state": "present"
        }
    }, 
    "msg": {
        "cmd": "/usr/local/bin/oc get dc asb -o json -n openshift-ansible-service-broker", 
        "results": [
            {}
        ], 
        "returncode": 1, 
        "stderr": "the server doesn't have a resource type \"dc\"\n", 
        "stdout": ""
    }
}

after upgrade: 
# oc version
oc v3.9.29
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-1-47.ec2.internal:443
openshift v3.10.0-0.47.0
kubernetes v1.10.0+b81c8f8

[root@ip-172-18-1-47 ~]# oc get deploymentconfig
NAME       REVISION   DESIRED   CURRENT   TRIGGERED BY
asb        1          0         1         config
asb-etcd   1          1         1         config

[root@ip-172-18-1-47 ~]# oc get dc
the server doesn't have a resource type "dc"
[root@ip-172-18-1-47 ~]# oc get crd
the server doesn't have a resource type "crd"
[root@ip-172-18-1-47 ~]# oc get --raw /apis/apps.openshift.io/v1/|json_reformat |grep -i -A 5 short
            "shortNames": [
                "dc"
            ],
            "categories": [
                "all"
            ]
 oc get node
NAME                           STATUS    ROLES     AGE       VERSION
*****                         Ready     master    2h        v1.10.0+b81c8f8
********.internal       Ready     compute   2h        v1.9.1+a0ce1bc657


Expected results:
upgrade succeed.

Additional info:

Comment 2 Fabian von Feilitzsch 2018-05-17 18:03:45 UTC

Can you post the content of `oc logs -f job/asb-etcd-migration -n openshift-ansible-service-broker`?

Comment 3 Vadim Rutkovsky 2018-05-17 18:06:51 UTC

>the server doesn't have a resource type \"dc\"

This is very weird and probably caused by API/etcd pod restart.

Could you attach journalctl logs from the master nodes?

Comment 4 Fabian von Feilitzsch 2018-05-17 18:09:49 UTC

This specific issue seems to be unrelated to the broker, likely an issue with the origin master api being down at the point the request is made. I think there might be a second issue lurking, which is why the migration failed in the first place, but that will probably need its own BZ (once I've had a chance to look at the job logs)

Comment 5 Anping Li 2018-05-18 07:34:59 UTC

Add more following comment 4.

the master-controllers couldn't be started. The pod report the following message. defail in attached file.


I0518 06:52:54.589754       1 client_builder.go:233] Verified credential for cluster-quota-reconciliation-controller/openshift-infra
I0518 06:52:54.743179       1 request.go:1099] body was not decodable (unable to check for Status): Object 'Kind' is missing in 'Error: 'x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "service-catalog-signer")'
Trying to reach: 'https://172.30.249.84:443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s''
I0518 06:52:55.551712       1 request.go:1099] body was not decodable (unable to check for Status): Object 'Kind' is missing in 'Error: 'x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "service-catalog-signer")'
Trying to reach: 'https://172.30.249.84:443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s''
F0518 06:52:55.552142       1 controller_manager.go:194] Error starting "openshift.io/cluster-quota-reconciliation" (failed to discover resources: unable to retrieve the complete list of server APIs: servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request)

Comment 6 Anping Li 2018-05-18 07:38:45 UTC

Created attachment 1438369 [details]
The master-controllers error logs

Comment 7 Vadim Rutkovsky 2018-05-18 10:21:32 UTC

In my testing I see API randomly getting stuck at various points of hosted resources install:

May 18 09:57:14 ip-172-18-1-93.ec2.internal origin-node[22961]: W0518 09:56:43.986619   22961 prober.go:103] No ref for container "docker://fafbd4c98301b50d4ecd2d685bf07d6966bbf6d555a36abe4aa" (master-api-ip-172-18-1-93.ec2.internal_kube-system(d87e86d962f1fde9fa7904cf1a1c6e53):api)
May 18 09:57:14 ip-172-18-1-93.ec2.internal origin-node[22961]: I0518 09:56:43.986641   22961 prober.go:111] Readiness probe for "master-api-ip-172-18-1-93.ec2.internal_kube-system(d87e86d962f1fde9fa7904cf1a1c6e53):api) 2f1fde9fa7904cf1a1c6e53):api" failed (failure): Get https://172.18.1.93:8443/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Comment 8 Vadim Rutkovsky 2018-05-18 17:06:47 UTC

Created PR https://github.com/openshift/openshift-ansible/pull/8443 - API restarts should happen less frequent now

Comment 12 Zihan Tang 2018-05-21 09:23:31 UTC

(In reply to Fabian von Feilitzsch from comment #2)
> Can you post the content of `oc logs -f job/asb-etcd-migration -n
> openshift-ansible-service-broker`?

in openshift-ansible-service-broker namespace , no asb-etcd-migration pod
[root@ip-172-18-12-195 ~]# oc get pod 
NAME               READY     STATUS    RESTARTS   AGE
asb-1-m5z2w        1/1       Running   5          1h
asb-etcd-1-vxn2c   1/1       Running   1          2h
[root@ip-172-18-12-195 ~]# oc logs -f job/asb-etcd-migration
^C
[root@ip-172-18-12-195 ~]# oc logs job/asb-etcd-migration
^C

But other env, it can triger `asb-etcd-migration` job. 

but failed by :  
[root@qe-zitang-39up-master-etcd-1 ~]# oc logs -f asb-etcd-migration-v5hnx
time="2018-05-21T09:01:35Z" level=info msg="etcd configuration: {asb-etcd.openshift-ansible-service-broker.svc /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt /var/run/asb-etcd-auth/client.crt /var/run/asb-etcd-auth/client.key 2379}"
time="2018-05-21T09:01:35Z" level=info msg="== ETCD CX =="
time="2018-05-21T09:01:35Z" level=info msg="EtcdHost: asb-etcd.openshift-ansible-service-broker.svc"
time="2018-05-21T09:01:35Z" level=info msg="EtcdPort: 2379"
time="2018-05-21T09:01:35Z" level=info msg="Endpoints: [https://asb-etcd.openshift-ansible-service-broker.svc:2379]"
2018/05/21 09:01:35 Dao::BatchGetRaw
2018/05/21 09:01:35 Successfully loaded [ 4 ] objects from etcd dir [ /spec ]
2018/05/21 09:01:35 Batch idx [ 0 ] -> [ 73ead67495322cc462794387fa9884f5 ]
2018/05/21 09:01:35 Batch idx [ 1 ] -> [ d5915e05b253df421efe6e41fb6a66ba ]
2018/05/21 09:01:35 Batch idx [ 2 ] -> [ 03b69500305d9859bb9440d9f9023784 ]
2018/05/21 09:01:35 Batch idx [ 3 ] -> [ 2c259ddd8059b9bc65081e07bf20058f ]
2018/05/21 09:01:35 set spec: 73ead67495322cc462794387fa9884f5
2018/05/21 09:01:35 set spec: d5915e05b253df421efe6e41fb6a66ba
2018/05/21 09:01:35 set spec: 03b69500305d9859bb9440d9f9023784
2018/05/21 09:01:35 set spec: 2c259ddd8059b9bc65081e07bf20058f
2018/05/21 09:01:35 Dao::BatchGetRaw
2018/05/21 09:01:35 Successfully loaded [ 3 ] objects from etcd dir [ /service_instance ]
2018/05/21 09:01:35 set service instance: 55f7fa4e-4557-4a40-ace8-451ee80ff04f
2018/05/21 09:01:35 unable to save service instance - bundleinstances.automationbroker.io is forbidden: User "system:serviceaccount:openshift-ansible-service-broker:asb" cannot create bundleinstances.automationbroker.io in the namespace "openshift-ansible-service-broker": User "system:serviceaccount:openshift-ansible-service-broker:asb" cannot create bundleinstances.automationbroker.io in project "openshift-ansible-service-broker"
time="2018-05-21T09:01:35Z" level=info msg="reverted service instances"
2018/05/21 09:01:35 Dao::DeleteSpec-> [ 73ead67495322cc462794387fa9884f5 ]
2018/05/21 09:01:35 Dao::DeleteSpec-> [ d5915e05b253df421efe6e41fb6a66ba ]
2018/05/21 09:01:35 Dao::DeleteSpec-> [ 03b69500305d9859bb9440d9f9023784 ]
2018/05/21 09:01:35 Dao::DeleteSpec-> [ 2c259ddd8059b9bc65081e07bf20058f ]
time="2018-05-21T09:01:35Z" level=info msg="reverted saved specs - exiting now - migration failed"
panic: Unable to migrate all the service instances set service instance - bundleinstances.automationbroker.io is forbidden: User "system:serviceaccount:openshift-ansible-service-broker:asb" cannot create bundleinstances.automationbroker.io in the namespace "openshift-ansible-service-broker": User "system:serviceaccount:openshift-ansible-service-broker:asb" cannot create bundleinstances.automationbroker.io in project "openshift-ansible-service-broker"

goroutine 1 [running]:
main.main()
	/builddir/build/BUILD/ansible-service-broker-1.2.11/cmd/migration/main.go:126 +0x357c

this is caused by bug 1579269

Comment 13 Jay Boyd 2018-05-21 15:24:03 UTC

FYI, I'm looking at a similar upgrade issue with Service Catalog - - similar in that its 3.9 to 3.10 upgrade on AWS, Catalog is not upgraded to 3.10 (no errors reported).  I'm seeing that the master controller is crash looping (its obvious if you oc get pods --all-namespaces).  I'm not certain if that is your issue, but I wanted to call it out.  The controller manager log ends with:

error building controller context: cloud provider could not be initialized: could not init cloud provider "aws": error finding instance i-07bf0c9c6b2f7f248: "error listing AWS instances: \"AuthFailure: AWS was not able to validate the provided access credentials\\n\\tstatus code: 401, request id: 1b9749b1-0985-4297-985b-0a59125f3678\""

Comment 14 Scott Dodson 2018-05-24 13:28:20 UTC

Fabian,

Can you look at why the ASB etcd to CRD migration is failing here? Possibly permissions problems it looks like.

Comment 15 Fabian von Feilitzsch 2018-05-24 17:10:33 UTC

Scott,

Looks like the logs indicate same failure as https://bugzilla.redhat.com/show_bug.cgi?id=1579269 , could also have been the same failure as the dc issue, where the API server is not up and we therefore cannot create the resources necessary to perform the migration. Looks like the fix for that was verified yesterday so that first case should be solved.

Comment 16 Scott Dodson 2018-05-25 18:51:27 UTC

Can this be re-tested using the latest per comment 15?

Comment 17 Zihan Tang 2018-05-28 08:02:12 UTC

I use the latest openshift-ansible-3.10.0-0.53.0 to upgrade the same env, it still failed by the same error.
  Play:     Upgrade Service Catalog
     Task:     scale up asb deploymentconfig
     Message:  {u'cmd': u'/usr/bin/oc get dc asb -o json -n openshift-ansible-service-broker', u'returncode': 1, u'results': [{}], u'stderr': u'error: the server doesn\'t have a resource type "dc"\n', u'stdout': u''}

Comment 19 Fabian von Feilitzsch 2018-05-29 17:34:35 UTC

It seems like this error is related to the stability of the master api server or controller, not the migration. If the API is not consistently up I'm not sure there's anything we can do to fix it other than add some retries and hope.

@sdodson, thoughts?

Comment 20 Vadim Rutkovsky 2018-05-30 14:32:14 UTC

Closing all 'Openshift API broken' bugs

*** This bug has been marked as a duplicate of bug 1579676 ***

Comment 21 Zihan Tang 2018-06-20 05:26:55 UTC

I repoen this bug because we hit it 3 times from today.  the error is same as #description. 
ansible version: openshift-ansible-3.10.1-1
this block service-catalog and asb upgrade test. so add TestBlocker tag.

Comment 23 Zihan Tang 2018-06-20 09:05:10 UTC

I tried upgrade in other platforms.
in aws, it's reproduced, in gce and openstack, upgrade succeed and it do not reproduce.

Comment 24 Zihan Tang 2018-07-04 06:29:18 UTC

It's not always reproduced, so remove TestBlocker keyword.

Comment 27 Scott Dodson 2019-10-14 20:13:33 UTC

With no active cases associated with this bug nor a reproducer we're closing this.