Bug 1579261
Summary: | [upgrade] ASB upgrade to 3.10 failed at 'scale up asb deploymentconfig' | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Zihan Tang <zitang> | ||||
Component: | Installer | Assignee: | Russell Teague <rteague> | ||||
Installer sub component: | openshift-ansible | QA Contact: | Johnny Liu <jialiu> | ||||
Status: | CLOSED DEFERRED | Docs Contact: | |||||
Severity: | high | ||||||
Priority: | high | CC: | anli, aos-bugs, chezhang, jiazha, jmatthew, jmontleo, jokerman, mifiedle, mmccomas, sdodson, sgaikwad, shurley, spadgett, vrutkovs, wmeng, zhsun | ||||
Version: | 3.10.0 | Keywords: | Reopened | ||||
Target Milestone: | --- | ||||||
Target Release: | 3.10.z | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2019-10-14 20:13:33 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Zihan Tang
2018-05-17 09:19:42 UTC
Can you post the content of `oc logs -f job/asb-etcd-migration -n openshift-ansible-service-broker`? >the server doesn't have a resource type \"dc\"
This is very weird and probably caused by API/etcd pod restart.
Could you attach journalctl logs from the master nodes?
This specific issue seems to be unrelated to the broker, likely an issue with the origin master api being down at the point the request is made. I think there might be a second issue lurking, which is why the migration failed in the first place, but that will probably need its own BZ (once I've had a chance to look at the job logs) Add more following comment 4. the master-controllers couldn't be started. The pod report the following message. defail in attached file. I0518 06:52:54.589754 1 client_builder.go:233] Verified credential for cluster-quota-reconciliation-controller/openshift-infra I0518 06:52:54.743179 1 request.go:1099] body was not decodable (unable to check for Status): Object 'Kind' is missing in 'Error: 'x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "service-catalog-signer")' Trying to reach: 'https://172.30.249.84:443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s'' I0518 06:52:55.551712 1 request.go:1099] body was not decodable (unable to check for Status): Object 'Kind' is missing in 'Error: 'x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "service-catalog-signer")' Trying to reach: 'https://172.30.249.84:443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s'' F0518 06:52:55.552142 1 controller_manager.go:194] Error starting "openshift.io/cluster-quota-reconciliation" (failed to discover resources: unable to retrieve the complete list of server APIs: servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request) Created attachment 1438369 [details]
The master-controllers error logs
In my testing I see API randomly getting stuck at various points of hosted resources install: May 18 09:57:14 ip-172-18-1-93.ec2.internal origin-node[22961]: W0518 09:56:43.986619 22961 prober.go:103] No ref for container "docker://fafbd4c98301b50d4ecd2d685bf07d6966bbf6d555a36abe4aa" (master-api-ip-172-18-1-93.ec2.internal_kube-system(d87e86d962f1fde9fa7904cf1a1c6e53):api) May 18 09:57:14 ip-172-18-1-93.ec2.internal origin-node[22961]: I0518 09:56:43.986641 22961 prober.go:111] Readiness probe for "master-api-ip-172-18-1-93.ec2.internal_kube-system(d87e86d962f1fde9fa7904cf1a1c6e53):api) 2f1fde9fa7904cf1a1c6e53):api" failed (failure): Get https://172.18.1.93:8443/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Created PR https://github.com/openshift/openshift-ansible/pull/8443 - API restarts should happen less frequent now (In reply to Fabian von Feilitzsch from comment #2) > Can you post the content of `oc logs -f job/asb-etcd-migration -n > openshift-ansible-service-broker`? in openshift-ansible-service-broker namespace , no asb-etcd-migration pod [root@ip-172-18-12-195 ~]# oc get pod NAME READY STATUS RESTARTS AGE asb-1-m5z2w 1/1 Running 5 1h asb-etcd-1-vxn2c 1/1 Running 1 2h [root@ip-172-18-12-195 ~]# oc logs -f job/asb-etcd-migration ^C [root@ip-172-18-12-195 ~]# oc logs job/asb-etcd-migration ^C But other env, it can triger `asb-etcd-migration` job. but failed by : [root@qe-zitang-39up-master-etcd-1 ~]# oc logs -f asb-etcd-migration-v5hnx time="2018-05-21T09:01:35Z" level=info msg="etcd configuration: {asb-etcd.openshift-ansible-service-broker.svc /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt /var/run/asb-etcd-auth/client.crt /var/run/asb-etcd-auth/client.key 2379}" time="2018-05-21T09:01:35Z" level=info msg="== ETCD CX ==" time="2018-05-21T09:01:35Z" level=info msg="EtcdHost: asb-etcd.openshift-ansible-service-broker.svc" time="2018-05-21T09:01:35Z" level=info msg="EtcdPort: 2379" time="2018-05-21T09:01:35Z" level=info msg="Endpoints: [https://asb-etcd.openshift-ansible-service-broker.svc:2379]" 2018/05/21 09:01:35 Dao::BatchGetRaw 2018/05/21 09:01:35 Successfully loaded [ 4 ] objects from etcd dir [ /spec ] 2018/05/21 09:01:35 Batch idx [ 0 ] -> [ 73ead67495322cc462794387fa9884f5 ] 2018/05/21 09:01:35 Batch idx [ 1 ] -> [ d5915e05b253df421efe6e41fb6a66ba ] 2018/05/21 09:01:35 Batch idx [ 2 ] -> [ 03b69500305d9859bb9440d9f9023784 ] 2018/05/21 09:01:35 Batch idx [ 3 ] -> [ 2c259ddd8059b9bc65081e07bf20058f ] 2018/05/21 09:01:35 set spec: 73ead67495322cc462794387fa9884f5 2018/05/21 09:01:35 set spec: d5915e05b253df421efe6e41fb6a66ba 2018/05/21 09:01:35 set spec: 03b69500305d9859bb9440d9f9023784 2018/05/21 09:01:35 set spec: 2c259ddd8059b9bc65081e07bf20058f 2018/05/21 09:01:35 Dao::BatchGetRaw 2018/05/21 09:01:35 Successfully loaded [ 3 ] objects from etcd dir [ /service_instance ] 2018/05/21 09:01:35 set service instance: 55f7fa4e-4557-4a40-ace8-451ee80ff04f 2018/05/21 09:01:35 unable to save service instance - bundleinstances.automationbroker.io is forbidden: User "system:serviceaccount:openshift-ansible-service-broker:asb" cannot create bundleinstances.automationbroker.io in the namespace "openshift-ansible-service-broker": User "system:serviceaccount:openshift-ansible-service-broker:asb" cannot create bundleinstances.automationbroker.io in project "openshift-ansible-service-broker" time="2018-05-21T09:01:35Z" level=info msg="reverted service instances" 2018/05/21 09:01:35 Dao::DeleteSpec-> [ 73ead67495322cc462794387fa9884f5 ] 2018/05/21 09:01:35 Dao::DeleteSpec-> [ d5915e05b253df421efe6e41fb6a66ba ] 2018/05/21 09:01:35 Dao::DeleteSpec-> [ 03b69500305d9859bb9440d9f9023784 ] 2018/05/21 09:01:35 Dao::DeleteSpec-> [ 2c259ddd8059b9bc65081e07bf20058f ] time="2018-05-21T09:01:35Z" level=info msg="reverted saved specs - exiting now - migration failed" panic: Unable to migrate all the service instances set service instance - bundleinstances.automationbroker.io is forbidden: User "system:serviceaccount:openshift-ansible-service-broker:asb" cannot create bundleinstances.automationbroker.io in the namespace "openshift-ansible-service-broker": User "system:serviceaccount:openshift-ansible-service-broker:asb" cannot create bundleinstances.automationbroker.io in project "openshift-ansible-service-broker" goroutine 1 [running]: main.main() /builddir/build/BUILD/ansible-service-broker-1.2.11/cmd/migration/main.go:126 +0x357c this is caused by bug 1579269 FYI, I'm looking at a similar upgrade issue with Service Catalog - - similar in that its 3.9 to 3.10 upgrade on AWS, Catalog is not upgraded to 3.10 (no errors reported). I'm seeing that the master controller is crash looping (its obvious if you oc get pods --all-namespaces). I'm not certain if that is your issue, but I wanted to call it out. The controller manager log ends with: error building controller context: cloud provider could not be initialized: could not init cloud provider "aws": error finding instance i-07bf0c9c6b2f7f248: "error listing AWS instances: \"AuthFailure: AWS was not able to validate the provided access credentials\\n\\tstatus code: 401, request id: 1b9749b1-0985-4297-985b-0a59125f3678\"" Fabian, Can you look at why the ASB etcd to CRD migration is failing here? Possibly permissions problems it looks like. Scott, Looks like the logs indicate same failure as https://bugzilla.redhat.com/show_bug.cgi?id=1579269 , could also have been the same failure as the dc issue, where the API server is not up and we therefore cannot create the resources necessary to perform the migration. Looks like the fix for that was verified yesterday so that first case should be solved. Can this be re-tested using the latest per comment 15? I use the latest openshift-ansible-3.10.0-0.53.0 to upgrade the same env, it still failed by the same error. Play: Upgrade Service Catalog Task: scale up asb deploymentconfig Message: {u'cmd': u'/usr/bin/oc get dc asb -o json -n openshift-ansible-service-broker', u'returncode': 1, u'results': [{}], u'stderr': u'error: the server doesn\'t have a resource type "dc"\n', u'stdout': u''} It seems like this error is related to the stability of the master api server or controller, not the migration. If the API is not consistently up I'm not sure there's anything we can do to fix it other than add some retries and hope. @sdodson, thoughts? Closing all 'Openshift API broken' bugs *** This bug has been marked as a duplicate of bug 1579676 *** I repoen this bug because we hit it 3 times from today. the error is same as #description. ansible version: openshift-ansible-3.10.1-1 this block service-catalog and asb upgrade test. so add TestBlocker tag. I tried upgrade in other platforms. in aws, it's reproduced, in gce and openstack, upgrade succeed and it do not reproduce. It's not always reproduced, so remove TestBlocker keyword. With no active cases associated with this bug nor a reproducer we're closing this. |