Bug 1569397
| Summary: | [SC] Provision APB failed during restart the node which the leader controller is running on component: service catalog | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jian Zhang <jiazha> |
| Component: | Service Catalog | Assignee: | Jay Boyd <jaboyd> |
| Status: | CLOSED CANTFIX | QA Contact: | Jian Zhang <jiazha> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 3.9.0 | CC: | bleanhar, chezhang, jaboyd, jiazha, zhsun, zitang |
| Target Milestone: | --- | ||
| Target Release: | 3.9.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: |
undefined
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-06-18 15:57:59 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Jian- Would you provide the logs from the controller pods so we can try to determine why the provision failed? I'd like to make sure we get the catalog controller-manager logs (for the pod that was rebooted and the pod took over as leader) as well as the APB logs. Paul & Jay I'm sorry, that cluster had been deleted yesterday, so, I create a new cluster 3.9.24 on GCE for this. But I could NOT reproduce this issue even if I reboot/shutdown them several times, All succeed finally although they state Pending status last some time. In this test, the pod that was rebooted always displaying "Running" status. That's different with last test. I will double confirm it once our "aws" env is ready for building. Paul $ Jay I double checked this issue in 3.9.24 on AWS, cannot reproduce it anymore. I confirm the HA feature works well when node reboot. I also kill its PS and stop the docker service, it still works well in these scenarios. [root@ip-172-18-7-157 ~]# oc version oc v3.9.24 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ec2-54-81-46-243.compute-1.amazonaws.com openshift v3.9.24 kubernetes v1.9.1+a0ce1bc657 Excellent, thanks for retesting and verifying this. Moving forward to close as not reproducible. |
Description of problem: After enabling the HA of the service catalog, provision APB failed during reboot the node which the leader controller is running on. Version-Release number of selected component (if applicable): The SC image: registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.9.20 The SC version: v0.1.9 How reproducible: Always Steps to Reproduce: 1, Check current leader controller pod. Like below: current leader controller pod is "controller-manager-msl9f". [root@ip-172-18-4-34 ~]# oc get endpoints service-catalog-controller-manager -o yaml apiVersion: v1 kind: Endpoints metadata: annotations: control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"controller-manager-msl9f-external-service-catalog-controller","leaseDurationSeconds":15,"acquireTime":"2018-04-19T03:18:18Z","renewTime":"2018-04-19T03:18:48Z","leaderTransitions":8}' creationTimestamp: 2018-04-19T01:48:04Z name: service-catalog-controller-manager namespace: kube-service-catalog resourceVersion: "23623" selfLink: /api/v1/namespaces/kube-service-catalog/endpoints/service-catalog-controller-manager uid: b36798a9-4373-11e8-add1-0ed341274e62 subsets: null [root@ip-172-18-4-34 ~]# oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE apiserver-77tsc 1/1 Running 0 1h 10.2.4.4 ip-172-18-3-228.ec2.internal apiserver-hppbr 1/1 Running 0 1h 10.2.6.3 ip-172-18-4-34.ec2.internal apiserver-whh68 1/1 Running 0 1h 10.2.8.4 ip-172-18-7-242.ec2.internal controller-manager-8fgsm 1/1 Running 0 16m 10.2.8.9 ip-172-18-7-242.ec2.internal controller-manager-msl9f 1/1 Running 0 13m 10.2.4.10 ip-172-18-3-228.ec2.internal controller-manager-zm292 1/1 Running 0 3m 10.2.6.8 ip-172-18-4-34.ec2.internal 2,Provision APBs(for example, MediaWiki, PostgreSQl, select creating binding) on Web and then reboot the node "ip-172-18-3-228.ec2.internal" that the leader controller pod is running on. [root@ip-172-18-4-34 ~]# oc get nodes NAME STATUS ROLES AGE VERSION ip-172-18-3-228.ec2.internal NotReady master 1h v1.9.1+a0ce1bc657 ip-172-18-4-208.ec2.internal Ready compute 1h v1.9.1+a0ce1bc657 ip-172-18-4-34.ec2.internal Ready master 1h v1.9.1+a0ce1bc657 ip-172-18-7-242.ec2.internal Ready master 1h v1.9.1+a0ce1bc657 ip-172-18-8-191.ec2.internal Ready compute 1h v1.9.1+a0ce1bc657 [root@ip-172-18-4-34 ~]# oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE apiserver-77tsc 0/1 Completed 0 1h <none> ip-172-18-3-228.ec2.internal apiserver-hppbr 1/1 Running 0 1h 10.2.6.3 ip-172-18-4-34.ec2.internal apiserver-whh68 1/1 Running 0 1h 10.2.8.4 ip-172-18-7-242.ec2.internal controller-manager-8fgsm 1/1 Running 0 17m 10.2.8.9 ip-172-18-7-242.ec2.internal controller-manager-msl9f 0/1 Error 0 14m <none> ip-172-18-3-228.ec2.internal controller-manager-zm292 1/1 Running 0 5m 10.2.6.8 ip-172-18-4-34.ec2.internal Actual results: Provision APB failed. [root@ip-172-18-4-34 ~]# oc describe serviceinstance rh-postgresql-apb-9cfvx -n jian5 Name: rh-postgresql-apb-9cfvx Namespace: jian5 Labels: <none> Annotations: <none> API Version: servicecatalog.k8s.io/v1beta1 Kind: ServiceInstance Metadata: Creation Timestamp: 2018-04-19T03:21:17Z Finalizers: kubernetes-incubator/service-catalog Generate Name: rh-postgresql-apb- Generation: 1 Resource Version: 24419 Self Link: /apis/servicecatalog.k8s.io/v1beta1/namespaces/jian5/serviceinstances/rh-postgresql-apb-9cfvx UID: b901a3e4-4380-11e8-a31c-0a580a020404 Spec: Cluster Service Class External Name: rh-postgresql-apb Cluster Service Class Ref: Name: d5915e05b253df421efe6e41fb6a66ba Cluster Service Plan External Name: dev Cluster Service Plan Ref: Name: 9783fc2e859f9179833a7dd003baa841 External ID: 18048f79-1a38-4ea6-b713-6ad79e167368 Parameters From: Secret Key Ref: Key: parameters Name: rh-postgresql-apb-parameters35056 Update Requests: 0 User Info: Extra: Scopes . Authorization . Openshift . Io: user:full Groups: system:authenticated:oauth system:authenticated UID: Username: jiazha2 Status: Async Op In Progress: false Conditions: Last Transition Time: 2018-04-19T03:21:17Z Message: Provision call failed: Error occurred during provision. Please contact administrator if it persists. Reason: ProvisionCallFailed Status: False Type: Ready Last Transition Time: 2018-04-19T03:22:28Z Message: Provision call failed: Error occurred during provision. Please contact administrator if it persists. Reason: ProvisionCallFailed Status: True Type: Failed Deprovision Status: Required Orphan Mitigation In Progress: false Reconciled Generation: 1 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ErrorWithParameters 8m service-catalog-controller-manager failed to prepare parameters nil: secrets "rh-postgresql-apb-parameters35056" not found Normal Provisioning 8m service-catalog-controller-manager The instance is being provisioned asynchronously Warning ReferencesNonexistentServiceClass 7m service-catalog-controller-manager The instance references a non-existent ClusterServiceClass (K8S: "d5915e05b253df421efe6e41fb6a66ba" ExternalName: "rh-postgresql-apb") Warning ErrorPollingLastOperation 7m service-catalog-controller-manager Error polling last operation: Status: 401; ErrorMessage: <nil>; Description: <nil>; ResponseError: <nil> Warning ProvisionCallFailed 7m (x6 over 7m) service-catalog-controller-manager Provision call failed: Error occurred during provision. Please contact administrator if it persists. One problem is the corresponding deploy namespaces still exist even if the deploy pods did NOT occur errros(PS: the "keep_namespace" option is false): [root@ip-172-18-4-34 ~]# oc get pods -n rh-mediawiki-apb-prov-q64rj NAME READY STATUS RESTARTS AGE apb-c90efb5b-b0f7-4465-9ed3-cb69eb05268c 0/1 Completed 0 2h [root@ip-172-18-4-34 ~]# oc get pods -n rh-postgresql-apb-prov-79mkt NAME READY STATUS RESTARTS AGE apb-d39b0c6b-319c-45ab-bfb2-24a1f1437948 0/1 Completed 0 2h Expected results: Provision APB success and the corresponding deploying namespaces can be deleted successfully. Additional info: [root@ip-172-18-4-34 ~]# uname -a Linux ip-172-18-4-34.ec2.internal 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux [root@ip-172-18-4-34 ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.5 (Maipo)