Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1569397

Summary: [SC] Provision APB failed during restart the node which the leader controller is running on component: service catalog
Product: OpenShift Container Platform Reporter: Jian Zhang <jiazha>
Component: Service CatalogAssignee: Jay Boyd <jaboyd>
Status: CLOSED CANTFIX QA Contact: Jian Zhang <jiazha>
Severity: high Docs Contact:
Priority: high    
Version: 3.9.0CC: bleanhar, chezhang, jaboyd, jiazha, zhsun, zitang
Target Milestone: ---   
Target Release: 3.9.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-18 15:57:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jian Zhang 2018-04-19 08:25:53 UTC
Description of problem:
After enabling the HA of the service catalog, provision APB failed during reboot the node which the leader controller is running on.

Version-Release number of selected component (if applicable):
The SC image: registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.9.20
The SC version: v0.1.9

How reproducible:
Always

Steps to Reproduce:
1, Check current leader controller pod. Like below: current leader controller pod is "controller-manager-msl9f".
[root@ip-172-18-4-34 ~]# oc get endpoints service-catalog-controller-manager -o yaml
apiVersion: v1
kind: Endpoints
metadata:
  annotations:
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"controller-manager-msl9f-external-service-catalog-controller","leaseDurationSeconds":15,"acquireTime":"2018-04-19T03:18:18Z","renewTime":"2018-04-19T03:18:48Z","leaderTransitions":8}'
  creationTimestamp: 2018-04-19T01:48:04Z
  name: service-catalog-controller-manager
  namespace: kube-service-catalog
  resourceVersion: "23623"
  selfLink: /api/v1/namespaces/kube-service-catalog/endpoints/service-catalog-controller-manager
  uid: b36798a9-4373-11e8-add1-0ed341274e62
subsets: null


[root@ip-172-18-4-34 ~]# oc get pods -o wide
NAME                       READY     STATUS    RESTARTS   AGE       IP          NODE
apiserver-77tsc            1/1       Running   0          1h        10.2.4.4    ip-172-18-3-228.ec2.internal
apiserver-hppbr            1/1       Running   0          1h        10.2.6.3    ip-172-18-4-34.ec2.internal
apiserver-whh68            1/1       Running   0          1h        10.2.8.4    ip-172-18-7-242.ec2.internal
controller-manager-8fgsm   1/1       Running   0          16m       10.2.8.9    ip-172-18-7-242.ec2.internal
controller-manager-msl9f   1/1       Running   0          13m       10.2.4.10   ip-172-18-3-228.ec2.internal
controller-manager-zm292   1/1       Running   0          3m        10.2.6.8    ip-172-18-4-34.ec2.internal

2,Provision APBs(for example, MediaWiki, PostgreSQl, select creating binding) on Web and then reboot the node "ip-172-18-3-228.ec2.internal" that the leader controller pod is running on.
[root@ip-172-18-4-34 ~]# oc get nodes
NAME                           STATUS     ROLES     AGE       VERSION
ip-172-18-3-228.ec2.internal   NotReady   master    1h        v1.9.1+a0ce1bc657
ip-172-18-4-208.ec2.internal   Ready      compute   1h        v1.9.1+a0ce1bc657
ip-172-18-4-34.ec2.internal    Ready      master    1h        v1.9.1+a0ce1bc657
ip-172-18-7-242.ec2.internal   Ready      master    1h        v1.9.1+a0ce1bc657
ip-172-18-8-191.ec2.internal   Ready      compute   1h        v1.9.1+a0ce1bc657

[root@ip-172-18-4-34 ~]# oc get pods -o wide
NAME                       READY     STATUS      RESTARTS   AGE       IP         NODE
apiserver-77tsc            0/1       Completed   0          1h        <none>     ip-172-18-3-228.ec2.internal
apiserver-hppbr            1/1       Running     0          1h        10.2.6.3   ip-172-18-4-34.ec2.internal
apiserver-whh68            1/1       Running     0          1h        10.2.8.4   ip-172-18-7-242.ec2.internal
controller-manager-8fgsm   1/1       Running     0          17m       10.2.8.9   ip-172-18-7-242.ec2.internal
controller-manager-msl9f   0/1       Error       0          14m       <none>     ip-172-18-3-228.ec2.internal
controller-manager-zm292   1/1       Running     0          5m        10.2.6.8   ip-172-18-4-34.ec2.internal

Actual results:
Provision APB failed.

[root@ip-172-18-4-34 ~]# oc describe serviceinstance rh-postgresql-apb-9cfvx -n jian5
Name:         rh-postgresql-apb-9cfvx
Namespace:    jian5
Labels:       <none>
Annotations:  <none>
API Version:  servicecatalog.k8s.io/v1beta1
Kind:         ServiceInstance
Metadata:
  Creation Timestamp:  2018-04-19T03:21:17Z
  Finalizers:
    kubernetes-incubator/service-catalog
  Generate Name:     rh-postgresql-apb-
  Generation:        1
  Resource Version:  24419
  Self Link:         /apis/servicecatalog.k8s.io/v1beta1/namespaces/jian5/serviceinstances/rh-postgresql-apb-9cfvx
  UID:               b901a3e4-4380-11e8-a31c-0a580a020404
Spec:
  Cluster Service Class External Name:  rh-postgresql-apb
  Cluster Service Class Ref:
    Name:                              d5915e05b253df421efe6e41fb6a66ba
  Cluster Service Plan External Name:  dev
  Cluster Service Plan Ref:
    Name:       9783fc2e859f9179833a7dd003baa841
  External ID:  18048f79-1a38-4ea6-b713-6ad79e167368
  Parameters From:
    Secret Key Ref:
      Key:          parameters
      Name:         rh-postgresql-apb-parameters35056
  Update Requests:  0
  User Info:
    Extra:
      Scopes . Authorization . Openshift . Io:
        user:full
    Groups:
      system:authenticated:oauth
      system:authenticated
    UID:       
    Username:  jiazha2
Status:
  Async Op In Progress:  false
  Conditions:
    Last Transition Time:         2018-04-19T03:21:17Z
    Message:                      Provision call failed: Error occurred during provision. Please contact administrator if it persists.
    Reason:                       ProvisionCallFailed
    Status:                       False
    Type:                         Ready
    Last Transition Time:         2018-04-19T03:22:28Z
    Message:                      Provision call failed: Error occurred during provision. Please contact administrator if it persists.
    Reason:                       ProvisionCallFailed
    Status:                       True
    Type:                         Failed
  Deprovision Status:             Required
  Orphan Mitigation In Progress:  false
  Reconciled Generation:          1
Events:
  Type     Reason                             Age              From                                Message
  ----     ------                             ----             ----                                -------
  Warning  ErrorWithParameters                8m               service-catalog-controller-manager  failed to prepare parameters nil: secrets "rh-postgresql-apb-parameters35056" not found
  Normal   Provisioning                       8m               service-catalog-controller-manager  The instance is being provisioned asynchronously
  Warning  ReferencesNonexistentServiceClass  7m               service-catalog-controller-manager  The instance references a non-existent ClusterServiceClass (K8S: "d5915e05b253df421efe6e41fb6a66ba" ExternalName: "rh-postgresql-apb")
  Warning  ErrorPollingLastOperation          7m               service-catalog-controller-manager  Error polling last operation: Status: 401; ErrorMessage: <nil>; Description: <nil>; ResponseError: <nil>
  Warning  ProvisionCallFailed                7m (x6 over 7m)  service-catalog-controller-manager  Provision call failed: Error occurred during provision. Please contact administrator if it persists.

One problem is the corresponding deploy namespaces still exist even if the deploy pods did NOT occur errros(PS: the "keep_namespace" option is false):
[root@ip-172-18-4-34 ~]# oc get pods -n rh-mediawiki-apb-prov-q64rj
NAME                                       READY     STATUS      RESTARTS   AGE
apb-c90efb5b-b0f7-4465-9ed3-cb69eb05268c   0/1       Completed   0          2h
[root@ip-172-18-4-34 ~]# oc get pods -n rh-postgresql-apb-prov-79mkt
NAME                                       READY     STATUS      RESTARTS   AGE
apb-d39b0c6b-319c-45ab-bfb2-24a1f1437948   0/1       Completed   0          2h

Expected results:
Provision APB success and the corresponding deploying namespaces can be deleted successfully.

Additional info:
[root@ip-172-18-4-34 ~]# uname -a
Linux ip-172-18-4-34.ec2.internal 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
[root@ip-172-18-4-34 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.5 (Maipo)

Comment 1 Paul Morie 2018-04-19 22:28:39 UTC
Jian-

Would you provide the logs from the controller pods so we can try to determine why the provision failed?

Comment 2 Jay Boyd 2018-04-20 00:18:33 UTC
I'd like to make sure we get the catalog controller-manager logs (for the pod that was rebooted and the pod took over as leader) as well as the APB logs.

Comment 3 Jian Zhang 2018-04-20 10:21:00 UTC
Paul & Jay

I'm sorry, that cluster had been deleted yesterday, so, I create a new cluster 3.9.24 on GCE for this. But I could NOT reproduce this issue even if I reboot/shutdown them several times,  All succeed finally although they state Pending status last some time.
In this test, the pod that was rebooted always displaying "Running" status. That's different with last test.  I will double confirm it once our "aws" env is ready for building.

Comment 4 Jian Zhang 2018-04-23 05:47:09 UTC
Paul $ Jay

I double checked this issue in 3.9.24 on AWS, cannot reproduce it anymore.
I confirm the HA feature works well when node reboot. I also kill its PS and stop the docker service, it still works well in these scenarios.

[root@ip-172-18-7-157 ~]# oc version
oc v3.9.24
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ec2-54-81-46-243.compute-1.amazonaws.com
openshift v3.9.24
kubernetes v1.9.1+a0ce1bc657

Comment 5 Jay Boyd 2018-04-23 12:39:43 UTC
Excellent, thanks for retesting and verifying this. Moving forward to close as not reproducible.

Comment 6 Jian Zhang 2018-04-24 01:27:04 UTC
verify success according to comment 3 and 4, thanks all!