Bug 1554623
| Summary: | apiserver and controller-manager pods from kube-service-catalog runs on a single master | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Suresh <sgaikwad> |
| Component: | Service Catalog | Assignee: | Jay Boyd <jaboyd> |
| Status: | CLOSED ERRATA | QA Contact: | Jian Zhang <jiazha> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.7.0 | CC: | chezhang, hgomes, jaboyd, jforrest, jiazha, jmalde, jpeeler, nick, pmorie, wzheng, yapei, zitang |
| Target Milestone: | --- | ||
| Target Release: | 3.9.z | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
The service-catalog pods were previously labeled with node selectors that were limited to the first master. Now the pods are properly labeled with a label that applies to all masters, which enables service-catalog pods to deploy to all masters properly.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-08-09 22:13:46 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1571088 | ||
| Bug Blocks: | |||
|
Description
Suresh
2018-03-13 03:02:04 UTC
I discussed with Paul and he agrees: yes, put the label on all masters. The controller-manager has leader election code that will ensure only one is actively handling load. As I known, we have a relate PR for controller: enable leader election https://github.com/kubernetes-incubator/service-catalog/pull/647 *** Bug 1566039 has been marked as a duplicate of this bug. *** @Jeff, I want to double confirm if you still want to provide fix in 3.9.z. Thanks. Yes, I do. Base on current test result, I think service catalog HA is blocking by issue "Service Catalog does not refresh ClusterServicePlan after removing from catalog". Original Bug: https://bugzilla.redhat.com/show_bug.cgi?id=1548122#c20 It was fixed in 3.10, but not in 3.9.z We have to merge relate fix in 3.9.z I opened another bug with target release "3.9.z" to trace: https://bugzilla.redhat.com/show_bug.cgi?id=1568815 I'm changing status to "Modified" and adding depend on "BZ 1568815", please correct me if I have mistake. Thanks. I'm really not sure how cluster service plans sticking around would influence multiple catalog pods running properly. But to be more specific, this bug is only about running multiple catalog pods versus just one. I think any weird interactions with deleting the current leader should be tracked separately. As far as endpoints versus configmaps go, I regret not merging https://github.com/openshift/openshift-ansible/pull/7530 with https://github.com/openshift/openshift-ansible/pull/8001, which only happened 5 hours ago. Once a build is tested with the latter PR, endpoints aren't used for anything related to HA any more. To be more clear, I new a bug 1569397 to trace the reboot issue(step1 comment15). And now, we encounter the bug 1568815 more frequently regardless killing the leader pod or rebooting the node that the leader pod is running on. Jay, will you take a look at this bug; there seems to be an entanglement with the bug for removed plans. As noted in bug 1568815 the fix for the RemovedFromBrokerCatalog problem is in builds ose-service-catalog:v3.9.21-1 and newer (this build is too old). Are there any remaining problems for development in this issue? Verified bug 1568815 with ose-service-catalog:v3.9.21-1 LGTM, we will double confirm SC HA function with v3.9.21-1 or newer version. Jeff,
Test it with the 3.9.24 cluster, the "Endpoints" type for leader election has been removed.
[root@qe-jiazha-3920-gcemaster-etcd-1 ~]# oc get endpoints
NAME ENDPOINTS AGE
apiserver 10.2.0.3:6443,10.2.4.3:6443,10.2.6.3:6443 17m
(1) Get the leader pod by checking the ConfigMap, it is pod "controller-manager-vx2hs" that running "qe-jiazha-3920-gcemaster-etcd-2".
[root@qe-jiazha-3920-gcemaster-etcd-1 ~]# oc get cm openshift-master-controllers -o yaml -n kube-system
apiVersion: v1
kind: ConfigMap
metadata:
annotations:
control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"qe-jiazha-3920-gcemaster-etcd-2","leaseDurationSeconds":15,"acquireTime":"2018-04-20T08:14:08Z","renewTime":"2018-04-20T08:53:55Z","leaderTransitions":0}'
creationTimestamp: 2018-04-20T08:14:08Z
name: openshift-master-controllers
namespace: kube-system
resourceVersion: "49085"
selfLink: /api/v1/namespaces/kube-system/configmaps/openshift-master-controllers
uid: cc64bf03-4472-11e8-b925-42010af0002d
[root@qe-jiazha-3920-gcemaster-etcd-1 ~]# oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
apiserver-9f75n 1/1 Running 0 17m 10.2.0.3 qe-jiazha-3920-gcemaster-etcd-2
apiserver-pm27l 1/1 Running 0 17m 10.2.6.3 qe-jiazha-3920-gcemaster-etcd-1
apiserver-zhvhb 1/1 Running 0 17m 10.2.4.3 qe-jiazha-3920-gcemaster-etcd-3
controller-manager-lmp26 1/1 Running 0 17m 10.2.4.4 qe-jiazha-3920-gcemaster-etcd-3
controller-manager-ssnfg 1/1 Running 2 17m 10.2.6.4 qe-jiazha-3920-gcemaster-etcd-1
controller-manager-vx2hs 1/1 Running 0 17m 10.2.0.4 qe-jiazha-3920-gcemaster-etcd-2
(2) But, the real lead pod is "controller-manager-ssnfg" running on node "qe-jiazha-3920-gcemaster-etcd-1", I think this is a problem, changed status to "ASSIGNED", correct me if I'm wrong.
[root@qe-jiazha-3920-gcemaster-etcd-1 ~]# oc logs -f controller-manager-ssnfg
I0420 08:34:32.323932 1 feature_gate.go:184] feature gates: map[OriginatingIdentity:true]
I0420 08:34:32.374433 1 hyperkube.go:188] Service Catalog version v3.9.24 (built 2018-04-18T18:07:34Z)
I0420 08:34:32.408271 1 leaderelection.go:174] attempting to acquire leader lease...
E0420 08:34:32.500860 1 event.go:260] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"service-catalog-controller-manager", GenerateName:"", Namespace:"kube-service-catalog", SelfLink:"/api/v1/namespaces/kube-service-catalog/configmaps/service-catalog-controller-manager", UID:"99cdcde5-4475-11e8-a149-42010af0002d", ResourceVersion:"44897", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63659810051, loc:(*time.Location)(0x31c6c00)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string{"control-plane.alpha.kubernetes.io/leader":"{\"holderIdentity\":\"controller-manager-ssnfg-external-service-catalog-controller\",\"leaseDurationSeconds\":15,\"acquireTime\":\"2018-04-20T08:34:11Z\",\"renewTime\":\"2018-04-20T08:34:32Z\",\"leaderTransitions\":0}"}, OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Data:map[string]string(nil)}' due to: 'no kind is registered for the type v1.ConfigMap'. Will not report event: 'Normal' 'LeaderElection' 'controller-manager-ssnfg-external-service-catalog-controller became leader'
I0420 08:34:32.500984 1 leaderelection.go:184] successfully acquired lease kube-service-catalog/service-catalog-controller-manager
I believe the command you're using to find the leader for catalog is incorrect. I'm guessing there's a configmap in the catalog namespace that is used, not in kube-system. (I don't know exactly what it is - the command I gave earlier was only an example.) Jeff,
Oh, maybe my misunderstood, the ConfigMap "openshift-master-controllers" doesn't matter with the election of the "service-catalog", right? I'm sorry for this.
Now, we can see the ConfigMap "service-catalog-controller-manager" in the kube-service-catalog namespace. Like below:
[root@qe-jiazha-3923-gcemaster-etcd-1 ~]# oc get cm service-catalog-controller-manager -o yaml
apiVersion: v1
kind: ConfigMap
metadata:
annotations:
control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"controller-manager-d6pvk-external-service-catalog-controller","leaseDurationSeconds":15,"acquireTime":"2018-04-23T02:39:57Z","renewTime":"2018-04-23T02:49:19Z","leaderTransitions":1}'
creationTimestamp: 2018-04-23T02:04:56Z
name: service-catalog-controller-manager
namespace: kube-service-catalog
resourceVersion: "52499"
selfLink: /api/v1/namespaces/kube-service-catalog/configmaps/service-catalog-controller-manager
uid: b8571c0b-469a-11e8-8685-42010af00019
It works well and we can see the "leaderTransitions" changed to 1 after one leader change. But, the "Events" is none, is it as we expect? Why did not display the leader change notice in the "Events" field? Or am I missing something?
[root@qe-jiazha-3923-gcemaster-etcd-1 ~]# oc describe service-catalog-controller-manager
the server doesn't have a resource type "service-catalog-controller-manager"
[root@qe-jiazha-3923-gcemaster-etcd-1 ~]# oc describe cm service-catalog-controller-manager
Name: service-catalog-controller-manager
Namespace: kube-service-catalog
Labels: <none>
Annotations: control-plane.alpha.kubernetes.io/leader={"holderIdentity":"controller-manager-d6pvk-external-service-catalog-controller","leaseDurationSeconds":15,"acquireTime":"2018-04-23T02:39:57Z","renewTime":"20...
Data
====
Events: <none>
I don't know why the events are missing. The configmap resource lock implements RecordEvent just like the endpoint resource lock. Jeff, I think we should fix this issue, so I open a new bug 1571088 for tracing this event missing issue. Remove the "NeedsTestCase" since test case has already covered it. Since bug 1571088 only fixed in 3.10, so I don't think it's a good choice to ship this HA feature of service catalog in 3.9.z. Below is the latest test on 3.9.z, bug 1571088 issue(event missing) still exists. [root@qe-jiazha-39master-etcd-1 ~]# oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE apiserver-77w6w 1/1 Running 0 2h 10.2.2.4 qe-jiazha-39master-etcd-1 apiserver-78l5b 1/1 Running 0 2h 10.2.0.4 qe-jiazha-39master-etcd-3 apiserver-wpxql 1/1 Running 0 2h 10.2.4.3 qe-jiazha-39master-etcd-2 controller-manager-5x2rc 1/1 Running 0 8s 10.2.2.7 qe-jiazha-39master-etcd-1 controller-manager-dpgmb 1/1 Running 0 8m 10.2.4.7 qe-jiazha-39master-etcd-2 controller-manager-rdc7j 1/1 Running 0 2h 10.2.0.5 qe-jiazha-39master-etcd-3 [root@qe-jiazha-39master-etcd-1 ~]# oc describe cm service-catalog-controller-manager Name: service-catalog-controller-manager Namespace: kube-service-catalog Labels: <none> Annotations: control-plane.alpha.kubernetes.io/leader={"holderIdentity":"controller-manager-k2tpk-external-service-catalog-controller","leaseDurationSeconds":15,"acquireTime":"2018-07-26T05:16:46Z","renewTime":"20... Data ==== Events: <none> [root@qe-jiazha-39master-etcd-1 ~]# oc version oc v3.9.38 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://qe-jiazha-39lb-1.0725-tms.qe.rhcloud.com openshift v3.9.38 kubernetes v1.9.1+a0ce1bc657 bug 1571088 is limited to setting the events on the configmap when a leadership change occurs. The change information is still logged in the controller logs when a new pod takes leadership. This comes down to a "serviceablity" issue and easily being able to understand leadership changes. The entries above indicate HA is working properly with the exception of the configmap events. The bottom line is that HA is working for the controller manager - leadership properly changes when the current leader is made unavailable. We agree it would be ideal to fix all bugs in all releases, but we have to determine where to focus our limited development resources. This decision was not made lightly and much discussion was had about if 1571088 should be back ported to 3.9.z. The end decision was "no". I will make sure to include the test in future discussions around if specific bugs will be backported. I'm going to put this back to ON_QA as the core issue in this bug has been addressed in 3.9.z. Jay, Thanks for your patient explanation! Verify it. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2335 |