Although this fix work in OSE3.3, but it doesn't work in OSE3.2. We QE tested on openshift v3.2.1.9-1-g2265530, kubernetes v1.2.0-36-g4a3f9c5, etcd 2.2.5. The endpoints can be updated follow api servers status, but cannot trigger deployment and build while stop api server in one master. Reproduce steps: 1. Check current endpoints status # oc describe svc kubernetes -n default Name: kubernetes Namespace: default Labels: component=apiserver,provider=kubernetes Selector: <none> Type: ClusterIP IP: 172.31.0.1 Port: https 443/TCP Endpoints: 192.168.1.118:443,192.168.1.119:443 Port: dns 53/UDP Endpoints: 192.168.1.118:8053,192.168.1.119:8053 Port: dns-tcp 53/TCP Endpoints: 192.168.1.118:8053,192.168.1.119:8053 Session Affinity: None No events. 2. Stop api server in one master: # systemctl stop atomic-openshift-master-api 3. Check endpoint status: # oc describe svc kubernetes -n default Name: kubernetes Namespace: default Labels: component=apiserver,provider=kubernetes Selector: <none> Type: ClusterIP IP: 172.31.0.1 Port: https 443/TCP Endpoints: 192.168.1.118:443 Port: dns 53/UDP Endpoints: 192.168.1.118:8053 Port: dns-tcp 53/TCP Endpoints: 192.168.1.118:8053 Session Affinity: None No events. 4. Create template dancer-example # oc new-app --template=dancer-example -n cheng 5. Check build and pod status Actual results: Cannot trigger deployment and build 5. Check build and pod status # oc get build # oc get pod Expected results: trigger deployment and build 5. Check build and pod status # oc get build NAME TYPE FROM STATUS STARTED DURATION dancer-example-1 Source Git@11c93c3 Running 56 seconds ago 56s # oc get pod NAME READY STATUS RESTARTS AGE dancer-example-1-build 1/1 Running 0 1m Addition info: Step 6: We can find build was triggered while started api server where stopped in step 2. # systemctl start atomic-openshift-master-api oc get build NAME TYPE FROM STATUS STARTED DURATION dancer-example-1 Source Git@11c93c3 Running 56 seconds ago 56s # oc get pod NAME READY STATUS RESTARTS AGE dancer-example-1-build 1/1 Running 0 1m
Could you please retest, and if it fails again, capture the information we need to be able to debug: - where did you test (e.g. AWS, local, ...)? - how many masters? - how many nodes? - what load balancer did you use? if haproxy, logs from it - logs from all atomic-openshift-master-api services - logs from all atomic-openshift-master-controller services - logs from all atomic-openshift-node services - oc describe buildconfig/<name of build config> - oc get events --all-namespaces Thanks!
Also, the master and node config files too.
@Andy Goldstein MTV2 is OpenStack, you maybe cannot access. Attachment exclude master1-controller-service-log, this size about 93MB, more than the limitation. I will attach in a mail and send to you.
@Andy Goldstein Because the size of master1-controller-service-log more than limitation of email, I put it in my google drive and shared with you.
There are a few items to point out: 1) Until https://github.com/openshift/openshift-ansible/issues/1563 is resolved, you will have to manually configure /etc/origin/master/openshift-master.kubeconfig to point either to the load balancer or to kubernetes.default.svc.cluster.local. This is used by the controllers to know the URL and credentials for the masters. Out of the box, this file is not configured to talk to an HA endpoint. Given that the fix for this bug updates the endpoints for kubernetes.default.svc.cluster.local, I would recommend updating the config to point to that URL. 2) The controllers talk directly to etcd to attempt to acquire the lease to become the active controller. As long as the active controller is still able to talk to etcd, it will remain active. In the event that the active controller is configured to talk only to its colocated master, and not to the load balancer or kubernetes service, it will happily continue being the active controller, even after the master goes down. 3) As mentioned before, it may take 10 to 20 seconds before a now-dead master's endpoint is removed from the list of endpoints for the kubernetes service.
@Andy Goldstein Thanks for your clarification. Trigger deployment and build successful anyway after manually configure /etc/origin/master/openshift-master.kubeconfig to point to load balancer. I will mark status to verified as above discussion and we already have pr https://github.com/openshift/openshift-ansible/issues/1563 to trace the special scenario.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1466
*** Bug 1370610 has been marked as a duplicate of this bug. ***