Bug 1261981 - kube-apiserver init failing to start when spawned 3 in parallel
kube-apiserver init failing to start when spawned 3 in parallel
Status: POST
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kubernetes (Show other bugs)
7.1
Unspecified Unspecified
high Severity high
: rc
: ---
Assigned To: Jan Chaloupka
atomic-bugs@redhat.com
: Extras
Depends On:
Blocks: 1203710 1298318
  Show dependency treegraph
 
Reported: 2015-09-10 10:33 EDT by Jaroslav Henner
Modified: 2016-11-03 19:54 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1301217 (view as bug list)
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Jaroslav Henner 2015-09-10 10:33:49 EDT
Description of problem:

My script is creating an etcd cluster from 3 atomic nodes. This works fine. Next step is to configure and restart kube-apiserver. When restarting it serially, it works 100% times. When done in parallel, it fails in 27 out of 41 cases on

> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10829]: F0909 19:14:39.705468   10829 controller.go:80] Unable to perform initial IP allocation check: unable to persist the updated service IP allocations: serviceipallocation "" cannot be updated: another caller has already initialized the resource
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal systemd[1]: kube-apiserver.service: main process exited, code=exited, status=255/n/a
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal systemd[1]: Failed to start Kubernetes API Server.
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal systemd[1]: Unit kube-apiserver.service entered failed state.

Then I think the systemd tries to restart it again, then the service works correctly
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal systemd[1]: kube-apiserver.service holdoff time over, scheduling restart.
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal systemd[1]: Stopping Kubernetes API Server...
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal systemd[1]: Starting Kubernetes API Server...

but the failure above causes the starting command to return nonzero.

> [aa.bb.cc.dd] sudo: systemctl start kube-apiserver
> [31m
> Fatal error: sudo() received nonzero return code 1 while executing!
>
> Requested: systemctl start kube-apiserver
> Executed: sudo -S -p 'sudo password:'  /bin/bash -l -c "systemctl start kube-apiserver"
> [0m[31m
> Aborting.
> [0m[aa.bb.cc.dd] out: Job for kube-apiserver.service failed. See 'systemctl status kube-apiserver.service' and 'journalctl -xn' for details.
> [aa.bb.cc.dd] out:
>
> [aa.bb.cc.dd] sudo: systemctl status kube-apiserver -l || true
> [aa.bb.cc.dd] out: kube-apiserver.service - Kubernetes API Server
> [aa.bb.cc.dd] out:    Loaded: loaded (/usr/lib/systemd/system/kube-apiserver.service; disabled)
> [aa.bb.cc.dd] out:    Active: activating (start) since Wed 2015-09-09 19:14:39 UTC; 387ms ago
> [aa.bb.cc.dd] out:      Docs: https://github.com/GoogleCloudPlatform/kubernetes
> [aa.bb.cc.dd] out:  Main PID: 10834 (kube-apiserver)
> [aa.bb.cc.dd] out:    CGroup: /system.slice/kube-apiserver.service
> [aa.bb.cc.dd] out:            └─10834 /usr/bin/kube-apiserver --logtostderr=true --v=0 --etcd_servers=http://172.16.80.10:2379,http://172.16.80.4:2379,http://172.16.80.8:2379 --address=0.0.0.0 --allow_privileged=false --service-cluster-ip-range=192.168.0.0/24 --admission_control=NamespaceLifecycle,NamespaceExists,LimitRanger,SecurityContextDeny,ResourceQuota
> [aa.bb.cc.dd] out:
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal systemd[1]: Starting Kubernetes API Server...
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: I0909 19:14:39.823584   10834 plugins.go:69] No cloud provider specified.
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: I0909 19:14:39.823830   10834 master.go:273] Node port range unspecified. Defaulting to 30000-32767.
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: I0909 19:14:39.824090   10834 master.go:295] Will report 172.16.80.8 as public IP address.
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: [restful] 2015/09/09 19:14:39 log.go:30: [restful/swagger] listing is available at https://172.16.80.8:6443/swaggerapi/
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: [restful] 2015/09/09 19:14:39 log.go:30: [restful/swagger] https://172.16.80.8:6443/swaggerui/ is mapped to folder /swagger-ui/
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: E0909 19:14:39.853735   10834 reflector.go:136] Failed to list *api.ResourceQuota: Get http://0.0.0.0:8080/api/v1/resourcequotas: dial tcp 0.0.0.0:8080: connection refused
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: E0909 19:14:39.853846   10834 reflector.go:136] Failed to list *api.LimitRange: Get http://0.0.0.0:8080/api/v1/limitranges: dial tcp 0.0.0.0:8080: connection refused
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: E0909 19:14:39.853904   10834 reflector.go:136] Failed to list *api.Namespace: Get http://0.0.0.0:8080/api/v1/namespaces: dial tcp 0.0.0.0:8080: connection refused
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: E0909 19:14:39.853971   10834 reflector.go:136] Failed to list *api.Namespace: Get http://0.0.0.0:8080/api/v1/namespaces: dial tcp 0.0.0.0:8080: connection refused
> [aa.bb.cc.dd] out:

I am not sure is this expected behavior or not, but I don't like it too much. The problematic part is surely the

func (e *Etcd) CreateOrUpdate(snapshot *api.RangeAllocation) error {

https://github.com/kubernetes/kubernetes/blob/7c9bbef96ed7f2a192a1318aa312919b861aee00/pkg/registry/service/allocator/etcd/etcd.go#L210 


Version-Release number of selected component (if applicable):
Kubernetes v1.0.0-290-gb2dafdaef5acea on Atomic host


How reproducible:
27 out of 41 

Steps to Reproduce:
above


Actual results:
init script exit status nonzero


Expected results:
init script succeeds

Additional info:
Comment 1 Jaroslav Henner 2015-09-10 10:34:41 EDT
I have forgot to note that apparently the fact that the subsequent start
attempt succeeds is hiding the original error message (the systemctl status kube-apiserver). One have to search the logs for that:

- hide quoted text -
> aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal systemd[1]: Starting Kubernetes API Server...
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: I0909 19:14:39.823584   10834 plugins.go:69] No cloud provider specified.
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: I0909 19:14:39.823830   10834 master.go:273] Node port range unspecified. Defaulting to 30000-32767.
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: I0909 19:14:39.824090   10834 master.go:295] Will report 172.16.80.8 as public IP address.
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: [restful] 2015/09/09 19:14:39 log.go:30: [restful/swagger] listing is available at https://172.16.80.8:6443/swaggerapi/
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: [restful] 2015/09/09 19:14:39 log.go:30: [restful/swagger] https://172.16.80.8:6443/swaggerui/ is mapped to folder /swagger-ui/
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: E0909 19:14:39.853735   10834 reflector.go:136] Failed to list *api.ResourceQuota: Get http://0.0.0.0:8080/api/v1/resourcequotas: dial tcp 0.0.0.0:8080: connection refused
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: E0909 19:14:39.853846   10834 reflector.go:136] Failed to list *api.LimitRange: Get http://0.0.0.0:8080/api/v1/limitranges: dial tcp 0.0.0.0:8080: connection refused
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: E0909 19:14:39.853904   10834 reflector.go:136] Failed to list *api.Namespace: Get http://0.0.0.0:8080/api/v1/namespaces: dial tcp 0.0.0.0:8080: connection refused
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: E0909 19:14:39.853971   10834 reflector.go:136] Failed to list *api.Namespace: Get http://0.0.0.0:8080/api/v1/namespaces: dial tcp 0.0.0.0:8080: connection refused
> [aa.bb.cc.dd] out: Sep 09 19:14:40 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: I0909 19:14:40.239197   10834 server.go:441] Serving securely on 0.0.0.0:6443
> [aa.bb.cc.dd] out: Sep 09 19:14:40 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: I0909 19:14:40.239230   10834 server.go:483] Serving insecurely on 0.0.0.0:8080
> [aa.bb.cc.dd] out: Sep 09 19:14:40 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: E0909 19:14:40.631190   10834 repair.go:52] unable to persist the updated port allocations: servicenodeportallocation "" cannot be updated: the provided resource version does not match
> [aa.bb.cc.dd] out: Sep 09 19:14:40 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: E0909 19:14:40.784381   10834 repair.go:67] unable to persist the updated service IP allocations: serviceipallocation "" cannot be updated: the provided resource version does not match
> [aa.bb.cc.dd] out:

Some restart attempts are showing only one of the "unable to persist" messages
Comment 3 Jeremy Eder 2015-10-22 08:34:14 EDT
I see this on a standalone openshift 3.0.2 environment that uses the built-in etcd (single node) deployment.

Noted this upstream issue: https://github.com/kubernetes/kubernetes/issues/9450

We need to confirm this is still a bug in openshift 3.1...
Comment 4 Ryan Howe 2015-12-30 13:49:55 EST
Similar issue seen in 3.1 HA setup.  Error message is a little different

Dec 30 13:16:42 master3.example.com atomic-openshift-master-api[59189]: F1230 13:16:42.932154   59189 controller.go:83] Unable to perform initial IP allocation check: unable to persist the updated service IP allocations: serviceipallocation "" cannot be updated: the provided resource version does not match


https://bugzilla.redhat.com/show_bug.cgi?id=1294513
Comment 5 Andy Goldstein 2016-01-20 09:37:04 EST
Upstream issue: https://github.com/kubernetes/kubernetes/issues/19848

Supporting PR: https://github.com/kubernetes/kubernetes/pull/16067

Once 16067 is merged, we can implement a fix for this bug.
Comment 6 Dana Safford 2016-01-21 22:54:38 EST
American Express says they are now blocked by this bug. I set the "Customer Escalation" flag.
Comment 11 Andy Goldstein 2016-02-01 12:50:15 EST
Upstream fix: https://github.com/kubernetes/kubernetes/pull/20007
Comment 12 Jan Chaloupka 2016-02-01 13:05:08 EST
The fix is not shipped in kubernetes for 7.2.2 so the state should be at most POST. Still, it will not do any harm to apply the patch and include this issue for 7.2.3.

What do you think Andy?
Comment 13 Andy Goldstein 2016-02-01 13:09:34 EST
Let's apply it for 7.2.3
Comment 15 Mike McCune 2016-03-28 18:44:25 EDT
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune@redhat.com with any questions

Note You need to log in before you can comment on or make changes to this bug.