1261981 – kube-apiserver init failing to start when spawned 3 in parallel

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1261981 - kube-apiserver init failing to start when spawned 3 in parallel

Summary: kube-apiserver init failing to start when spawned 3 in parallel

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	kubernetes
Sub Component:
Version:	7.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jan Chaloupka
QA Contact:	atomic-bugs@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1203710 1298318
TreeView+	depends on / blocked

Reported:	2015-09-10 14:33 UTC by Jaroslav Henner
Modified:	2020-12-15 07:36 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1301217 (view as bug list)
Environment:
Last Closed:	2020-12-15 07:36:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Jaroslav Henner 2015-09-10 14:33:49 UTC

Description of problem:

My script is creating an etcd cluster from 3 atomic nodes. This works fine. Next step is to configure and restart kube-apiserver. When restarting it serially, it works 100% times. When done in parallel, it fails in 27 out of 41 cases on

> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10829]: F0909 19:14:39.705468   10829 controller.go:80] Unable to perform initial IP allocation check: unable to persist the updated service IP allocations: serviceipallocation "" cannot be updated: another caller has already initialized the resource
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal systemd[1]: kube-apiserver.service: main process exited, code=exited, status=255/n/a
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal systemd[1]: Failed to start Kubernetes API Server.
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal systemd[1]: Unit kube-apiserver.service entered failed state.

Then I think the systemd tries to restart it again, then the service works correctly
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal systemd[1]: kube-apiserver.service holdoff time over, scheduling restart.
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal systemd[1]: Stopping Kubernetes API Server...
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal systemd[1]: Starting Kubernetes API Server...

but the failure above causes the starting command to return nonzero.

> [aa.bb.cc.dd] sudo: systemctl start kube-apiserver
> [31m
> Fatal error: sudo() received nonzero return code 1 while executing!
>
> Requested: systemctl start kube-apiserver
> Executed: sudo -S -p 'sudo password:'  /bin/bash -l -c "systemctl start kube-apiserver"
> [0m[31m
> Aborting.
> [0m[aa.bb.cc.dd] out: Job for kube-apiserver.service failed. See 'systemctl status kube-apiserver.service' and 'journalctl -xn' for details.
> [aa.bb.cc.dd] out:
>
> [aa.bb.cc.dd] sudo: systemctl status kube-apiserver -l || true
> [aa.bb.cc.dd] out: kube-apiserver.service - Kubernetes API Server
> [aa.bb.cc.dd] out:    Loaded: loaded (/usr/lib/systemd/system/kube-apiserver.service; disabled)
> [aa.bb.cc.dd] out:    Active: activating (start) since Wed 2015-09-09 19:14:39 UTC; 387ms ago
> [aa.bb.cc.dd] out:      Docs: https://github.com/GoogleCloudPlatform/kubernetes
> [aa.bb.cc.dd] out:  Main PID: 10834 (kube-apiserver)
> [aa.bb.cc.dd] out:    CGroup: /system.slice/kube-apiserver.service
> [aa.bb.cc.dd] out:            └─10834 /usr/bin/kube-apiserver --logtostderr=true --v=0 --etcd_servers=http://172.16.80.10:2379,http://172.16.80.4:2379,http://172.16.80.8:2379 --address=0.0.0.0 --allow_privileged=false --service-cluster-ip-range=192.168.0.0/24 --admission_control=NamespaceLifecycle,NamespaceExists,LimitRanger,SecurityContextDeny,ResourceQuota
> [aa.bb.cc.dd] out:
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal systemd[1]: Starting Kubernetes API Server...
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: I0909 19:14:39.823584   10834 plugins.go:69] No cloud provider specified.
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: I0909 19:14:39.823830   10834 master.go:273] Node port range unspecified. Defaulting to 30000-32767.
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: I0909 19:14:39.824090   10834 master.go:295] Will report 172.16.80.8 as public IP address.
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: [restful] 2015/09/09 19:14:39 log.go:30: [restful/swagger] listing is available at https://172.16.80.8:6443/swaggerapi/
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: [restful] 2015/09/09 19:14:39 log.go:30: [restful/swagger] https://172.16.80.8:6443/swaggerui/ is mapped to folder /swagger-ui/
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: E0909 19:14:39.853735   10834 reflector.go:136] Failed to list *api.ResourceQuota: Get http://0.0.0.0:8080/api/v1/resourcequotas: dial tcp 0.0.0.0:8080: connection refused
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: E0909 19:14:39.853846   10834 reflector.go:136] Failed to list *api.LimitRange: Get http://0.0.0.0:8080/api/v1/limitranges: dial tcp 0.0.0.0:8080: connection refused
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: E0909 19:14:39.853904   10834 reflector.go:136] Failed to list *api.Namespace: Get http://0.0.0.0:8080/api/v1/namespaces: dial tcp 0.0.0.0:8080: connection refused
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: E0909 19:14:39.853971   10834 reflector.go:136] Failed to list *api.Namespace: Get http://0.0.0.0:8080/api/v1/namespaces: dial tcp 0.0.0.0:8080: connection refused
> [aa.bb.cc.dd] out:

I am not sure is this expected behavior or not, but I don't like it too much. The problematic part is surely the

func (e *Etcd) CreateOrUpdate(snapshot *api.RangeAllocation) error {

https://github.com/kubernetes/kubernetes/blob/7c9bbef96ed7f2a192a1318aa312919b861aee00/pkg/registry/service/allocator/etcd/etcd.go#L210 


Version-Release number of selected component (if applicable):
Kubernetes v1.0.0-290-gb2dafdaef5acea on Atomic host


How reproducible:
27 out of 41 

Steps to Reproduce:
above


Actual results:
init script exit status nonzero


Expected results:
init script succeeds

Additional info:

Comment 1 Jaroslav Henner 2015-09-10 14:34:41 UTC

I have forgot to note that apparently the fact that the subsequent start
attempt succeeds is hiding the original error message (the systemctl status kube-apiserver). One have to search the logs for that:

- hide quoted text -
> aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal systemd[1]: Starting Kubernetes API Server...
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: I0909 19:14:39.823584   10834 plugins.go:69] No cloud provider specified.
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: I0909 19:14:39.823830   10834 master.go:273] Node port range unspecified. Defaulting to 30000-32767.
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: I0909 19:14:39.824090   10834 master.go:295] Will report 172.16.80.8 as public IP address.
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: [restful] 2015/09/09 19:14:39 log.go:30: [restful/swagger] listing is available at https://172.16.80.8:6443/swaggerapi/
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: [restful] 2015/09/09 19:14:39 log.go:30: [restful/swagger] https://172.16.80.8:6443/swaggerui/ is mapped to folder /swagger-ui/
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: E0909 19:14:39.853735   10834 reflector.go:136] Failed to list *api.ResourceQuota: Get http://0.0.0.0:8080/api/v1/resourcequotas: dial tcp 0.0.0.0:8080: connection refused
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: E0909 19:14:39.853846   10834 reflector.go:136] Failed to list *api.LimitRange: Get http://0.0.0.0:8080/api/v1/limitranges: dial tcp 0.0.0.0:8080: connection refused
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: E0909 19:14:39.853904   10834 reflector.go:136] Failed to list *api.Namespace: Get http://0.0.0.0:8080/api/v1/namespaces: dial tcp 0.0.0.0:8080: connection refused
> [aa.bb.cc.dd] out: Sep 09 19:14:39 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: E0909 19:14:39.853971   10834 reflector.go:136] Failed to list *api.Namespace: Get http://0.0.0.0:8080/api/v1/namespaces: dial tcp 0.0.0.0:8080: connection refused
> [aa.bb.cc.dd] out: Sep 09 19:14:40 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: I0909 19:14:40.239197   10834 server.go:441] Serving securely on 0.0.0.0:6443
> [aa.bb.cc.dd] out: Sep 09 19:14:40 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: I0909 19:14:40.239230   10834 server.go:483] Serving insecurely on 0.0.0.0:8080
> [aa.bb.cc.dd] out: Sep 09 19:14:40 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: E0909 19:14:40.631190   10834 repair.go:52] unable to persist the updated port allocations: servicenodeportallocation "" cannot be updated: the provided resource version does not match
> [aa.bb.cc.dd] out: Sep 09 19:14:40 ci-cluster-fa3cb6cf-1.novalocal kube-apiserver[10834]: E0909 19:14:40.784381   10834 repair.go:67] unable to persist the updated service IP allocations: serviceipallocation "" cannot be updated: the provided resource version does not match
> [aa.bb.cc.dd] out:

Some restart attempts are showing only one of the "unable to persist" messages

Comment 3 Jeremy Eder 2015-10-22 12:34:14 UTC

I see this on a standalone openshift 3.0.2 environment that uses the built-in etcd (single node) deployment.

Noted this upstream issue: https://github.com/kubernetes/kubernetes/issues/9450

We need to confirm this is still a bug in openshift 3.1...

Comment 4 Ryan Howe 2015-12-30 18:49:55 UTC

Similar issue seen in 3.1 HA setup.  Error message is a little different

Dec 30 13:16:42 master3.example.com atomic-openshift-master-api[59189]: F1230 13:16:42.932154   59189 controller.go:83] Unable to perform initial IP allocation check: unable to persist the updated service IP allocations: serviceipallocation "" cannot be updated: the provided resource version does not match


https://bugzilla.redhat.com/show_bug.cgi?id=1294513

Comment 5 Andy Goldstein 2016-01-20 14:37:04 UTC

Upstream issue: https://github.com/kubernetes/kubernetes/issues/19848

Supporting PR: https://github.com/kubernetes/kubernetes/pull/16067

Once 16067 is merged, we can implement a fix for this bug.

Comment 6 Dana Safford 2016-01-22 03:54:38 UTC

American Express says they are now blocked by this bug. I set the "Customer Escalation" flag.

Comment 11 Andy Goldstein 2016-02-01 17:50:15 UTC

Upstream fix: https://github.com/kubernetes/kubernetes/pull/20007

Comment 12 Jan Chaloupka 2016-02-01 18:05:08 UTC

The fix is not shipped in kubernetes for 7.2.2 so the state should be at most POST. Still, it will not do any harm to apply the patch and include this issue for 7.2.3.

What do you think Andy?

Comment 13 Andy Goldstein 2016-02-01 18:09:34 UTC

Let's apply it for 7.2.3

Comment 15 Mike McCune 2016-03-28 22:44:25 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 17 RHEL Program Management 2020-12-15 07:36:39 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Note You need to log in before you can comment on or make changes to this bug.