Bug 1377483 - [dev-preview-int] atomic-openshift-master-controllers crash repeatedly
Summary: [dev-preview-int] atomic-openshift-master-controllers crash repeatedly
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Online
Classification: Red Hat
Component: Pod
Version: 3.x
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.x
Assignee: Paul Morie
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks: OSOPS_V3 1378274
TreeView+ depends on / blocked
 
Reported: 2016-09-19 20:43 UTC by Stefanie Forrester
Modified: 2017-02-16 22:11 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1378274 (view as bug list)
Environment:
Last Closed: 2017-02-16 22:11:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Stefanie Forrester 2016-09-19 20:43:53 UTC
Description of problem:

The master controllers are unable to stay online for more than a few hours. The service crashes and leaves core files in /var/lib/origin.

[root@dev-preview-int-master-d41bf origin]# gdb /usr/bin/openshift core.19361
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/bin/openshift...done.
[New LWP 19366]
[New LWP 19367]
[New LWP 19368]
[New LWP 19362]
[New LWP 19363]
[New LWP 19364]
[New LWP 19365]
[New LWP 19361]
[New LWP 19371]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/bin/openshift start master controllers --config=/etc/origin/master/master-'.
Program terminated with signal 6, Aborted.
#0  runtime.raise () at /usr/lib/golang/src/runtime/sys_linux_amd64.s:110
110     /usr/lib/golang/src/runtime/sys_linux_amd64.s: No such file or directory.
warning: Missing auto-load scripts referenced in section .debug_gdb_scripts
of file /usr/bin/openshift
Use `info auto-load python [REGEXP]' to list them.
Missing separate debuginfos, use: debuginfo-install atomic-openshift-3.3.0.31-1.git.0.aede597.el7.x86_64
(gdb) bt
#0  runtime.raise () at /usr/lib/golang/src/runtime/sys_linux_amd64.s:110
#1  0x00000000004433a8 in runtime.dieFromSignal (sig=6) at /usr/lib/golang/src/runtime/signal1_unix.go:192
#2  0x00000000004434cf in runtime.crash () at /usr/lib/golang/src/runtime/signal1_unix.go:247
#3  0x0000000000430bce in runtime.dopanic_m (gp=0xc822381500, pc=4391174, sp=859719858496) at /usr/lib/golang/src/runtime/panic.go:659
#4  0x00000000004576b2 in runtime.dopanic.func1 () at /usr/lib/golang/src/runtime/panic.go:534
#5  0x0000000000460ad9 in runtime.systemstack () at /usr/lib/golang/src/runtime/asm_amd64.s:291
#6  0x0000000000434a30 in runtime.startTheWorldWithSema () at /usr/lib/golang/src/runtime/proc.go:986
#7  0x000000c820019500 in ?? ()
#8  0x0000000000000000 in ?? ()


# goroutine stack trace:

#0  runtime.systemstack_switch ()
    at /usr/lib/golang/src/runtime/asm_amd64.s:245
#1  0x0000000000430531 in runtime.dopanic (unused=0)
    at /usr/lib/golang/src/runtime/panic.go:535
#2  0x0000000000430106 in runtime.gopanic (e=...)
    at /usr/lib/golang/src/runtime/panic.go:481
#3  0x000000000042e8c5 in runtime.panicmem ()
    at /usr/lib/golang/src/runtime/panic.go:62
#4  0x000000000044550a in runtime.sigpanic ()
    at /usr/lib/golang/src/runtime/sigpanic_unix.go:24
#5  0x0000000001da7c6d in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service.(*ServiceController).processDelta (s=0xc826ae2000,
    delta=0xc82e3dd880, ~r1=..., ~r2=0)
    at /builddir/build/BUILD/atomic-openshift-git-0.aede597/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service/servicecontroller.go:243
#6  0x0000000001dad8a7 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service.(*ServiceController).watchServices.func1 (obj=...,
    ~r1=...)
    at /builddir/build/BUILD/atomic-openshift-git-0.aede597/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service/servicecontroller.go:198
#7  0x00000000019359bf in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache.(*DeltaFIFO).Pop (f=0xc827f4f8c0, process=
    {void (interface {}, error *)} 0xc826e17f38, ~r1=..., ~r2=...)
    at /builddir/build/BUILD/atomic-openshift-git-0.aede597/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/delta_fifo.go:420
#8  0x0000000001da6563 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service.(*ServiceController).watchServices (s=0xc826ae2000,
    serviceQueue=0xc827f4f8c0)
    at /builddir/build/BUILD/atomic-openshift-git-0.aede597/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service/servicecontroller.go:212

Version-Release number of selected component (if applicable):

atomic-openshift-3.3.0.31-1.git.0.aede597.el7.x86_64

How reproducible:
Several times per day.

Steps to Reproduce:
1. Start up master controllers with Online master-config.yaml settings. Wait a few hours for a crash.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Stefanie Forrester 2016-09-19 20:44:38 UTC
Related bug:

https://bugzilla.redhat.com/show_bug.cgi?id=1374569

Comment 2 Derek Carr 2016-09-20 18:18:48 UTC
Seth -- can you figure out what may be causing the panic?

Comment 3 Seth Jennings 2016-09-20 18:46:07 UTC
vendor/k8s.io/kubernetes/pkg/controller/service/servicecontroller.go:243 is

glog.V(2).Infof("Got new %s delta for service: %v", delta.Type, namespacedName)

I'm not seeing how we could get to this point in the function without namespacedName being non-nil.  However, delta.Type could possibly be nil, just looking at this function.

I'll continue looking.

Comment 4 Seth Jennings 2016-09-20 18:50:48 UTC
Worth noting that this whole section of code has been reworked in Kubernetes upstream in this PR:
https://github.com/kubernetes/kubernetes/pull/25189

Comment 5 Seth Jennings 2016-09-20 18:57:51 UTC
Is it the same goroutine stack trace every time?

Comment 6 Seth Jennings 2016-09-20 19:31:15 UTC
Also, can I get that core file please?

Comment 7 Timothy St. Clair 2016-09-20 19:40:40 UTC
Are

Comment 8 Timothy St. Clair 2016-09-20 19:43:41 UTC
This is the standard upstream service controller is there anything else that is going on?

Comment 9 Stefanie Forrester 2016-09-21 00:46:30 UTC
I just reproduced this issue on a regular OCP host, without any of the Online-specific configs. Here are the only non-default configs I can think of on that cluster:

kubeletArguments:
  cloud-config:
  - /etc/aws/aws.conf
  cloud-provider:
  - aws
  enable-controller-attach-detach:
  - 'true'

And...

volumeConfig:
  dynamicProvisioningEnabled: false

I'll upload the core file for this one. This is ded-int-aws-master-a84a0, from our Dedicated environment.

Comment 12 Derek Carr 2016-09-21 19:22:25 UTC
PR posted here:

https://github.com/openshift/ose/pull/377

Comment 13 Seth Jennings 2016-09-21 19:37:21 UTC
Looking at the dev-preview core.23261:

======================================
(gdb) where
#0  runtime.systemstack_switch () at /usr/lib/golang/src/runtime/asm_amd64.s:245
#1  0x0000000000430531 in runtime.dopanic (unused=0) at /usr/lib/golang/src/runtime/panic.go:535
#2  0x0000000000430106 in runtime.gopanic (e=...) at /usr/lib/golang/src/runtime/panic.go:481
#3  0x000000000042e8c5 in runtime.panicmem () at /usr/lib/golang/src/runtime/panic.go:62
#4  0x000000000044550a in runtime.sigpanic () at /usr/lib/golang/src/runtime/sigpanic_unix.go:24
#5  0x0000000001da7c6d in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service.(*ServiceController).processDelta (s=0xc828790500, 
    delta=0xc83440ecc0, ~r1=..., ~r2=0)
    at /builddir/build/BUILD/atomic-openshift-git-0.aede597/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service/servicecontroller.go:243
#6  0x0000000001dad8a7 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service.(*ServiceController).watchServices.func1 (obj=..., ~r1=...)
    at /builddir/build/BUILD/atomic-openshift-git-0.aede597/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service/servicecontroller.go:198
#7  0x00000000019359bf in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache.(*DeltaFIFO).Pop (f=0xc8287b8420, process=
    {void (interface {}, error *)} 0xc836e11f38, ~r1=..., ~r2=...)
    at /builddir/build/BUILD/atomic-openshift-git-0.aede597/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/delta_fifo.go:420
#8  0x0000000001da6563 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service.(*ServiceController).watchServices (s=0xc828790500, 
    serviceQueue=0xc8287b8420)
    at /builddir/build/BUILD/atomic-openshift-git-0.aede597/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service/servicecontroller.go:212
#9  0x0000000000463521 in runtime.goexit () at /usr/lib/golang/src/runtime/asm_amd64.s:1998
#10 0x000000c828790500 in ?? ()
#11 0x000000c8287b8420 in ?? ()
#12 0x6136393461303234 in ?? ()
#13 0x3736323465343134 in ?? ()
#14 0x6236383631376236 in ?? ()
#15 0x3737393337343936 in ?? ()
#16 0x000000c836e10d18 in ?? ()
#17 0x0000000000f700a7 in net/http.send (ireq=0x0, rt=..., deadline=..., ~r3=0x0, ~r4=...) at /usr/lib/golang/src/net/http/client.go:260
#18 0x0000000000000000 in ?? ()
(gdb) print delta
$9 = (github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache.Delta *) 0xc83440ecc0
(gdb) print deltaService
$10 = (struct github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/api.Service *) 0x0
(gdb) print namespacedName
$11 = {Namespace = "", Name = ""}
(gdb) print cachedService
$12 = (struct github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service.cachedService *) 0xc8344e18c0
======================================

This is the case where the crash is on the glog line 243

deltaService is nil.  That means that cachedService.lastState is nil.

Current theory is that:

- else on 238 is run, NamespacedName is added to cache
- 254 fails to find service, therefore 275 is never reached to set lastState
- on next processDelta() cache has an entry but lastState is nil
- deref'ing lastState crashes (golang optimizing so no deref until glog line sometimes?)

Some noise here but:
======================================
(gdb) info locals
service = 0xc820088c58
deltaService = 0x0
cachedService = 0xc8344e18c0
~r0 = ""
~r0 = 
~r0 = 
message = ""
message = 
err = {tab = 0x0, data = 0x0}
err = {tab = 0x408242 <runtime.selectnbsend+82>, data = 0x35a6240}
n·2 = {Namespace = "37b5f2f55e9542e4a34dfccda09b4d81", Name = ""}
n·2 = {Namespace = , Name = }
n·2 = {
  Namespace = "Failed to process service delta. Retrying in 5m0s: AccessDenied: User: arn:aws:iam::704252977135:user/cloud_provider is not authorized to perform: elasticloadbalancing:DescribeLoadBalancers\n\tstatus co"..., Name = ""}
namespacedName = {Namespace = "", Name = ""}
key = {Key = "ops-hello-openshift-dev-preview-int-master-968f8-coebzs/hello-openshift", Obj = {_type = 0x4261300, data = 0xc8344e18c0}}
======================================

There is an AWS error as the Namespace in an runtime variable.  Might explain why the service lookup fails.

Comment 14 Andy Goldstein 2016-09-21 20:52:42 UTC
QE- to try to reproduce this issue, do the following:

1. Set up OpenShift on AWS
2. Configure the AWS cloud provider for OpenShift with an AWS access key that does not have permissions to describe load balancers in EC2.
3. Create and delete Services in OpenShift. You may need to create and then delete very quickly in a loop.

I'm not entirely certain this will cause it to panic, but it's a working theory.

Comment 16 DeShuai Ma 2016-09-22 07:42:47 UTC
QE has flow the step of Commment14 test on aws with elb, but still can't reproduce the bug.

Test version:
atomic-openshift-3.3.0.31-1.git.0.aede597.el7.x86_64

steps:
1. On master and node, change AWS access key
cd /etc/sysconfig
sed -i 's/<old_aws_access_key_id>/<new_aws_access_key_id>/g' atomic-openshift-*
sed -i "s/<old_aws_secret_access_key>/<new_aws_secret_access_key>/g" atomic-openshift-*
2. Restart master and node service
systemctl restart atomic-openshift-master-api
systemctl restart atomic-openshift-master-controllers
systemctl restart atomic-openshift-node

3. On client run a loop to create/delete svc
while true; do oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/services/multi-portsvc.json; oc delete svc multi-portsvc; done

Test results:
In the env there are lots of error like below which indicate we have configure the account correctly, but still can't reproduce the crash error
#atomic-openshift-master-controllers log: 
servicecontroller.go:201]  Failed to process service delta. Retrying in 5s: Error getting LB for service dma-6/frontend: AccessDenied: User: arn:aws:iam::531415883065:user/openshift-qe-nolb is not authorized to perform: elasticloadbalancing:DescribeLoadBalancers
servicecontroller.go:243] Got new Sync delta for service: dma-5/frontend

Comment 18 Brenton Leanhardt 2016-09-22 12:18:05 UTC
I logged in to both of the Masters QE used to while attempting to reproduce the problem.  The following files appear to have been set correctly:

/etc/origin/master/master-config.yam
/etc/sysconfig/atomic-openshift-master-controllers
/etc/origin/cloudprovider/aws.conf

I tried to create a load balancer with the credentials and hit the same error mentioned at the bottom of Comment #13:

An error occurred fetching load balancer data: User: arn:aws:iam::531415883065:user/openshift-qe-nolb is not authorized to perform: elasticloadbalancing:DescribeLoadBalancers

Comment 19 Bing Li 2016-09-30 10:11:31 UTC
This bug should be fixed now in online 3.3.0.33. We didn't meet the same issue with bug 1374569.

Comment 21 Derek Carr 2016-10-07 15:24:42 UTC
The aws quota issue is a duplicate of

Comment 22 Derek Carr 2016-10-07 15:25:34 UTC
*** Bug 1381745 has been marked as a duplicate of this bug. ***

Comment 23 Derek Carr 2016-10-07 15:28:40 UTC
The excessive use of aws quota around load balancers is being worked here:
https://bugzilla.redhat.com/show_bug.cgi?id=1367229

Comment 24 Abhishek Gupta 2016-10-13 22:39:27 UTC
The fix has been applied to devpreview INT.

Comment 25 Bing Li 2016-10-18 06:16:15 UTC
We didn't find any core files under /var/lib/origin in 3 masters, and dev preview INT environment works fine in our tests.

Comment 26 Bing Li 2016-10-20 10:35:06 UTC
For a few minutes, there was a long delay for project deletion and couldn't create new project just like bug 1374569. But after a few minutes, projects were cleared successfully and then new project could be created.

$ oc new-project bingli730
Error from server: projectrequests "bingli730" is forbidden: user bingli7 cannot create more than 1 project(s).
$ 
$ oc get project
NAME        DISPLAY NAME   STATUS
bingli726                  Terminating
bingli727                  Terminating
bingli728                  Terminating
$ oc get project
NAME        DISPLAY NAME   STATUS
bingli726                  Terminating
bingli727                  Terminating
bingli728                  Terminating

After a few minutes:

$ oc get project
$ oc new-project bingli731
Now using project "bingli731" on server "https://api.dev-preview-stg.openshift.com:443".

You can add applications to this project with the 'new-app' command. For example, try:

    oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git

to build a new example application in Ruby.


Note You need to log in before you can comment on or make changes to this bug.