Description of problem: The master controllers are unable to stay online for more than a few hours. The service crashes and leaves core files in /var/lib/origin. [root@dev-preview-int-master-d41bf origin]# gdb /usr/bin/openshift core.19361 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/bin/openshift...done. [New LWP 19366] [New LWP 19367] [New LWP 19368] [New LWP 19362] [New LWP 19363] [New LWP 19364] [New LWP 19365] [New LWP 19361] [New LWP 19371] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/bin/openshift start master controllers --config=/etc/origin/master/master-'. Program terminated with signal 6, Aborted. #0 runtime.raise () at /usr/lib/golang/src/runtime/sys_linux_amd64.s:110 110 /usr/lib/golang/src/runtime/sys_linux_amd64.s: No such file or directory. warning: Missing auto-load scripts referenced in section .debug_gdb_scripts of file /usr/bin/openshift Use `info auto-load python [REGEXP]' to list them. Missing separate debuginfos, use: debuginfo-install atomic-openshift-3.3.0.31-1.git.0.aede597.el7.x86_64 (gdb) bt #0 runtime.raise () at /usr/lib/golang/src/runtime/sys_linux_amd64.s:110 #1 0x00000000004433a8 in runtime.dieFromSignal (sig=6) at /usr/lib/golang/src/runtime/signal1_unix.go:192 #2 0x00000000004434cf in runtime.crash () at /usr/lib/golang/src/runtime/signal1_unix.go:247 #3 0x0000000000430bce in runtime.dopanic_m (gp=0xc822381500, pc=4391174, sp=859719858496) at /usr/lib/golang/src/runtime/panic.go:659 #4 0x00000000004576b2 in runtime.dopanic.func1 () at /usr/lib/golang/src/runtime/panic.go:534 #5 0x0000000000460ad9 in runtime.systemstack () at /usr/lib/golang/src/runtime/asm_amd64.s:291 #6 0x0000000000434a30 in runtime.startTheWorldWithSema () at /usr/lib/golang/src/runtime/proc.go:986 #7 0x000000c820019500 in ?? () #8 0x0000000000000000 in ?? () # goroutine stack trace: #0 runtime.systemstack_switch () at /usr/lib/golang/src/runtime/asm_amd64.s:245 #1 0x0000000000430531 in runtime.dopanic (unused=0) at /usr/lib/golang/src/runtime/panic.go:535 #2 0x0000000000430106 in runtime.gopanic (e=...) at /usr/lib/golang/src/runtime/panic.go:481 #3 0x000000000042e8c5 in runtime.panicmem () at /usr/lib/golang/src/runtime/panic.go:62 #4 0x000000000044550a in runtime.sigpanic () at /usr/lib/golang/src/runtime/sigpanic_unix.go:24 #5 0x0000000001da7c6d in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service.(*ServiceController).processDelta (s=0xc826ae2000, delta=0xc82e3dd880, ~r1=..., ~r2=0) at /builddir/build/BUILD/atomic-openshift-git-0.aede597/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service/servicecontroller.go:243 #6 0x0000000001dad8a7 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service.(*ServiceController).watchServices.func1 (obj=..., ~r1=...) at /builddir/build/BUILD/atomic-openshift-git-0.aede597/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service/servicecontroller.go:198 #7 0x00000000019359bf in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache.(*DeltaFIFO).Pop (f=0xc827f4f8c0, process= {void (interface {}, error *)} 0xc826e17f38, ~r1=..., ~r2=...) at /builddir/build/BUILD/atomic-openshift-git-0.aede597/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/delta_fifo.go:420 #8 0x0000000001da6563 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service.(*ServiceController).watchServices (s=0xc826ae2000, serviceQueue=0xc827f4f8c0) at /builddir/build/BUILD/atomic-openshift-git-0.aede597/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service/servicecontroller.go:212 Version-Release number of selected component (if applicable): atomic-openshift-3.3.0.31-1.git.0.aede597.el7.x86_64 How reproducible: Several times per day. Steps to Reproduce: 1. Start up master controllers with Online master-config.yaml settings. Wait a few hours for a crash. 2. 3. Actual results: Expected results: Additional info:
Related bug: https://bugzilla.redhat.com/show_bug.cgi?id=1374569
Seth -- can you figure out what may be causing the panic?
vendor/k8s.io/kubernetes/pkg/controller/service/servicecontroller.go:243 is glog.V(2).Infof("Got new %s delta for service: %v", delta.Type, namespacedName) I'm not seeing how we could get to this point in the function without namespacedName being non-nil. However, delta.Type could possibly be nil, just looking at this function. I'll continue looking.
Worth noting that this whole section of code has been reworked in Kubernetes upstream in this PR: https://github.com/kubernetes/kubernetes/pull/25189
Is it the same goroutine stack trace every time?
Also, can I get that core file please?
Are
This is the standard upstream service controller is there anything else that is going on?
I just reproduced this issue on a regular OCP host, without any of the Online-specific configs. Here are the only non-default configs I can think of on that cluster: kubeletArguments: cloud-config: - /etc/aws/aws.conf cloud-provider: - aws enable-controller-attach-detach: - 'true' And... volumeConfig: dynamicProvisioningEnabled: false I'll upload the core file for this one. This is ded-int-aws-master-a84a0, from our Dedicated environment.
PR posted here: https://github.com/openshift/ose/pull/377
Looking at the dev-preview core.23261: ====================================== (gdb) where #0 runtime.systemstack_switch () at /usr/lib/golang/src/runtime/asm_amd64.s:245 #1 0x0000000000430531 in runtime.dopanic (unused=0) at /usr/lib/golang/src/runtime/panic.go:535 #2 0x0000000000430106 in runtime.gopanic (e=...) at /usr/lib/golang/src/runtime/panic.go:481 #3 0x000000000042e8c5 in runtime.panicmem () at /usr/lib/golang/src/runtime/panic.go:62 #4 0x000000000044550a in runtime.sigpanic () at /usr/lib/golang/src/runtime/sigpanic_unix.go:24 #5 0x0000000001da7c6d in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service.(*ServiceController).processDelta (s=0xc828790500, delta=0xc83440ecc0, ~r1=..., ~r2=0) at /builddir/build/BUILD/atomic-openshift-git-0.aede597/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service/servicecontroller.go:243 #6 0x0000000001dad8a7 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service.(*ServiceController).watchServices.func1 (obj=..., ~r1=...) at /builddir/build/BUILD/atomic-openshift-git-0.aede597/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service/servicecontroller.go:198 #7 0x00000000019359bf in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache.(*DeltaFIFO).Pop (f=0xc8287b8420, process= {void (interface {}, error *)} 0xc836e11f38, ~r1=..., ~r2=...) at /builddir/build/BUILD/atomic-openshift-git-0.aede597/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/delta_fifo.go:420 #8 0x0000000001da6563 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service.(*ServiceController).watchServices (s=0xc828790500, serviceQueue=0xc8287b8420) at /builddir/build/BUILD/atomic-openshift-git-0.aede597/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service/servicecontroller.go:212 #9 0x0000000000463521 in runtime.goexit () at /usr/lib/golang/src/runtime/asm_amd64.s:1998 #10 0x000000c828790500 in ?? () #11 0x000000c8287b8420 in ?? () #12 0x6136393461303234 in ?? () #13 0x3736323465343134 in ?? () #14 0x6236383631376236 in ?? () #15 0x3737393337343936 in ?? () #16 0x000000c836e10d18 in ?? () #17 0x0000000000f700a7 in net/http.send (ireq=0x0, rt=..., deadline=..., ~r3=0x0, ~r4=...) at /usr/lib/golang/src/net/http/client.go:260 #18 0x0000000000000000 in ?? () (gdb) print delta $9 = (github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache.Delta *) 0xc83440ecc0 (gdb) print deltaService $10 = (struct github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/api.Service *) 0x0 (gdb) print namespacedName $11 = {Namespace = "", Name = ""} (gdb) print cachedService $12 = (struct github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service.cachedService *) 0xc8344e18c0 ====================================== This is the case where the crash is on the glog line 243 deltaService is nil. That means that cachedService.lastState is nil. Current theory is that: - else on 238 is run, NamespacedName is added to cache - 254 fails to find service, therefore 275 is never reached to set lastState - on next processDelta() cache has an entry but lastState is nil - deref'ing lastState crashes (golang optimizing so no deref until glog line sometimes?) Some noise here but: ====================================== (gdb) info locals service = 0xc820088c58 deltaService = 0x0 cachedService = 0xc8344e18c0 ~r0 = "" ~r0 = ~r0 = message = "" message = err = {tab = 0x0, data = 0x0} err = {tab = 0x408242 <runtime.selectnbsend+82>, data = 0x35a6240} n·2 = {Namespace = "37b5f2f55e9542e4a34dfccda09b4d81", Name = ""} n·2 = {Namespace = , Name = } n·2 = { Namespace = "Failed to process service delta. Retrying in 5m0s: AccessDenied: User: arn:aws:iam::704252977135:user/cloud_provider is not authorized to perform: elasticloadbalancing:DescribeLoadBalancers\n\tstatus co"..., Name = ""} namespacedName = {Namespace = "", Name = ""} key = {Key = "ops-hello-openshift-dev-preview-int-master-968f8-coebzs/hello-openshift", Obj = {_type = 0x4261300, data = 0xc8344e18c0}} ====================================== There is an AWS error as the Namespace in an runtime variable. Might explain why the service lookup fails.
QE- to try to reproduce this issue, do the following: 1. Set up OpenShift on AWS 2. Configure the AWS cloud provider for OpenShift with an AWS access key that does not have permissions to describe load balancers in EC2. 3. Create and delete Services in OpenShift. You may need to create and then delete very quickly in a loop. I'm not entirely certain this will cause it to panic, but it's a working theory.
QE has flow the step of Commment14 test on aws with elb, but still can't reproduce the bug. Test version: atomic-openshift-3.3.0.31-1.git.0.aede597.el7.x86_64 steps: 1. On master and node, change AWS access key cd /etc/sysconfig sed -i 's/<old_aws_access_key_id>/<new_aws_access_key_id>/g' atomic-openshift-* sed -i "s/<old_aws_secret_access_key>/<new_aws_secret_access_key>/g" atomic-openshift-* 2. Restart master and node service systemctl restart atomic-openshift-master-api systemctl restart atomic-openshift-master-controllers systemctl restart atomic-openshift-node 3. On client run a loop to create/delete svc while true; do oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/services/multi-portsvc.json; oc delete svc multi-portsvc; done Test results: In the env there are lots of error like below which indicate we have configure the account correctly, but still can't reproduce the crash error #atomic-openshift-master-controllers log: servicecontroller.go:201] Failed to process service delta. Retrying in 5s: Error getting LB for service dma-6/frontend: AccessDenied: User: arn:aws:iam::531415883065:user/openshift-qe-nolb is not authorized to perform: elasticloadbalancing:DescribeLoadBalancers servicecontroller.go:243] Got new Sync delta for service: dma-5/frontend
I logged in to both of the Masters QE used to while attempting to reproduce the problem. The following files appear to have been set correctly: /etc/origin/master/master-config.yam /etc/sysconfig/atomic-openshift-master-controllers /etc/origin/cloudprovider/aws.conf I tried to create a load balancer with the credentials and hit the same error mentioned at the bottom of Comment #13: An error occurred fetching load balancer data: User: arn:aws:iam::531415883065:user/openshift-qe-nolb is not authorized to perform: elasticloadbalancing:DescribeLoadBalancers
This bug should be fixed now in online 3.3.0.33. We didn't meet the same issue with bug 1374569.
The aws quota issue is a duplicate of
*** Bug 1381745 has been marked as a duplicate of this bug. ***
The excessive use of aws quota around load balancers is being worked here: https://bugzilla.redhat.com/show_bug.cgi?id=1367229
The fix has been applied to devpreview INT.
We didn't find any core files under /var/lib/origin in 3 masters, and dev preview INT environment works fine in our tests.
For a few minutes, there was a long delay for project deletion and couldn't create new project just like bug 1374569. But after a few minutes, projects were cleared successfully and then new project could be created. $ oc new-project bingli730 Error from server: projectrequests "bingli730" is forbidden: user bingli7 cannot create more than 1 project(s). $ $ oc get project NAME DISPLAY NAME STATUS bingli726 Terminating bingli727 Terminating bingli728 Terminating $ oc get project NAME DISPLAY NAME STATUS bingli726 Terminating bingli727 Terminating bingli728 Terminating After a few minutes: $ oc get project $ oc new-project bingli731 Now using project "bingli731" on server "https://api.dev-preview-stg.openshift.com:443". You can add applications to this project with the 'new-app' command. For example, try: oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git to build a new example application in Ruby.