Bug 1377483
Summary: | [dev-preview-int] atomic-openshift-master-controllers crash repeatedly | |||
---|---|---|---|---|
Product: | OpenShift Online | Reporter: | Stefanie Forrester <dakini> | |
Component: | Pod | Assignee: | Paul Morie <pmorie> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | DeShuai Ma <dma> | |
Severity: | high | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 3.x | CC: | abhgupta, agoldste, anli, aos-bugs, bingli, bleanhar, dakini, decarr, eparis, jeder, jokerman, mifiedle, mmccomas, sspeiche, tdawson, tstclair, wmeng, xtian, xxia | |
Target Milestone: | --- | |||
Target Release: | 3.x | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1378274 (view as bug list) | Environment: | ||
Last Closed: | 2017-02-16 22:11:44 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1303130, 1378274 |
Description
Stefanie Forrester
2016-09-19 20:43:53 UTC
Related bug: https://bugzilla.redhat.com/show_bug.cgi?id=1374569 Seth -- can you figure out what may be causing the panic? vendor/k8s.io/kubernetes/pkg/controller/service/servicecontroller.go:243 is glog.V(2).Infof("Got new %s delta for service: %v", delta.Type, namespacedName) I'm not seeing how we could get to this point in the function without namespacedName being non-nil. However, delta.Type could possibly be nil, just looking at this function. I'll continue looking. Worth noting that this whole section of code has been reworked in Kubernetes upstream in this PR: https://github.com/kubernetes/kubernetes/pull/25189 Is it the same goroutine stack trace every time? Also, can I get that core file please? Are This is the standard upstream service controller is there anything else that is going on? I just reproduced this issue on a regular OCP host, without any of the Online-specific configs. Here are the only non-default configs I can think of on that cluster: kubeletArguments: cloud-config: - /etc/aws/aws.conf cloud-provider: - aws enable-controller-attach-detach: - 'true' And... volumeConfig: dynamicProvisioningEnabled: false I'll upload the core file for this one. This is ded-int-aws-master-a84a0, from our Dedicated environment. PR posted here: https://github.com/openshift/ose/pull/377 Looking at the dev-preview core.23261: ====================================== (gdb) where #0 runtime.systemstack_switch () at /usr/lib/golang/src/runtime/asm_amd64.s:245 #1 0x0000000000430531 in runtime.dopanic (unused=0) at /usr/lib/golang/src/runtime/panic.go:535 #2 0x0000000000430106 in runtime.gopanic (e=...) at /usr/lib/golang/src/runtime/panic.go:481 #3 0x000000000042e8c5 in runtime.panicmem () at /usr/lib/golang/src/runtime/panic.go:62 #4 0x000000000044550a in runtime.sigpanic () at /usr/lib/golang/src/runtime/sigpanic_unix.go:24 #5 0x0000000001da7c6d in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service.(*ServiceController).processDelta (s=0xc828790500, delta=0xc83440ecc0, ~r1=..., ~r2=0) at /builddir/build/BUILD/atomic-openshift-git-0.aede597/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service/servicecontroller.go:243 #6 0x0000000001dad8a7 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service.(*ServiceController).watchServices.func1 (obj=..., ~r1=...) at /builddir/build/BUILD/atomic-openshift-git-0.aede597/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service/servicecontroller.go:198 #7 0x00000000019359bf in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache.(*DeltaFIFO).Pop (f=0xc8287b8420, process= {void (interface {}, error *)} 0xc836e11f38, ~r1=..., ~r2=...) at /builddir/build/BUILD/atomic-openshift-git-0.aede597/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache/delta_fifo.go:420 #8 0x0000000001da6563 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service.(*ServiceController).watchServices (s=0xc828790500, serviceQueue=0xc8287b8420) at /builddir/build/BUILD/atomic-openshift-git-0.aede597/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service/servicecontroller.go:212 #9 0x0000000000463521 in runtime.goexit () at /usr/lib/golang/src/runtime/asm_amd64.s:1998 #10 0x000000c828790500 in ?? () #11 0x000000c8287b8420 in ?? () #12 0x6136393461303234 in ?? () #13 0x3736323465343134 in ?? () #14 0x6236383631376236 in ?? () #15 0x3737393337343936 in ?? () #16 0x000000c836e10d18 in ?? () #17 0x0000000000f700a7 in net/http.send (ireq=0x0, rt=..., deadline=..., ~r3=0x0, ~r4=...) at /usr/lib/golang/src/net/http/client.go:260 #18 0x0000000000000000 in ?? () (gdb) print delta $9 = (github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/cache.Delta *) 0xc83440ecc0 (gdb) print deltaService $10 = (struct github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/api.Service *) 0x0 (gdb) print namespacedName $11 = {Namespace = "", Name = ""} (gdb) print cachedService $12 = (struct github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/service.cachedService *) 0xc8344e18c0 ====================================== This is the case where the crash is on the glog line 243 deltaService is nil. That means that cachedService.lastState is nil. Current theory is that: - else on 238 is run, NamespacedName is added to cache - 254 fails to find service, therefore 275 is never reached to set lastState - on next processDelta() cache has an entry but lastState is nil - deref'ing lastState crashes (golang optimizing so no deref until glog line sometimes?) Some noise here but: ====================================== (gdb) info locals service = 0xc820088c58 deltaService = 0x0 cachedService = 0xc8344e18c0 ~r0 = "" ~r0 = ~r0 = message = "" message = err = {tab = 0x0, data = 0x0} err = {tab = 0x408242 <runtime.selectnbsend+82>, data = 0x35a6240} n·2 = {Namespace = "37b5f2f55e9542e4a34dfccda09b4d81", Name = ""} n·2 = {Namespace = , Name = } n·2 = { Namespace = "Failed to process service delta. Retrying in 5m0s: AccessDenied: User: arn:aws:iam::704252977135:user/cloud_provider is not authorized to perform: elasticloadbalancing:DescribeLoadBalancers\n\tstatus co"..., Name = ""} namespacedName = {Namespace = "", Name = ""} key = {Key = "ops-hello-openshift-dev-preview-int-master-968f8-coebzs/hello-openshift", Obj = {_type = 0x4261300, data = 0xc8344e18c0}} ====================================== There is an AWS error as the Namespace in an runtime variable. Might explain why the service lookup fails. QE- to try to reproduce this issue, do the following: 1. Set up OpenShift on AWS 2. Configure the AWS cloud provider for OpenShift with an AWS access key that does not have permissions to describe load balancers in EC2. 3. Create and delete Services in OpenShift. You may need to create and then delete very quickly in a loop. I'm not entirely certain this will cause it to panic, but it's a working theory. QE has flow the step of Commment14 test on aws with elb, but still can't reproduce the bug. Test version: atomic-openshift-3.3.0.31-1.git.0.aede597.el7.x86_64 steps: 1. On master and node, change AWS access key cd /etc/sysconfig sed -i 's/<old_aws_access_key_id>/<new_aws_access_key_id>/g' atomic-openshift-* sed -i "s/<old_aws_secret_access_key>/<new_aws_secret_access_key>/g" atomic-openshift-* 2. Restart master and node service systemctl restart atomic-openshift-master-api systemctl restart atomic-openshift-master-controllers systemctl restart atomic-openshift-node 3. On client run a loop to create/delete svc while true; do oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/services/multi-portsvc.json; oc delete svc multi-portsvc; done Test results: In the env there are lots of error like below which indicate we have configure the account correctly, but still can't reproduce the crash error #atomic-openshift-master-controllers log: servicecontroller.go:201] Failed to process service delta. Retrying in 5s: Error getting LB for service dma-6/frontend: AccessDenied: User: arn:aws:iam::531415883065:user/openshift-qe-nolb is not authorized to perform: elasticloadbalancing:DescribeLoadBalancers servicecontroller.go:243] Got new Sync delta for service: dma-5/frontend I logged in to both of the Masters QE used to while attempting to reproduce the problem. The following files appear to have been set correctly: /etc/origin/master/master-config.yam /etc/sysconfig/atomic-openshift-master-controllers /etc/origin/cloudprovider/aws.conf I tried to create a load balancer with the credentials and hit the same error mentioned at the bottom of Comment #13: An error occurred fetching load balancer data: User: arn:aws:iam::531415883065:user/openshift-qe-nolb is not authorized to perform: elasticloadbalancing:DescribeLoadBalancers This bug should be fixed now in online 3.3.0.33. We didn't meet the same issue with bug 1374569. The aws quota issue is a duplicate of *** Bug 1381745 has been marked as a duplicate of this bug. *** The excessive use of aws quota around load balancers is being worked here: https://bugzilla.redhat.com/show_bug.cgi?id=1367229 The fix has been applied to devpreview INT. We didn't find any core files under /var/lib/origin in 3 masters, and dev preview INT environment works fine in our tests. For a few minutes, there was a long delay for project deletion and couldn't create new project just like bug 1374569. But after a few minutes, projects were cleared successfully and then new project could be created. $ oc new-project bingli730 Error from server: projectrequests "bingli730" is forbidden: user bingli7 cannot create more than 1 project(s). $ $ oc get project NAME DISPLAY NAME STATUS bingli726 Terminating bingli727 Terminating bingli728 Terminating $ oc get project NAME DISPLAY NAME STATUS bingli726 Terminating bingli727 Terminating bingli728 Terminating After a few minutes: $ oc get project $ oc new-project bingli731 Now using project "bingli731" on server "https://api.dev-preview-stg.openshift.com:443". You can add applications to this project with the 'new-app' command. For example, try: oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git to build a new example application in Ruby. |