Bug 1486416 - [free-int] Core file generated by OCP 3.7
Summary: [free-int] Core file generated by OCP 3.7
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Master
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.7.0
Assignee: Dan Mace
QA Contact: zhou ying
URL:
Whiteboard:
: 1477233 (view as bug list)
Depends On:
Blocks: 1519277
TreeView+ depends on / blocked
 
Reported: 2017-08-29 17:45 UTC by Justin Pierce
Modified: 2017-11-30 14:17 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1519277 (view as bug list)
Environment:
Last Closed: 2017-11-28 22:08:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 0 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-29 02:34:54 UTC

Description Justin Pierce 2017-08-29 17:45:13 UTC
Description of problem:
Operations cluster reporting out of space due to core files being generated. Attaching example created by relatively recent 3.7 install.

[/var/log/origin] ls
-rw-------. 1 root root 522891264 Aug 29 03:22 core.104479


Version-Release number of selected component (if applicable):
oc v3.7.0-0.104.0
kubernetes v1.7.0+695f48a16f
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://internal.api.free-int.openshift.com:443
openshift v3.7.0-0.104.0
kubernetes v1.7.0+695f48a16f


How reproducible:
?

Steps to Reproduce:
1. Little to no load on this cluster at the time.


Actual results:
http://file.rdu.redhat.com/~jupierce/core/
core.104479.tgz - The gzipped core which was created
core-logs.tgz - Logs from all masters


Expected results:
No cores

Comment 1 Dan Mace 2017-09-05 18:22:03 UTC
Quick update while I continue investigating.

Here are the stacks of two goroutines obtained from the provided core dump:

(dlv) goroutine 14266 bt 1000
 0  0x000000000045b5b0 in runtime.systemstack_switch
    at /usr/lib/golang/src/runtime/asm_amd64.s:281
 1  0x000000000042ef51 in runtime.dopanic
    at /usr/lib/golang/src/runtime/panic.go:579
 2  0x000000000042f045 in runtime.throw
    at /usr/lib/golang/src/runtime/panic.go:596
 3  0x000000000040ce41 in runtime.mapassign
    at /usr/lib/golang/src/runtime/hashmap.go:589
 4  0x00000000010e64b5 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/api/v1.Convert_v1_Pod_To_api_Pod
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/api/v1/conversion.go:625
 5  0x00000000038e2988 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/quota/evaluator/core.toInternalPodOrError
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/quota/evaluator/core/pods.go:200
 6  0x00000000038e2a49 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/quota/evaluator/core.podMatchesScopeFunc
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/quota/evaluator/core/pods.go:213
 7  0x00000000038ddf92 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/quota/generic.CalculateUsageStats
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/quota/generic/evaluator.go:104
 8  0x00000000038e1e72 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/quota/evaluator/core.(*podEvaluator).UsageStats
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/quota/evaluator/core/pods.go:156
 9  0x00000000038dd75b in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/quota.CalculateUsage
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/quota/resources.go:241
10  0x0000000003b2006a in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/resourcequota.(*ResourceQuotaController).syncResourceQuota
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/resourcequota/resource_quota_controller.go:304
11  0x0000000003b1fd0b in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/resourcequota.(*ResourceQuotaController).syncResourceQuotaFromKey
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/resourcequota/resource_quota_controller.go:280
12  0x0000000003b21b7e in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/resourcequota.(*ResourceQuotaController).(github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/resourcequota.syncResourceQuotaFromKey)-fm
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/resourcequota/resource_quota_controller.go:98
13  0x0000000003b2165c in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/resourcequota.(*ResourceQuotaController).worker.func1
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/resourcequota/resource_quota_controller.go:212
14  0x0000000003b2176b in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/resourcequota.(*ResourceQuotaController).worker.func2
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/resourcequota/resource_quota_controller.go:224
15  0x0000000000578ede in github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:97
16  0x00000000005784cd in github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:98
17  0x000000000057839d in github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.Until
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:52
18  0x000000000045e191 in runtime.goexit
    at /usr/lib/golang/src/runtime/asm_amd64.s:2197


(dlv) goroutine 14260 bt 1000
 0  0x00000000004b700c in reflect.Value.Field
    at /usr/lib/golang/src/reflect/value.go:757
 1  0x000000000074c4b3 in encoding/json.(*decodeState).object
    at /usr/lib/golang/src/encoding/json/decode.go:708
 2  0x000000000074aca4 in encoding/json.(*decodeState).value
    at /usr/lib/golang/src/encoding/json/decode.go:402
 3  0x000000000074b663 in encoding/json.(*decodeState).array
    at /usr/lib/golang/src/encoding/json/decode.go:555
 4  0x000000000074ac37 in encoding/json.(*decodeState).value
    at /usr/lib/golang/src/encoding/json/decode.go:399
 5  0x000000000074a11a in encoding/json.(*decodeState).unmarshal
    at /usr/lib/golang/src/encoding/json/decode.go:184
 6  0x0000000000749ad8 in encoding/json.Unmarshal
    at /usr/lib/golang/src/encoding/json/decode.go:104
 7  0x00000000010e63d9 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/api/v1.Convert_v1_Pod_To_api_Pod
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/api/v1/conversion.go:629
 8  0x00000000038e2988 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/quota/evaluator/core.toInternalPodOrError
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/quota/evaluator/core/pods.go:200
 9  0x00000000038e2a49 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/quota/evaluator/core.podMatchesScopeFunc
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/quota/evaluator/core/pods.go:213
10  0x00000000038ddf92 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/quota/generic.CalculateUsageStats
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/quota/generic/evaluator.go:104
11  0x00000000038e1e72 in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/quota/evaluator/core.(*podEvaluator).UsageStats
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/quota/evaluator/core/pods.go:156
12  0x00000000038dd75b in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/quota.CalculateUsage
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/quota/resources.go:241
13  0x0000000003b2006a in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/resourcequota.(*ResourceQuotaController).syncResourceQuota
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/resourcequota/resource_quota_controller.go:304
14  0x0000000003b1fd0b in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/resourcequota.(*ResourceQuotaController).syncResourceQuotaFromKey
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/resourcequota/resource_quota_controller.go:280
15  0x0000000003b21b7e in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/resourcequota.(*ResourceQuotaController).(github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/resourcequota.syncResourceQuotaFromKey)-fm
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/resourcequota/resource_quota_controller.go:98
16  0x0000000003b2165c in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/resourcequota.(*ResourceQuotaController).worker.func1
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/resourcequota/resource_quota_controller.go:212
17  0x0000000003b2176b in github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/resourcequota.(*ResourceQuotaController).worker.func2
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/resourcequota/resource_quota_controller.go:224
18  0x0000000000578ede in github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:97
19  0x00000000005784cd in github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:98
20  0x000000000057839d in github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait.Until
    at /builddir/build/BUILD/atomic-openshift-git-0.c420cf9/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:52
21  0x000000000045e191 in runtime.goexit
    at /usr/lib/golang/src/runtime/asm_amd64.s:2197


My theory is a race between resource quota controllers when the component configuration ConcurrentResourceQuotaSyncs value is > 2 and shared informers are used. Each controller will:

1. Receive the same *api.Pod instance for processing
2. Pass the *api.Pod to Convert_v1_Pod_To_api_Pod
3. Panic on a concurrent map write to the *api.Pod (as Convert_v1_Pod_To_api_Pod unsafely mutates the input)

I'll try to reproduce in a test.

Comment 2 Dan Mace 2017-09-05 21:26:56 UTC
The kube-controller-manager `--concurrent-resource-quota-syncs` flag (default: 5) controls the number of resource quota controller worker threads. In Kube 1.7, competing quota resource controller instances can crash due to concurrent write attempts during pod type conversion using an unsafe conversion function that mutates input[1]. The most obvious conditions for triggering the race would be the quota controllers analyzing pods with any of the following annotations:

* PodInitContainersBetaAnnotationKey
* PodInitContainersAnnotationKey
* PodInitContainerStatusesBetaAnnotationKey
* PodInitContainerStatusesAnnotationKey

This list of conditions is just the ones I easily identified, and isn't assumed to be exhaustive.

The problematic pod conversion mutations have been removed in Kube 1.8. However, it's not clear to me yet whether conversion functions should be generally considered thread-safe. If not, the controller code may need refactored to be more defensive in dealing with conversion functions. I'll open a discussion upstream.

In the meantime, a temporary stabilizing workaround would be to reduce `--concurrent-resource-quota-syncs` to zero, at the cost of quota calculation performance.

[1] https://github.com/kubernetes/kubernetes/blob/release-1.7/pkg/api/v1/conversion.go#L592

Comment 3 Dan Mace 2017-09-06 15:02:39 UTC
(In reply to Dan Mace from comment #2)
> The kube-controller-manager `--concurrent-resource-quota-syncs` flag
> (default: 5) controls the number of resource quota controller worker
> threads. In Kube 1.7, competing quota resource controller instances can
> crash due to concurrent write attempts during pod type conversion using an
> unsafe conversion function that mutates input[1]. The most obvious
> conditions for triggering the race would be the quota controllers analyzing
> pods with any of the following annotations:
> 
> * PodInitContainersBetaAnnotationKey
> * PodInitContainersAnnotationKey
> * PodInitContainerStatusesBetaAnnotationKey
> * PodInitContainerStatusesAnnotationKey

I failed to provide the serialized annotation key names to look for:

* pod.beta.kubernetes.io/init-containers
* pod.alpha.kubernetes.io/init-containers
* pod.beta.kubernetes.io/init-container-statuses
* pod.alpha.kubernetes.io/init-container-statuses

Comment 4 Dan Mace 2017-09-07 14:02:16 UTC
Upstream Kubernetes 1.7 PR: https://github.com/kubernetes/kubernetes/pull/52092

Comment 5 Dan Mace 2017-09-08 13:18:44 UTC
OpenShift PR: https://github.com/openshift/origin/pull/16241

Comment 10 Dan Mace 2017-09-15 19:36:30 UTC
*** Bug 1477233 has been marked as a duplicate of this bug. ***

Comment 11 zhou ying 2017-09-26 09:27:46 UTC
Can't reproduce this issue with latest OCP 3.7, will verify it.
openshift version
openshift v3.7.0-0.127.0
kubernetes v1.7.0+80709908fd
etcd 3.2.1

Comment 15 errata-xmlrpc 2017-11-28 22:08:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188


Note You need to log in before you can comment on or make changes to this bug.