Bug 1549902
| Summary: | internal image registry was down due to data race | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Kenjiro Nakayama <knakayam> | |
| Component: | Image Registry | Assignee: | Oleg Bulatov <obulatov> | |
| Status: | CLOSED ERRATA | QA Contact: | Dongbo Yan <dyan> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 3.5.1 | CC: | aos-bugs, bparees, yapei | |
| Target Milestone: | --- | |||
| Target Release: | 3.5.z | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
Cause: concurrent write to the cache.
Consequence: panic occurs.
Fix: protect write to the cache with mutex.
Result: the cache is safe to use concurrently.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1549916 1549917 (view as bug list) | Environment: | ||
| Last Closed: | 2018-04-30 05:00:57 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1549916, 1549917 | |||
The customer would like to keep using OCP v3.5 for their cluster. So, backport the fix[1] to v3.5.x is ideal. However, if v3.6.x's or v3.7.x's registry image is compatible with OCP 3.5, they are happy to use it on OCP. (v3.7 does not have the fix yet, though...) [1] Make remoteBlobGetterService thread-safe https://github.com/openshift/image-registry/commit/8cac4cd531f7770745f9d8aea5d349ab77e5c28f Verified openshift v3.5.5.31.66 kubernetes v1.5.2+43a9be4 etcd 3.1.0 No changes for this bug, change back to VERIFIED since it's mistakenly changed by scripts Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:1235 |
Description of problem: After running several hours or days, image registry was suddenly restarted it's own. Below is the critical panic logs when issue happened: dockerd-current[79274]: fatal error: concurrent map writes dockerd-current[79274]: dockerd-current[79274]: goroutine 174282 [running]: dockerd-current[79274]: runtime.throw(0x19ac0f9, 0x15) dockerd-current[79274]: /usr/lib/golang/src/runtime/panic.go:566 +0x95 fp=0xc425eab1a8 sp=0xc425eab188 dockerd-current[79274]: runtime.mapassign1(0x1758e60, 0xc421d01b90, 0xc425eab390, 0xc425eab380) dockerd-current[79274]: /usr/lib/golang/src/runtime/hashmap.go:458 +0x8ef fp=0xc425eab290 sp=0xc425eab1a8 dockerd-current[79274]: github.com/openshift/origin/pkg/dockerregistry/server.(*remoteBlobGetterService).proxyStat(0xc421b6e300, 0x7f8e69351528, 0xc425785590, 0x2530f80, 0xc4228862d0, 0xc425eab808, 0xc425fce5a5, 0x47, 0x0, 0x0, ...) dockerd-current[79274]: /builddir/build/BUILD/atomic-openshift-git-0.b6f55a2/_output/local/go/src/github.com/openshift/origin/pkg/dockerregistry/server/remoteblobgetter.go:181 +0xcf9 fp=0xc425eab720 sp=0xc425eab290 dockerd-current[79274]: github.com/openshift/origin/pkg/dockerregistry/server.(*remoteBlobGetterService).findCandidateRepository(0xc421b6e300, 0x7f8e69351528, 0xc425785590, 0xc421addb90, 0x1, 0x1, 0xc4252d5b60, 0xc421c35900, 0x2, 0x2, ...) dockerd-current[79274]: /builddir/build/BUILD/atomic-openshift-git-0.b6f55a2/_output/local/go/src/github.com/openshift/origin/pkg/dockerregistry/server/remoteblobgetter.go:228 +0x1ab fp=0xc425eab908 sp=0xc425eab720 dockerd-current[79274]: github.com/openshift/origin/pkg/dockerregistry/server.(*remoteBlobGetterService).Stat(0xc421b6e300, 0x7f8e69351528, 0xc425785590, 0xc425fce5a5, 0x47, 0x0, 0x0, 0x0, 0x0, 0x0, ...) dockerd-current[79274]: /builddir/build/BUILD/atomic-openshift-git-0.b6f55a2/_output/local/go/src/github.com/openshift/origin/pkg/dockerregistry/server/remoteblobgetter.go:94 +0x3e1 fp=0xc425eabc38 sp=0xc425eab908 dockerd-current[79274]: github.com/openshift/origin/pkg/dockerregistry/server.(*pullthroughBlobStore).copyContent(0xc425377760, 0x254fd00, 0xc421b6e300, 0x7f8e69351528, 0xc425785590, 0xc425fce5a5, 0x47, 0x7f8e692b1c78, 0xc4206b98a0, 0x0, ...) dockerd-current[79274]: /builddir/build/BUILD/atomic-openshift-git-0.b6f55a2/_output/local/go/src/github.com/openshift/origin/pkg/dockerregistry/server/pullthroughblobstore.go:152 +0xa9 fp=0xc425eabd50 sp=0xc425eabc38 dockerd-current[79274]: github.com/openshift/origin/pkg/dockerregistry/server.(*pullthroughBlobStore).storeLocal(0xc425377760, 0x254fd00, 0xc421b6e300, 0x7f8e69351528, 0xc425785590, 0xc425fce5a5, 0x47, 0x0, 0x0) dockerd-current[79274]: /builddir/build/BUILD/atomic-openshift-git-0.b6f55a2/_output/local/go/src/github.com/openshift/origin/pkg/dockerregistry/server/pullthroughblobstore.go:199 +0x1d2 fp=0xc425eabe58 sp=0xc425eabd50 dockerd-current[79274]: github.com/openshift/origin/pkg/dockerregistry/server.(*pullthroughBlobStore).ServeBlob.func1(0x7f8e69351528, 0xc425785590, 0xc425377760, 0x254fd00, 0xc421b6e300, 0xc425fce5a5, 0x47) dockerd-current[79274]: /builddir/build/BUILD/atomic-openshift-git-0.b6f55a2/_output/local/go/src/github.com/openshift/origin/pkg/dockerregistry/server/pullthroughblobstore.go:85 +0x19f fp=0xc425eabf18 sp=0xc425eabe58 Version-Release number of selected component (if applicable): - ose-docker-registry:v3.5.5.31 Steps to Reproduce: 1. It is a data race issue, we cannot produce easily. But the logs are evidence. Actual results: - Above gopanic happens Expected results: - No data race happens. Additional info: - Upstream already fixed it. But apparently no backport for enterprise version at this moment(28 Feb). - Make remoteBlobGetterService thread-safe https://github.com/openshift/image-registry/commit/8cac4cd531f7770745f9d8aea5d349ab77e5c28f