Bug 1549902

Summary: internal image registry was down due to data race
Product: OpenShift Container Platform Reporter: Kenjiro Nakayama <knakayam>
Component: Image RegistryAssignee: Oleg Bulatov <obulatov>
Status: CLOSED ERRATA QA Contact: Dongbo Yan <dyan>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.5.1CC: aos-bugs, bparees, yapei
Target Milestone: ---   
Target Release: 3.5.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: concurrent write to the cache. Consequence: panic occurs. Fix: protect write to the cache with mutex. Result: the cache is safe to use concurrently.
Story Points: ---
Clone Of:
: 1549916 1549917 (view as bug list) Environment:
Last Closed: 2018-04-30 05:00:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 1549916, 1549917    

Description Kenjiro Nakayama 2018-02-28 02:30:59 UTC
Description of problem:

After running several hours or days, image registry was suddenly restarted it's own. Below is the critical panic logs when issue happened:

  dockerd-current[79274]: fatal error: concurrent map writes
  dockerd-current[79274]:
  dockerd-current[79274]: goroutine 174282 [running]:
  dockerd-current[79274]: runtime.throw(0x19ac0f9, 0x15)
  dockerd-current[79274]:         /usr/lib/golang/src/runtime/panic.go:566 +0x95 fp=0xc425eab1a8 sp=0xc425eab188
  dockerd-current[79274]: runtime.mapassign1(0x1758e60, 0xc421d01b90, 0xc425eab390, 0xc425eab380)
  dockerd-current[79274]:         /usr/lib/golang/src/runtime/hashmap.go:458 +0x8ef fp=0xc425eab290 sp=0xc425eab1a8
  dockerd-current[79274]: github.com/openshift/origin/pkg/dockerregistry/server.(*remoteBlobGetterService).proxyStat(0xc421b6e300, 0x7f8e69351528, 0xc425785590, 0x2530f80, 0xc4228862d0, 0xc425eab808, 0xc425fce5a5, 0x47, 0x0, 0x0, ...)
  dockerd-current[79274]:         /builddir/build/BUILD/atomic-openshift-git-0.b6f55a2/_output/local/go/src/github.com/openshift/origin/pkg/dockerregistry/server/remoteblobgetter.go:181 +0xcf9 fp=0xc425eab720 sp=0xc425eab290
  dockerd-current[79274]: github.com/openshift/origin/pkg/dockerregistry/server.(*remoteBlobGetterService).findCandidateRepository(0xc421b6e300, 0x7f8e69351528, 0xc425785590, 0xc421addb90, 0x1, 0x1, 0xc4252d5b60, 0xc421c35900, 0x2, 0x2, ...)
  dockerd-current[79274]:         /builddir/build/BUILD/atomic-openshift-git-0.b6f55a2/_output/local/go/src/github.com/openshift/origin/pkg/dockerregistry/server/remoteblobgetter.go:228 +0x1ab fp=0xc425eab908 sp=0xc425eab720
  dockerd-current[79274]: github.com/openshift/origin/pkg/dockerregistry/server.(*remoteBlobGetterService).Stat(0xc421b6e300, 0x7f8e69351528, 0xc425785590, 0xc425fce5a5, 0x47, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
  dockerd-current[79274]:         /builddir/build/BUILD/atomic-openshift-git-0.b6f55a2/_output/local/go/src/github.com/openshift/origin/pkg/dockerregistry/server/remoteblobgetter.go:94 +0x3e1 fp=0xc425eabc38 sp=0xc425eab908
  dockerd-current[79274]: github.com/openshift/origin/pkg/dockerregistry/server.(*pullthroughBlobStore).copyContent(0xc425377760, 0x254fd00, 0xc421b6e300, 0x7f8e69351528, 0xc425785590, 0xc425fce5a5, 0x47, 0x7f8e692b1c78, 0xc4206b98a0, 0x0, ...)
  dockerd-current[79274]:         /builddir/build/BUILD/atomic-openshift-git-0.b6f55a2/_output/local/go/src/github.com/openshift/origin/pkg/dockerregistry/server/pullthroughblobstore.go:152 +0xa9 fp=0xc425eabd50 sp=0xc425eabc38
  dockerd-current[79274]: github.com/openshift/origin/pkg/dockerregistry/server.(*pullthroughBlobStore).storeLocal(0xc425377760, 0x254fd00, 0xc421b6e300, 0x7f8e69351528, 0xc425785590, 0xc425fce5a5, 0x47, 0x0, 0x0)
  dockerd-current[79274]:         /builddir/build/BUILD/atomic-openshift-git-0.b6f55a2/_output/local/go/src/github.com/openshift/origin/pkg/dockerregistry/server/pullthroughblobstore.go:199 +0x1d2 fp=0xc425eabe58 sp=0xc425eabd50
  dockerd-current[79274]: github.com/openshift/origin/pkg/dockerregistry/server.(*pullthroughBlobStore).ServeBlob.func1(0x7f8e69351528, 0xc425785590, 0xc425377760, 0x254fd00, 0xc421b6e300, 0xc425fce5a5, 0x47)
  dockerd-current[79274]:         /builddir/build/BUILD/atomic-openshift-git-0.b6f55a2/_output/local/go/src/github.com/openshift/origin/pkg/dockerregistry/server/pullthroughblobstore.go:85 +0x19f fp=0xc425eabf18 sp=0xc425eabe58

Version-Release number of selected component (if applicable):
 - ose-docker-registry:v3.5.5.31

Steps to Reproduce:
1. It is a data race issue, we cannot produce easily. But the logs are evidence.

Actual results:
  - Above gopanic happens

Expected results:
  - No data race happens.

Additional info:
  - Upstream already fixed it. But apparently no backport for enterprise version at this moment(28 Feb).
    - Make remoteBlobGetterService thread-safe
      https://github.com/openshift/image-registry/commit/8cac4cd531f7770745f9d8aea5d349ab77e5c28f

Comment 2 Kenjiro Nakayama 2018-02-28 02:42:01 UTC
The customer would like to keep using OCP v3.5 for their cluster. So, backport the fix[1] to v3.5.x is ideal. However, if v3.6.x's or v3.7.x's registry image is compatible with OCP 3.5, they are happy to use it on OCP. (v3.7 does not have the fix yet, though...)

[1] Make remoteBlobGetterService thread-safe
https://github.com/openshift/image-registry/commit/8cac4cd531f7770745f9d8aea5d349ab77e5c28f

Comment 6 Dongbo Yan 2018-04-04 18:54:35 UTC
Verified
openshift v3.5.5.31.66
kubernetes v1.5.2+43a9be4
etcd 3.1.0

Comment 9 Yadan Pei 2018-04-23 08:10:08 UTC
No changes for this bug, change back to VERIFIED since it's mistakenly changed by scripts

Comment 12 errata-xmlrpc 2018-04-30 05:00:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:1235