1549902 – internal image registry was down due to data race

Bug 1549902 - internal image registry was down due to data race

Summary: internal image registry was down due to data race

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Image Registry
Sub Component:
Version:	3.5.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.5.z
Assignee:	Oleg Bulatov
QA Contact:	Dongbo Yan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1549916 1549917
TreeView+	depends on / blocked

Reported:	2018-02-28 02:30 UTC by Kenjiro Nakayama
Modified:	2021-06-10 14:55 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: concurrent write to the cache. Consequence: panic occurs. Fix: protect write to the cache with mutex. Result: the cache is safe to use concurrently.
Clone Of:
Clones:	1549916 1549917 (view as bug list)
Environment:
Last Closed:	2018-04-30 05:00:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3391611	0	None	None	None	2018-03-26 01:20:01 UTC
Red Hat Product Errata	RHSA-2018:1235	0	None	None	None	2018-04-30 05:01:54 UTC

Description Kenjiro Nakayama 2018-02-28 02:30:59 UTC

Description of problem:

After running several hours or days, image registry was suddenly restarted it's own. Below is the critical panic logs when issue happened:

  dockerd-current[79274]: fatal error: concurrent map writes
  dockerd-current[79274]:
  dockerd-current[79274]: goroutine 174282 [running]:
  dockerd-current[79274]: runtime.throw(0x19ac0f9, 0x15)
  dockerd-current[79274]:         /usr/lib/golang/src/runtime/panic.go:566 +0x95 fp=0xc425eab1a8 sp=0xc425eab188
  dockerd-current[79274]: runtime.mapassign1(0x1758e60, 0xc421d01b90, 0xc425eab390, 0xc425eab380)
  dockerd-current[79274]:         /usr/lib/golang/src/runtime/hashmap.go:458 +0x8ef fp=0xc425eab290 sp=0xc425eab1a8
  dockerd-current[79274]: github.com/openshift/origin/pkg/dockerregistry/server.(*remoteBlobGetterService).proxyStat(0xc421b6e300, 0x7f8e69351528, 0xc425785590, 0x2530f80, 0xc4228862d0, 0xc425eab808, 0xc425fce5a5, 0x47, 0x0, 0x0, ...)
  dockerd-current[79274]:         /builddir/build/BUILD/atomic-openshift-git-0.b6f55a2/_output/local/go/src/github.com/openshift/origin/pkg/dockerregistry/server/remoteblobgetter.go:181 +0xcf9 fp=0xc425eab720 sp=0xc425eab290
  dockerd-current[79274]: github.com/openshift/origin/pkg/dockerregistry/server.(*remoteBlobGetterService).findCandidateRepository(0xc421b6e300, 0x7f8e69351528, 0xc425785590, 0xc421addb90, 0x1, 0x1, 0xc4252d5b60, 0xc421c35900, 0x2, 0x2, ...)
  dockerd-current[79274]:         /builddir/build/BUILD/atomic-openshift-git-0.b6f55a2/_output/local/go/src/github.com/openshift/origin/pkg/dockerregistry/server/remoteblobgetter.go:228 +0x1ab fp=0xc425eab908 sp=0xc425eab720
  dockerd-current[79274]: github.com/openshift/origin/pkg/dockerregistry/server.(*remoteBlobGetterService).Stat(0xc421b6e300, 0x7f8e69351528, 0xc425785590, 0xc425fce5a5, 0x47, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
  dockerd-current[79274]:         /builddir/build/BUILD/atomic-openshift-git-0.b6f55a2/_output/local/go/src/github.com/openshift/origin/pkg/dockerregistry/server/remoteblobgetter.go:94 +0x3e1 fp=0xc425eabc38 sp=0xc425eab908
  dockerd-current[79274]: github.com/openshift/origin/pkg/dockerregistry/server.(*pullthroughBlobStore).copyContent(0xc425377760, 0x254fd00, 0xc421b6e300, 0x7f8e69351528, 0xc425785590, 0xc425fce5a5, 0x47, 0x7f8e692b1c78, 0xc4206b98a0, 0x0, ...)
  dockerd-current[79274]:         /builddir/build/BUILD/atomic-openshift-git-0.b6f55a2/_output/local/go/src/github.com/openshift/origin/pkg/dockerregistry/server/pullthroughblobstore.go:152 +0xa9 fp=0xc425eabd50 sp=0xc425eabc38
  dockerd-current[79274]: github.com/openshift/origin/pkg/dockerregistry/server.(*pullthroughBlobStore).storeLocal(0xc425377760, 0x254fd00, 0xc421b6e300, 0x7f8e69351528, 0xc425785590, 0xc425fce5a5, 0x47, 0x0, 0x0)
  dockerd-current[79274]:         /builddir/build/BUILD/atomic-openshift-git-0.b6f55a2/_output/local/go/src/github.com/openshift/origin/pkg/dockerregistry/server/pullthroughblobstore.go:199 +0x1d2 fp=0xc425eabe58 sp=0xc425eabd50
  dockerd-current[79274]: github.com/openshift/origin/pkg/dockerregistry/server.(*pullthroughBlobStore).ServeBlob.func1(0x7f8e69351528, 0xc425785590, 0xc425377760, 0x254fd00, 0xc421b6e300, 0xc425fce5a5, 0x47)
  dockerd-current[79274]:         /builddir/build/BUILD/atomic-openshift-git-0.b6f55a2/_output/local/go/src/github.com/openshift/origin/pkg/dockerregistry/server/pullthroughblobstore.go:85 +0x19f fp=0xc425eabf18 sp=0xc425eabe58

Version-Release number of selected component (if applicable):
 - ose-docker-registry:v3.5.5.31

Steps to Reproduce:
1. It is a data race issue, we cannot produce easily. But the logs are evidence.

Actual results:
  - Above gopanic happens

Expected results:
  - No data race happens.

Additional info:
  - Upstream already fixed it. But apparently no backport for enterprise version at this moment(28 Feb).
    - Make remoteBlobGetterService thread-safe
      https://github.com/openshift/image-registry/commit/8cac4cd531f7770745f9d8aea5d349ab77e5c28f

Comment 2 Kenjiro Nakayama 2018-02-28 02:42:01 UTC

The customer would like to keep using OCP v3.5 for their cluster. So, backport the fix[1] to v3.5.x is ideal. However, if v3.6.x's or v3.7.x's registry image is compatible with OCP 3.5, they are happy to use it on OCP. (v3.7 does not have the fix yet, though...)

[1] Make remoteBlobGetterService thread-safe
https://github.com/openshift/image-registry/commit/8cac4cd531f7770745f9d8aea5d349ab77e5c28f

Comment 6 Dongbo Yan 2018-04-04 18:54:35 UTC

Verified
openshift v3.5.5.31.66
kubernetes v1.5.2+43a9be4
etcd 3.1.0

Comment 9 Yadan Pei 2018-04-23 08:10:08 UTC

No changes for this bug, change back to VERIFIED since it's mistakenly changed by scripts

Comment 12 errata-xmlrpc 2018-04-30 05:00:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:1235

Note You need to log in before you can comment on or make changes to this bug.