Bug 1897520

Summary: After restarting nodes the image-registry co is in degraded true state.
Product: OpenShift Container Platform Reporter: Alisha <aprabhu>
Component: Image RegistryAssignee: Oleg Bulatov <obulatov>
Status: CLOSED ERRATA QA Contact: Wenjing Zheng <wzheng>
Severity: high Docs Contact:
Priority: high    
Version: 4.7CC: aos-bugs, aprabhak, danili, dgilmore, jpoulin, manokuma
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: ppc64le   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:33:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1949040, 1949086    

Description Alisha 2020-11-13 10:13:12 UTC
Description of problem:
After restarting nodes the image-registry co is in degraded true state.

Version-Release number of selected component (if applicable):
# oc version
Client Version: 4.7.0-0.nightly-ppc64le-2020-11-05-140212
Server Version: 4.7.0-0.nightly-ppc64le-2020-11-05-140212
Kubernetes Version: v1.19.2+6bd0f34


Steps to Reproduce:
1. Restated worker node.
2. image-registry co went in degraded true state
3. Fixed the issue after deleting a image-pruner job. After which image-registry co was in good state.
4. Again restarted the cluster nodes including the bastion. Again image-registry co went in degraded true state.

Actual results:
After restarting the nodes, image-registry co went in degraded true state

Expected results:
image-registry co should not go in degraded true state after restarting the cluster nodes.

Additional info:
Temporary fix for this issue is : 
When the image-registry co goes in degraded true state, following is observed in pruner job : 
# oc -n openshift-image-registry get jobs
NAME                      COMPLETIONS   DURATION   AGE
image-pruner-1604880000   1/1           3s         2d11h
image-pruner-1604966400   1/1           4s         35h
image-pruner-1605052800   0/1           11h        11h

The image-registry comes back to degraded false state after deleting the affected job.
# oc -n openshift-image-registry delete jobs/image-pruner-1605052800
job.batch "image-pruner-1605052800" deleted

Following is observed, after restaring the cluster nodes.
- Saw that image-registry co is again in progressing true state.
- The following pods have issue : 
# oc get  pods --all-namespaces -o wide | grep -v Running | grep -v Completed
NAMESPACE                                          NAME                                                      READY   STATUS             RESTARTS   AGE     IP               NODE       NOMINATED NODE   READINESS GATES
openshift-image-registry                           image-pruner-1605139200-qkx2x                             0/1     Error              0          5h2m    10.131.0.154     worker-0   <none>           <none>
openshift-image-registry                           image-registry-5d4446bdd8-8fxwp                           0/1     CrashLoopBackOff   155        20h     10.128.2.20      worker-1   <none>           <none>

- The pod is in CrashLoopBackOff state
# oc logs -f po/image-registry-5d4446bdd8-8fxwp -n openshift-image-registry
mkdir: cannot create directory '/etc/pki/ca-trust/extracted/edk2': File exists
mkdir: cannot create directory '/etc/pki/ca-trust/extracted/java': File exists
mkdir: cannot create directory '/etc/pki/ca-trust/extracted/openssl': File exists
mkdir: cannot create directory '/etc/pki/ca-trust/extracted/pem': File exists

# oc logs -f po/image-pruner-1605139200-qkx2x -n openshift-image-registry
F1112 00:00:38.597063       1 helpers.go:115] error: failed to ping registry https://image-registry.openshift-image-registry.svc:5000: Get "https://image-registry.openshift-image-registry.svc:5000/": dial tcp 172.30.55.81:5000: i/o timeout
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc000012001, 0xc00169c580, 0xf0, 0x141)
        /go/src/github.com/openshift/oc/vendor/k8s.io/klog/v2/klog.go:996 +0xac
k8s.io/klog/v2.(*loggingT).output(0x142c7740, 0xc000000003, 0x0, 0x0, 0xc000a2c620, 0x1404c036, 0xa, 0x73, 0x0)
        /go/src/github.com/openshift/oc/vendor/k8s.io/klog/v2/klog.go:945 +0x1a0
k8s.io/klog/v2.(*loggingT).printDepth(0x142c7740, 0x3, 0x0, 0x0, 0x2, 0xc0018236a0, 0x1, 0x1)
        /go/src/github.com/openshift/oc/vendor/k8s.io/klog/v2/klog.go:718 +0x128
k8s.io/klog/v2.FatalDepth(...)
        /go/src/github.com/openshift/oc/vendor/k8s.io/klog/v2/klog.go:1449
k8s.io/kubectl/pkg/cmd/util.fatal(0xc001068000, 0xc1, 0x1)
        /go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/util/helpers.go:93 +0x1b4
k8s.io/kubectl/pkg/cmd/util.checkErr(0x12c78920, 0xc001df16b0, 0x129c4608)
        /go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/util/helpers.go:188 +0x80c
k8s.io/kubectl/pkg/cmd/util.CheckErr(...)
        /go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/util/helpers.go:115
github.com/openshift/oc/pkg/cli/admin/prune/images.NewCmdPruneImages.func1(0xc00163f340, 0xc0016cee80, 0x0, 0x8)
        /go/src/github.com/openshift/oc/pkg/cli/admin/prune/images/images.go:161 +0x1b4
github.com/spf13/cobra.(*Command).execute(0xc00163f340, 0xc0016cee00, 0x8, 0x8, 0xc00163f340, 0xc0016cee00)
        /go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:846 +0x208
github.com/spf13/cobra.(*Command).ExecuteC(0xc000859080, 0x2, 0xc000859080, 0x2)
        /go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:950 +0x29c
github.com/spf13/cobra.(*Command).Execute(...)
        /go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:887
main.main()
        /go/src/github.com/openshift/oc/cmd/oc/oc.go:110 +0x930
goroutine 6 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x142c7740)
        /go/src/github.com/openshift/oc/vendor/k8s.io/klog/v2/klog.go:1131 +0x78
created by k8s.io/klog/v2.init.0
        /go/src/github.com/openshift/oc/vendor/k8s.io/klog/v2/klog.go:416 +0xe0
goroutine 9 [chan receive]:
k8s.io/klog.(*loggingT).flushDaemon(0x142c7660)
        /go/src/github.com/openshift/oc/vendor/k8s.io/klog/klog.go:1010 +0x78
created by k8s.io/klog.init.0
        /go/src/github.com/openshift/oc/vendor/k8s.io/klog/klog.go:411 +0xe0
goroutine 66 [IO wait]:
internal/poll.runtime_pollWait(0x7fff9028ee98, 0x72, 0xfadf)
        /opt/rh/go-toolset-1.15/root/usr/lib/go-toolset-1.15-golang/src/runtime/netpoll.go:220 +0x50
internal/poll.(*pollDesc).wait(0xc0016cf418, 0x72, 0xc0018ba000, 0xfadf, 0xfadf)
        /opt/rh/go-toolset-1.15/root/usr/lib/go-toolset-1.15-golang/src/internal/poll/fd_poll_runtime.go:87 +0x40
internal/poll.(*pollDesc).waitRead(...)
        /opt/rh/go-toolset-1.15/root/usr/lib/go-toolset-1.15-golang/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc0016cf400, 0xc0018ba000, 0xfadf, 0xfadf, 0x0, 0x0, 0x0)
        /opt/rh/go-toolset-1.15/root/usr/lib/go-toolset-1.15-golang/src/internal/poll/fd_unix.go:159 +0x188
net.(*netFD).Read(0xc0016cf400, 0xc0018ba000, 0xfadf, 0xfadf, 0x0, 0x0, 0x103185ac)
        /opt/rh/go-toolset-1.15/root/usr/lib/go-toolset-1.15-golang/src/net/fd_posix.go:55 +0x48
net.(*conn).Read(0xc0003fa040, 0xc0018ba000, 0xfadf, 0xfadf, 0x0, 0x0, 0x0)
        /opt/rh/go-toolset-1.15/root/usr/lib/go-toolset-1.15-golang/src/net/net.go:182 +0x84
crypto/tls.(*atLeastReader).Read(0xc000eae640, 0xc0018ba000, 0xfadf, 0xfadf, 0x1031aa4c, 0x223, 0xc00003e000)
        /opt/rh/go-toolset-1.15/root/usr/lib/go-toolset-1.15-golang/src/crypto/tls/conn.go:779 +0x60
bytes.(*Buffer).ReadFrom(0xc000ece280, 0x12c78660, 0xc000eae640, 0xc0018ba005, 0x123da540, 0x126d03a0)
        /opt/rh/go-toolset-1.15/root/usr/lib/go-toolset-1.15-golang/src/bytes/buffer.go:204 +0xac
crypto/tls.(*Conn).readFromUntil(0xc000ece000, 0x12c7dbe0, 0xc0003fa040, 0x5, 0xc0003fa040, 0x223)
        /opt/rh/go-toolset-1.15/root/usr/lib/go-toolset-1.15-golang/src/crypto/tls/conn.go:801 +0xd4
crypto/tls.(*Conn).readRecordOrCCS(0xc000ece000, 0x0, 0x0, 0xc0013cc080)
        /opt/rh/go-toolset-1.15/root/usr/lib/go-toolset-1.15-golang/src/crypto/tls/conn.go:608 +0xf4
crypto/tls.(*Conn).readRecord(...)
        /opt/rh/go-toolset-1.15/root/usr/lib/go-toolset-1.15-golang/src/crypto/tls/conn.go:576
crypto/tls.(*Conn).Read(0xc000ece000, 0xc001896000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /opt/rh/go-toolset-1.15/root/usr/lib/go-toolset-1.15-golang/src/crypto/tls/conn.go:1252 +0x150
bufio.(*Reader).Read(0xc001892840, 0xc0004a4658, 0x9, 0x9, 0x129c7f01, 0x10515144, 0x19982220e1ce)
        /opt/rh/go-toolset-1.15/root/usr/lib/go-toolset-1.15-golang/src/bufio/bufio.go:227 +0x238
io.ReadAtLeast(0x12c78400, 0xc001892840, 0xc0004a4658, 0x9, 0x9, 0x9, 0xc000090060, 0x105128e8, 0xd6)
        /opt/rh/go-toolset-1.15/root/usr/lib/go-toolset-1.15-golang/src/io/io.go:314 +0x84
io.ReadFull(...)
        /opt/rh/go-toolset-1.15/root/usr/lib/go-toolset-1.15-golang/src/io/io.go:333
golang.org/x/net/http2.readFrameHeader(0xc0004a4658, 0x9, 0x9, 0x12c78400, 0xc001892840, 0x0, 0x0, 0xc000ead260, 0x21a)
        /go/src/github.com/openshift/oc/vendor/golang.org/x/net/http2/frame.go:237 +0x60
golang.org/x/net/http2.(*Framer).ReadFrame(0xc0004a4620, 0xc000ead260, 0x0, 0x0, 0x0)
        /go/src/github.com/openshift/oc/vendor/golang.org/x/net/http2/frame.go:492 +0x84
golang.org/x/net/http2.(*clientConnReadLoop).run(0xc001813f98, 0x0, 0x0)
        /go/src/github.com/openshift/oc/vendor/golang.org/x/net/http2/transport.go:1794 +0xc0
golang.org/x/net/http2.(*ClientConn).readLoop(0xc000ecc300)
        /go/src/github.com/openshift/oc/vendor/golang.org/x/net/http2/transport.go:1716 +0x68
created by golang.org/x/net/http2.(*Transport).newClientConn
        /go/src/github.com/openshift/oc/vendor/golang.org/x/net/http2/transport.go:695 +0x554
goroutine 25 [select]:
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x129c4530, 0x12c7c6a0, 0xc000eca090, 0x1, 0xc0000b0360)
        /go/src/github.com/openshift/oc/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:167 +0x11c
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x129c4530, 0x12a05f200, 0x0, 0xc000750001, 0xc0000b0360)
        /go/src/github.com/openshift/oc/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x90
k8s.io/apimachinery/pkg/util/wait.Until(...)
        /go/src/github.com/openshift/oc/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/apimachinery/pkg/util/wait.Forever(0x129c4530, 0x12a05f200)
        /go/src/github.com/openshift/oc/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:81 +0x50
created by k8s.io/component-base/logs.InitLogs
        /go/src/github.com/openshift/oc/vendor/k8s.io/component-base/logs/logs.go:58 +0x88
goroutine 26 [select]:
io.(*pipe).Read(0xc0014e1a40, 0xc0012d0000, 0x1000, 0x1000, 0x122a1c40, 0x1, 0xc0012d0000)
        /opt/rh/go-toolset-1.15/root/usr/lib/go-toolset-1.15-golang/src/io/pipe.go:57 +0xc8
io.(*PipeReader).Read(0xc0013c4d78, 0xc0012d0000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        /opt/rh/go-toolset-1.15/root/usr/lib/go-toolset-1.15-golang/src/io/pipe.go:134 +0x48
bufio.(*Scanner).Scan(0xc001593480, 0x0)
        /opt/rh/go-toolset-1.15/root/usr/lib/go-toolset-1.15-golang/src/bufio/scan.go:214 +0xa8
github.com/openshift/oc/pkg/cli/admin/mustgather.newPrefixWriter.func1(0xc001593480, 0x12c7e0c0, 0xc000012018, 0x127dd833, 0x17)
        /go/src/github.com/openshift/oc/pkg/cli/admin/mustgather/mustgather.go:424 +0xfc
created by github.com/openshift/oc/pkg/cli/admin/mustgather.newPrefixWriter
        /go/src/github.com/openshift/oc/pkg/cli/admin/mustgather/mustgather.go:423 +0x1b4
		
# oc -n openshift-image-registry get jobs
NAME                      COMPLETIONS   DURATION   AGE
image-pruner-1604880000   1/1           3s         3d5h
image-pruner-1604966400   1/1           4s         2d5h
image-pruner-1605139200   0/1           5h16m      5h16m

Comment 1 Oleg Bulatov 2020-11-13 11:34:19 UTC
Steps to reproduce the problem:

1. get registry pods: oc get pods -n openshift-image-registry
2. rsh into a pod: oc -n openshift-image-registry rsh image-registry-584b6c5f4f-h88rr
3. kill the registry process (it always has pid=1 in the container): kill 1 (in rsh session)
4. check the pod status:

NAME                                               READY   STATUS             RESTARTS   AGE
image-registry-584b6c5f4f-h88rr                    0/1     CrashLoopBackOff   4          21m

Comment 4 Wenjing Zheng 2020-11-26 08:32:02 UTC
Image registry pod just restarts and then go back running:
$ oc get pods
NAME                                               READY   STATUS      RESTARTS   AGE
cluster-image-registry-operator-84bfd498b7-hwm4r   1/1     Running     0          8h
image-pruner-1606379280-gn9xb                      0/1     Completed   0          2m35s
image-pruner-1606379340-kcwkl                      0/1     Completed   0          95s
image-pruner-1606379400-btg4j                      0/1     Completed   0          35s
image-registry-74d66d97bf-dlstg                    1/1     Running     1          8h
image-registry-74d66d97bf-xffcp                    1/1     Running     1          8h
node-ca-7c4vr                                      1/1     Running     0          8h
node-ca-8cgzc                                      1/1     Running     0          8h
node-ca-8sd2w                                      1/1     Running     0          8h
node-ca-ml6xg                                      1/1     Running     0          8h
node-ca-sxg7z                                      1/1     Running     0          8h
node-ca-ztfxf                                      1/1     Running     0          8h
zhengwenjings-MacBook-Pro:4.7 wzheng$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-11-25-114114   True        False         8h      Cluster version is 4.7.0-0.nightly-2020-11-25-114114

Comment 7 errata-xmlrpc 2021-02-24 15:33:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633