Description of problem: After we upgraded the cluster to the 4.8 version, all imagestream imports started to fail. After some investigation, we found out that the problem is that the timeout is less than the imports need. See: https://github.com/kubernetes/kubernetes/pull/96901 The above PR was landed in Kubernetes v1.21 and Openshift 4.8 is rebased in that version. Some logs: 1 trace.go:205] Trace[1676786077]: "Create" url:/apis/image.openshift.io/v1/namespaces/ci-op-frj9n6jn/imagestreamimports,user-agent:openshift-controller-manager/v0.0.0 (linux/amd64) kubernetes/$Format/system:serviceaccount:openshift-infra:image-import-controller,client:10.0.140.81,accept:application/vnd.kubernetes.protobuf,application/json,protocol:HTTP/2.0 (26-May-2021 09:20:32.145) (total time: 34000ms): Trace[1676786077]: [34.000650239s] [34.000650239s] END I0526 09:21:13.364660 1 client.go:360] parsed scheme: "passthrough" I0526 09:21:13.364699 1 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://10.0.159.123:2379 <nil> 0 <nil>}] <nil> <nil>} I0526 09:21:13.364707 1 clientconn.go:948] ClientConn switching balancer to "pick_first" I0526 09:21:13.364772 1 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc01bfc3f40, {CONNECTING <nil>} I0526 09:21:13.371195 1 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc01bfc3f40, {READY <nil>} I0526 09:21:13.371746 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0526 09:21:27.184293 1 trace.go:205] Trace[1909494622]: "Create" url:/apis/image.openshift.io/v1/namespaces/ci-ln-tqd27xk/imagestreamimports,user-agent:openshift-controller-manager/v0.0.0 (linux/amd64) kubernetes/$Format/system:serviceaccount:openshift-infra:image-import-controller,client:10.0.140.81,accept:application/vnd.kubernetes.protobuf,application/json,protocol:HTTP/2.0 (26-May-2021 09:20:53.183) (total time: 34000ms): Trace[1909494622]: [34.000478205s] [34.000478205s] END I0526 09:21:37.200374 1 client.go:360] parsed scheme: "passthrough" I0526 09:21:37.200409 1 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://10.0.175.171:2379 <nil> 0 <nil>}] <nil> <nil>} I0526 09:21:37.200416 1 clientconn.go:948] ClientConn switching balancer to "pick_first" I0526 09:21:37.200480 1 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc017aba090, {CONNECTING <nil>} I0526 09:21:37.207769 1 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc017aba090, {READY <nil>} I0526 09:21:37.208891 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" I0526 09:21:42.784476 1 client.go:360] parsed scheme: "passthrough" I0526 09:21:42.784515 1 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://10.0.140.81:2379 <nil> 0 <nil>}] <nil> <nil>} I0526 09:21:42.784523 1 clientconn.go:948] ClientConn switching balancer to "pick_first" I0526 09:21:42.784597 1 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc04706ec10, {CONNECTING <nil>} I0526 09:21:42.792328 1 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc04706ec10, {READY <nil>} I0526 09:21:42.793530 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
Openshift-apiserver's imagestreamimports resource is not etcd bound (which has an implicit 30s timeout), but calls out to the image registry. Oleg: "for 200 images there are about 600 requests to registries". This takes (much) longer than our admitted 34s in 1.21+. Background: https://github.com/kubernetes/kubernetes/pull/96901 started to cap the request timeout parameter to the effective O(30s) seconds that we have as bound on etcd requests. This made the much longer timeout used for imagestreamimports ineffective.
Sorry did not timely check it. Was fully occupied by burning down on_qa bugs and other focused team work. Today researched above description and PR, tested it in 4.8.0-0.nightly-2021-06-14-145150 env: $ for i in $IS; do oc import-image $i --all -n openshift &> /dev/null; [ "$?" != "0" ] && echo "Failed to import $i" || echo "Succeeded in importing $i"; done Succeeded in importing apicast-gateway Succeeded in importing apicurito-ui Succeeded in importing cli Succeeded in importing cli-artifacts ... Succeeded in importing tools Succeeded in importing ubi8-openjdk-11 Succeeded in importing ubi8-openjdk-8 All imports are successful. Check OAS pod logs: grep -o '"Create" url:/apis/image.openshift.io/v1/namespaces/openshift/imagestreamimports.*(total time: .*)' OAS_POD_LOGS | grep -o "total time: [^)]*" | sort -n -k total time: 560ms total time: 561ms total time: 568ms ... total time: 7103ms total time: 7363ms total time: 7546ms total time: 7817ms total time: 9728ms Though the time is various, all imports are successful. The "transport is closing" is tracked in bug 1923976
Today I double checked above time, 9728ms is less than 34 ms, so this is not enough. I managed to design a big imagestream > 34 ms and have a rest, see next comment.
1. ** Check the top imagestreams that take longest time in import ** IS=`oc get is -n openshift --no-headers | grep -o "^[^ ]*"` for i in $IS; do oc import-image $i --all -n openshift --v 6 &> test.log [ "$?" != "0" ] && echo "Failed to import $i" || echo "Succeeded importing $i" TIME=$(grep -o "/imagestreamimports.*Created in.*$" test.log | grep -o "[0-9]* milliseconds") echo "$i imported in $TIME" done > import_time.log grep "milliseconds" import_time.log | sort -n -k 4 Got the top ones: jboss-amq-62 imported in 5424 milliseconds fuse7-java-openshift imported in 5467 milliseconds jboss-datavirt64-driver-openshift imported in 5776 milliseconds eap-cd-openshift imported in 6101 milliseconds ruby imported in 6561 milliseconds jboss-decisionserver64-openshift imported in 6973 milliseconds python imported in 7068 milliseconds apicast-gateway imported in 8064 milliseconds 2. ** We can see 5424 + ... + 8064 milliseconds > 34s. So choose them to design a big imagestream as below ** LONG_IMPORT_IS="jboss-amq-62 fuse7-java-openshift jboss-datavirt64-driver-openshift eap-cd-openshift ruby jboss-decisionserver64-openshift python apicast-gateway" for i in $LONG_IMPORT_IS; do oc get is $i -n openshift -o yaml > is.yaml L1=`sed -n "/^spec:/=" is.yaml` L2=`sed -n "/^status:/=" is.yaml` let L1=$L1+4 # "tags:" line let L2=$L2-1 sed -n "$L1,$L2 p" is.yaml | sed "s/^ name: / name: $i-/" done > spec.yaml cat > test-is.yaml << EOF apiVersion: image.openshift.io/v1 kind: ImageStream metadata: name: test-is namespace: openshift spec: tags: `cat spec.yaml` EOF 3. ** Create it and import ** $ oc create -f test-is.yaml imagestream.image.openshift.io/test-is created $ oc import-image test-is --all -n openshift --v 6 Got: I0617 17:27:43.186457 11765 round_trippers.go:454] POST https://api.xxxx.qe.devcluster.openshift.com:6443/apis/image.openshift.io/v1/namespaces/openshift/imagestreamimports 504 Gateway Timeout in 34078 milliseconds I0617 17:27:43.187167 11765 helpers.go:216] server response object: [{ "metadata": {}, "status": "Failure", "message": "Timeout: request did not complete within requested timeout context deadline exceeded", "reason": "Timeout", "details": {}, "code": 504 }] F0617 17:27:43.187244 11765 helpers.go:115] Error from server (Timeout): Timeout: request did not complete within requested timeout context deadline exceeded goroutine 1 [running]: k8s.io/klog/v2.stacks(0xc000134001, 0xc000d10600, 0xa0, 0xf1) /builddir/build/BUILD/openshift-clients-4.8.0/__gopath/src/github.com/openshift/oc/vendor/k8s.io/klog/v2/klog.go:1021 +0xb9 ... oc get is test-is -n openshift -o jsonpath='{range .status.tags[*]}{.tag}{"\n"}{end}' | wc -l # got only 64 tags imported grep "^ name: " spec.yaml | wc -l # got 76 tags Check openshift-apiserver pod logs: 2021-06-17T09:27:43.147878176Z I0617 09:27:43.147791 1 trace.go:205] Trace[1725700142]: "Create" url:/apis/image.openshift.io/v1/namespaces/openshift/imagestreamimports,user-agent:oc/4.8.0 (linux/amd64) kubernetes/10cec43,client:172.31.248.140,accept:application/json, */*,protocol:HTTP/2.0 (17-Jun-2021 09:27:09.146) (total time: 34001ms): 2021-06-17T09:27:43.147878176Z Trace[1725700142]: [34.001069324s] [34.001069324s] END 2021-06-17T09:28:06.574375167Z I0617 09:28:06.574324 1 client.go:360] parsed scheme: "passthrough" 2021-06-17T09:28:06.574375167Z I0617 09:28:06.574368 1 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://172.31.248.137:2379 <nil> 0 <nil>}] <nil> <nil>} 2021-06-17T09:28:06.574422094Z I0617 09:28:06.574375 1 clientconn.go:948] ClientConn switching balancer to "pick_first" 2021-06-17T09:28:06.574462327Z I0617 09:28:06.574445 1 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc0056efb60, {CONNECTING <nil>} 2021-06-17T09:28:06.584707320Z I0617 09:28:06.584656 1 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc0056efb60, {READY <nil>} 2021-06-17T09:28:06.586118613Z I0617 09:28:06.586085 1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing" Above "request did not complete within requested timeout" means import still fails?
xxia, > oc --request-timeout=5m create -f request.yaml -o yaml You will need to specify a timeout, otherwise the default of 34s will be applied. See the slack thread - https://coreos.slack.com/archives/CBN38N3MW/p1622110734332000?thread_ts=1622014795.242100&cid=CBN38N3MW
Using comment 9 steps, I tried in below envs to do comparison: 4.6 (4.6.0-0.nightly-2021-06-17-075809) 4.7 (4.7.17) latest 4.8 (as above) which DOES include the attached PR code They all got same result of comment 8 as 4.8. If adding --request-timeout=5m: oc import-image --request-timeout=5m test-is --all -n openshift --v 6 &> test.log They all succeeded as below: I0618 07:37:30.868063 18040 round_trippers.go:454] POST https://api.xxx.qe.devcluster.openshift.com:6443/apis/image.openshift.io/v1/namespaces/openshift/imagestreamimports?timeout=5m0s 201 Created in 58262 milliseconds To compare, I also tried in env of: 4.8.0-fc.6 (created from 4.8.0-0.nightly-2021-05-27-064741) which does NOT include the attached PR code. Got same bahavior as above. So, no matter DOES or NOT include the attached PR code, test result has no any difference! Weird, Nikolaos, Abu, any thought? Anyway, since comment 6 usage of daily import works, and --request-timeout can solve comment 8, I move it VERIFIED.
(In reply to Xingxing Xia from comment #11) > Using comment 9 steps, I tried in below envs to do comparison: Typo, I intended to say Using comment 8 steps.
> 4.8.0-fc.6 (created from 4.8.0-0.nightly-2021-05-27-064741) which does NOT include the attached PR code. > Got same bahavior as above. > So, no matter DOES or NOT include the attached PR code, test result has no any difference! Weird, Nikolaos, Abu, any thought? Sorry, I double checked again: for 4.8.0-fc.6, with --request-timeout=5m, `oc import-image --request-timeout=5m test-is --all -n openshift --v 6 &> test.log` also fails with timeout. But for latest 4.8, with --request-timeout=5m, import succeeded. So, this difference means the fix makes it! No confusion now!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438