Bug 1964888

Summary: opoenshift-apiserver imagestreamimports depend on >34s timeout support, WAS: transport: loopyWriter.run returning. connection error: desc = "transport is closing"
Product: OpenShift Container Platform Reporter: Nikolaos Leandros Moraitis <nmoraiti>
Component: openshift-apiserverAssignee: Abu Kashem <akashem>
Status: CLOSED ERRATA QA Contact: Xingxing Xia <xxia>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.8CC: aaleman, akashem, aos-bugs, dmistry, mfojtik, pmuller, sttts, vlaad, wking, wlewis, xxia
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:10:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Nikolaos Leandros Moraitis 2021-05-26 09:36:48 UTC
Description of problem:


After we upgraded the cluster to the 4.8 version, all imagestream imports started to fail. After some investigation, we found out that the problem is that the timeout is less than the imports need.

See: https://github.com/kubernetes/kubernetes/pull/96901

The above PR was landed in Kubernetes v1.21 and Openshift 4.8 is rebased in that version. 



Some logs:
1 trace.go:205] Trace[1676786077]: "Create" url:/apis/image.openshift.io/v1/namespaces/ci-op-frj9n6jn/imagestreamimports,user-agent:openshift-controller-manager/v0.0.0 (linux/amd64) kubernetes/$Format/system:serviceaccount:openshift-infra:image-import-controller,client:10.0.140.81,accept:application/vnd.kubernetes.protobuf,application/json,protocol:HTTP/2.0 (26-May-2021 09:20:32.145) (total time: 34000ms):
Trace[1676786077]: [34.000650239s] [34.000650239s] END
I0526 09:21:13.364660       1 client.go:360] parsed scheme: "passthrough"
I0526 09:21:13.364699       1 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://10.0.159.123:2379  <nil> 0 <nil>}] <nil> <nil>}
I0526 09:21:13.364707       1 clientconn.go:948] ClientConn switching balancer to "pick_first"
I0526 09:21:13.364772       1 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc01bfc3f40, {CONNECTING <nil>}
I0526 09:21:13.371195       1 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc01bfc3f40, {READY <nil>}
I0526 09:21:13.371746       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0526 09:21:27.184293       1 trace.go:205] Trace[1909494622]: "Create" url:/apis/image.openshift.io/v1/namespaces/ci-ln-tqd27xk/imagestreamimports,user-agent:openshift-controller-manager/v0.0.0 (linux/amd64) kubernetes/$Format/system:serviceaccount:openshift-infra:image-import-controller,client:10.0.140.81,accept:application/vnd.kubernetes.protobuf,application/json,protocol:HTTP/2.0 (26-May-2021 09:20:53.183) (total time: 34000ms):
Trace[1909494622]: [34.000478205s] [34.000478205s] END
I0526 09:21:37.200374       1 client.go:360] parsed scheme: "passthrough"
I0526 09:21:37.200409       1 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://10.0.175.171:2379  <nil> 0 <nil>}] <nil> <nil>}
I0526 09:21:37.200416       1 clientconn.go:948] ClientConn switching balancer to "pick_first"
I0526 09:21:37.200480       1 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc017aba090, {CONNECTING <nil>}
I0526 09:21:37.207769       1 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc017aba090, {READY <nil>}
I0526 09:21:37.208891       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I0526 09:21:42.784476       1 client.go:360] parsed scheme: "passthrough"
I0526 09:21:42.784515       1 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://10.0.140.81:2379  <nil> 0 <nil>}] <nil> <nil>}
I0526 09:21:42.784523       1 clientconn.go:948] ClientConn switching balancer to "pick_first"
I0526 09:21:42.784597       1 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc04706ec10, {CONNECTING <nil>}
I0526 09:21:42.792328       1 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc04706ec10, {READY <nil>}
I0526 09:21:42.793530       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"

Comment 1 Stefan Schimanski 2021-05-26 12:45:52 UTC
Openshift-apiserver's imagestreamimports resource is not etcd bound (which has an implicit 30s timeout), but calls out to the image registry. Oleg: "for 200 images there are about 600 requests to registries". This takes (much) longer than our admitted 34s in 1.21+.

Background: https://github.com/kubernetes/kubernetes/pull/96901 started to cap the request timeout parameter to the effective O(30s) seconds that we have as bound on etcd requests. This made the much longer timeout used for imagestreamimports ineffective.

Comment 6 Xingxing Xia 2021-06-16 14:32:58 UTC
Sorry did not timely check it. Was fully occupied by burning down on_qa bugs and other focused team work.

Today researched above description and PR, tested it in 4.8.0-0.nightly-2021-06-14-145150 env:
$ for i in $IS; do oc import-image $i --all -n openshift &> /dev/null; [ "$?" != "0" ] && echo "Failed to import $i" || echo "Succeeded in importing $i"; done
Succeeded in importing apicast-gateway
Succeeded in importing apicurito-ui
Succeeded in importing cli
Succeeded in importing cli-artifacts
...
Succeeded in importing tools
Succeeded in importing ubi8-openjdk-11
Succeeded in importing ubi8-openjdk-8

All imports are successful.

Check OAS pod logs:
grep -o '"Create" url:/apis/image.openshift.io/v1/namespaces/openshift/imagestreamimports.*(total time: .*)' OAS_POD_LOGS | grep -o "total time: [^)]*" | sort -n -k
total time: 560ms
total time: 561ms
total time: 568ms
...
total time: 7103ms
total time: 7363ms
total time: 7546ms
total time: 7817ms
total time: 9728ms

Though the time is various, all imports are successful.

The "transport is closing" is tracked in bug 1923976

Comment 7 Xingxing Xia 2021-06-17 09:31:15 UTC
Today I double checked above time, 9728ms is less than 34 ms, so this is not enough. I managed to design a big imagestream > 34 ms and have a rest, see next comment.

Comment 8 Xingxing Xia 2021-06-17 10:25:30 UTC
1. ** Check the top imagestreams that take longest time in import **
IS=`oc get is -n openshift --no-headers | grep -o "^[^ ]*"`
for i in $IS; do
  oc import-image $i --all -n openshift --v 6 &> test.log
  [ "$?" != "0" ] && echo "Failed to import $i" || echo "Succeeded importing $i"
  TIME=$(grep -o "/imagestreamimports.*Created in.*$" test.log | grep -o "[0-9]* milliseconds")
  echo "$i imported in $TIME"
done > import_time.log

grep "milliseconds" import_time.log | sort -n -k 4
Got the top ones:
jboss-amq-62 imported in 5424 milliseconds
fuse7-java-openshift imported in 5467 milliseconds
jboss-datavirt64-driver-openshift imported in 5776 milliseconds
eap-cd-openshift imported in 6101 milliseconds
ruby imported in 6561 milliseconds
jboss-decisionserver64-openshift imported in 6973 milliseconds
python imported in 7068 milliseconds
apicast-gateway imported in 8064 milliseconds

2. ** We can see 5424 + ... + 8064 milliseconds > 34s. So choose them to design a big imagestream as below **
LONG_IMPORT_IS="jboss-amq-62 fuse7-java-openshift jboss-datavirt64-driver-openshift eap-cd-openshift ruby jboss-decisionserver64-openshift python apicast-gateway"
for i in $LONG_IMPORT_IS; do
  oc get is $i -n openshift -o yaml > is.yaml
  L1=`sed -n "/^spec:/=" is.yaml`
  L2=`sed -n "/^status:/=" is.yaml`
  let L1=$L1+4 # "tags:" line
  let L2=$L2-1
  sed -n "$L1,$L2 p" is.yaml | sed "s/^    name: /    name: $i-/"
done > spec.yaml

cat > test-is.yaml << EOF
apiVersion: image.openshift.io/v1
kind: ImageStream
metadata:
  name: test-is
  namespace: openshift
spec:
  tags:
`cat spec.yaml`
EOF

3. ** Create it and import **
$ oc create -f test-is.yaml
imagestream.image.openshift.io/test-is created

$ oc import-image test-is --all -n openshift --v 6
Got:
I0617 17:27:43.186457   11765 round_trippers.go:454] POST https://api.xxxx.qe.devcluster.openshift.com:6443/apis/image.openshift.io/v1/namespaces/openshift/imagestreamimports 504 Gateway Timeout in 34078 milliseconds
I0617 17:27:43.187167   11765 helpers.go:216] server response object: [{
  "metadata": {},
  "status": "Failure",
  "message": "Timeout: request did not complete within requested timeout context deadline exceeded",
  "reason": "Timeout",
  "details": {},
  "code": 504
}]
F0617 17:27:43.187244   11765 helpers.go:115] Error from server (Timeout): Timeout: request did not complete within requested timeout context deadline exceeded
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc000134001, 0xc000d10600, 0xa0, 0xf1)
        /builddir/build/BUILD/openshift-clients-4.8.0/__gopath/src/github.com/openshift/oc/vendor/k8s.io/klog/v2/klog.go:1021 +0xb9
...

oc get is test-is -n openshift -o jsonpath='{range .status.tags[*]}{.tag}{"\n"}{end}' | wc -l # got only 64 tags imported
grep "^    name: " spec.yaml | wc -l # got 76 tags

Check openshift-apiserver pod logs:
2021-06-17T09:27:43.147878176Z I0617 09:27:43.147791       1 trace.go:205] Trace[1725700142]: "Create" url:/apis/image.openshift.io/v1/namespaces/openshift/imagestreamimports,user-agent:oc/4.8.0 (linux/amd64) kubernetes/10cec43,client:172.31.248.140,accept:application/json, */*,protocol:HTTP/2.0 (17-Jun-2021 09:27:09.146) (total time: 34001ms):
2021-06-17T09:27:43.147878176Z Trace[1725700142]: [34.001069324s] [34.001069324s] END
2021-06-17T09:28:06.574375167Z I0617 09:28:06.574324       1 client.go:360] parsed scheme: "passthrough"
2021-06-17T09:28:06.574375167Z I0617 09:28:06.574368       1 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{https://172.31.248.137:2379  <nil> 0 <nil>}] <nil> <nil>}
2021-06-17T09:28:06.574422094Z I0617 09:28:06.574375       1 clientconn.go:948] ClientConn switching balancer to "pick_first"
2021-06-17T09:28:06.574462327Z I0617 09:28:06.574445       1 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc0056efb60, {CONNECTING <nil>}
2021-06-17T09:28:06.584707320Z I0617 09:28:06.584656       1 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc0056efb60, {READY <nil>}
2021-06-17T09:28:06.586118613Z I0617 09:28:06.586085       1 controlbuf.go:508] transport: loopyWriter.run returning. connection error: desc = "transport is closing"

Above "request did not complete within requested timeout" means import still fails?

Comment 9 Abu Kashem 2021-06-17 14:53:06 UTC
xxia,

> oc --request-timeout=5m create -f request.yaml -o yaml

You will need to specify a timeout, otherwise the default of 34s will be applied. See the slack thread - https://coreos.slack.com/archives/CBN38N3MW/p1622110734332000?thread_ts=1622014795.242100&cid=CBN38N3MW

Comment 11 Xingxing Xia 2021-06-18 11:20:26 UTC
Using comment 9 steps, I tried in below envs to do comparison:
4.6 (4.6.0-0.nightly-2021-06-17-075809) 
4.7 (4.7.17) 
latest 4.8 (as above) which DOES include the attached PR code

They all got same result of comment 8 as 4.8. If adding --request-timeout=5m:
oc import-image --request-timeout=5m test-is --all -n openshift --v 6 &> test.log
They all succeeded as below:
I0618 07:37:30.868063   18040 round_trippers.go:454] POST https://api.xxx.qe.devcluster.openshift.com:6443/apis/image.openshift.io/v1/namespaces/openshift/imagestreamimports?timeout=5m0s 201 Created in 58262 milliseconds

To compare, I also tried in env of:
4.8.0-fc.6 (created from 4.8.0-0.nightly-2021-05-27-064741) which does NOT include the attached PR code.
Got same bahavior as above.

So, no matter DOES or NOT include the attached PR code, test result has no any difference! Weird, Nikolaos, Abu, any thought?

Anyway, since comment 6 usage of daily import works, and --request-timeout can solve comment 8, I move it VERIFIED.

Comment 12 Xingxing Xia 2021-06-18 11:42:22 UTC
(In reply to Xingxing Xia from comment #11)
> Using comment 9 steps, I tried in below envs to do comparison:
Typo, I intended to say Using comment 8 steps.

Comment 13 Xingxing Xia 2021-06-18 12:05:34 UTC
> 4.8.0-fc.6 (created from 4.8.0-0.nightly-2021-05-27-064741) which does NOT include the attached PR code.
> Got same bahavior as above.
> So, no matter DOES or NOT include the attached PR code, test result has no any difference! Weird, Nikolaos, Abu, any thought?
Sorry, I double checked again: for 4.8.0-fc.6, with --request-timeout=5m, `oc import-image --request-timeout=5m test-is --all -n openshift --v 6 &> test.log` also fails with timeout. But for latest 4.8, with --request-timeout=5m, import succeeded. So, this difference means the fix makes it! No confusion now!

Comment 15 errata-xmlrpc 2021-07-27 23:10:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438