Since it landed way back in 2019-04-04 [1], there have been some bugs in the extraction match-and-replace implementation: * It uses len(value) of the marker when searching the incoming bytes. Especially for short strings like version replacement (which is new in 4.3 [2]), this meant we were only looking for five bytes for 4.2.7, which is \x00_REL. That's not very specific. It does not distinguish between \x00_RELEASE_IMAGE_LOCATION_\x00XX... and \x00_RELEASE_VERSION_LOCATION_\x00XX.... It doesn't even distinguish between the defaultVersionPadded and defaultVersionPrefix constants in pkg/version. We should search for the full marker, regardless of len(value). * I has: nextOffset := end - len(marker) ... _, wErr := w.Write(buf[:end-nextOffset]) ... copy(buf[:nextOffset], buf[end-nextOffset:end]) offset = nextOffset That is problematic for a few reasons: * It doesn't write much data. Substituting for nextOffset, we were writing to: end - nextOffset = end - (end - len(marker)) = len(marker) ``` this makes the buffer size largely meaningless, and means lots of inefficient, small reads/writes. We should try to write everything except the *last* `len(marker)`. * It ignores the amount of data written. You can want to write 1k bytes but only actually write 50 bytes with a given Write call. That didn't happen often before; because of the previous list entry we were only attempting to write a hundred or so bytes at a time. But once we start trying to write the bulk of the buffer size, it will happen more often. We should keep attempting Write until it errors out on us or we finish pushing all the bytes we no longer need. [1]: https://github.com/openshift/origin/pull/22439#event-2252366741 [2]: https://github.com/openshift/oc/pull/88
Compared with old version, the issue still could reproduce : [root@dhcp-140-138 ~]# ./oc-old version Client Version: version.Info{Major:"", Minor:"", GitVersion:"openshift-clients-4.2.0-201909020729", GitCommit:"9ff96feb1aea1217938e2f1aeaf0be091cc59728", GitTreeState:"clean", BuildDate:"2019-09-02T15:22:22Z", GoVersion:"go1.12.8", Compiler:"gc", Platform:"linux/amd64"} [root@dhcp-140-138 ~]# strace -fo strace-43 ./oc-old adm release extract --command=oc registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2019-12-17-223355 [root@dhcp-140-138 ~]# oc version Client Version: v4.4.0 [root@dhcp-140-138 ~]# strace -fo strace-44 oc adm release extract --command=oc registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2019-12-17-223355 [root@dhcp-140-138 ~]# grep write strace-44 |tail -n 20 5251 write(12, "ine a default period\\n\"\n\"\\t\\tbef"..., 16384 <unfinished ...> 5251 <... write resumed> ) = 16384 5251 write(12, "rt of the kubernetes api and als"..., 16384) = 16384 5249 write(11, "\27\3\3\0%\0\0\0\0\0\0\3\201\303\354\7\tX\320 \262\364;\32kS\322\372/\256\302\250"..., 42 <unfinished ...> 5249 <... write resumed> ) = 42 5249 write(12, "s of all of the pods in the clus"..., 16384) = 16384 5249 write(12, "xisting image, ignored otherwise"..., 16384) = 16384 5249 write(12, "\nmsgid \"\"\n\"The resource requirem"..., 16384) = 16384 5249 write(12, "h the app=nginx label\\n\"\n\"\\t\\t# "..., 16384) = 16384 5249 write(12, "# Listen on ports 5000 and 6000 "..., 16384) = 16384 5249 write(12, "<arg1> ... <argN>\\n\"\n\"\\n\"\n\"\\t\\t#"..., 16384 <unfinished ...> 5249 <... write resumed> ) = 16384 5249 write(12, " autoscaler can automatically in"..., 16384 <unfinished ...> 5249 <... write resumed> ) = 16384 5249 write(12, "e spec must be provided. This ca"..., 16384) = 16384 5249 write(12, "' and the value 'my \"\n\"frontend "..., 16384 <unfinished ...> 5249 <... write resumed> ) = 16384 5249 write(12, "kubernetes/blob/master/pkg/kubec"..., 16384) = 16384 5249 write(12, "ed object.\"\nmsgstr \"The name for"..., 11928 <unfinished ...> 5249 <... write resumed> ) = 11928 [root@dhcp-140-138 ~]# [root@dhcp-140-138 ~]# grep write strace-43 |tail -n 20 5291 write(10, " 'foo' only if the resource is u"..., 16384) = 16384 5291 write(10, "ine a default period\\n\"\n\"\\t\\tbef"..., 16384) = 16384 5291 write(10, "rt of the kubernetes api and als"..., 16384) = 16384 5280 write(9, "\27\3\3\0%\0\0\0\0\0\0\5\326\227\26\0\\\345\31q\345\r\230\230|CH\243\332\222(3"..., 42) = 42 5280 write(10, "s of all of the pods in the clus"..., 16384) = 16384 5280 write(10, "xisting image, ignored otherwise"..., 16384 <unfinished ...> 5280 <... write resumed> ) = 16384 5280 write(10, "\nmsgid \"\"\n\"The resource requirem"..., 16384 <unfinished ...> 5280 <... write resumed> ) = 16384 5280 write(10, "h the app=nginx label\\n\"\n\"\\t\\t# "..., 16384) = 16384 5280 write(10, "# Listen on ports 5000 and 6000 "..., 16384) = 16384 5280 write(10, "<arg1> ... <argN>\\n\"\n\"\\n\"\n\"\\t\\t#"..., 16384 <unfinished ...> 5280 <... write resumed> ) = 16384 5280 write(10, " autoscaler can automatically in"..., 16384 <unfinished ...> 5280 <... write resumed> ) = 16384 5280 write(10, "e spec must be provided. This ca"..., 16384) = 16384 5280 write(10, "' and the value 'my \"\n\"frontend "..., 16384 <unfinished ...> 5280 <... write resumed> ) = 16384 5280 write(10, "kubernetes/blob/master/pkg/kubec"..., 16384) = 16384 5280 write(10, "ed object.\"\nmsgstr \"The name for"..., 11928) = 11928
Ah, I think I understand what's wrong with our verification procedure. By looking at the tail, we're only hitting the post-replacement phase where even the old code made nice, fat writes. I'll see if I can get a more-specific test to use...
I'm pushing this over to 4.5, since this is not a regression and I don't think we'll be able to fix that in 4.4. Trevor any chances you can pick it up in 4.5 or share what/how needs changing?
I'll take another look, but https://github.com/openshift/oc/pull/167 did not get reverted, so I'm pretty happy with where things stand ;). The issue may just be how we're trying to verify existing behavior, but I'm not sure.
Based on Trevor previous comment I'd move this for verification, we can revisit this at any time.
This failed verification back in comment 2. I think it's unlikely that the current code will pass verification unless we either land additional patches or figure out why the previous verification procedure didn't turn out like we'd expected.
> ... or figure out why the previous verification procedure didn't turn out like we'd expected. I still don't have a procedure for this, but if nobody else thinks one up I will work one up eventually.
If folks want to test this, it's been in 4.3 oc since: $ git --no-pager log -1 'origin/release-4.3^{/replacements to copyAndReplace}' commit a82d2ce4efdbfa9836480d8567acccfde19d297b Merge: 57997ba65 960c5070b Author: OpenShift Merge Robot <openshift-merge-robot.github.com> Date: Tue Dec 3 06:06:00 2019 -0800 Merge pull request #198 from openshift-cherrypick-robot/cherry-pick-167-to-release-4.3 Bug 1778882: pkg/cli/admin/release/extract_tools: Pass []replacements to copyAndReplace and 4.4 oc since: $ git --no-pager log -1 'origin/release-4.4^{/replacements to copyAndReplace}' commit d3f137b710de331abc946b9ad42e42cc8d5ded52 Merge: a8894f3e0 267009391 Author: OpenShift Merge Robot <openshift-merge-robot.github.com> Date: Mon Dec 2 16:22:23 2019 -0800 Merge pull request #167 from wking/release-extract-cleanup Bug 1775875: pkg/cli/admin/release/extract_tools: Pass []replacements to copyAndReplace So anything 4.3.0 [1] or later should be fine for testing. Double-checking: $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.3.0-x86_64 | grep cli-artifacts cli-artifacts https://github.com/openshift/oc 6a937dfe56ff26255d09702c69b8406040c14505 $ git log --first-parent --oneline origin/release-4.3 | grep 'a82d2ce4\|6a937dfe' 6a937dfe5 Merge pull request #208 from openshift-cherrypick-robot/cherry-pick-207-to-release-4.3 a82d2ce4e Merge pull request #198 from openshift-cherrypick-robot/cherry-pick-167-to-release-4.3 [1]: https://mirror.openshift.com/pub/openshift-v4/clients/ocp/4.3.0/
Hi W. Trevor King : Since this issue has been On_QA for a long time, I'm wondering do we have some method to verify this issue ? thanks.
I've checked no regression issue for command `oc adm release extract` , will move to verified.
I guess I'm off-the-hook for figuring out why my previous testing plan wasn't giving the expected results :p
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409