Bug 2046686 - Importer pod keeps in retarting when dataimportcron has a reference to invalid image sha
Summary: Importer pod keeps in retarting when dataimportcron has a reference to invali...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Storage
Version: 4.10.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.10.1
Assignee: Arnon Gilboa
QA Contact: Yan Du
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-27 08:46 UTC by Yan Du
Modified: 2022-05-18 20:27 UTC (History)
4 users (show)

Fixed In Version: CNV v4.10.1-75
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-05-18 20:26:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt containerized-data-importer pull 2169 0 None Merged Delete erroneous DataVolume on DataImportCron desired digest update 2022-03-02 11:11:38 UTC
Github kubevirt containerized-data-importer pull 2212 0 None closed [release-v1.43] Delete erroneous DataVolume on DataImportCron desired digest update 2022-03-31 10:26:01 UTC
Github kubevirt containerized-data-importer pull 2220 0 None Merged [release-v1.43] Delete erroneous DVs on DataImportCron digest update 2022-04-10 11:17:04 UTC
Github kubevirt containerized-data-importer pull 2239 0 None Merged [release-v1.43] Fix DataImportCron watch race 2022-04-15 12:49:19 UTC
Red Hat Product Errata RHSA-2022:4668 0 None None None 2022-05-18 20:27:18 UTC

Description Yan Du 2022-01-27 08:46:18 UTC
Description of problem:
Importer pod keeps in retarting when dataimportcron has a reference to invalid image sha

Version-Release number of selected component (if applicable):
CNV-4.10.0-605

How reproducible:
Always

Steps to Reproduce:
1. Make sure enableCommonBootImageImport is enabled
$ oc get hco -n openshift-cnv -o yaml | grep -i bootimage
      enableCommonBootImageImport: true
2. Annotate the dataimportcron and make it has a reference to invalid image sha
oc annotate dic centos-stream8-image-cron --overwrite cdi.kubevirt.io/storage.import.sourceDesiredDigest=sha256:12345678900987654321
3. Check the import process

Actual results:
Importer pod keeps restarting
$ oc get dv centos-stream8-123456789009
NAME                          PHASE              PROGRESS   RESTARTS   AGE
centos-stream8-123456789009   ImportInProgress   N/A        10         29m

$ oc get pvc centos-stream8-123456789009
NAME                          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS           AGE
centos-stream8-123456789009   Bound    pvc-47e7df1e-852b-424e-9454-e74ec8697043   68Gi       RWO            hostpath-provisioner   30m

$ oc get pvc centos-stream8-123456789009-scratch
NAME                                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS           AGE
centos-stream8-123456789009-scratch   Bound    pvc-f40008cc-bec4-4635-8e71-ce43f324333d   68Gi       RWO            hostpath-provisioner   30m


Got some unfriendly warning messages:

From DV:

Events:
  Type     Reason                Age                 From                   Message
  ----     ------                ----                ----                   -------
  Normal   Pending               20m                 datavolume-controller  PVC centos-stream8-123456789009 Pending
  Normal   CreatingScratchSpace  20m                 datavolume-controller  target PVC centos-stream8-123456789009 Pending and Creating scratch space
  Normal   Claim Pending         20m                 datavolume-controller  target PVC centos-stream8-123456789009 Pending and Claim Pending
  Normal   ImportScheduled       20m                 datavolume-controller  Import into centos-stream8-123456789009 scheduled
  Normal   Claim Pending         20m                 datavolume-controller  Claim Pending
  Normal   Bound                 20m                 datavolume-controller  PVC centos-stream8-123456789009 Bound
  Normal   ImportInProgress      20m                 datavolume-controller  Import into centos-stream8-123456789009 in progress
  Warning  Error                 4m4s (x8 over 20m)  datavolume-controller  Unable to process data: invalid reference format Could not parse image kubevirt.io/containerized-data-importer/pkg/importer.readImageSource   /remote-source/app/pkg/importer/transport.go:73 kubevirt.io/containerized-data-importer/pkg/importer.copyRegistryImage   /remote-source/app/pkg/importer/transport.go:182 kubevirt.io/containerized-data-importer/pkg/importer.CopyRegistryImage   /remote-source/app/pkg/importer/transport.go:262 kubevirt.io/containerized-data-importer/pkg/importer.(*RegistryDataSource).Transfer   /remote-source/app/pkg/importer/registry-datasource.go:83 kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause   /remote-source/app/pkg/importer/data-processor.go:197 kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData   /remote-source/app/pkg/importer/data-processor.go:172 main.main   /remote-source/app/cmd/cdi-importer/importer.go:138 runtime.main   /usr/lib/golang/src/runtime/proc.go:255 runtime.goexit   /usr/lib/golang/src/runtime/asm_amd64.s:1581 Failed to read registry image kubevirt.io/containerized-data-importer/pkg/importer.(*RegistryDataSource).Transfer   /remote-source/app/pkg/importer/registry-datasource.go:85 kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause   /remote-source/app/pkg/importer/data-processor.go:197 kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData   /remote-source/app/pkg/importer/data-processor.go:172 main.main   /remote-source/app/cmd/cdi-importer/importer.go:138 runtime.main   /usr/lib/golang/src/runtime/proc.go:255 runtime.goexit   /usr/lib/golang/src/runtime/asm_amd64.s:1581 Unable to transfer source data to scratch space kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause   /remote-source/app/pkg/importer/data-processor.go:202 kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData   /remote-source/app/pkg/importer/data-processor.go:172 main.main   /remote-source/app/cmd/cdi-importer/importer.go:138 runtime.main   /usr/lib/golang/src/runtime/proc.go:255 runtime.goexit   /usr/lib/golang/src/runtime/asm_amd64.s:1581

From PVC:

Events:
  Type     Reason                 Age                   From                                                                                              Message
  ----     ------                 ----                  ----                                                                                              -------
  Normal   WaitForFirstConsumer   32m (x2 over 32m)     persistentvolume-controller                                                                       waiting for first consumer to be created before binding
  Normal   WaitForPodScheduled    32m (x2 over 32m)     persistentvolume-controller                                                                       waiting for pod importer-centos-stream8-123456789009 to be scheduled
  Normal   ExternalProvisioning   32m (x2 over 32m)     persistentvolume-controller                                                                       waiting for a volume to be created, either by external provisioner "kubevirt.io/hostpath-provisioner" or manually created by system administrator
  Normal   Provisioning           32m                   kubevirt.io/hostpath-provisioner_hostpath-provisioner-4hj57_27cafcf6-dd69-4e09-aef9-fbf16a0beed4  External provisioner is provisioning volume for claim "openshift-virtualization-os-images/centos-stream8-123456789009"
  Normal   ProvisioningSucceeded  32m                   kubevirt.io/hostpath-provisioner_hostpath-provisioner-4hj57_27cafcf6-dd69-4e09-aef9-fbf16a0beed4  Successfully provisioned volume pvc-47e7df1e-852b-424e-9454-e74ec8697043
  Warning  ErrImportFailed        110s (x946 over 31m)  import-controller                                                                                 Unable to process data: invalid reference format Could not parse image kubevirt.io/containerized-data-importer/pkg/importer.readImageSource   /remote-source/app/pkg/importer/transport.go:73 kubevirt.io/containerized-data-importer/pkg/importer.copyRegistryImage   /remote-source/app/pkg/importer/transport.go:182 kubevirt.io/containerized-data-importer/pkg/importer.CopyRegistryImage   /remote-source/app/pkg/importer/transport.go:262 kubevirt.io/containerized-data-importer/pkg/importer.(*RegistryDataSource).Transfer   /remote-source/app/pkg/importer/registry-datasource.go:83 kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause   /remote-source/app/pkg/importer/data-processor.go:197 kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData   /remote-source/app/pkg/importer/data-processor.go:172 main.main   /remote-source/app/cmd/cdi-importer/importer.go:138 runtime.main   /usr/lib/golang/src/runtime/proc.go:255 runtime.goexit   /usr/lib/golang/src/runtime/asm_amd64.s:1581 Failed to read registry image kubevirt.io/containerized-data-importer/pkg/importer.(*RegistryDataSource).Transfer   /remote-source/app/pkg/importer/registry-datasource.go:85 kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause   /remote-source/app/pkg/importer/data-processor.go:197 kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData   /remote-source/app/pkg/importer/data-processor.go:172 main.main   /remote-source/app/cmd/cdi-importer/importer.go:138 runtime.main   /usr/lib/golang/src/runtime/proc.go:255 runtime.goexit   /usr/lib/golang/src/runtime/asm_amd64.s:1581 Unable to transfer source data to scratch space kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessDataWithPause   /remote-source/app/pkg/importer/data-processor.go:202 kubevirt.io/containerized-data-importer/pkg/importer.(*DataProcessor).ProcessData   /remote-source/app/pkg/importer/data-processor.go:172 main.main   /remote-source/app/cmd/cdi-importer/importer.go:138 runtime.main   /usr/lib/golang/src/runtime/proc.go:255 runtime.goexit   /usr/lib/golang/src/runtime/asm_amd64.s:1581


Expected results:
1. Import pod will not keep trying to import and restart
2. PVC should not bound since the image has invalid reference format
3. Better to remove the trace log
 
Additional info:

Comment 1 Adam Litke 2022-02-03 13:28:57 UTC
Arnon, I am pretty sure this is behaving as expected but I want to confirm a few things.  When a DataImportCron import is failing do we emit an event?  Are we raising an alert as well?  If so, then I think we can close this.  If not, we must have events and alerts to help customers understand that there is a problem that needs their attention.

I also want to note our discussion about handling the case where an updated image appears in the registry while there is a currently failing DataImportCron.  We will want to delete the old import and switch to the new image automatically.

Comment 2 Arnon Gilboa 2022-02-06 09:47:14 UTC
Adam,

First of all, when a DataImportCron import is failing the usual PVC and DV events are emitted, e.g.:

openshift-virtualization-os-images   16m         Warning   Error                     datavolume/centos8.4-123456789009                                             Unable to process data: Unable to transfer source data to scratch space: Failed to read registry image: Could not parse image: invalid reference format
openshift-virtualization-os-images   16m         Warning   ErrImportFailed           persistentvolumeclaim/centos8.4-123456789009                                  Unable to process data: Unable to transfer source data to scratch space: Failed to read registry image: Could not parse image: invalid reference format

In addition, when DataImportCron is outdated (like this case) an alert will be raised (kubevirt_cdi_dataimportcron_outdated)
When the import pod is restarted too many times (like this case) will also trigger an alert (kubevirt_cdi_import_dv_unusual_restartcount_total)

As a manual solution, before deleting the failing import DV, you should first tell the DataImportCron to give up trying to import using the current source digest and revert to digest of the last successful import as noted in the DataSource. Otherwise, deleting the failing DV will cause it's recreation by the DataImportCron.

$ oc describe das -n openshift-virtualization-os-images centos8.4
...
Spec:
  Source:
    Pvc:
      Name:       centos8.4-4a6a0eee7b4a

Use the pvc name suffix to annotate the DataImportCron sourceDesiredDigest:

$ oc annotate --overwrite dic -n openshift-virtualization-os-images centos8.4-import-cron cdi.kubevirt.io/storage.import.sourceDesiredDigest=sha256:4a6a0eee7b4a

Now you can safely delete the failing DV.

Comment 3 Yan Du 2022-04-12 05:01:39 UTC
Test on  CNV-v4.10.1-62, issue has been fixed.

$ oc get po
NAME                                   READY   STATUS             RESTARTS      AGE
importer-centos-stream8-123456789009   1/2     InvalidImageName   2 (56s ago)   3m8s

  Normal   Created           80s (x2 over 2m22s)  kubelet            Created container importer
  Normal   Started           80s (x2 over 2m22s)  kubelet            Started container importer
  Warning  InspectFailed     80s (x8 over 2m22s)  kubelet            Failed to apply default image tag "quay.io/containerdisks/centos-stream:8@sha256:12345678900987654321": couldn't parse image reference "quay.io/containerdisks/centos-stream:8@sha256:12345678900987654321": invalid reference format
  Warning  Failed            80s (x8 over 2m22s)  kubelet            Error: InvalidImageName

Comment 5 Yan Du 2022-04-13 01:19:40 UTC
The fix PR introduced a regression, move back to assigned

{"level":"error","ts":1649760447.0815444,"logger":"controller.dataimportcron-controller","msg":"Reconciler error","name":"","namespace":"openshift-virtualization-os-images","error":"values[0][cdi.kubevirt.io/dataImportCron]: Invalid value: \"openshift-virtualization-os-images.\": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')","errorCauses":[{"error":"values[0][cdi.kubevirt.io/dataImportCron]: Invalid value: \"openshift-virtualization-os-images.\": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')"}],"stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227"}

Comment 6 Maya Rashish 2022-04-18 13:46:42 UTC
Marking "ON_QA" because there's a downstream build of this that QA can try.
(MODIFIED is for upstream code merged, but not yet in a downstream release)

Comment 7 Maya Rashish 2022-04-18 13:50:10 UTC
Oops, the fixed in version failed QA, updated to the one including the pull request fixing the regression.

Comment 8 Yan Du 2022-04-20 06:53:22 UTC
Test on CNV v4.10.1-78, issue has been fixed.

Comment 13 errata-xmlrpc 2022-05-18 20:26:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.10.1 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:4668


Note You need to log in before you can comment on or make changes to this bug.