Bug 1656740

Summary: Samples-operator pod go to crash when set skippedimagestreams|skippedtemplates to invalid values
Product: OpenShift Container Platform Reporter: XiuJuan Wang <xiuwang>
Component: ImageStreamsAssignee: Gabe Montero <gmontero>
Status: CLOSED ERRATA QA Contact: XiuJuan Wang <xiuwang>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.1.0CC: aos-bugs, bparees, jokerman, mmccomas, wzheng, xiuwang
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:41:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description XiuJuan Wang 2018-12-06 08:41:22 UTC
Description of problem:
The valid value of skippedimagestreams|skippedtemplates should be in an array.
Then samples-operator pod go to crash when set skippedimagestreams|skippedtemplate to invalid value(such as jenkins,ruby|jenkins-ephemeral,jenkins-persistent)

Version-Release number of selected component (if applicable):

$ oc get clusterversion version 
NAME      VERSION                           AVAILABLE   PROGRESSING   SINCE     STATUS
version   4.0.0-0.alpha-2018-12-06-005835   True        False         5h        Cluster version is 4.0.0-0.alpha-2018-12-06-005835

How reproducible:
always

Steps to Reproduce:
1.Set skippedTemplates: jenkins-ephemeral,jenkins-persistent in samplesresource
2.Check samples operator pod
3.

Actual results:
$ oc get pods 
NAME                                        READY     STATUS             RESTARTS   AGE
cluster-samples-operator-859558f8bd-q42n6   0/1       CrashLoopBackOff   6          11m

$ oc logs  -f cluster-samples-operator-859558f8bd-q42n6 
time="2018-12-06T08:04:35Z" level=info msg="Go Version: go1.10.3"
time="2018-12-06T08:04:35Z" level=info msg="Go OS/Arch: linux/amd64"
time="2018-12-06T08:04:35Z" level=info msg="operator-sdk Version: 0.0.5+git"
time="2018-12-06T08:04:35Z" level=info msg="Metrics service cluster-samples-operator created"
time="2018-12-06T08:04:35Z" level=info msg="Watching samplesoperator.config.openshift.io/v1alpha1, SamplesResource, openshift-cluster-samples-operator, 600"
time="2018-12-06T08:04:35Z" level=info msg="Watching secrets"
time="2018-12-06T08:04:35Z" level=info msg="template client &v1.TemplateV1Client{restClient:(*rest.RESTClient)(0xc42035d980)}"
time="2018-12-06T08:04:35Z" level=info msg="image client &v1.ImageV1Client{restClient:(*rest.RESTClient)(0xc42035da40)}"
time="2018-12-06T08:04:35Z" level=info msg="creating default SamplesResource"
time="2018-12-06T08:04:38Z" level=info msg="got already exists error on create default"
ERROR: logging before flag.Parse: E1206 08:04:39.013860       1 runtime.go:66] Observed a panic: &errors.errorString{s:"failed to decode json data with gvk(samplesoperator.config.openshift.io/v1alpha1, Kind=SamplesResource): v1alpha1.SamplesResource.Spec: v1alpha1.SamplesResourceSpec.SkippedTemplates: []string: decode slice: expect [ or n, but found \", error found in #10 byte of ...|mplates\":\"jenkins-ep|..., bigger context ...|egistry\":\"registry.redhat.io\",\"skippedTemplates\":\"jenkins-ephemeral,jenkins-persistent\"},\"status\":{\"|..."} (failed to decode json data with gvk(samplesoperator.config.openshift.io/v1alpha1, Kind=SamplesResource): v1alpha1.SamplesResource.Spec: v1alpha1.SamplesResourceSpec.SkippedTemplates: []string: decode slice: expect [ or n, but found ", error found in #10 byte of ...|mplates":"jenkins-ep|..., bigger context ...|egistry":"registry.redhat.io","skippedTemplates":"jenkins-ephemeral,jenkins-persistent"},"status":{"|...)
/go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/asm_amd64.s:573
/usr/local/go/src/runtime/panic.go:502
/go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/util/k8sutil/k8sutil.go:83
/go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer-sync.go:77
/go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer-sync.go:52
/go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer-sync.go:36
/go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer.go:92
/go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/usr/local/go/src/runtime/asm_amd64.s:2361
panic: failed to decode json data with gvk(samplesoperator.config.openshift.io/v1alpha1, Kind=SamplesResource): v1alpha1.SamplesResource.Spec: v1alpha1.SamplesResourceSpec.SkippedTemplates: []string: decode slice: expect [ or n, but found ", error found in #10 byte of ...|mplates":"jenkins-ep|..., bigger context ...|egistry":"registry.redhat.io","skippedTemplates":"jenkins-ephemeral,jenkins-persistent"},"status":{"|... [recovered]
	panic: failed to decode json data with gvk(samplesoperator.config.openshift.io/v1alpha1, Kind=SamplesResource): v1alpha1.SamplesResource.Spec: v1alpha1.SamplesResourceSpec.SkippedTemplates: []string: decode slice: expect [ or n, but found ", error found in #10 byte of ...|mplates":"jenkins-ep|..., bigger context ...|egistry":"registry.redhat.io","skippedTemplates":"jenkins-ephemeral,jenkins-persistent"},"status":{"|...

goroutine 158 [running]:
github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x107
panic(0x1092380, 0xc42077f060)
	/usr/local/go/src/runtime/panic.go:502 +0x229
github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/util/k8sutil.RuntimeObjectFromUnstructured(0xc42000e0f8, 0xc42000e0f8, 0x11)
	/go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/util/k8sutil/k8sutil.go:83 +0x3c9
github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk.(*informer).sync(0xc4200d8240, 0xc4200dbde0, 0x11, 0x1053460, 0xc4202f1e20)
	/go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer-sync.go:77 +0xcc
github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk.(*informer).processNextItem(0xc4200d8240, 0xc4200b2700)
	/go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer-sync.go:52 +0xd2
github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk.(*informer).runWorker(0xc4200d8240)
	/go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer-sync.go:36 +0x2b
github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk.(*informer).(github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk.runWorker)-fm()
	/go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer.go:92 +0x2a
github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc4202f0a20)
	/go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc4202f0a20, 0x3b9aca00, 0x0, 0x8b00000001, 0x0)
	/go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc4202f0a20, 0x3b9aca00, 0x0)
	/go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk.(*informer).Run
	/go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer.go:92 +0x209

Expected results:
Should reject to save the invalid value for samplesresource.

Additional info:

Comment 1 Gabe Montero 2018-12-06 17:54:17 UTC
Your yaml most likely is in the wrong format.

Unfortunate that the sdk panics:

/usr/local/go/src/runtime/panic.go:502
/go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/util/k8sutil/k8sutil.go:83

Looking at the latest version of the sdk, they no longer call panic on formatting errors like this.  In fact, only their metrics package calls panic on its initialization trying to get its kube client from the in cluster config.

So presumably @Ben this bug could serve as the motivation / argument for upgrading the sdk for this operator.

In the interim, please provide an copy/paste of the yaml you used @XiuJuan and I'll help you correct the formatting to assist your testing.

Once we get that sorted out, Ben and I should probably be able to get to a consensus on upgrading the sdk.

Comment 2 XiuJuan Wang 2018-12-07 01:20:22 UTC
Gabe, I know my setting invalid. As my expect result in the comment #0, this invalid setting should be rejected to save in samplesresource, otherwise this bug will happen.

$ oc get samplesresources  -o yaml 
apiVersion: v1
items:
- apiVersion: samplesoperator.config.openshift.io/v1alpha1
  kind: SamplesResource
  metadata:
    creationTimestamp: 2018-12-07T00:56:06Z
    finalizers:
    - samplesoperator.config.openshift.io/finalizer
    generation: 1
    name: openshift-samples
    namespace: ""
    resourceVersion: "20398"
    selfLink: /apis/samplesoperator.config.openshift.io/v1alpha1/samplesresources/openshift-samples
    uid: e09b17b1-f9ba-11e8-a3ef-0ea829f86d10
  spec:
    architectures:
    - x86_64
    installType: centos
    managementState: Managed
    skippedTemplates: jenkins-persistent,jenkins-ephemeral
  status:
    architectures:
    - x86_64
    conditions:
    - lastTransitionTime: 2018-12-07T00:56:26Z
      lastUpdateTime: 2018-12-07T00:56:26Z
      status: "True"
      type: SamplesExist
    - lastTransitionTime: 2018-12-07T00:56:03Z
      lastUpdateTime: 2018-12-07T00:56:03Z
      status: "False"
      type: ImportCredentialsExists
    - lastTransitionTime: 2018-12-07T00:56:03Z
      lastUpdateTime: 2018-12-07T00:56:03Z
      status: "True"
      type: ConfigurationValid
    - lastTransitionTime: 2018-12-07T00:58:07Z
      lastUpdateTime: 2018-12-07T00:58:07Z
      status: "False"
      type: ChangesInProgress
    - lastTransitionTime: 2018-12-07T00:56:03Z
      lastUpdateTime: 2018-12-07T00:56:03Z
      status: "False"
      type: PendingRemove
    installType: centos
    managementState: Managed
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Comment 3 XiuJuan Wang 2018-12-07 01:42:25 UTC
Samplesresource also should treat below senario correctly:

should save repeat values for skippedTemplates as one.
$ oc get samplesresources -o yaml 
apiVersion: v1
items:
- apiVersion: samplesoperator.config.openshift.io/v1alpha1
  kind: SamplesResource
  metadata:
    creationTimestamp: 2018-12-07T00:56:06Z
    finalizers:
    - samplesoperator.config.openshift.io/finalizer
    generation: 1
    name: openshift-samples
    namespace: ""
    resourceVersion: "26876"
    selfLink: /apis/samplesoperator.config.openshift.io/v1alpha1/samplesresources/openshift-samples
    uid: e09b17b1-f9ba-11e8-a3ef-0ea829f86d10
  spec:
    architectures:
    - x86_64
    installType: centos
    managementState: Managed
    skippedTemplates:
    - jenkins-persistent
    - jenkins-persistent
  status:
    architectures:
    - x86_64
    conditions:
    - lastTransitionTime: 2018-12-07T00:56:26Z
      lastUpdateTime: 2018-12-07T00:56:26Z
      status: "True"
      type: SamplesExist
    - lastTransitionTime: 2018-12-07T00:56:03Z
      lastUpdateTime: 2018-12-07T00:56:03Z
      status: "False"
      type: ImportCredentialsExists
    - lastTransitionTime: 2018-12-07T00:56:03Z
      lastUpdateTime: 2018-12-07T00:56:03Z
      status: "True"
      type: ConfigurationValid
    - lastTransitionTime: 2018-12-07T01:26:23Z
      lastUpdateTime: 2018-12-07T01:26:23Z
      status: "False"
      type: ChangesInProgress
    - lastTransitionTime: 2018-12-07T00:56:03Z
      lastUpdateTime: 2018-12-07T00:56:03Z
      status: "False"
      type: PendingRemove
    installType: centos
    managementState: Managed
    skippedTemplates:
    - jenkins-persistent
    - jenkins-persistent
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Comment 4 Gabe Montero 2018-12-07 14:56:00 UTC
ah ... right, sorry, I missed what you said in you description XiuJuan about putting an invalid value on purpose.

And absolutely, it should not crash the operator.

As I mentioned to Ben, I think this is what pushes us to upgrading the sdk in the samples operator, to get the changes so it does not call panic when invalid yaml is provided.

Comment 5 XiuJuan Wang 2019-01-08 07:48:22 UTC
oc get clusterversion 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE     STATUS
version   4.0.0-9   True        False         3h        Cluster version is 4.0.0-9

The invalid value still could be saved, but operator pod doesn't crash now.

  Management State:        Managed
  Skipped Imagestreams:
    jenkins,ruby

Comment 6 Gabe Montero 2019-01-08 20:30:35 UTC
OK with the changes in https://github.com/openshift/cluster-samples-operator/pull/71 (still under review with Ben),
I've confirmed that the invalid yaml for the samples struct noted in the description does not cause the panic.

That said, the expectation that the yaml won't be saved will not hold for 4.0.

As long as the yaml is valid from at valid yaml perspective, the yaml will be saved.

The entry `skippedTemplates: jenkins-persistent,jenkins-ephemeral` is valid yaml.  It is not a string array
as expected for the SamplesResource, but a yaml key with a string value of `jenkins-persistent,jenkins-ephemeral`

You'll see logs like:

ERROR: logging before flag.Parse: E0108 20:06:58.490247       1 streamwatcher.go:109] Unable to decode an event from the watch stream: unable to decode watch event: v1alpha1.SamplesResource.Spec: v1alpha1.SamplesResourceSpec.SkippedTemplates: []string: decode slice: expect [ or n, but found ", error found in #10 byte of ...|mplates":"jenkins-pe|..., bigger context ...|","managementState":"Managed","skippedTemplates":"jenkins-persistent,jenkins-ephemeral","version":"4|...
ERROR: logging before flag.Parse: W0108 20:06:58.490437       1 reflector.go:341] github.com/openshift/cluster-samples-operator/pkg/generated/informers/externalversions/factory.go:58: watch of *v1alpha1.SamplesResource ended with: very short watch: github.com/openshift/cluster-samples-operator/pkg/generated/informers/externalversions/factory.go:58: Unexpected watch close - watch lasted less than a second and no items received
ERROR: logging before flag.Parse: E0108 20:06:59.498704       1 reflector.go:205] github.com/openshift/cluster-samples-operator/pkg/generated/informers/externalversions/factory.go:58: Failed to list *v1alpha1.SamplesResource: v1alpha1.SamplesResourceList.Items: []v1alpha1.SamplesResource: v1alpha1.SamplesResource.Spec: v1alpha1.SamplesResourceSpec.SkippedTemplates: []string: decode slice: expect [ or n, but found ", error found in #10 byte of ...|mplates":"jenkins-pe|..., bigger context ...|","managementState":"Managed","skippedTemplates":"jenkins-persistent,jenkins-ephemeral","version":"4|...

with the reflector.go log repeating.

There is initial discussions around custom validators getting provided for the 4.0 CRDs, but that most likely will be post 4.0.

Comment 7 Gabe Montero 2019-01-09 04:04:16 UTC
PR has merged ... look for the next installable level with the change

Comment 9 XiuJuan Wang 2019-01-11 12:30:14 UTC
Can't reproduce this bug 
registry.svc.ci.openshift.org/openshifr/origin-release:4.0.0-0.alpha-2019-01-11-075335
#oc get clusterversion 
NAME      VERSION                           AVAILABLE   PROGRESSING   SINCE     STATUS
version   4.0.0-0.alpha-2019-01-11-075335   True        False         1h        Cluster version is 4.0.0-0.alpha-2019-01-11-075335

Comment 12 errata-xmlrpc 2019-06-04 10:41:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758