Bug 1645463 - Samples-operator pod go to crash when has different setting between configmap and samplesresources
Summary: Samples-operator pod go to crash when has different setting between configmap...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: ImageStreams
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.1.0
Assignee: Gabe Montero
QA Contact: XiuJuan Wang
Depends On:
TreeView+ depends on / blocked
Reported: 2018-11-02 09:49 UTC by XiuJuan Wang
Modified: 2019-06-04 10:40 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Last Closed: 2019-06-04 10:40:52 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 0 None None None 2019-06-04 10:40:58 UTC

Description XiuJuan Wang 2018-11-02 09:49:26 UTC
Description of problem:
When set skippedImageStreams in openshift-samples samplesresources not in configmap.The samples operator pod will go to CrashLoopBackOff.

Version-Release number of selected component (if applicable):

openshift v4.0.0-0.43.0
openshift/ose-cluster-samples-operator:latest(2873803d5632) ->v4.0.0-0.43.0

How reproducible:

Steps to Reproduce:
1.Deploy samples operator manually, use ose-cluster-samples-operator image.
2.Edit samplesresources to add skippedImageStreams
#oc edit samplesresources openshift-samples
  - jenkins
  - perl

Keep configmap original
# oc describe  configmap  openshift-samples
 Name:         openshift-samples
Namespace:    openshift-cluster-samples-operator
Labels:       <none>
Annotations:  <none>




Events:  <none>

3.Check samples operator pod

Actual results:

#oc get pods 
NAME                                        READY     STATUS             RESTARTS   AGE
cluster-samples-operator-7d4cdbb8fb-6crcd   0/1       CrashLoopBackOff   12         32m

#oc logs  -f cluster-samples-operator-7d4cdbb8fb-6crcd

time="2018-11-02T08:58:54Z" level=info msg="updated template nodejs-mongodb-example"
time="2018-11-02T08:58:54Z" level=info msg="processing file templates from dir /opt/openshift/operator/okd-x86_64/nodejs"
time="2018-11-02T08:58:54Z" level=info msg="processing file nodejs from dir /opt/openshift/operator/okd-x86_64"
time="2018-11-02T08:58:54Z" level=info msg="processing subdir perl from dir /opt/openshift/operator/okd-x86_64"
time="2018-11-02T08:58:54Z" level=info msg="processing subdir imagestreams from dir /opt/openshift/operator/okd-x86_64/perl"
time="2018-11-02T08:58:54Z" level=info msg="processing file perl-centos7.json from dir /opt/openshift/operator/okd-x86_64/perl/imagestreams"
ERROR: logging before flag.Parse: E1102 08:58:54.248287       1 runtime.go:66] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xb0 pc=0xf71570]

goroutine 114 [running]:
github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x107
panic(0x10a5660, 0x1aa7670)
	/opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/runtime/panic.go:502 +0x229
github.com/openshift/cluster-samples-operator/pkg/stub.(*Handler).processFiles(0xc4203b6000, 0xc42060cf40, 0x34, 0xc4206a0f30, 0x1, 0x1, 0xc420319680, 0xc42060cf40, 0x34)
	/go/src/github.com/openshift/cluster-samples-operator/pkg/stub/handler.go:919 +0xdf0
github.com/openshift/cluster-samples-operator/pkg/stub.(*Handler).processFiles(0xc4203b6000, 0xc420454360, 0x27, 0xc4206a0db0, 0x1, 0x1, 0xc420319680, 0xc420454360, 0x27)
	/go/src/github.com/openshift/cluster-samples-operator/pkg/stub/handler.go:873 +0x18ae
github.com/openshift/cluster-samples-operator/pkg/stub.(*Handler).processFiles(0xc4203b6000, 0x120dfac, 0x22, 0xc42037a280, 0x13, 0x13, 0xc420319680, 0x0, 0xc420438ed0)
	/go/src/github.com/openshift/cluster-samples-operator/pkg/stub/handler.go:873 +0x18ae
github.com/openshift/cluster-samples-operator/pkg/stub.(*Handler).Handle(0xc4203b6000, 0x1301d60, 0xc420046038, 0x12ebb00, 0xc420319680, 0x42c800, 0x0, 0x0)
	/go/src/github.com/openshift/cluster-samples-operator/pkg/stub/handler.go:812 +0x9cb
github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk.(*informer).sync(0xc4200f07e0, 0xc42041dba0, 0x11, 0x1049e40, 0xc42054c810)
	/go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer-sync.go:85 +0x11a
github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk.(*informer).processNextItem(0xc4200f07e0, 0xc4200f4300)
	/go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer-sync.go:52 +0xd2
	/go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer-sync.go:36 +0x2b
	/go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer.go:92 +0x2a
	/go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc42030a190, 0x3b9aca00, 0x0, 0x1, 0x0)
	/go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc42030a190, 0x3b9aca00, 0x0)
	/go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk.(*informer).Run
	/go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer.go:92 +0x209

Expected results:
Should prompt forbidden info explicitly

Additional info:

Comment 1 Gabe Montero 2018-11-02 17:27:21 UTC
Ran into this recently as well.

https://github.com/openshift/cluster-samples-operator/pull/41 among other things has the fix for this.

Will move to on qa when that PR merges.

Comment 2 Gabe Montero 2018-11-05 16:03:01 UTC
OK I've received the notifications that new builds have completed with the fix from https://github.com/openshift/cluster-samples-operator/pull/41

I'll copy/paste the content here ... hopefully this is sufficient for QA to pull what is needed to verify:

Image Name: openshift/ose-cluster-samples-operator
Status: Succeeded
Submitted by: ocp-build/buildvm.openshift.eng.bos.redhat.com
Logs: http://download.eng.bos.redhat.com/brewroot/work/tasks/3122/19063122

Package: ose-cluster-samples-operator-container
NVR: ose-cluster-samples-operator-container-v4.0.0-
User: osbs
Status: complete
Tag Operation: tagged
Into Tag: rhaos-4.0-rhel-7-candidate

ose-cluster-samples-operator-container-v4.0.0- successfully tagged into rhaos-4.0-rhel-7-candidate by osbs

Comment 3 XiuJuan Wang 2018-11-06 03:37:04 UTC
Two modifies in samplesresource openshift-samples
a.change installtype to rhel 
  Skipped Image Streams:

After modify for a long time(about 1h), samplesresource still didn't fetch my change. And configmap has no data updated.

See details here, http://pastebin.test.redhat.com/666428

Comment 4 Gabe Montero 2018-11-06 14:55:19 UTC
I do see where the timestamps in the status condition are an hour before the that latest timestamps in your pod logs

It also looks like the pod logs have been truncated ... none of the logging from the 2018-11-06T02:48:38Z is present.

I wonder if there was an issue with the initial insertion of data in the config map

And the lack of data in the config map is certainly problematic ... lack of data there could result in ignoring subsequent events.

And break config validations.

I did hear yesterday that there was an issue with generic API access ... that up to half of the interactions could fail 

@Ben Parees - did you happen to hear of a  tracking item for that?
Feels like that upon initial reaction.

@XiuJuan Wang - Do you have the environment in question still up?  Or can you reproduce?

If possible, I'd like:
- oc get cm -o yaml, oc  get events, oc logs -f of the pod after initial bring up
- Same data after you make the initial change to the skipped lists
- Same data after you make the switch to rhel
- Same data after you make new entries to the skipped lists

In parallel I'll be trying to reproduce here, first with the version of the installer I currently have, then I'll update that.  Will also see about tracking down that rumor about api server access if Ben does not know how that is being tracked.

Comment 5 Ben Parees 2018-11-06 15:11:50 UTC
> @Ben Parees - did you happen to hear of a  tracking item for that?
Feels like that upon initial reaction.

I think i heard there was a PR to fix it, but i don't have details.

Regardless, as Eric noted when discussing it, our operators need to be able to tolerate those sort of failures and do the right thing, so if there are flows where api calls failing leave us in a broken state we can't recover from, we need to look at that and fix it.

Comment 6 Gabe Montero 2018-11-06 16:11:50 UTC
OK thanks Ben ... I'll spend some time tracking the PR just so we know what the playing field is

A couple of follow up items

- @XiuJuan Wang, Ben also noted during our scrum that `oc logs -f` *MIGHT* truncate some data

can you go back to the failed env or reproduced, where you just run `oc logs <podname>` without the `-f` option ... capturing the details of any errors on api interactions will help shape any solutions

- @Ben, understood on the general recovery / tolerance standpoint ... but at some point, based on the nature of the error, recovering and enforcing our cannot make certain config change notions are going to be a friction point ... I fear trade offs in that regard will be inevitable

So we need to start considering those trade offs

that said, the probably/upcoming simplication of things like the rhel/centos choice going away (yes @XiuJuan Wang, a possible design change for the samples operator is comign, based on discussions we've had with the installer team and stakes in the ground getting placed by the architecture team)

...and with the ppc integration somewhat up in the air ...

Short term, re-adding the needinfo to get the full pod logs 

could mitigate things a bit

Comment 7 Gabe Montero 2018-11-06 16:48:00 UTC
Actually, before I have @XiuJuan gather any logs or reproduced, my attempts to reproduce have uncovered an endless cycle of updating once the skipped lists have been modified.

Going to use this defect to correct that first.

Comment 8 Gabe Montero 2018-11-06 18:55:19 UTC
PR https://github.com/openshift/cluster-samples-operator/pull/43 has the additional fix for the continual updating when you changed the skipped lists

Moving to POST 

Will move on QA when I get notification of a new build after the merge

Comment 9 XiuJuan Wang 2018-11-07 08:51:06 UTC
brewregistry.stage.redhat.io/openshift/ose-cluster-samples-operator   latest              70b6dac29355        4 hours ago         387 MB
brewregistry.stage.redhat.io/openshift/ose-cluster-samples-operator   v4.0.0-     70b6dac29355        4 hours ago         387 MB
docker.io/openshift/origin-cluster-samples-operator                   latest              5ae59e6fb645        12 hours ago        413 MB

I checked both ocp and origin images.Should has include https://github.com/openshift/cluster-samples-operator/pull/43 .

a. After the startup update stauts for samplesresource, then no any update later even have set skippedimagestreams. configmap don't sync with skippedimagestream change.

Details here, http://pastebin.test.redhat.com/666940

b. Change installtype to rhel
   Delete configmap.

Details here,http://pastebin.test.redhat.com/666939

Comment 10 Gabe Montero 2018-11-07 16:26:44 UTC
Hmmmmmmmm .... yeah that is not what I'm seeing when I try your item a) and the config map.  The fact that the streams are missing per you pastebin is good news though.

For your item b), the results are what I expected.  The ConfigurationValid condition is false, and it mentions that you cannot change from centos to rhel.  I also see the logs in the pod log.

So good there.  You say you deleted to config map afterward.  However the code was faster than you :-) as it flagged the error via checking with the config map before the delete occurred.  The code then created a new config map with the default, which is centos.  So good there as well.  In general, admins should not be deleting the config map, but I assume you were trying to break things @XiuJuan Wang :-)

If you wanted to clear the system, and start off with rhel, that is more complicated.  As I mentioned during test case review, exactly how you do that was TBD as we needed to sort out some details with the installer and architecture teams.  There is a way to do it, but as we discussed with Ben during the test case review, we'd rather QA wait until we sort out the installer integration.  FYI, it looks like the centos/rhel choice is going away.  We'll always install rhel, and use the image pull secret from kube-system namesapce.  But there are some logistics with having that pull secret work with the samples that are still pending.

Back to a) and the configmap update... if there was some error trying to update the configmap, it would have shown up in the pod log and the samplesoperator status.

At this point, the only debug option I can think of, without providing a debug patch, is to get a thread dump of the sample operator after you add the skipped streams, and see if there are some hung threads somewhere.

Reproduce a) and after gathering what you already gathered, do the following
- oc rsh <samples operator pod> /bin/bash
- ps -ef
- look for the cluster-samples-operator pid ... should be "1", but double check
- kill -3 <pid>
- the thread dump should show up in the pod log

If something shows up from the thread dump, we'll go from there.  Otherwise I'll build a debug version of the operator, and explain how you can run with that.

Overall ... progress for sure ... thanks.

Comment 11 XiuJuan Wang 2018-11-08 07:07:13 UTC
Hi gabe, No error shown up after thread dump in samplesresources and operator pod.

You can see  SkippedImagestreams:[]string(nil) in pod log.


for b), yeah, I just try to do destructive operation to see how to affect operator.

Comment 12 Gabe Montero 2018-11-08 14:48:03 UTC
Hey @XiuJuan Wang ... apologies, my instructions on getting the thread dump were insufficient for the uninitiated.   You have to be tailing the pod before you send the -3 signal.  Then copy/paste the thread dump.  In your attempt a restart occurs as a result of the termination, and your attempt to capture the logs is too late.

That said, upon reflection, most likely there is not a deadlock, otherwise, none of the subsequent activity would not have occurred.

I'll work on creating a debug version that highlights all the details of the skipped imagestream processing to the config map, and we'll go from there.  I may just update our current version temporarily, or push the debug version to docker hub, and the provide instructions for enabling it (though it is a multi-step process so I'm leaning towards just temporarily adding extraneous debug.

I'm still somewhat wary of the issue I noted previously about the bootstrap server still being considered an api server and updates getting lost, but that was supposedly fixed (though not 100% sure about that still).  We might need to engage the master team.

I'll update here when the debug is ready.

Comment 13 Gabe Montero 2018-11-08 23:59:15 UTC
OK https://github.com/openshift/cluster-samples-operator/pull/49 has merged @XiuJuan Wang.

Please reproduced with that level, and we'll see what the bz1645463 messages in the pod logs say.

If it ends up being a api server sort of thing, I contend we mark this bug verified given the validations I noted in #Comment 10 and we open a new bug against the master team.

Comment 14 XiuJuan Wang 2018-11-09 03:17:10 UTC
Here is pasted logs http://pastebin.test.redhat.com/667720 .

I am not skilled, there is no interaction between api server and samples operator pods in the log.So is this a issue of samples operator ifself, right?

There is a additional info, I deployed sample operator by creating 00-05 yaml and clusteroperator-crd.yaml manaully.

Comment 15 Gabe Montero 2018-11-09 15:39:52 UTC
Hey @XiuJuan Wang

There certainly is interaction between the samples operator and api server in the logs.  Both from directly updating config maps to getting events when the samplesresource has changed.

Some specifics on the current data below, but apologies, I'm going to need one more reproduction with the changes from https://github.com/openshift/cluster-samples-operator/pull/50 before we can decide on any additional bugs.

This run:
- the thread dump confirmed there is no deadlock.  You can quit getting the thread dumps for this bugzilla.
- the initial config map update was OK
- we never processed any events for when you added to the samplesresource skipped imagestreams ... so either we
1) did not get the event ... if so, master and/or operator sdk bug
2) incorrectly ignored the event .... I fix in samples operator

I definitely see neither 1) nor 2) in my test cluster when I do the same thing you are doing.

The additional debug should provide the final piece.


Comment 16 XiuJuan Wang 2018-11-12 05:52:10 UTC
the initial config map update works well.
See details http://pastebin.test.redhat.com/668167

Comment 17 Gabe Montero 2018-11-12 19:53:02 UTC
OK the debug shows that a samplesresource event wit the skippedImagestreams field set never arrived to the operator:

We always got:

time="2018-11-12T03:20:18Z" level=info msg="bz1645463 DBG - srcfg skipped streams []string(nil)"
time="2018-11-12T03:20:18Z" level=info msg="bz1645463 DBG - cm skipped streams  and ok true"
time="2018-11-12T03:20:18Z" level=info msg="bz1645463 DBG - cm array len 0 and contents []string{} and srcfg len 0 and contents []string(nil)"
time="2018-11-12T03:20:18Z" level=info msg="Incoming samplesresource unchanged from last processed version"

Since @XiuJuan Wang previously showed an samplesresource with that field set, that would imply to me an issue with the SDK propagating events to the operator ... whether that is the result of an underlying api server type of issue would be the contention point most likely.

My thoughts on next steps:
- given the samples operator is at an older version of the SDK, getting told to upgrade the sdk first by that team is to be expected ...
- do we start down that path
- or do we hold on that until the sdk/no sdk decision is made

@Ben - thoughts/opinions ?

Comment 18 Ben Parees 2018-11-12 20:29:02 UTC
I'm pretty skeptical this is evidence of an api server/watch issue (including at the sdk level), it's just not a problem we've ever seen anywhere else despite having speculated about it at various times, it's never born out.

In addition, you're doing (if i recall correctly) a relist every 10 minutes, so even if you missed the event, you should see the object again within 10 minutes and process it at that point.

I haven't followed this bug closely or looked through the code, but are we sure it's not an issue in the samples operator of either:

1) missing the event while the operator was down?
2) intentionally ignoring the event because of some of the version checking or other logic?

Comment 19 Gabe Montero 2018-11-12 21:25:50 UTC
@Ben - reader digest details
- only seeing in @xiujuanwang's env
- the operator has not gone down during the test based on the pod logs 
- I've been sprinkling in debug as we have iterated ... at this point, the spots in the event handling prior to my latest debug:
   a) a problem updating the operator status would result in an error log we are not seeing

   b) process mgmt field says unmanged or removed ... we would see a log there which is not showing up

   c) spec validation flags an error ... again, would have shown up in error log and samplesoperator status

   d) then our config check for changes, where my debug says we do not see any changes to the skipped imagestream list

    also, my logging shows events coming in every 9 to 10 seconds 

@XiuJuan - you did not provide a oc get samplesresource -o yaml with your last pastebin ... given the prior pastebins all had it and corroborated timing, I did not harp on it.  But given Ben's skeptical response, please reproduce again, and include oc get samplesresource -o yaml ... and don't bother with the events, they have nothing of interest for us here.

I'll corroborate the timestamps in the samplesresource yaml with the latest debug, and we'll level set with Ben and go from there.

Comment 20 Ben Parees 2018-11-12 21:32:20 UTC
Yeah at a minimum i'd make sure you have debug that shows you whether or not you're getting the expected relist events every 10 minutes (possibly even crank that value down for now while this is investigated).

Comment 21 XiuJuan Wang 2018-11-13 06:52:10 UTC
@gmontero @bparees
I have to say applogies to both you.

I have updated samplesresource with 'skippedImageStream' not 'skippedImageStreams' since comment #9. Today when I set 'skippedImageStream' first time, the field always gone a few mins later.Then I realized used wrong keywords.

Now I think we could mark this bug as verified, the skippedImageStream feature works well.

Comment 22 Gabe Montero 2018-11-13 14:32:36 UTC
Ahhh ... yep that would do it.

And no worries @XiuJuan Wang - I should have not assumed after comment 9 the samples resource was the same.

Yes let's mark as verified ... sending to you on qa

and yes Ben's skepticism was certainly warranted :-)

Comment 23 XiuJuan Wang 2018-11-14 02:57:48 UTC
Test openshift/ose-cluster-samples-operator:latest ---> v4.0.0-

skippedImageStreams and skippedTemplates feature work well when add/modify/remove from samplesresources.

Move this bug as verified.
Thanks Gabe and Ben~

Comment 26 errata-xmlrpc 2019-06-04 10:40:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.