Hide Forgot
Description of problem: When set skippedImageStreams in openshift-samples samplesresources not in configmap.The samples operator pod will go to CrashLoopBackOff. Version-Release number of selected component (if applicable): openshift v4.0.0-0.43.0 openshift/ose-cluster-samples-operator:latest(2873803d5632) ->v4.0.0-0.43.0 How reproducible: always Steps to Reproduce: 1.Deploy samples operator manually, use ose-cluster-samples-operator image. 2.Edit samplesresources to add skippedImageStreams #oc edit samplesresources openshift-samples skippedImageStreams: - jenkins - perl Keep configmap original # oc describe configmap openshift-samples Name: openshift-samples Namespace: openshift-cluster-samples-operator Labels: <none> Annotations: <none> Data ==== keyForInstallTypeField: ---- centos keyForSamplesRegistryField: ---- keyForSkippedImageStreamsField: ---- keyForSkippedTemplatesField: ---- x86_64: ---- x86_64 Events: <none> 3.Check samples operator pod Actual results: #oc get pods NAME READY STATUS RESTARTS AGE cluster-samples-operator-7d4cdbb8fb-6crcd 0/1 CrashLoopBackOff 12 32m #oc logs -f cluster-samples-operator-7d4cdbb8fb-6crcd time="2018-11-02T08:58:54Z" level=info msg="updated template nodejs-mongodb-example" time="2018-11-02T08:58:54Z" level=info msg="processing file templates from dir /opt/openshift/operator/okd-x86_64/nodejs" time="2018-11-02T08:58:54Z" level=info msg="processing file nodejs from dir /opt/openshift/operator/okd-x86_64" time="2018-11-02T08:58:54Z" level=info msg="processing subdir perl from dir /opt/openshift/operator/okd-x86_64" time="2018-11-02T08:58:54Z" level=info msg="processing subdir imagestreams from dir /opt/openshift/operator/okd-x86_64/perl" time="2018-11-02T08:58:54Z" level=info msg="processing file perl-centos7.json from dir /opt/openshift/operator/okd-x86_64/perl/imagestreams" ERROR: logging before flag.Parse: E1102 08:58:54.248287 1 runtime.go:66] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) /go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72 /go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65 /go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51 /opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/runtime/asm_amd64.s:573 /opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/runtime/panic.go:502 /opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/runtime/panic.go:63 /opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/runtime/signal_unix.go:388 /go/src/github.com/openshift/cluster-samples-operator/pkg/stub/handler.go:919 /go/src/github.com/openshift/cluster-samples-operator/pkg/stub/handler.go:873 /go/src/github.com/openshift/cluster-samples-operator/pkg/stub/handler.go:873 /go/src/github.com/openshift/cluster-samples-operator/pkg/stub/handler.go:812 /go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer-sync.go:85 /go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer-sync.go:52 /go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer-sync.go:36 /go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer.go:92 /go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 /go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 /go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 /opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/runtime/asm_amd64.s:2361 panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0xb0 pc=0xf71570] goroutine 114 [running]: github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x107 panic(0x10a5660, 0x1aa7670) /opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/runtime/panic.go:502 +0x229 github.com/openshift/cluster-samples-operator/pkg/stub.(*Handler).processFiles(0xc4203b6000, 0xc42060cf40, 0x34, 0xc4206a0f30, 0x1, 0x1, 0xc420319680, 0xc42060cf40, 0x34) /go/src/github.com/openshift/cluster-samples-operator/pkg/stub/handler.go:919 +0xdf0 github.com/openshift/cluster-samples-operator/pkg/stub.(*Handler).processFiles(0xc4203b6000, 0xc420454360, 0x27, 0xc4206a0db0, 0x1, 0x1, 0xc420319680, 0xc420454360, 0x27) /go/src/github.com/openshift/cluster-samples-operator/pkg/stub/handler.go:873 +0x18ae github.com/openshift/cluster-samples-operator/pkg/stub.(*Handler).processFiles(0xc4203b6000, 0x120dfac, 0x22, 0xc42037a280, 0x13, 0x13, 0xc420319680, 0x0, 0xc420438ed0) /go/src/github.com/openshift/cluster-samples-operator/pkg/stub/handler.go:873 +0x18ae github.com/openshift/cluster-samples-operator/pkg/stub.(*Handler).Handle(0xc4203b6000, 0x1301d60, 0xc420046038, 0x12ebb00, 0xc420319680, 0x42c800, 0x0, 0x0) /go/src/github.com/openshift/cluster-samples-operator/pkg/stub/handler.go:812 +0x9cb github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk.(*informer).sync(0xc4200f07e0, 0xc42041dba0, 0x11, 0x1049e40, 0xc42054c810) /go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer-sync.go:85 +0x11a github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk.(*informer).processNextItem(0xc4200f07e0, 0xc4200f4300) /go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer-sync.go:52 +0xd2 github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk.(*informer).runWorker(0xc4200f07e0) /go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer-sync.go:36 +0x2b github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk.(*informer).(github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk.runWorker)-fm() /go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer.go:92 +0x2a github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc42030a190) /go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54 github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc42030a190, 0x3b9aca00, 0x0, 0x1, 0x0) /go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc42030a190, 0x3b9aca00, 0x0) /go/src/github.com/openshift/cluster-samples-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d created by github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk.(*informer).Run /go/src/github.com/openshift/cluster-samples-operator/vendor/github.com/operator-framework/operator-sdk/pkg/sdk/informer.go:92 +0x209 Expected results: Should prompt forbidden info explicitly Additional info:
Ran into this recently as well. https://github.com/openshift/cluster-samples-operator/pull/41 among other things has the fix for this. Will move to on qa when that PR merges.
OK I've received the notifications that new builds have completed with the fix from https://github.com/openshift/cluster-samples-operator/pull/41 I'll copy/paste the content here ... hopefully this is sufficient for QA to pull what is needed to verify: Image Name: openshift/ose-cluster-samples-operator Repositories: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift/ose-cluster-samples-operator:rhaos-4.0-rhel-7-containers-candidate-24620-20181105145213 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift/ose-cluster-samples-operator:latest brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift/ose-cluster-samples-operator:v4.0.0 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift/ose-cluster-samples-operator:v4.0.0-0.47.0.0 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift/ose-cluster-samples-operator:v4.0 brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift/ose-cluster-samples-operator:v4.0.0.20181105.080123 Status: Succeeded Submitted by: ocp-build/buildvm.openshift.eng.bos.redhat.com Logs: http://download.eng.bos.redhat.com/brewroot/work/tasks/3122/19063122 Package: ose-cluster-samples-operator-container NVR: ose-cluster-samples-operator-container-v4.0.0-0.47.0.0 User: osbs Status: complete Tag Operation: tagged Into Tag: rhaos-4.0-rhel-7-candidate ose-cluster-samples-operator-container-v4.0.0-0.47.0.0 successfully tagged into rhaos-4.0-rhel-7-candidate by osbs
Two modifies in samplesresource openshift-samples a.change installtype to rhel b.add Skipped Image Streams: jenkins perl ruby After modify for a long time(about 1h), samplesresource still didn't fetch my change. And configmap has no data updated. See details here, http://pastebin.test.redhat.com/666428
I do see where the timestamps in the status condition are an hour before the that latest timestamps in your pod logs It also looks like the pod logs have been truncated ... none of the logging from the 2018-11-06T02:48:38Z is present. I wonder if there was an issue with the initial insertion of data in the config map And the lack of data in the config map is certainly problematic ... lack of data there could result in ignoring subsequent events. And break config validations. I did hear yesterday that there was an issue with generic API access ... that up to half of the interactions could fail @Ben Parees - did you happen to hear of a tracking item for that? Feels like that upon initial reaction. @XiuJuan Wang - Do you have the environment in question still up? Or can you reproduce? If possible, I'd like: - oc get cm -o yaml, oc get events, oc logs -f of the pod after initial bring up - Same data after you make the initial change to the skipped lists - Same data after you make the switch to rhel - Same data after you make new entries to the skipped lists In parallel I'll be trying to reproduce here, first with the version of the installer I currently have, then I'll update that. Will also see about tracking down that rumor about api server access if Ben does not know how that is being tracked.
> @Ben Parees - did you happen to hear of a tracking item for that? Feels like that upon initial reaction. I think i heard there was a PR to fix it, but i don't have details. Regardless, as Eric noted when discussing it, our operators need to be able to tolerate those sort of failures and do the right thing, so if there are flows where api calls failing leave us in a broken state we can't recover from, we need to look at that and fix it.
OK thanks Ben ... I'll spend some time tracking the PR just so we know what the playing field is A couple of follow up items - @XiuJuan Wang, Ben also noted during our scrum that `oc logs -f` *MIGHT* truncate some data can you go back to the failed env or reproduced, where you just run `oc logs <podname>` without the `-f` option ... capturing the details of any errors on api interactions will help shape any solutions - @Ben, understood on the general recovery / tolerance standpoint ... but at some point, based on the nature of the error, recovering and enforcing our cannot make certain config change notions are going to be a friction point ... I fear trade offs in that regard will be inevitable So we need to start considering those trade offs that said, the probably/upcoming simplication of things like the rhel/centos choice going away (yes @XiuJuan Wang, a possible design change for the samples operator is comign, based on discussions we've had with the installer team and stakes in the ground getting placed by the architecture team) ...and with the ppc integration somewhat up in the air ... Short term, re-adding the needinfo to get the full pod logs could mitigate things a bit
Actually, before I have @XiuJuan gather any logs or reproduced, my attempts to reproduce have uncovered an endless cycle of updating once the skipped lists have been modified. Going to use this defect to correct that first.
PR https://github.com/openshift/cluster-samples-operator/pull/43 has the additional fix for the continual updating when you changed the skipped lists Moving to POST Will move on QA when I get notification of a new build after the merge
brewregistry.stage.redhat.io/openshift/ose-cluster-samples-operator latest 70b6dac29355 4 hours ago 387 MB brewregistry.stage.redhat.io/openshift/ose-cluster-samples-operator v4.0.0-0.49.0.0 70b6dac29355 4 hours ago 387 MB docker.io/openshift/origin-cluster-samples-operator latest 5ae59e6fb645 12 hours ago 413 MB I checked both ocp and origin images.Should has include https://github.com/openshift/cluster-samples-operator/pull/43 . a. After the startup update stauts for samplesresource, then no any update later even have set skippedimagestreams. configmap don't sync with skippedimagestream change. Details here, http://pastebin.test.redhat.com/666940 b. Change installtype to rhel Delete configmap. Details here,http://pastebin.test.redhat.com/666939
Hmmmmmmmm .... yeah that is not what I'm seeing when I try your item a) and the config map. The fact that the streams are missing per you pastebin is good news though. For your item b), the results are what I expected. The ConfigurationValid condition is false, and it mentions that you cannot change from centos to rhel. I also see the logs in the pod log. So good there. You say you deleted to config map afterward. However the code was faster than you :-) as it flagged the error via checking with the config map before the delete occurred. The code then created a new config map with the default, which is centos. So good there as well. In general, admins should not be deleting the config map, but I assume you were trying to break things @XiuJuan Wang :-) If you wanted to clear the system, and start off with rhel, that is more complicated. As I mentioned during test case review, exactly how you do that was TBD as we needed to sort out some details with the installer and architecture teams. There is a way to do it, but as we discussed with Ben during the test case review, we'd rather QA wait until we sort out the installer integration. FYI, it looks like the centos/rhel choice is going away. We'll always install rhel, and use the image pull secret from kube-system namesapce. But there are some logistics with having that pull secret work with the samples that are still pending. Back to a) and the configmap update... if there was some error trying to update the configmap, it would have shown up in the pod log and the samplesoperator status. At this point, the only debug option I can think of, without providing a debug patch, is to get a thread dump of the sample operator after you add the skipped streams, and see if there are some hung threads somewhere. Reproduce a) and after gathering what you already gathered, do the following - oc rsh <samples operator pod> /bin/bash - ps -ef - look for the cluster-samples-operator pid ... should be "1", but double check - kill -3 <pid> - the thread dump should show up in the pod log If something shows up from the thread dump, we'll go from there. Otherwise I'll build a debug version of the operator, and explain how you can run with that. Overall ... progress for sure ... thanks.
Hi gabe, No error shown up after thread dump in samplesresources and operator pod. You can see SkippedImagestreams:[]string(nil) in pod log. http://pastebin.test.redhat.com/667328 for b), yeah, I just try to do destructive operation to see how to affect operator.
Hey @XiuJuan Wang ... apologies, my instructions on getting the thread dump were insufficient for the uninitiated. You have to be tailing the pod before you send the -3 signal. Then copy/paste the thread dump. In your attempt a restart occurs as a result of the termination, and your attempt to capture the logs is too late. That said, upon reflection, most likely there is not a deadlock, otherwise, none of the subsequent activity would not have occurred. I'll work on creating a debug version that highlights all the details of the skipped imagestream processing to the config map, and we'll go from there. I may just update our current version temporarily, or push the debug version to docker hub, and the provide instructions for enabling it (though it is a multi-step process so I'm leaning towards just temporarily adding extraneous debug. I'm still somewhat wary of the issue I noted previously about the bootstrap server still being considered an api server and updates getting lost, but that was supposedly fixed (though not 100% sure about that still). We might need to engage the master team. I'll update here when the debug is ready.
OK https://github.com/openshift/cluster-samples-operator/pull/49 has merged @XiuJuan Wang. Please reproduced with that level, and we'll see what the bz1645463 messages in the pod logs say. If it ends up being a api server sort of thing, I contend we mark this bug verified given the validations I noted in #Comment 10 and we open a new bug against the master team.
Gabe, Here is pasted logs http://pastebin.test.redhat.com/667720 . I am not skilled, there is no interaction between api server and samples operator pods in the log.So is this a issue of samples operator ifself, right? There is a additional info, I deployed sample operator by creating 00-05 yaml and clusteroperator-crd.yaml manaully.
Hey @XiuJuan Wang There certainly is interaction between the samples operator and api server in the logs. Both from directly updating config maps to getting events when the samplesresource has changed. Some specifics on the current data below, but apologies, I'm going to need one more reproduction with the changes from https://github.com/openshift/cluster-samples-operator/pull/50 before we can decide on any additional bugs. This run: - the thread dump confirmed there is no deadlock. You can quit getting the thread dumps for this bugzilla. - the initial config map update was OK - we never processed any events for when you added to the samplesresource skipped imagestreams ... so either we 1) did not get the event ... if so, master and/or operator sdk bug 2) incorrectly ignored the event .... I fix in samples operator I definitely see neither 1) nor 2) in my test cluster when I do the same thing you are doing. The additional debug should provide the final piece. thanks
Gabe, the initial config map update works well. See details http://pastebin.test.redhat.com/668167
OK the debug shows that a samplesresource event wit the skippedImagestreams field set never arrived to the operator: We always got: ``` time="2018-11-12T03:20:18Z" level=info msg="bz1645463 DBG - srcfg skipped streams []string(nil)" time="2018-11-12T03:20:18Z" level=info msg="bz1645463 DBG - cm skipped streams and ok true" time="2018-11-12T03:20:18Z" level=info msg="bz1645463 DBG - cm array len 0 and contents []string{} and srcfg len 0 and contents []string(nil)" time="2018-11-12T03:20:18Z" level=info msg="Incoming samplesresource unchanged from last processed version" ``` Since @XiuJuan Wang previously showed an samplesresource with that field set, that would imply to me an issue with the SDK propagating events to the operator ... whether that is the result of an underlying api server type of issue would be the contention point most likely. My thoughts on next steps: - given the samples operator is at an older version of the SDK, getting told to upgrade the sdk first by that team is to be expected ... - do we start down that path - or do we hold on that until the sdk/no sdk decision is made @Ben - thoughts/opinions ?
I'm pretty skeptical this is evidence of an api server/watch issue (including at the sdk level), it's just not a problem we've ever seen anywhere else despite having speculated about it at various times, it's never born out. In addition, you're doing (if i recall correctly) a relist every 10 minutes, so even if you missed the event, you should see the object again within 10 minutes and process it at that point. I haven't followed this bug closely or looked through the code, but are we sure it's not an issue in the samples operator of either: 1) missing the event while the operator was down? 2) intentionally ignoring the event because of some of the version checking or other logic?
@Ben - reader digest details - only seeing in @xiujuanwang's env - the operator has not gone down during the test based on the pod logs - I've been sprinkling in debug as we have iterated ... at this point, the spots in the event handling prior to my latest debug: a) a problem updating the operator status would result in an error log we are not seeing b) process mgmt field says unmanged or removed ... we would see a log there which is not showing up c) spec validation flags an error ... again, would have shown up in error log and samplesoperator status d) then our config check for changes, where my debug says we do not see any changes to the skipped imagestream list also, my logging shows events coming in every 9 to 10 seconds @XiuJuan - you did not provide a oc get samplesresource -o yaml with your last pastebin ... given the prior pastebins all had it and corroborated timing, I did not harp on it. But given Ben's skeptical response, please reproduce again, and include oc get samplesresource -o yaml ... and don't bother with the events, they have nothing of interest for us here. I'll corroborate the timestamps in the samplesresource yaml with the latest debug, and we'll level set with Ben and go from there.
Yeah at a minimum i'd make sure you have debug that shows you whether or not you're getting the expected relist events every 10 minutes (possibly even crank that value down for now while this is investigated).
@gmontero @bparees I have to say applogies to both you. I have updated samplesresource with 'skippedImageStream' not 'skippedImageStreams' since comment #9. Today when I set 'skippedImageStream' first time, the field always gone a few mins later.Then I realized used wrong keywords. Now I think we could mark this bug as verified, the skippedImageStream feature works well. http://pastebin.test.redhat.com/668585
Ahhh ... yep that would do it. And no worries @XiuJuan Wang - I should have not assumed after comment 9 the samples resource was the same. Yes let's mark as verified ... sending to you on qa and yes Ben's skepticism was certainly warranted :-)
Test openshift/ose-cluster-samples-operator:latest ---> v4.0.0-0.56.0.0 skippedImageStreams and skippedTemplates feature work well when add/modify/remove from samplesresources. Move this bug as verified. Thanks Gabe and Ben~
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758