Is the mgr in crash loop?
Also, I've noticed this: 020-03-23T14:47:07.220987726Z 2020-03-23 14:47:07.220899 I | op-mgr: starting monitoring deployment 2020-03-23T14:47:07.221602153Z W0323 14:47:07.221384 7 client_config.go:549] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. 2020-03-23T14:47:07.231352078Z 2020-03-23 14:47:07.231276 E | op-mgr: failed to enable service monitor. service monitor could not be enabled: failed to update servicemonitor. resource name may not be empty 2020-03-23T14:47:07.23360796Z W0323 14:47:07.233272 7 client_config.go:549] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. 2020-03-23T14:47:07.251538117Z 2020-03-23 14:47:07.251439 E | op-mgr: failed to deploy prometheus rule. prometheus rule could not be deployed: failed to update prometheusRule. prometheusrules.monitoring.coreos.com "prometheus-ceph-v14-rules" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update In http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/pratik/bz/1816492/mar-24/must-gather-OCS/quay-io-rhceph-dev-ocs-must-gather-sha256-c8aa70af719ef8fb99d59bef11cf3335b97e48e7ac14756db0725b773e5f46f6/ceph/namespaces/openshift-storage/pods/rook-ceph-operator-577cb7dfd9-xlzxw/rook-ceph-operator/rook-ceph-operator/logs/current.log Umanga, any idea? Thanks.
(In reply to leseb from comment #4) > Also, I've noticed this: > > > 020-03-23T14:47:07.220987726Z 2020-03-23 14:47:07.220899 I | op-mgr: > starting monitoring deployment > 2020-03-23T14:47:07.221602153Z W0323 14:47:07.221384 7 > client_config.go:549] Neither --kubeconfig nor --master was specified. > Using the inClusterConfig. This might not work. > 2020-03-23T14:47:07.231352078Z 2020-03-23 14:47:07.231276 E | op-mgr: failed > to enable service monitor. service monitor could not be enabled: failed to > update servicemonitor. resource name may not be empty > 2020-03-23T14:47:07.23360796Z W0323 14:47:07.233272 7 > client_config.go:549] Neither --kubeconfig nor --master was specified. > Using the inClusterConfig. This might not work. > 2020-03-23T14:47:07.251538117Z 2020-03-23 14:47:07.251439 E | op-mgr: failed > to deploy prometheus rule. prometheus rule could not be deployed: failed to > update prometheusRule. prometheusrules.monitoring.coreos.com > "prometheus-ceph-v14-rules" is invalid: metadata.resourceVersion: Invalid > value: 0x0: must be specified for an update > > > In > http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/pratik/bz/1816492/mar-24/ > must-gather-OCS/quay-io-rhceph-dev-ocs-must-gather-sha256- > c8aa70af719ef8fb99d59bef11cf3335b97e48e7ac14756db0725b773e5f46f6/ceph/ > namespaces/openshift-storage/pods/rook-ceph-operator-577cb7dfd9-xlzxw/rook- > ceph-operator/rook-ceph-operator/logs/current.log > > > Umanga, any idea? > Thanks. Here's the upstream issue for this: https://github.com/rook/rook/issues/4528 It was fixed by: https://github.com/rook/rook/pull/4603. It is probably missing from Rook v1.2.z ? Also, these errors have been there for a long time. It does not affect the mgr pod in anyway. Our strategy was to log the error and move on. This BZ is caused by something else.
Can you tell if the mgr died during an orchestration? Or did it happen when the operator was idle? Thanks
Ok after further investigation it looks like the prometheus exported crashed. I can"t really explain it, maybe prometheus hick-up? However, looking at the mgr status, it seems to be running and the prometheus exporter too. This is not a blocker AFAIC, I've opened an issue on the Ceph tracker https://tracker.ceph.com/issues/44726 to better track this. I'm moving this to Ceph, so if someone from the mgr team can look into that failure. Thanks.
moving to 4.4.0 for now. Not sure we can get a ceph code fix in that timeframe,
As the fix is available upstream and this can cause issues if the mgr will be restarted without knowing the real root cause, marking as a blocker to 4.4
Hi @Pratik, Could you please answer Michael's question about the impact?
The clone is fixed and shipped in RHCS, I believe this can be moved to ON_QA.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3754