Bug 1816492 - [BAREMETAL] rook-ceph-mgr pod restarted with assert message
Summary: [BAREMETAL] rook-ceph-mgr pod restarted with assert message
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ceph
Version: 4.3
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: OCS 4.5.0
Assignee: Neha Ojha
QA Contact: Pratik Surve
URL:
Whiteboard:
Depends On: 1831119
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-24 06:12 UTC by Pratik Surve
Modified: 2020-09-15 10:16 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1831119 (view as bug list)
Environment:
Last Closed: 2020-09-15 10:16:04 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph pull 34326 0 None closed nautilus: mgr: synchronize ClusterState's health and mon_status. 2020-12-01 04:45:22 UTC
Red Hat Product Errata RHBA-2020:3754 0 None None None 2020-09-15 10:16:32 UTC

Comment 3 Sébastien Han 2020-03-24 09:28:44 UTC
Is the mgr in crash loop?

Comment 4 Sébastien Han 2020-03-24 09:29:42 UTC
Also, I've noticed this:


020-03-23T14:47:07.220987726Z 2020-03-23 14:47:07.220899 I | op-mgr: starting monitoring deployment
2020-03-23T14:47:07.221602153Z W0323 14:47:07.221384       7 client_config.go:549] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2020-03-23T14:47:07.231352078Z 2020-03-23 14:47:07.231276 E | op-mgr: failed to enable service monitor. service monitor could not be enabled: failed to update servicemonitor. resource name may not be empty
2020-03-23T14:47:07.23360796Z W0323 14:47:07.233272       7 client_config.go:549] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2020-03-23T14:47:07.251538117Z 2020-03-23 14:47:07.251439 E | op-mgr: failed to deploy prometheus rule. prometheus rule could not be deployed: failed to update prometheusRule. prometheusrules.monitoring.coreos.com "prometheus-ceph-v14-rules" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update


In http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/pratik/bz/1816492/mar-24/must-gather-OCS/quay-io-rhceph-dev-ocs-must-gather-sha256-c8aa70af719ef8fb99d59bef11cf3335b97e48e7ac14756db0725b773e5f46f6/ceph/namespaces/openshift-storage/pods/rook-ceph-operator-577cb7dfd9-xlzxw/rook-ceph-operator/rook-ceph-operator/logs/current.log


Umanga, any idea?
Thanks.

Comment 6 umanga 2020-03-24 10:07:51 UTC
(In reply to leseb from comment #4)
> Also, I've noticed this:
> 
> 
> 020-03-23T14:47:07.220987726Z 2020-03-23 14:47:07.220899 I | op-mgr:
> starting monitoring deployment
> 2020-03-23T14:47:07.221602153Z W0323 14:47:07.221384       7
> client_config.go:549] Neither --kubeconfig nor --master was specified. 
> Using the inClusterConfig.  This might not work.
> 2020-03-23T14:47:07.231352078Z 2020-03-23 14:47:07.231276 E | op-mgr: failed
> to enable service monitor. service monitor could not be enabled: failed to
> update servicemonitor. resource name may not be empty
> 2020-03-23T14:47:07.23360796Z W0323 14:47:07.233272       7
> client_config.go:549] Neither --kubeconfig nor --master was specified. 
> Using the inClusterConfig.  This might not work.
> 2020-03-23T14:47:07.251538117Z 2020-03-23 14:47:07.251439 E | op-mgr: failed
> to deploy prometheus rule. prometheus rule could not be deployed: failed to
> update prometheusRule. prometheusrules.monitoring.coreos.com
> "prometheus-ceph-v14-rules" is invalid: metadata.resourceVersion: Invalid
> value: 0x0: must be specified for an update
> 
> 
> In
> http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/pratik/bz/1816492/mar-24/
> must-gather-OCS/quay-io-rhceph-dev-ocs-must-gather-sha256-
> c8aa70af719ef8fb99d59bef11cf3335b97e48e7ac14756db0725b773e5f46f6/ceph/
> namespaces/openshift-storage/pods/rook-ceph-operator-577cb7dfd9-xlzxw/rook-
> ceph-operator/rook-ceph-operator/logs/current.log
> 
> 
> Umanga, any idea?
> Thanks.

Here's the upstream issue for this: https://github.com/rook/rook/issues/4528
It was fixed by: https://github.com/rook/rook/pull/4603. It is probably missing from Rook v1.2.z ?

Also, these errors have been there for a long time. It does not affect the mgr pod in anyway.
Our strategy was to log the error and move on.

This BZ is caused by something else.

Comment 7 Sébastien Han 2020-03-24 10:55:08 UTC
Can you tell if the mgr died during an orchestration? Or did it happen when the operator was idle? Thanks

Comment 8 Sébastien Han 2020-03-24 11:09:08 UTC
Ok after further investigation it looks like the prometheus exported crashed.
I can"t really explain it, maybe prometheus hick-up?

However, looking at the mgr status, it seems to be running and the prometheus exporter too.
This is not a blocker AFAIC, I've opened an issue on the Ceph tracker https://tracker.ceph.com/issues/44726 to better track this.

I'm moving this to Ceph, so if someone from the mgr team can look into that failure.
Thanks.

Comment 9 Michael Adam 2020-03-26 16:51:10 UTC
moving to 4.4.0 for now.
Not sure we can get a ceph code fix in that timeframe,

Comment 11 Raz Tamir 2020-04-12 14:01:45 UTC
As the fix is available upstream and this can cause issues if the mgr will be restarted without knowing the real root cause, marking as a blocker to 4.4

Comment 20 Raz Tamir 2020-05-11 11:48:33 UTC
Hi @Pratik,

Could you please answer Michael's question about the impact?

Comment 22 Yaniv Kaul 2020-06-25 10:25:20 UTC
The clone is fixed and shipped in RHCS, I believe this can be moved to ON_QA.

Comment 28 errata-xmlrpc 2020-09-15 10:16:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3754


Note You need to log in before you can comment on or make changes to this bug.