1816492 – [BAREMETAL] rook-ceph-mgr pod restarted with assert message

Bug 1816492 - [BAREMETAL] rook-ceph-mgr pod restarted with assert message

Summary: [BAREMETAL] rook-ceph-mgr pod restarted with assert message

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.5.0
Assignee:	Neha Ojha
QA Contact:	Pratik Surve
Docs Contact:
URL:
Whiteboard:
Depends On:	1831119
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-24 06:12 UTC by Pratik Surve
Modified:	2020-09-15 10:16 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1831119 (view as bug list)
Environment:
Last Closed:	2020-09-15 10:16:04 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ceph ceph pull 34326	0	None	closed	nautilus: mgr: synchronize ClusterState's health and mon_status.	2020-12-01 04:45:22 UTC
Red Hat Product Errata	RHBA-2020:3754	0	None	None	None	2020-09-15 10:16:32 UTC

Comment 3 Sébastien Han 2020-03-24 09:28:44 UTC

Is the mgr in crash loop?

Comment 4 Sébastien Han 2020-03-24 09:29:42 UTC

Also, I've noticed this:


020-03-23T14:47:07.220987726Z 2020-03-23 14:47:07.220899 I | op-mgr: starting monitoring deployment
2020-03-23T14:47:07.221602153Z W0323 14:47:07.221384       7 client_config.go:549] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2020-03-23T14:47:07.231352078Z 2020-03-23 14:47:07.231276 E | op-mgr: failed to enable service monitor. service monitor could not be enabled: failed to update servicemonitor. resource name may not be empty
2020-03-23T14:47:07.23360796Z W0323 14:47:07.233272       7 client_config.go:549] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2020-03-23T14:47:07.251538117Z 2020-03-23 14:47:07.251439 E | op-mgr: failed to deploy prometheus rule. prometheus rule could not be deployed: failed to update prometheusRule. prometheusrules.monitoring.coreos.com "prometheus-ceph-v14-rules" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update


In http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/pratik/bz/1816492/mar-24/must-gather-OCS/quay-io-rhceph-dev-ocs-must-gather-sha256-c8aa70af719ef8fb99d59bef11cf3335b97e48e7ac14756db0725b773e5f46f6/ceph/namespaces/openshift-storage/pods/rook-ceph-operator-577cb7dfd9-xlzxw/rook-ceph-operator/rook-ceph-operator/logs/current.log


Umanga, any idea?
Thanks.

Comment 6 umanga 2020-03-24 10:07:51 UTC

(In reply to leseb from comment #4)
> Also, I've noticed this:
> 
> 
> 020-03-23T14:47:07.220987726Z 2020-03-23 14:47:07.220899 I | op-mgr:
> starting monitoring deployment
> 2020-03-23T14:47:07.221602153Z W0323 14:47:07.221384       7
> client_config.go:549] Neither --kubeconfig nor --master was specified. 
> Using the inClusterConfig.  This might not work.
> 2020-03-23T14:47:07.231352078Z 2020-03-23 14:47:07.231276 E | op-mgr: failed
> to enable service monitor. service monitor could not be enabled: failed to
> update servicemonitor. resource name may not be empty
> 2020-03-23T14:47:07.23360796Z W0323 14:47:07.233272       7
> client_config.go:549] Neither --kubeconfig nor --master was specified. 
> Using the inClusterConfig.  This might not work.
> 2020-03-23T14:47:07.251538117Z 2020-03-23 14:47:07.251439 E | op-mgr: failed
> to deploy prometheus rule. prometheus rule could not be deployed: failed to
> update prometheusRule. prometheusrules.monitoring.coreos.com
> "prometheus-ceph-v14-rules" is invalid: metadata.resourceVersion: Invalid
> value: 0x0: must be specified for an update
> 
> 
> In
> http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/pratik/bz/1816492/mar-24/
> must-gather-OCS/quay-io-rhceph-dev-ocs-must-gather-sha256-
> c8aa70af719ef8fb99d59bef11cf3335b97e48e7ac14756db0725b773e5f46f6/ceph/
> namespaces/openshift-storage/pods/rook-ceph-operator-577cb7dfd9-xlzxw/rook-
> ceph-operator/rook-ceph-operator/logs/current.log
> 
> 
> Umanga, any idea?
> Thanks.

Here's the upstream issue for this: https://github.com/rook/rook/issues/4528
It was fixed by: https://github.com/rook/rook/pull/4603. It is probably missing from Rook v1.2.z ?

Also, these errors have been there for a long time. It does not affect the mgr pod in anyway.
Our strategy was to log the error and move on.

This BZ is caused by something else.

Comment 7 Sébastien Han 2020-03-24 10:55:08 UTC

Can you tell if the mgr died during an orchestration? Or did it happen when the operator was idle? Thanks

Comment 8 Sébastien Han 2020-03-24 11:09:08 UTC

Ok after further investigation it looks like the prometheus exported crashed.
I can"t really explain it, maybe prometheus hick-up?

However, looking at the mgr status, it seems to be running and the prometheus exporter too.
This is not a blocker AFAIC, I've opened an issue on the Ceph tracker https://tracker.ceph.com/issues/44726 to better track this.

I'm moving this to Ceph, so if someone from the mgr team can look into that failure.
Thanks.

Comment 9 Michael Adam 2020-03-26 16:51:10 UTC

moving to 4.4.0 for now.
Not sure we can get a ceph code fix in that timeframe,

Comment 11 Raz Tamir 2020-04-12 14:01:45 UTC

As the fix is available upstream and this can cause issues if the mgr will be restarted without knowing the real root cause, marking as a blocker to 4.4

Comment 20 Raz Tamir 2020-05-11 11:48:33 UTC

Hi @Pratik,

Could you please answer Michael's question about the impact?

Comment 22 Yaniv Kaul 2020-06-25 10:25:20 UTC

The clone is fixed and shipped in RHCS, I believe this can be moved to ON_QA.

Comment 28 errata-xmlrpc 2020-09-15 10:16:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3754

Note You need to log in before you can comment on or make changes to this bug.