1826450 – leader monitor marked as down after "tcmalloc: large alloc" despite service and container are running

Bug 1826450 - leader monitor marked as down after "tcmalloc: large alloc" despite service and container are running

Summary: leader monitor marked as down after "tcmalloc: large alloc" despite service a...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	4.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	5.*
Assignee:	Neha Ojha
QA Contact:	Vasishta
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-21 17:41 UTC by Vasishta
Modified:	2021-05-17 05:41 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-01 12:51:42 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Vasishta 2020-04-21 17:41:41 UTC

Description of problem:
Monitor marked as down after tcmalloc: large alloc" despite service and container are running retrying handle_auth_request 

Version-Release number of selected component (if applicable):
14.2.8-35
Upgraded from luminous to nautilus
3.3.z4 to 4.1 latest build


How reproducible:
(Tried only once)

Steps followed:
1. Upgrade cluster from luminous to nautilus
2. Convert all filestore OSDs to bluestore OSDs; one host at a time

Actual results:
Monitor service is down

Expected results:
Monitor must be up and running

Additional info:

Comment 2 Yaniv Kaul 2020-04-23 14:32:00 UTC

The steps are somewhat too simplistic.
- does it happen all the time?
- is there a workaround?
- full logs would be great, not just the journal?

Comment 3 Josh Durgin 2020-04-23 17:48:08 UTC

Is there a crash here? If the monitor was just marked down due to some slowness but kept running, that does not meet the blocker criteria: https://mojo.redhat.com/docs/DOC-1146159

Comment 4 Vasishta 2020-04-24 07:27:34 UTC

Hi Josh,

(In reply to Josh Durgin from comment #3)
> Is there a crash here? If the monitor was just marked down due to some
> slowness but kept running, that does not meet the blocker criteria:
> https://mojo.redhat.com/docs/DOC-1146159

No, I coulsn't observe any crash, clearing blocker?


(In reply to Yaniv Kaul from comment #2)
> The steps are somewhat too simplistic.
> - does it happen all the time?
> - is there a workaround?
> - full logs would be great, not just the journal?

Hi Yaniv,

Will try to reproduce in couple of days.
We couldn't procure logs as logs aren't stored in by default in containerized scenario.
Ref - https://bugzilla.redhat.com/show_bug.cgi?id=1794409 .
Will get full logs if reproduced.

Regards,
Vasishta Shastry
QE, Ceph

Comment 5 Josh Durgin 2020-04-27 13:14:29 UTC

(In reply to Vasishta from comment #4)
> Hi Josh,
> 
> (In reply to Josh Durgin from comment #3)
> > Is there a crash here? If the monitor was just marked down due to some
> > slowness but kept running, that does not meet the blocker criteria:
> > https://mojo.redhat.com/docs/DOC-1146159
> 
> No, I coulsn't observe any crash, clearing blocker?

Thanks, moving out of 4.1 to the backlog to clarify that it isn't a blocker.

Note You need to log in before you can comment on or make changes to this bug.