Bug 1826450 - leader monitor marked as down after "tcmalloc: large alloc" despite service and container are running
Summary: leader monitor marked as down after "tcmalloc: large alloc" despite service a...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 4.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 5.*
Assignee: Neha Ojha
QA Contact: Vasishta
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-21 17:41 UTC by Vasishta
Modified: 2021-05-17 05:41 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-01 12:51:42 UTC
Embargoed:


Attachments (Terms of Use)

Description Vasishta 2020-04-21 17:41:41 UTC
Description of problem:
Monitor marked as down after tcmalloc: large alloc" despite service and container are running retrying handle_auth_request 

Version-Release number of selected component (if applicable):
14.2.8-35
Upgraded from luminous to nautilus
3.3.z4 to 4.1 latest build


How reproducible:
(Tried only once)

Steps followed:
1. Upgrade cluster from luminous to nautilus
2. Convert all filestore OSDs to bluestore OSDs; one host at a time

Actual results:
Monitor service is down

Expected results:
Monitor must be up and running

Additional info:

Comment 2 Yaniv Kaul 2020-04-23 14:32:00 UTC
The steps are somewhat too simplistic.
- does it happen all the time?
- is there a workaround?
- full logs would be great, not just the journal?

Comment 3 Josh Durgin 2020-04-23 17:48:08 UTC
Is there a crash here? If the monitor was just marked down due to some slowness but kept running, that does not meet the blocker criteria: https://mojo.redhat.com/docs/DOC-1146159

Comment 4 Vasishta 2020-04-24 07:27:34 UTC
Hi Josh,

(In reply to Josh Durgin from comment #3)
> Is there a crash here? If the monitor was just marked down due to some
> slowness but kept running, that does not meet the blocker criteria:
> https://mojo.redhat.com/docs/DOC-1146159

No, I coulsn't observe any crash, clearing blocker?


(In reply to Yaniv Kaul from comment #2)
> The steps are somewhat too simplistic.
> - does it happen all the time?
> - is there a workaround?
> - full logs would be great, not just the journal?

Hi Yaniv,

Will try to reproduce in couple of days.
We couldn't procure logs as logs aren't stored in by default in containerized scenario.
Ref - https://bugzilla.redhat.com/show_bug.cgi?id=1794409 .
Will get full logs if reproduced.

Regards,
Vasishta Shastry
QE, Ceph

Comment 5 Josh Durgin 2020-04-27 13:14:29 UTC
(In reply to Vasishta from comment #4)
> Hi Josh,
> 
> (In reply to Josh Durgin from comment #3)
> > Is there a crash here? If the monitor was just marked down due to some
> > slowness but kept running, that does not meet the blocker criteria:
> > https://mojo.redhat.com/docs/DOC-1146159
> 
> No, I coulsn't observe any crash, clearing blocker?

Thanks, moving out of 4.1 to the backlog to clarify that it isn't a blocker.


Note You need to log in before you can comment on or make changes to this bug.