1987034 – [Multus][VMware] Ceph status reporting osds down and slow ops during or after OCS deployment

Bug 1987034 - [Multus][VMware] Ceph status reporting osds down and slow ops during or after OCS deployment

Summary: [Multus][VMware] Ceph status reporting osds down and slow ops during or after...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Rohan Gupta
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-28 18:46 UTC by Sidhant Agrawal
Modified:	2023-08-09 17:03 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-09-29 09:19:09 UTC
Embargoed:

Attachments	(Terms of Use)

Comment 3 Travis Nielsen 2021-07-28 19:33:28 UTC

Multus is tech preview so it doesn't seem like a blocker. Rohan can you take a look?

Comment 5 Mudit Agarwal 2021-07-29 05:09:31 UTC

Removing blocker flag which was added due to the urgent severity

Comment 6 Travis Nielsen 2021-08-02 15:41:38 UTC

Can you define exactly when this issue started reproing?

When it happens, can we get a connection to the live cluster to debug? (Rohan/Seb/me)

Comment 8 Mudit Agarwal 2021-08-03 06:36:27 UTC

Not a blocker for 4.8

Comment 10 Travis Nielsen 2021-08-16 15:57:17 UTC

Rohan can you take a look?

Comment 11 Rohan Gupta 2021-09-21 12:32:19 UTC

I see that there is network connection issue in the cluster from the logs

debug 2021-08-04 19:20:05.453 7fb954953700  1 osd.0 pg_epoch: 89 pg[10.5( empty local-lis/les=74/75 n=0 ec=48/48 lis/c 83/68 les/c/f 84/69/0 89/89/89) [0,2,1] r=0 lpr=89 pi=[68,89)/3 crt=0'0 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary
debug 2021-08-04 19:20:06.455 7fb961767700  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.0 down, but it is still running
debug 2021-08-04 19:20:06.455 7fb961767700  0 log_channel(cluster) log [DBG] : map e90 wrongly marked me down at e90
debug 2021-08-04 19:20:06.455 7fb961767700  0 osd.0 90 _committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last 600.000000 seconds, shutting down
debug 2021-08-04 19:20:06.455 7fb961767700  1 osd.0 90 start_waiting_for_healthy
debug 2021-08-04 19:20:06.459 7fb954152700  1 osd.0 pg_epoch: 90 pg[10.1f( empty local-lis/les=81/82 n=0 ec=48/48 lis/c 83/81 les/c/f 84/82/0 90/90/74) [1,2] r=-1 lpr=90 pi=[81,90)/1 crt=0'0 unknown NOTIFY mbc={}] start_peering_interval up [1,2,0] -> [1,2], acting [1,2,0] -> [1,2], acting_primary 1 -> 1, up_primary 1 -> 1, role 2 -> -1, features acting 4611087858330828799 upacting 4611087858330828799

At certain intervals the network connectivity between the pods on the multus network gets broken.

Comment 12 Mudit Agarwal 2021-09-24 16:08:46 UTC

Sidhant, is this still reproducible?

Note You need to log in before you can comment on or make changes to this bug.