Bug 1902034

Summary:	Module 'crash' has failed: dictionary changed size during iteration
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Vasishta <vashastr>
Component:	Ceph-Mgr Plugins	Assignee:	Boris Ranto <branto>
Status:	CLOSED ERRATA	QA Contact:	Vasishta <vashastr>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.2	CC:	akupczyk, bhubbard, ceph-eng-bugs, ceph-qe-bugs, dzafman, gmeno, jdurgin, kchai, nia, nojha, rzarzyns, sangadi, sseshasa, tserlin, vereddy, vumrao
Target Milestone:	---	Flags:	nia: needinfo-
Target Release:	4.2
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	ceph-14.2.11-93.el8cp, ceph-14.2.11-93.el7cp	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-01-12 14:58:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Vasishta 2020-11-26 15:52:10 UTC

Description of problem:
In a containerized cluster which had continious crashing of a daemon, cluster has reached ERR state and saying 
Module 'crash' has failed: dictionary changed size during iteration

Version-Release number of selected component (if applicable):
14.2.11-82.el8cp

How reproducible:
Tried once

Steps to Reproduce:
1. configure a cluster and make daemons to crash continiously


Actual results:
Module 'crash' has failed: dictionary changed size during iteration

Expected results:
module shouldn't have failed

Additional info:

Comment 3 Vasishta 2020-12-02 05:50:12 UTC

Hi Neha,

Found a workaround from upstream and had to modify steps, Let me know this be suggested to customers
If not we need to move this to 4.2 as this is a blocker as it takes cluster to ERR state.

Ref - https://www.mail-archive.com/ceph-users@ceph.io/msg05459.html

The change needed was to empty posted folder from node where crashes were seen the most.

Regards,
Vasishta Shastry
QE, Ceph

Comment 4 Vasishta 2020-12-02 06:08:53 UTC

Hi,

The workaround seems to be temporary, All crash/posted entries were restored and cluster rolled back to ERR state
Temporarily moving this back to 4.2 with a proposition as blocker
Please let me know if there are any concerns.

Regards,
Vasishta Shastry
QE, Ceph

Comment 5 Neha Ojha 2020-12-02 17:58:26 UTC

(In reply to Vasishta from comment #4)
> Hi,
> 
> The workaround seems to be temporary, All crash/posted entries were restored
> and cluster rolled back to ERR state
> Temporarily moving this back to 4.2 with a proposition as blocker
> Please let me know if there are any concerns.
> 
> Regards,
> Vasishta Shastry
> QE, Ceph

Do you have a live environment where we can debug this? Since nothing has changed in the crash module in 4.2, I don't believe this is a regression. It is most likely a python3 issue where we are iterating over a dict while modifying it. We'd like to fix it but isn't a blocker IMHO.

Comment 6 Josh Durgin 2020-12-02 19:03:45 UTC

agreed, retargeting for 4.2z1(In reply to Neha Ojha from comment #5)
> (In reply to Vasishta from comment #4)
> > Hi,
> > 
> > The workaround seems to be temporary, All crash/posted entries were restored
> > and cluster rolled back to ERR state
> > Temporarily moving this back to 4.2 with a proposition as blocker
> > Please let me know if there are any concerns.
> > 
> > Regards,
> > Vasishta Shastry
> > QE, Ceph
> 
> Do you have a live environment where we can debug this? Since nothing has
> changed in the crash module in 4.2, I don't believe this is a regression. It
> is most likely a python3 issue where we are iterating over a dict while
> modifying it. We'd like to fix it but isn't a blocker IMHO.

agreed, retargeting for 4.2z1

Comment 23 Veera Raghava Reddy 2020-12-07 08:17:15 UTC

Based on the inputs in comment 22, the issue might be observed by customers more frequently due to default enable of https for Dashboard. Considering customer experience and support calls this issue might generate would propose this fix in 4.2

Comment 27 Vasishta 2020-12-17 11:33:27 UTC

Working fine with ceph version 14.2.11-94.el8cp 
Moving to VERIFIED state.

(Had a cluster with mgr crash module in failed state with 5800+ crashes, Upgraded to version as mentioned above, now crash module is not in failed state)

Comment 29 errata-xmlrpc 2021-01-12 14:58:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 4.2 Security and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0081