Bug 1902034

Summary: Module 'crash' has failed: dictionary changed size during iteration
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vasishta <vashastr>
Component: Ceph-Mgr PluginsAssignee: Boris Ranto <branto>
Status: CLOSED ERRATA QA Contact: Vasishta <vashastr>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.2CC: akupczyk, bhubbard, ceph-eng-bugs, ceph-qe-bugs, dzafman, gmeno, jdurgin, kchai, nia, nojha, rzarzyns, sangadi, sseshasa, tserlin, vereddy, vumrao
Target Milestone: ---Flags: nia: needinfo-
Target Release: 4.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-14.2.11-93.el8cp, ceph-14.2.11-93.el7cp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-12 14:58:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vasishta 2020-11-26 15:52:10 UTC
Description of problem:
In a containerized cluster which had continious crashing of a daemon, cluster has reached ERR state and saying 
Module 'crash' has failed: dictionary changed size during iteration

Version-Release number of selected component (if applicable):
14.2.11-82.el8cp

How reproducible:
Tried once

Steps to Reproduce:
1. configure a cluster and make daemons to crash continiously


Actual results:
Module 'crash' has failed: dictionary changed size during iteration

Expected results:
module shouldn't have failed

Additional info:

Comment 3 Vasishta 2020-12-02 05:50:12 UTC
Hi Neha,

Found a workaround from upstream and had to modify steps, Let me know this be suggested to customers
If not we need to move this to 4.2 as this is a blocker as it takes cluster to ERR state.

Ref - https://www.mail-archive.com/ceph-users@ceph.io/msg05459.html

The change needed was to empty posted folder from node where crashes were seen the most.

Regards,
Vasishta Shastry
QE, Ceph

Comment 4 Vasishta 2020-12-02 06:08:53 UTC
Hi,

The workaround seems to be temporary, All crash/posted entries were restored and cluster rolled back to ERR state
Temporarily moving this back to 4.2 with a proposition as blocker
Please let me know if there are any concerns.

Regards,
Vasishta Shastry
QE, Ceph

Comment 5 Neha Ojha 2020-12-02 17:58:26 UTC
(In reply to Vasishta from comment #4)
> Hi,
> 
> The workaround seems to be temporary, All crash/posted entries were restored
> and cluster rolled back to ERR state
> Temporarily moving this back to 4.2 with a proposition as blocker
> Please let me know if there are any concerns.
> 
> Regards,
> Vasishta Shastry
> QE, Ceph

Do you have a live environment where we can debug this? Since nothing has changed in the crash module in 4.2, I don't believe this is a regression. It is most likely a python3 issue where we are iterating over a dict while modifying it. We'd like to fix it but isn't a blocker IMHO.

Comment 6 Josh Durgin 2020-12-02 19:03:45 UTC
agreed, retargeting for 4.2z1(In reply to Neha Ojha from comment #5)
> (In reply to Vasishta from comment #4)
> > Hi,
> > 
> > The workaround seems to be temporary, All crash/posted entries were restored
> > and cluster rolled back to ERR state
> > Temporarily moving this back to 4.2 with a proposition as blocker
> > Please let me know if there are any concerns.
> > 
> > Regards,
> > Vasishta Shastry
> > QE, Ceph
> 
> Do you have a live environment where we can debug this? Since nothing has
> changed in the crash module in 4.2, I don't believe this is a regression. It
> is most likely a python3 issue where we are iterating over a dict while
> modifying it. We'd like to fix it but isn't a blocker IMHO.

agreed, retargeting for 4.2z1

Comment 23 Veera Raghava Reddy 2020-12-07 08:17:15 UTC
Based on the inputs in comment 22, the issue might be observed by customers more frequently due to default enable of https for Dashboard. Considering customer experience and support calls this issue might generate would propose this fix in 4.2

Comment 27 Vasishta 2020-12-17 11:33:27 UTC
Working fine with ceph version 14.2.11-94.el8cp 
Moving to VERIFIED state.

(Had a cluster with mgr crash module in failed state with 5800+ crashes, Upgraded to version as mentioned above, now crash module is not in failed state)

Comment 29 errata-xmlrpc 2021-01-12 14:58:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 4.2 Security and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0081