Bug 1902034 - Module 'crash' has failed: dictionary changed size during iteration
Summary: Module 'crash' has failed: dictionary changed size during iteration
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Mgr Plugins
Version: 4.2
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.2
Assignee: Boris Ranto
QA Contact: Vasishta
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-26 15:52 UTC by Vasishta
Modified: 2021-01-12 14:58 UTC (History)
16 users (show)

Fixed In Version: ceph-14.2.11-93.el8cp, ceph-14.2.11-93.el7cp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-12 14:58:11 UTC
Embargoed:
nia: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 48573 0 None None None 2020-12-11 22:58:19 UTC
Github ceph ceph pull 38453 0 None closed mgr/crash: Serialize command handling 2021-01-13 19:51:11 UTC
Red Hat Product Errata RHSA-2021:0081 0 None None None 2021-01-12 14:58:31 UTC

Description Vasishta 2020-11-26 15:52:10 UTC
Description of problem:
In a containerized cluster which had continious crashing of a daemon, cluster has reached ERR state and saying 
Module 'crash' has failed: dictionary changed size during iteration

Version-Release number of selected component (if applicable):
14.2.11-82.el8cp

How reproducible:
Tried once

Steps to Reproduce:
1. configure a cluster and make daemons to crash continiously


Actual results:
Module 'crash' has failed: dictionary changed size during iteration

Expected results:
module shouldn't have failed

Additional info:

Comment 3 Vasishta 2020-12-02 05:50:12 UTC
Hi Neha,

Found a workaround from upstream and had to modify steps, Let me know this be suggested to customers
If not we need to move this to 4.2 as this is a blocker as it takes cluster to ERR state.

Ref - https://www.mail-archive.com/ceph-users@ceph.io/msg05459.html

The change needed was to empty posted folder from node where crashes were seen the most.

Regards,
Vasishta Shastry
QE, Ceph

Comment 4 Vasishta 2020-12-02 06:08:53 UTC
Hi,

The workaround seems to be temporary, All crash/posted entries were restored and cluster rolled back to ERR state
Temporarily moving this back to 4.2 with a proposition as blocker
Please let me know if there are any concerns.

Regards,
Vasishta Shastry
QE, Ceph

Comment 5 Neha Ojha 2020-12-02 17:58:26 UTC
(In reply to Vasishta from comment #4)
> Hi,
> 
> The workaround seems to be temporary, All crash/posted entries were restored
> and cluster rolled back to ERR state
> Temporarily moving this back to 4.2 with a proposition as blocker
> Please let me know if there are any concerns.
> 
> Regards,
> Vasishta Shastry
> QE, Ceph

Do you have a live environment where we can debug this? Since nothing has changed in the crash module in 4.2, I don't believe this is a regression. It is most likely a python3 issue where we are iterating over a dict while modifying it. We'd like to fix it but isn't a blocker IMHO.

Comment 6 Josh Durgin 2020-12-02 19:03:45 UTC
agreed, retargeting for 4.2z1(In reply to Neha Ojha from comment #5)
> (In reply to Vasishta from comment #4)
> > Hi,
> > 
> > The workaround seems to be temporary, All crash/posted entries were restored
> > and cluster rolled back to ERR state
> > Temporarily moving this back to 4.2 with a proposition as blocker
> > Please let me know if there are any concerns.
> > 
> > Regards,
> > Vasishta Shastry
> > QE, Ceph
> 
> Do you have a live environment where we can debug this? Since nothing has
> changed in the crash module in 4.2, I don't believe this is a regression. It
> is most likely a python3 issue where we are iterating over a dict while
> modifying it. We'd like to fix it but isn't a blocker IMHO.

agreed, retargeting for 4.2z1

Comment 23 Veera Raghava Reddy 2020-12-07 08:17:15 UTC
Based on the inputs in comment 22, the issue might be observed by customers more frequently due to default enable of https for Dashboard. Considering customer experience and support calls this issue might generate would propose this fix in 4.2

Comment 27 Vasishta 2020-12-17 11:33:27 UTC
Working fine with ceph version 14.2.11-94.el8cp 
Moving to VERIFIED state.

(Had a cluster with mgr crash module in failed state with 5800+ crashes, Upgraded to version as mentioned above, now crash module is not in failed state)

Comment 29 errata-xmlrpc 2021-01-12 14:58:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 4.2 Security and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0081


Note You need to log in before you can comment on or make changes to this bug.