Bug 1949166
| Summary: | OCS 4.7 Arbiter Mode Cluster becomes stuck when entire zone is shutdown | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Greg Farnum <gfarnum> |
| Component: | RADOS | Assignee: | Greg Farnum <gfarnum> |
| Status: | CLOSED ERRATA | QA Contact: | Pawan <pdhiran> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.2 | CC: | aclewett, akupczyk, bhubbard, bniver, ceph-eng-bugs, ceph-qe-bugs, dzafman, gfarnum, jelopez, kchai, madam, mbukatov, muagarwa, nojha, ocs-bugs, owasserm, pdhiran, ratamir, rzarzyns, sostapov, sseshasa, tnielsen, tserlin, vumrao |
| Target Milestone: | --- | ||
| Target Release: | 4.2z1 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | ceph-14.2.11-147.el8cp, ceph-14.2.11-147.el7cp | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 1946837 | Environment: | |
| Last Closed: | 2021-04-28 20:13:55 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1946837 | ||
|
Comment 1
Greg Farnum
2021-04-13 19:45:32 UTC
@Pawan 1. Did you also bring down the two mons in the same zone as the down OSDs, or are the OSDs only down? 2. In what way were the OSDs brought down? The nodes were turned off, or something else? 3. Is this a manual test, or with automation? 4. Was it hit on the first attempt, or was this after multiple tests of bringing the OSDs up/down? (In reply to Travis Nielsen from comment #9) > @Pawan > 1. Did you also bring down the two mons in the same zone as the down OSDs, > or are the OSDs only down? Only the OSD nodes belonging to the Site 2 were brought down. The two mons were still running. > 2. In what way were the OSDs brought down? The nodes were turned off, or > something else? I brought down the OSD service using systemctl. > 3. Is this a manual test, or with automation? The test was run manually. > 4. Was it hit on the first attempt, or was this after multiple tests of > bringing the OSDs up/down? I had been doing power cycles on a few OSD's on the site, to check the behavior. 1. Even when just more than half the OSD's are brought down from a site, The cluster stops serving client data. ( before we see the 100% PG inactive error ) 2. 100% PG inactive is error is seen once all the OSD's are brought down. 3. I hit the 100% PG inactive issue on the 1st attempt 1 brought down all the OSD's of the site. ( have been bringing up/down few OSD's earlier, but not all ) Observation: 1. Once even the mons belonging to the same site are brought down, the 100% inactive error is gone, and the cluster will start serving the data again. 2. Cluster recovers without any errors Once the site is brought up. So the DU scenario arises when only some/all OSD's belonging to a site go down, with the mons still running. Ok thanks for the clarification on the scenario. Since the BZ was originally about bringing down the entire zone including mons, I'd suggest we open a new BZ for this issue instead of failing this BZ. (In reply to Travis Nielsen from comment #11) > Ok thanks for the clarification on the scenario. Since the BZ was originally > about bringing down the entire zone including mons, I'd suggest we open a > new BZ for this issue instead of failing this BZ. Sure, Opened a new bug for the issue : https://bugzilla.redhat.com/show_bug.cgi?id=1953640, and marked this Verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat Ceph Storage security, bug fix, and enhancement Update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:1452 |