Bug 1861780
Summary: | [Tracker BZ1866386][IBM s390x] Mount Failed for CEPH while running couple of OCS test cases. | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Chidanand Harlapur <charlapu> |
Component: | ceph | Assignee: | Scott Ostapovicz <sostapov> |
Status: | CLOSED ERRATA | QA Contact: | Elad <ebenahar> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.4 | CC: | aeyal, bkunal, bniver, brueckner, ebenahar, hwolf, jlayton, jligon, khiremat, madam, mrajanna, muagarwa, nberry, ocs-bugs, pdonnell, ratamir, rcyriac, sarumuga, sostapov, tdesala, uweigand, vpiniset |
Target Milestone: | --- | Keywords: | AutomationBackLog, Tracking |
Target Release: | OCS 4.6.0 | ||
Hardware: | s390x | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-12-17 06:23:13 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Chidanand Harlapur
2020-07-29 14:23:41 UTC
Log file can be found in this location for OCS and OCP https://drive.google.com/drive/folders/1CK2PcG63pW9Z1XB74aXD8v50WmVlOzZe?usp=sharing What kernel was the client running? It probably needs this patch (which is already merged for RHEL8.3): https://marc.info/?l=ceph-devel&m=158807659304587&w=2 You may want to update the kernel on the client to the latest RHEL8.3 beta kernel and see whether this is still reproducible. Client running with below version Linux test1-f9xps-worker-0-rwvmr 4.18.0-193.13.2.el8_2.s390x #1 SMP Mon Jul 13 23:23:50 UTC 2020 s390x s390x s390x GNU/Linux Thanks. Yeah, that kernel doesn't have the endianness fix, as that went into -207.el8. See: https://bugzilla.redhat.com/show_bug.cgi?id=1827767 If you have the ability to run this on a RHEL8.3 kernel, then it should (hopefully) work. Don't have access (You are not authorized to access bug #1827767) to view this bug https://bugzilla.redhat.com/show_bug.cgi?id=1827767 Chidanand, I cc'ed you on the RHEL8.3 update bug, bit it's not very interesting as it's just a big rollup of upstream patches. What I'd probably suggest is getting the latest RHEL-8.3.0 candidate kernel you can find, and see if this is still reproducible with it. See: https://brewweb.engineering.redhat.com/brew/packageinfo?packageID=1231 Assigning this to Jeff so he can make sure it gets tested with the latest RHEL-8.3.0 candidate kernel Chidanand, do you have the ability to run this test and override the kernel that it uses? Jeff I never tried and don't have much idea on how to override the kernel, If you can share some instruction on this , I can have one try. I think that depends on the testing infrastructure you were using. Maybe Raz (our QA contact) can help? Hi Jeff, I'm not sure what is the ask here. Are you suggesting to run the tests over RHEL 8.3? or any other modifications to the kernel itself? We are running our tests over RHEL which is supported in OCP. (In reply to Raz Tamir from comment #12) > Hi Jeff, > > I'm not sure what is the ask here. > Are you suggesting to run the tests over RHEL 8.3? or any other > modifications to the kernel itself? > We are running our tests over RHEL which is supported in OCP. The initial description was "Setup OCP and OCS on KVM guest on s390x hardware" - I don't think you are running on that one. Setting NEEDINFO on the initial reporter, if the kernel can be updated on that VM. Setup is on s390x LPAR and I did OCP4.4 installation using libvirt (https://github.com/openshift/installer/tree/release-4.4/docs/dev/libvirt) by referring steps provided in the link. OCP cluster setup with 3 Master and 3 Worker node, all nodes are running with RedHat CoreOS with version (Linux test1-f9xps-worker-0-rwvmr 4.18.0-193.13.2.el8_2.s390x #1 SMP Mon Jul 13 23:23:50 UTC 2020 s390x s390x s390x GNU/Linux) Regarding Kernel update I can try to update the kernel on those KVM's. It will be more helpful if you can give the link for the latest kernel which support s390x hardware. A quick update on the nature of the kernel bug. A while ago, the kernel cephfs component was extended to support a new element of the on-the-wire protocol between cephfs and the MDS daemon that allows negotiation of the supported feature set. The initial version of that patch was broken on big-endian machines, causing *any* cephfs mount to fail. This bug was fixed by the patch in comment #3. So we have three possible states: 1) old cephfs without feature selection - works on s390x 2) cephfs with (buggy) feature selection - broken on s390x 3) cephfs with fixed feature selection - works on s390x When we initially noticed and fixed that problem, it looked like the RHEL kernels didn't have that issue, because at the time they were still using an old kernel (1) without the feature selection code. That's why we just fixed it upstream and didn't ask for a backport. However, it looks like we didn't notice that in the meantime, the feature selection code was actually backported to the RHEL kernel -- unfortunately (initially) without the fix, so now we have the situation (2) in the current RHEL kernel where cephfs mounts are completely broken. As mentioned in comment #5, the fix has now been backported as well, and will be in the RHEL 8.3 kernel. This means as far as RHEL kernels are concerned, we have the following status: before 4.18.0-154: state (1) - works on s390x since 4.18.0-154 but before 4.18.0-207: state(2) - broken on s390x since 4.18.0-207: state (3) - works on s390x For RHEL 8.x releases this implies: RHEL 8.1 - GA kernel 4.18.0-147 - works on s390x RHEL 8.2 - GA kernel 4.18.0-193 - broken on s390x RHEL 8.3 - GA kernel >4.18.0-207 - will work on s390x So in a sense, RHEL 8.2 introduced a cephfs regression that will be fixed again in RHEL 8.3. Given that this is a regression, I guess one question would be whether this ought to be fixed in the RHEL 8.2 maintenance stream as well. However, the RHEL product isn't really what is relevant for our use case here, because we're using OCP on RH CoreOS, we are not actually using RHEL. (OCP is only supported on top of CoreOS on Z.) So the interesting question is how to get the fix into *CoreOS* (and therefore OCP). It looks like both OCP 4.4 and OCP 4.5 currently use the same -193 kernel that is in RHEL 8.2, and are therefore broken on s390x. I'm not so familiar with how the CoreOS upgrade process works, so here's a few questions: - What kernel will OCP 4.6 use? How can we ensure this bug will be fixed there? - Should this regression get fixed as a maintenance update for OCP 4.5 (and possibly 4.4)? - If there will be no official fix on 4.5, is there some way we can work around the bug? I believe you cannot simply install another kernel in CoreOS ... As an aside, note that it now appears that installing OCP on KVM vs. z/VM actually doesn't make any difference here, it's just that our z/VM install was still based on OCP 4.3 (using the RHEL 8.1 kernel), and therefore did not yet show this bug. Thanks for the succinct explanation Ulrich.
> So in a sense, RHEL 8.2 introduced a cephfs regression that will be fixed again in RHEL 8.3. Given that this is a regression, I guess one question would be whether this ought to be fixed in the RHEL 8.2 maintenance stream as well.
I'd be fine with that. The patch is pretty safe, so backporting should be no big deal. I'm not well versed enough in CoreOS to know what we'd need to do for that though.
Oof, spoke a bit too soon about the safety of that patch. I just opened BZ#1866018 today, which is a regression that was caused by that endianness fix. If you pull that patch into CoreOS, you'll also want this one (not yet merged upstream). https://marc.info/?l=ceph-devel&m=159655872206314&w=2 In an offline conversation with Elad and Chidanand it was clear that OCP 4.5 CoreOS is running with kernel version 4.18.0-193, which is causing the problem on s390x hardware. Also, Cephfs issue is fixed with Kernel version 207 and above because of this OCS tier1 test cases are failing. And this is possible to hit with any OCS version with these conditions, so it can be a test blocker but I don't think this should be a blocker for OCS 4.5 I would like to remove the blocker flag for 4.5 or/and move it to next release unless someone thinks otherwise. Jeff, I don't seem to be able to access BZ#1866018, could you add me on CC? Following up on comment #14, I looked more closely at the CoreOS situation: - The CoreOS/OCP 4.4 GA release actually still uses the RHEL 8.1 kernel, so this works on s390x - However, at some point (around 2020-07-14) the nightly dev-prelease stream of OCP 4.4 switched over to the RHEL 8.2 kernel, from which point on it fails - CoreOS/OCP 4.5 has always used the RHEL 8.2 kernel, and therefore always fails - The current CoreOS/OCP 4.6 nightlies seem to be using the RHEL 8.3 kernel, so should work again (however, note the new regression in BZ#1866018) So I believe the next steps should be: - Get the regression fix in BZ#1866018 accepted upstream and included into RHEL 8.3 (and then CoreOS 4.6) - Port both fixes into the RHEL 8.2.z maintenance stream - Update the CoreOS 4.5 kernel with the latest RHEL 8.2.z kernel Does that look reasonable? As to release blocker status, I agree that this bug is not tied to a particular *OCS version* as such; any OCS version will fail if the kernel has this bug. However, I'd consider presence of this bug a release blocker for *OCS on Z* in general. Depending on which version of OCS we're targeting for the initial Z release, this would then become a blocker for that version. I believe at this point, we still have not made the final decision; it could be either some post-GA OCS 4.5.z release or else OCS 4.6. Yes, that's more or less what I'm planning to do. We should probably clone this bug for 8.2.z and we'll just make sure we pull in both patches for that. RHEL-8.2.z bug : https://bugzilla.redhat.com/show_bug.cgi?id=1866386 > - The current CoreOS/OCP 4.6 nightlies seem to be using the RHEL 8.3 kernel, > so should work again (however, note the new regression in BZ#1866018) > We are currently using a bespoke 8.3 kernel to work around a selinux patch that is being backported to RHEL 8.2. OCP 4.6 will not move to RHEL 8.3 during its lifecycle, so please be sure to continue backporting to 8.2 for OCP 4.6 fixes. > So I believe the next steps should be: > - Get the regression fix in BZ#1866018 accepted upstream and included into > RHEL 8.3 (and then CoreOS 4.6) > - Port both fixes into the RHEL 8.2.z maintenance stream > - Update the CoreOS 4.5 kernel with the latest RHEL 8.2.z kernel > Does that look reasonable? > that sounds like a reasonable approach to make sure RHCOS consumes this fix correctly, but there could be a brief window between RHCOS switching back to the 8.2 kernel and patches landing in the proper z streams. (In reply to Jeff Layton from comment #16) > I'd be fine with that. The patch is pretty safe, so backporting should be no > big deal. I'm not well versed enough in CoreOS to know what we'd need to do > for that though. How to update new kernel in CoreOS for testing/development? ------------------------------------------------------------ Download all the relevant kernel rpm packages: like kernel,kernel-core, kernel-modules etc., // For replacing a test kernel: # rpm-ostree override replace /path/to/kernel-XYZ*.rpm \ /path/to/kernel-core*.rpm \ /path/to/kernel-modules*.rpm Reboot the node and ensure the latest kernel is running. (Note: You can still select old kernel using grub menu during bootup) It is important to note that this is only for "development/testing" purposes. Once you get actual update in the official kernel, you need first revert back to the original kernel (1) and then follow this to upgrade the OS: https://github.com/openshift/os/blob/master/FAQ.md#q-how-do-i-upgrade-the-os Note : (1) // To undo some of the overrides you have done in the past # rpm-ostree override reset // To discard all the local modifications done and goes back to original tree # rpm-ostree reset This should help us to proceed with testing. Ref: https://github.com/openshift/os/blob/master/FAQ.md#q-what-happens-when-i-use-rpm-ostree-override-replace-to-replace-an-rpm (In reply to Jeff Ligon from comment #25) > > - The current CoreOS/OCP 4.6 nightlies seem to be using the RHEL 8.3 kernel, > > so should work again (however, note the new regression in BZ#1866018) > > > > We are currently using a bespoke 8.3 kernel to work around a selinux patch > that is being backported to RHEL 8.2. OCP 4.6 will not move to RHEL 8.3 > during its lifecycle, so please be sure to continue backporting to 8.2 for > OCP 4.6 fixes. Jeff Layton opened a bug to track backporting to RHEL 8.2 (see above). Does this mean that the change will then flow to OCP 4.6 automatically, or do we need to open *another* bug again OCP/CoreOS to track that? As noted, this issue is only a tracker for BZ#1866018. I guess there is a backport BZ for 8.2 (https://bugzilla.redhat.com/show_bug.cgi?id=1875787) so that fix is already backported to 8.2 and hence can be tested there. Based on the automation run results of BUILD ID: v4.6.0-131.ci RUN ID: 1603961686 (tier1 over IBM ROKS), in which both this test case passed, I am moving to VERIFIED: tests/manage/pv_services/test_dynamic_pvc_accessmodes_with_reclaim_policies.py::TestDynamicPvc::test_rwx_dynamic_pvc[CephFileSystem-Delete] With OCP 4.6.3 (with kernel version 4.18.0-193.28.1.el8_2.s390x) issue has be been fixed . Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5605 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |