Bug 2246920
Summary: | clevis error message regression when remote tang server is down | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Martin Pitt <mpitt> |
Component: | stratisd | Assignee: | mulhern <amulhern> |
Status: | POST --- | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | low | Docs Contact: | |
Priority: | low | ||
Version: | 41 | CC: | amulhern, bgurney, jbaublitz, rust-sig |
Target Milestone: | --- | Keywords: | Regression, Reopened |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
URL: | https://cockpit-logs.us-east-1.linodeobjects.com/pull-19541-20231028-061240-f17dc026-fedora-39-updates-testing/log.html#234-2 | ||
Whiteboard: | CockpitTest | ||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-11-27 05:27:00 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Martin Pitt
2023-10-30 06:04:00 UTC
When ignoring the weird error message, the more serious part of that regression comes later on: After restarting tangd.socket on the tang server and trying to unlock, stratis does not seem to recognize the tang server: stratis pool list --stopped # Name UUID # Devices Key Description Clevis # pool0 669ec8d2-e2e5-4305-841a-dda5b1e1996e 1 pool0 present busctl call org.storage.stratis3 /org/storage/stratis3 org.storage.stratis3.Manager.r5 StartPool 'ss(bs)' 669ec8d2-e2e5-4305-841a-dda5b1e1996e uuid true clevis # (b(oaoao))qs false "/" 0 0 1 "The key description \"pool0\" is not currently set." With 3.5.9 the unlocking with clevis/tang works: stratis pool list --stopped # Name UUID # Devices Key Description Clevis # pool0 3a08c37c-2b5c-4ffa-923e-68a4fd0e4bce 1 pool0 present busctl call org.storage.stratis3 /org/storage/stratis3 org.storage.stratis3.Manager.r5 StartPool 'ss(bs)' 3a08c37c-2b5c-4ffa-923e-68a4fd0e4bce uuid true clevis # (b(oaoao))qs true "/org/storage/stratis3/4" 1 "/org/storage/stratis3/5" 0 0 "Ok" So it feels like with 3.6.0 it somehow "remembers" that tang previously didn't work? Sorry if that is a bit vague, I don't *really* know what I am doing here. The change in the error message is due to the fact that stratisd is using the libcryptsetup API and a callback via libcryptsetup-rs rather than the cryptsetup command-line interface as it did previously. We can look at enriching the error message. The second behavior that you observed in https://bugzilla.redhat.com/show_bug.cgi?id=2246920#c1 we will investigate. I think I found the cause of the second regression and I'm putting up a bug fix PR now so that I can test that it resolves the problem. As for the error message, this is a little bit hard to fix largely because the callback that we're performing the Clevis operation in only returns an integer since it goes through the C cryptsetup API. We do log the full error message from Clevis in the system logs but I'm not sure if that's considered adequate. Thanks John! I'm glad that my vague description sufficed. Of course I'm not asking to reproduce the original error message to 100%. But the part of it that said "Error communicating with server 10.111.112.5" at least gave a hint that it is due to the tang server malfunctioning (or not being present), while "IO error occurred: Operation not permitted (os error 1)" is rather vague. We can of course adjust cockpit's tests to be less picky, but admins are going to see this. Cheers! I agree that this is a less helpful message and I'm little bit torn mainly because we have kind of two conflicting interests here. On one side, we are trying to use Clevis more programmatically as much as possible since it doesn't have an API and we've gotten feedback that bubbling up some of the raw error messages can be confusing in some places where it spits out usage information for example. Then on the other side of things, the cause can sometimes be included in the message which is ultimately helpful. I'm not entirely what to do about this because it seems to be a case by case basis. In the case of standalone Stratis, we definitely do log the error message so users checking the system logs would be aware that the server is down. However, I can see how this is less than ideal in the case of Cockpit which relies on errors to display to the user. There's also the technical consideration that doing it the way we do it in 3.6.0 is better from a maintenance and code cleanliness standpoint. I'm not sure exactly how to proceed, but I'll discuss it with the team and see if we can come up with something better than what we have now. Right, I fully sympathize with moving from calling a CLI to a proper API. This very probably needs some change in libcryptsetup to report a more useful error message -- if the only thing that you get is "1" (aka EPERM), this is a rather thin and sad error.. The PR fixing the unlock regression and adding a regression test is up upstream: https://github.com/stratis-storage/stratisd/pull/3486. Stratisd #3486 was merged yesterday, and is incorporated in stratisd-3.6.1. This has been released on Rawhide, as stratisd-3.6.1-1.fc40. Thanks! I tested with stratisd 3.6.1, and confirm that "the more serious part" from comment #2 is fixed. Adjusting the title for the error message, and lowering priority. For those reading along, TestStorageStratisNBDE.testBasic still fails now, but later on due to https://github.com/stratis-storage/stratisd/issues/3348 Hang on, 3.6.1 fixes the bug from comment #1, but not the bad error message. This message is a reminder that Fedora Linux 39 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora Linux 39 on 2024-11-26. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a 'version' of '39'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, change the 'version' to a later Fedora Linux version. Note that the version field may be hidden. Click the "Show advanced fields" button if you do not see it. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora Linux 39 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora Linux, you are encouraged to change the 'version' to a later version prior to this bug being closed. I re-tested this on current F41 with stratisd-3.7.3-1.fc41.x86_64 and tang-15-6.fc41.x86_64. The error message when the server isn't reachable is still the same unintelligible one as in the description:
> Cryptsetup error: IO error occurred: Operation not permitted (os error 1)
We expect the error message to be more helpful in stratisd 3.8: https://github.com/stratis-storage/stratisd/pull/3741 . |