Bug 2246920

Summary: clevis error message regression when remote tang server is down
Product: [Fedora] Fedora Reporter: Martin Pitt <mpitt>
Component: stratisdAssignee: mulhern <amulhern>
Status: POST --- QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: low Docs Contact:
Priority: low    
Version: 41CC: amulhern, bgurney, jbaublitz, rust-sig
Target Milestone: ---Keywords: Regression, Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
URL: https://cockpit-logs.us-east-1.linodeobjects.com/pull-19541-20231028-061240-f17dc026-fedora-39-updates-testing/log.html#234-2
Whiteboard: CockpitTest
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-11-27 05:27:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Martin Pitt 2023-10-30 06:04:00 UTC
Cockpit's tests recently started to run nightly checks on Fedora 39 updates-testing (until Friday it was still F38). This found some stratis regressions from the recent 3.5.9 → 3.6.0 update [1].

I have half of a CLI reproducer, I'm not very familiar with stratis. If this isn't enough, I can spend some more time to write a fully standalone one.

Reproducible: Always

Steps to Reproduce:
1. Set up a stratis pool with clevis unlocking with a remote tang server
2. Stop the tang server (`systemctl stop tangd.socket`)
3. Get pool UUID from `stratis pool  list --stopped`
4. Try to start it:
   busctl call org.storage.stratis3 /org/storage/stratis3 org.storage.stratis3.Manager.r5 StartPool 'ss(bs)' 283752180886442e9d922c3e3c453d04 uuid true clevis

Of course, substitute the actual UUID
Actual Results:  
With 3.6.0 this fails in a weird way:

(b(oaoao))qs false "/" 0 0 1 "Cryptsetup error: IO error occurred: Operation not permitted (os error 1)"

Expected Results:  
With 3.5.9 it failed in a more useful way:

b(oaoao))qs false "/" 0 0 1 "Command failed: cmd: \"/usr/bin/clevis\" \"luks\" \"unlock\" \"-d\" \"/dev/sda\" \"-n\" \"stratis-1-private-38f728d7e68f4cefa9a6db78eae8f9ee-crypt\", exit reason: 1 stdout:  stderr: Error communicating with server 10.111.112.5\n/dev/sda could not be opened.\n"

[1] https://bodhi.fedoraproject.org/updates/FEDORA-2023-c464ef1f0a

These versions didn't change between F39 release and updates-testing:
clevis-19-4.fc39.x86_64
tang-14-2.fc39.x86_64

stratis changed from stratisd-3.5.9-1.fc39.x86_64 to stratisd-3.6.0-1.fc39.x86_64

Comment 1 Martin Pitt 2023-10-30 06:12:05 UTC
When ignoring the weird error message, the more serious part of that regression comes later on: After restarting tangd.socket on the tang server and trying to unlock, stratis does not seem to recognize the tang server:

stratis pool list --stopped
# Name    UUID                                   # Devices   Key Description   Clevis 
# pool0   669ec8d2-e2e5-4305-841a-dda5b1e1996e           1   pool0             present

busctl call org.storage.stratis3 /org/storage/stratis3 org.storage.stratis3.Manager.r5 StartPool 'ss(bs)' 669ec8d2-e2e5-4305-841a-dda5b1e1996e uuid true clevis
# (b(oaoao))qs false "/" 0 0 1 "The key description \"pool0\" is not currently set."

With 3.5.9 the unlocking with clevis/tang works:

stratis pool list --stopped
# Name    UUID                                   # Devices   Key Description   Clevis 
# pool0   3a08c37c-2b5c-4ffa-923e-68a4fd0e4bce           1   pool0             present

busctl call org.storage.stratis3 /org/storage/stratis3 org.storage.stratis3.Manager.r5 StartPool 'ss(bs)' 3a08c37c-2b5c-4ffa-923e-68a4fd0e4bce uuid true clevis
# (b(oaoao))qs true "/org/storage/stratis3/4" 1 "/org/storage/stratis3/5" 0 0 "Ok"

So it feels like with 3.6.0 it somehow "remembers" that tang previously didn't work?

Sorry if that is a bit vague, I don't *really* know what I am doing here.

Comment 2 mulhern 2023-10-30 13:49:23 UTC
The change in the error message is due to the fact that stratisd is using the libcryptsetup API and a callback via libcryptsetup-rs rather than the cryptsetup command-line interface as it did previously. We can look at enriching the error message.

The second behavior that you observed in https://bugzilla.redhat.com/show_bug.cgi?id=2246920#c1 we will investigate.

Comment 3 John Baublitz 2023-10-30 14:29:33 UTC
I think I found the cause of the second regression and I'm putting up a bug fix PR now so that I can test that it resolves the problem.

As for the error message, this is a little bit hard to fix largely because the callback that we're performing the Clevis operation in only returns an integer since it goes through the C cryptsetup API. We do log the full error message from Clevis in the system logs but I'm not sure if that's considered adequate.

Comment 4 Martin Pitt 2023-10-30 14:34:17 UTC
Thanks John! I'm glad that my vague description sufficed.

Of course I'm not asking to reproduce the original error message to 100%. But the part of it that said "Error communicating with server 10.111.112.5" at least gave a hint that it is due to the tang server malfunctioning (or not being present), while "IO error occurred: Operation not permitted (os error 1)" is rather vague. We can of course adjust cockpit's tests to be less picky, but admins are going to see this. Cheers!

Comment 5 John Baublitz 2023-10-30 15:14:15 UTC
I agree that this is a less helpful message and I'm little bit torn mainly because we have kind of two conflicting interests here. On one side, we are trying to use Clevis more programmatically as much as possible since it doesn't have an API and we've gotten feedback that bubbling up some of the raw error messages can be confusing in some places where it spits out usage information for example. Then on the other side of things, the cause can sometimes be included in the message which is ultimately helpful. I'm not entirely what to do about this because it seems to be a case by case basis. In the case of standalone Stratis, we definitely do log the error message so users checking the system logs would be aware that the server is down. However, I can see how this is less than ideal in the case of Cockpit which relies on errors to display to the user. There's also the technical consideration that doing it the way we do it in 3.6.0 is better from a maintenance and code cleanliness standpoint. I'm not sure exactly how to proceed, but I'll discuss it with the team and see if we can come up with something better than what we have now.

Comment 6 Martin Pitt 2023-10-30 15:30:11 UTC
Right, I fully sympathize with moving from calling a CLI to a proper API. This very probably needs some change in libcryptsetup to report a more useful error message -- if the only thing that you get is "1" (aka EPERM), this is a rather thin and sad error..

Comment 7 John Baublitz 2023-10-31 14:55:31 UTC
The PR fixing the unlock regression and adding a regression test is up upstream: https://github.com/stratis-storage/stratisd/pull/3486.

Comment 8 Bryan Gurney 2023-11-01 13:27:23 UTC
Stratisd #3486 was merged yesterday, and is incorporated in stratisd-3.6.1.

This has been released on Rawhide, as stratisd-3.6.1-1.fc40.

Comment 9 Martin Pitt 2023-11-03 05:22:33 UTC
Thanks! I tested with stratisd 3.6.1, and confirm that "the more serious part" from comment #2 is fixed. Adjusting the title for the error message, and lowering priority.

For those reading along, TestStorageStratisNBDE.testBasic still fails now, but later on due to https://github.com/stratis-storage/stratisd/issues/3348

Comment 10 Martin Pitt 2023-11-27 05:27:58 UTC
Hang on, 3.6.1 fixes the bug from comment #1, but not the bad error message.

Comment 11 Aoife Moloney 2024-11-13 09:57:49 UTC
This message is a reminder that Fedora Linux 39 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 39 on 2024-11-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '39'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version. Note that the version field may be hidden.
Click the "Show advanced fields" button if you do not see it.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 39 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 12 Martin Pitt 2024-11-13 12:39:44 UTC
I re-tested this on current F41 with stratisd-3.7.3-1.fc41.x86_64 and tang-15-6.fc41.x86_64. The error message when the server isn't reachable is still the same unintelligible one as in the description:

> Cryptsetup error: IO error occurred: Operation not permitted (os error 1)

Comment 13 mulhern 2025-01-21 15:05:46 UTC
We expect the error message to be more helpful in stratisd 3.8: https://github.com/stratis-storage/stratisd/pull/3741 .