Bug 2246920 - clevis error message regression when remote tang server is down
Summary: clevis error message regression when remote tang server is down
Keywords:
Status: ASSIGNED
Alias: None
Product: Fedora
Classification: Fedora
Component: stratisd
Version: 39
Hardware: Unspecified
OS: Linux
low
low
Target Milestone: ---
Assignee: mulhern
QA Contact: Fedora Extras Quality Assurance
URL: https://cockpit-logs.us-east-1.linode...
Whiteboard: CockpitTest
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-10-30 06:04 UTC by Martin Pitt
Modified: 2023-11-27 05:27 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-11-27 05:27:00 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Martin Pitt 2023-10-30 06:04:00 UTC
Cockpit's tests recently started to run nightly checks on Fedora 39 updates-testing (until Friday it was still F38). This found some stratis regressions from the recent 3.5.9 → 3.6.0 update [1].

I have half of a CLI reproducer, I'm not very familiar with stratis. If this isn't enough, I can spend some more time to write a fully standalone one.

Reproducible: Always

Steps to Reproduce:
1. Set up a stratis pool with clevis unlocking with a remote tang server
2. Stop the tang server (`systemctl stop tangd.socket`)
3. Get pool UUID from `stratis pool  list --stopped`
4. Try to start it:
   busctl call org.storage.stratis3 /org/storage/stratis3 org.storage.stratis3.Manager.r5 StartPool 'ss(bs)' 283752180886442e9d922c3e3c453d04 uuid true clevis

Of course, substitute the actual UUID
Actual Results:  
With 3.6.0 this fails in a weird way:

(b(oaoao))qs false "/" 0 0 1 "Cryptsetup error: IO error occurred: Operation not permitted (os error 1)"

Expected Results:  
With 3.5.9 it failed in a more useful way:

b(oaoao))qs false "/" 0 0 1 "Command failed: cmd: \"/usr/bin/clevis\" \"luks\" \"unlock\" \"-d\" \"/dev/sda\" \"-n\" \"stratis-1-private-38f728d7e68f4cefa9a6db78eae8f9ee-crypt\", exit reason: 1 stdout:  stderr: Error communicating with server 10.111.112.5\n/dev/sda could not be opened.\n"

[1] https://bodhi.fedoraproject.org/updates/FEDORA-2023-c464ef1f0a

These versions didn't change between F39 release and updates-testing:
clevis-19-4.fc39.x86_64
tang-14-2.fc39.x86_64

stratis changed from stratisd-3.5.9-1.fc39.x86_64 to stratisd-3.6.0-1.fc39.x86_64

Comment 1 Martin Pitt 2023-10-30 06:12:05 UTC
When ignoring the weird error message, the more serious part of that regression comes later on: After restarting tangd.socket on the tang server and trying to unlock, stratis does not seem to recognize the tang server:

stratis pool list --stopped
# Name    UUID                                   # Devices   Key Description   Clevis 
# pool0   669ec8d2-e2e5-4305-841a-dda5b1e1996e           1   pool0             present

busctl call org.storage.stratis3 /org/storage/stratis3 org.storage.stratis3.Manager.r5 StartPool 'ss(bs)' 669ec8d2-e2e5-4305-841a-dda5b1e1996e uuid true clevis
# (b(oaoao))qs false "/" 0 0 1 "The key description \"pool0\" is not currently set."

With 3.5.9 the unlocking with clevis/tang works:

stratis pool list --stopped
# Name    UUID                                   # Devices   Key Description   Clevis 
# pool0   3a08c37c-2b5c-4ffa-923e-68a4fd0e4bce           1   pool0             present

busctl call org.storage.stratis3 /org/storage/stratis3 org.storage.stratis3.Manager.r5 StartPool 'ss(bs)' 3a08c37c-2b5c-4ffa-923e-68a4fd0e4bce uuid true clevis
# (b(oaoao))qs true "/org/storage/stratis3/4" 1 "/org/storage/stratis3/5" 0 0 "Ok"

So it feels like with 3.6.0 it somehow "remembers" that tang previously didn't work?

Sorry if that is a bit vague, I don't *really* know what I am doing here.

Comment 2 mulhern 2023-10-30 13:49:23 UTC
The change in the error message is due to the fact that stratisd is using the libcryptsetup API and a callback via libcryptsetup-rs rather than the cryptsetup command-line interface as it did previously. We can look at enriching the error message.

The second behavior that you observed in https://bugzilla.redhat.com/show_bug.cgi?id=2246920#c1 we will investigate.

Comment 3 John Baublitz 2023-10-30 14:29:33 UTC
I think I found the cause of the second regression and I'm putting up a bug fix PR now so that I can test that it resolves the problem.

As for the error message, this is a little bit hard to fix largely because the callback that we're performing the Clevis operation in only returns an integer since it goes through the C cryptsetup API. We do log the full error message from Clevis in the system logs but I'm not sure if that's considered adequate.

Comment 4 Martin Pitt 2023-10-30 14:34:17 UTC
Thanks John! I'm glad that my vague description sufficed.

Of course I'm not asking to reproduce the original error message to 100%. But the part of it that said "Error communicating with server 10.111.112.5" at least gave a hint that it is due to the tang server malfunctioning (or not being present), while "IO error occurred: Operation not permitted (os error 1)" is rather vague. We can of course adjust cockpit's tests to be less picky, but admins are going to see this. Cheers!

Comment 5 John Baublitz 2023-10-30 15:14:15 UTC
I agree that this is a less helpful message and I'm little bit torn mainly because we have kind of two conflicting interests here. On one side, we are trying to use Clevis more programmatically as much as possible since it doesn't have an API and we've gotten feedback that bubbling up some of the raw error messages can be confusing in some places where it spits out usage information for example. Then on the other side of things, the cause can sometimes be included in the message which is ultimately helpful. I'm not entirely what to do about this because it seems to be a case by case basis. In the case of standalone Stratis, we definitely do log the error message so users checking the system logs would be aware that the server is down. However, I can see how this is less than ideal in the case of Cockpit which relies on errors to display to the user. There's also the technical consideration that doing it the way we do it in 3.6.0 is better from a maintenance and code cleanliness standpoint. I'm not sure exactly how to proceed, but I'll discuss it with the team and see if we can come up with something better than what we have now.

Comment 6 Martin Pitt 2023-10-30 15:30:11 UTC
Right, I fully sympathize with moving from calling a CLI to a proper API. This very probably needs some change in libcryptsetup to report a more useful error message -- if the only thing that you get is "1" (aka EPERM), this is a rather thin and sad error..

Comment 7 John Baublitz 2023-10-31 14:55:31 UTC
The PR fixing the unlock regression and adding a regression test is up upstream: https://github.com/stratis-storage/stratisd/pull/3486.

Comment 8 Bryan Gurney 2023-11-01 13:27:23 UTC
Stratisd #3486 was merged yesterday, and is incorporated in stratisd-3.6.1.

This has been released on Rawhide, as stratisd-3.6.1-1.fc40.

Comment 9 Martin Pitt 2023-11-03 05:22:33 UTC
Thanks! I tested with stratisd 3.6.1, and confirm that "the more serious part" from comment #2 is fixed. Adjusting the title for the error message, and lowering priority.

For those reading along, TestStorageStratisNBDE.testBasic still fails now, but later on due to https://github.com/stratis-storage/stratisd/issues/3348

Comment 10 Martin Pitt 2023-11-27 05:27:58 UTC
Hang on, 3.6.1 fixes the bug from comment #1, but not the bad error message.


Note You need to log in before you can comment on or make changes to this bug.