Bug 2065543 - First host report turns host non-operational with FIPS_INCOMPATIBLE_WITH_CLUSTER
Summary: First host report turns host non-operational with FIPS_INCOMPATIBLE_WITH_CLUSTER
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Backend.Core
Version: 4.5.0
Hardware: Unspecified
OS: Unspecified
urgent
medium
Target Milestone: ovirt-4.5.1
: ---
Assignee: Liran Rotenberg
QA Contact: Qin Yuan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-18 07:16 UTC by Michal Skrivanek
Modified: 2022-06-23 05:57 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-04-12 15:18:34 UTC
oVirt Team: Virt
Embargoed:
pm-rhel: ovirt-4.5?
pm-rhel: devel_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-engine pull 235 0 None Merged Fix racing cluster FIPS settings 2022-04-06 15:23:00 UTC
Github oVirt ovirt-engine pull 255 0 None Merged fix FIPS handling 2022-04-12 15:18:34 UTC
Red Hat Issue Tracker RHV-45355 0 None None None 2022-03-18 07:23:48 UTC

Description Michal Skrivanek 2022-03-18 07:16:51 UTC
When host is added and comes up the first capabilities report includes "fipsEnabled": true, but engine sets host non operational anyway:

2022-03-18 05:04:41,378Z DEBUG [org.ovirt.engine.core.bll.HandleVdsFipsCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-62) [fa6fe67] Permission check skipped for internal action HandleVdsFips.
2022-03-18 05:04:41,404Z DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-62) [fa6fe67] method: get, params: [414771dc-eaee-4c7a-889d-42c7d3f8c56b], timeElapsed: 19ms
2022-03-18 05:04:41,405Z INFO  [org.ovirt.engine.core.bll.HandleVdsFipsCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-62) [fa6fe67] Running command: HandleVdsFipsCommand(VdsId = 414771dc-eaee-4c7a-889d-42c7d3f8c56b, RunSilent = false) internal: true. Entities affected :  ID: 414771dc-eaee-4c7a-889d-42c7d3f8c56b Type: VDS
2022-03-18 05:04:41,443Z DEBUG [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-62) [4243d84] Permission check skipped for internal action SetNonOperationalVds.
2022-03-18 05:04:41,458Z DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-62) [4243d84] method: get, params: [414771dc-eaee-4c7a-889d-42c7d3f8c56b], timeElapsed: 12ms
2022-03-18 05:04:41,482Z INFO  [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-62) [4243d84] Running command: SetNonOperationalVdsCommand(NonOperationalReason = FIPS_INCOMPATIBLE_WITH_CLUSTER, StorageDomainId = 00000000-0000-0000-0000-000000000000, CustomLogValues = {}, Internal = true, StopGlusterService = false, VdsId = 414771dc-eaee-4c7a-889d-42c7d3f8c56b, RunSilent = false) internal: true. Entities affected :  ID: 414771dc-eaee-4c7a-889d-42c7d3f8c56b Type: VDS

may be a timing issue, it doesnt' happen all the time, but frequent enough

Comment 1 Michal Skrivanek 2022-03-18 07:17:26 UTC
breaks OST fairly often, please fix ASAP

Comment 2 Arik 2022-03-20 09:42:34 UTC
I would have been better to include the full log or point to a failed job, but I suppose we can find it on the OST channel if it's that frequent

I don't remember changes in that area in 4.5 and as far as I know QE didn't notice that
So let's handle the regression and gaps in dedicated CPUs and then look at it when we stabilize 4.5

Comment 3 Arik 2022-03-20 10:09:08 UTC
(In reply to Arik from comment #2)
> I would have been better to include the full log or point to a failed job,
> but I suppose we can find it on the OST channel if it's that frequent

It would*
https://rhv-devops-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/ds-ost-baremetal_manual/31001/

Comment 4 Liran Rotenberg 2022-03-20 10:40:53 UTC
Looking on the logs,
The VDSM was on recovery and we fail to get the capabilities of the host.
The FIPS DB entry is NOT NULL DEFAULT FALSE. Once we have a fresh cluster and we add the first host, we do a one-time settings to the cluster.

A better approach would be to not set the cluster as long we don't have the host data. We need to find some indication for this state and pass it along, or maybe changing the DB entry to default NULL and handle the null case.

Comment 5 Arik 2022-03-20 12:35:14 UTC
(In reply to Liran Rotenberg from comment #4)
> A better approach would be to not set the cluster as long we don't have the
> host data. We need to find some indication for this state and pass it along,
> or maybe changing the DB entry to default NULL and handle the null case.

The second approach makes sense to me

Comment 6 Martin Perina 2022-03-21 08:17:34 UTC
(In reply to Arik from comment #5)
> (In reply to Liran Rotenberg from comment #4)
> > A better approach would be to not set the cluster as long we don't have the
> > host data. We need to find some indication for this state and pass it along,
> > or maybe changing the DB entry to default NULL and handle the null case.
> 
> The second approach makes sense to me

Wouldn't it be better to handle that in InitVdsOnUpCommand? Only here we are really sure that full communication with the host has been established. I know that is a race when we mention host is Up even though it's not yet connected to storage for example, but if it doesn't cause an issue for storage why it would be harmful for FIPS?

It would be great to have PreparingForUp host status, but I don't believe we have enough resources to introduce it in 4.5

Comment 7 Liran Rotenberg 2022-04-05 14:05:30 UTC
Since we don't share the FIPS value of the host (it's internal and nowhere in the API), switching the place we call the FIPS handling is a much easier solution.
Thanks Martin.

Comment 9 Michal Skrivanek 2022-04-08 15:26:36 UTC
there's been a suspicious report of the same problem even with this patch included. reopening until it's investigated (or not reproduced for a week)

Comment 10 Sandro Bonazzola 2022-06-23 05:57:04 UTC
This bugzilla is included in oVirt 4.5.1 release, published on June 22nd 2022.
Since the problem described in this bug report should be resolved in oVirt 4.5.1 release, it has been closed with a resolution of CURRENT RELEASE.
If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.