Bug 2065543

Summary: First host report turns host non-operational with FIPS_INCOMPATIBLE_WITH_CLUSTER
Product: [oVirt] ovirt-engine Reporter: Michal Skrivanek <michal.skrivanek>
Component: Backend.CoreAssignee: Liran Rotenberg <lrotenbe>
Status: CLOSED CURRENTRELEASE QA Contact: Qin Yuan <qiyuan>
Severity: medium Docs Contact:
Priority: urgent    
Version: 4.5.0CC: ahadas, asocha, bugs, lrotenbe, mperina
Target Milestone: ovirt-4.5.1Keywords: Reopened
Target Release: ---Flags: pm-rhel: ovirt-4.5?
pm-rhel: devel_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-04-12 15:18:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Michal Skrivanek 2022-03-18 07:16:51 UTC
When host is added and comes up the first capabilities report includes "fipsEnabled": true, but engine sets host non operational anyway:

2022-03-18 05:04:41,378Z DEBUG [org.ovirt.engine.core.bll.HandleVdsFipsCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-62) [fa6fe67] Permission check skipped for internal action HandleVdsFips.
2022-03-18 05:04:41,404Z DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-62) [fa6fe67] method: get, params: [414771dc-eaee-4c7a-889d-42c7d3f8c56b], timeElapsed: 19ms
2022-03-18 05:04:41,405Z INFO  [org.ovirt.engine.core.bll.HandleVdsFipsCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-62) [fa6fe67] Running command: HandleVdsFipsCommand(VdsId = 414771dc-eaee-4c7a-889d-42c7d3f8c56b, RunSilent = false) internal: true. Entities affected :  ID: 414771dc-eaee-4c7a-889d-42c7d3f8c56b Type: VDS
2022-03-18 05:04:41,443Z DEBUG [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-62) [4243d84] Permission check skipped for internal action SetNonOperationalVds.
2022-03-18 05:04:41,458Z DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-62) [4243d84] method: get, params: [414771dc-eaee-4c7a-889d-42c7d3f8c56b], timeElapsed: 12ms
2022-03-18 05:04:41,482Z INFO  [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-62) [4243d84] Running command: SetNonOperationalVdsCommand(NonOperationalReason = FIPS_INCOMPATIBLE_WITH_CLUSTER, StorageDomainId = 00000000-0000-0000-0000-000000000000, CustomLogValues = {}, Internal = true, StopGlusterService = false, VdsId = 414771dc-eaee-4c7a-889d-42c7d3f8c56b, RunSilent = false) internal: true. Entities affected :  ID: 414771dc-eaee-4c7a-889d-42c7d3f8c56b Type: VDS

may be a timing issue, it doesnt' happen all the time, but frequent enough

Comment 1 Michal Skrivanek 2022-03-18 07:17:26 UTC
breaks OST fairly often, please fix ASAP

Comment 2 Arik 2022-03-20 09:42:34 UTC
I would have been better to include the full log or point to a failed job, but I suppose we can find it on the OST channel if it's that frequent

I don't remember changes in that area in 4.5 and as far as I know QE didn't notice that
So let's handle the regression and gaps in dedicated CPUs and then look at it when we stabilize 4.5

Comment 3 Arik 2022-03-20 10:09:08 UTC
(In reply to Arik from comment #2)
> I would have been better to include the full log or point to a failed job,
> but I suppose we can find it on the OST channel if it's that frequent

It would*
https://rhv-devops-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/ds-ost-baremetal_manual/31001/

Comment 4 Liran Rotenberg 2022-03-20 10:40:53 UTC
Looking on the logs,
The VDSM was on recovery and we fail to get the capabilities of the host.
The FIPS DB entry is NOT NULL DEFAULT FALSE. Once we have a fresh cluster and we add the first host, we do a one-time settings to the cluster.

A better approach would be to not set the cluster as long we don't have the host data. We need to find some indication for this state and pass it along, or maybe changing the DB entry to default NULL and handle the null case.

Comment 5 Arik 2022-03-20 12:35:14 UTC
(In reply to Liran Rotenberg from comment #4)
> A better approach would be to not set the cluster as long we don't have the
> host data. We need to find some indication for this state and pass it along,
> or maybe changing the DB entry to default NULL and handle the null case.

The second approach makes sense to me

Comment 6 Martin Perina 2022-03-21 08:17:34 UTC
(In reply to Arik from comment #5)
> (In reply to Liran Rotenberg from comment #4)
> > A better approach would be to not set the cluster as long we don't have the
> > host data. We need to find some indication for this state and pass it along,
> > or maybe changing the DB entry to default NULL and handle the null case.
> 
> The second approach makes sense to me

Wouldn't it be better to handle that in InitVdsOnUpCommand? Only here we are really sure that full communication with the host has been established. I know that is a race when we mention host is Up even though it's not yet connected to storage for example, but if it doesn't cause an issue for storage why it would be harmful for FIPS?

It would be great to have PreparingForUp host status, but I don't believe we have enough resources to introduce it in 4.5

Comment 7 Liran Rotenberg 2022-04-05 14:05:30 UTC
Since we don't share the FIPS value of the host (it's internal and nowhere in the API), switching the place we call the FIPS handling is a much easier solution.
Thanks Martin.

Comment 9 Michal Skrivanek 2022-04-08 15:26:36 UTC
there's been a suspicious report of the same problem even with this patch included. reopening until it's investigated (or not reproduced for a week)

Comment 10 Sandro Bonazzola 2022-06-23 05:57:04 UTC
This bugzilla is included in oVirt 4.5.1 release, published on June 22nd 2022.
Since the problem described in this bug report should be resolved in oVirt 4.5.1 release, it has been closed with a resolution of CURRENT RELEASE.
If the solution does not work for you, please open a new bug report.