Bug 1758024

Summary: Long running Ansible tasks timeout and abort for RHV-H hosts with STIG/Security Profiles applied
Product: Red Hat Enterprise Virtualization Manager Reporter: John Call <jcall>
Component: ovirt-engineAssignee: Dana <delfassy>
Status: CLOSED ERRATA QA Contact: Qin Yuan <qiyuan>
Severity: medium Docs Contact:
Priority: medium    
Version: unspecifiedCC: mavital, mperina, mtessun
Target Milestone: ovirt-4.4.2Keywords: ZStream
Target Release: 4.4.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rhv-4.4.2-2 ovirt-engine-4.4.2.1 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-23 16:11:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description John Call 2019-10-03 03:49:02 UTC
Description of problem:
Update/Upgrade tasks timeout, abort, and fail when an RHV-H host has been deployed using the "VPP - Protection Profile for Virtualization v. 1.0 for Red Hat Enterprise Linux Hypervisor (RHELH)" Security Policy during the Anaconda installation.

This could also be caused by simply adjusting the "ClientAlive..." options in /etc/ssh/sshd_config like this:
[root@rhhi1 ~]# grep -B1 '^ClientAlive' /etc/ssh/sshd_config
# Per CCE-CCE-27082-7: Set ClientAliveCountMax 0 in /etc/ssh/sshd_config
ClientAliveCountMax 0
--
# Per CCE-CCE-27433-2: Set ClientAliveInterval 300 in /etc/ssh/sshd_config
ClientAliveInterval 300

When the task is aborted, yum, rpm, and imgbased become "angry" and the update/upgrade can't be retried easily.



How reproducible:
Install/Deploy a host with that Security Profile, or with those sshd_config options.  Upgrade the host from the RHVM GUI.  Wait 5 minutes.  The task is failed.



Additional info:
It seems like this could easily be avoided by installing a "ServerAliveInterval=60" parameter somewhere in the ansible playbook / variables.  Maybe like this...

cat << EOF >> /usr/share/ovirt-engine/playbooks/ansible.cfg
# Maintain a heartbeat for long-running tasks like ovirt-host-upgrade
ansible_ssh_common_args = '-o ServerAliveInterval=60'
EOF

Comment 1 John Call 2019-10-03 04:17:54 UTC
Here are the RHVM (ovirt-engine) logs that show the disconnect after 5 minutes waiting for yum to complete (which takes ~8 minutes on these hosts), followed by the `yum history` from the affected host.

[root@rhvm host-deploy]# cat ovirt-host-mgmt-ansible-20190930160816-rhv1.cluster.net-bddbd03f-3a7f-4e6c-b79d-d4651f411d0f.log [[[[ output from successful tasks omitted]]]]
2019-09-30 16:08:16,580 p=29853 u=ovirt |  Using /usr/share/ovirt-engine/playbooks/ansible.cfg as config file
2019-09-30 16:08:17,459 passlib.registry registered 'md5_crypt' handler: <class 'passlib.handlers.md5_crypt.md5_crypt'>
2019-09-30 16:08:17,466 p=29853 u=ovirt |  PLAY [all] *********************************************************************
2019-09-30 16:08:17,476 p=29853 u=ovirt |  TASK [Gathering Facts] *********************************************************
2019-09-30 16:08:22,727 p=29853 u=ovirt |  ok: [rhv1.cluster.net]
2019-09-30 16:08:22,763 p=29853 u=ovirt |  TASK [Check host certificate] **************************************************
2019-09-30 16:08:25,915 p=29853 u=ovirt |  changed: [rhv1.cluster.net] => (item={u'ca': u'/etc/pki/vdsm/certs/cacert.pem', u'cert': u'/etc/pki/vdsm/certs/vdsmcert.pem'}) => {
2019-09-30 16:08:28,919 p=29853 u=ovirt |  changed: [rhv1.cluster.net] => (item={u'ca': u'/etc/pki/libvirt/../CA/cacert.pem', u'cert': u'/etc/pki/libvirt/clientcert.pem'}) => {
2019-09-30 16:08:31,950 p=29853 u=ovirt |  changed: [rhv1.cluster.net] => (item={u'ca': u'/etc/pki/vdsm/libvirt-spice/ca-cert.pem', u'cert': u'/etc/pki/vdsm/libvirt-spice/server-cert.pem'}) => {
2019-09-30 16:08:31,990 p=29853 u=ovirt |  TASK [Set enrollment variable] *************************************************
2019-09-30 16:08:32,009 p=29853 u=ovirt |  ok: [rhv1.cluster.net] => {
2019-09-30 16:08:32,045 p=29853 u=ovirt |  TASK [ovirt-host-enroll-certificate : include_tasks] ***************************
2019-09-30 16:08:32,058 p=29853 u=ovirt |  skipping: [rhv1.cluster.net] => (item=loadconfig)  => {
2019-09-30 16:08:32,062 p=29853 u=ovirt |  skipping: [rhv1.cluster.net] => (item=vdsm)  => {
2019-09-30 16:08:32,065 p=29853 u=ovirt |  skipping: [rhv1.cluster.net] => (item=vmconsole)  => {
2019-09-30 16:08:32,100 p=29853 u=ovirt |  TASK [ovirt-host-upgrade : Install ovirt-host package if it isn't installed] ***
2019-09-30 16:08:45,971 p=29853 u=ovirt |  ok: [rhv1.cluster.net] => {
    "changed": false, 
    "changes": {
        "installed": [], 
        "updated": []
    }, 
    "obsoletes": {
        "redhat-virtualization-host-image-update-placeholder": {
            "dist": "noarch", 
            "repo": "installed", 
            "version": "4.3.5-3.el7ev"
        }
    }, 
    "rc": 0, 
    "results": [
        "All packages providing ovirt-host are up to date", 
        ""
    ]
}
2019-09-30 16:08:46,004 p=29853 u=ovirt |  TASK [ovirt-host-upgrade : Update system] **************************************
2019-09-30 16:13:48,011 p=29853 u=ovirt |  fatal: [rhv1.cluster.net]: UNREACHABLE! => {
    "changed": false, 
    "unreachable": true
}

MSG:

Failed to connect to the host via ssh: FIPS mode initialized
Shared connection to rhv1.cluster.net closed.

2019-09-30 16:13:48,012 p=29853 u=ovirt |  PLAY RECAP *********************************************************************
2019-09-30 16:13:48,012 p=29853 u=ovirt |  rhv1.cluster.net           : ok=4    changed=1    unreachable=1    failed=0    skipped=1    rescued=0    ignored=0   
[root@rhvm host-deploy]# 





And here are the yum history info logs from the host where the update failed/aborted
[root@rhv1 ~]# yum history
Loaded plugins: enabled_repos_upload, imgbased-persist, package_upload, product-id, search-disabled-repos, subscription-manager,
              : vdsmupgrade, versionlock
ID     | Login user               | Date and time    | Action(s)      | Altered
-------------------------------------------------------------------------------
     4 | root <root>              | 2019-09-30 16:09 | Obsoleting     |    2 **
     3 | root <root>              | 2019-09-30 15:02 | Erase          |    1   
     2 | root <root>              | 2019-09-30 13:47 | Install        |    1   
     1 | System <unset>           | 2019-09-30 11:59 | Install        |   11   
history list
Uploading Enabled Repositories Report
Loaded plugins: product-id, subscription-manager, versionlock
Unable to upload Enabled Repositories Report

[root@rhv1 ~]# yum history info 4
Loaded plugins: enabled_repos_upload, imgbased-persist, package_upload, product-id, search-disabled-repos, subscription-manager,
              : vdsmupgrade, versionlock
Transaction ID : 4
Begin time     : Mon Sep 30 16:09:40 2019
Begin rpmdb    : 899:25e583f565e071183b4b6d2b29f8489f6e5463d1
User           : root <root>
Return-Code    : ** Aborted **                           ######################## ABORTED #########################
Command Line   : -d 2 -y update
Transaction performed with:
    Installed     rpm-4.11.3-40.el7.x86_64                    installed
    Installed     subscription-manager-1.24.13-1.el7.x86_64   installed
    Installed     yum-3.4.3-163.el7.noarch                    installed
    Installed     yum-plugin-versionlock-1.1.31-52.el7.noarch installed
Packages Altered:
 ** Obsoleting redhat-virtualization-host-image-update-4.3.5-20190920.0.el7_7.noarch    @rhel-7-server-rhvh-4-rpms
Excluding 5 updates due to versionlock (use "yum versionlock status" to show them)
 ** Obsoleted  redhat-virtualization-host-image-update-placeholder-4.3.5-3.el7ev.noarch ?
Scriptlet output:
   1 warning: %post(redhat-virtualization-host-image-update-4.3.5-20190920.0.el7_7.noarch) scriptlet failed, signal 1
history info
Uploading Enabled Repositories Report
Loaded plugins: product-id, subscription-manager, versionlock
Unable to upload Enabled Repositories Report

[root@rhv1 ~]#

Comment 5 Qin Yuan 2020-08-21 15:47:24 UTC
Verified with:
ovirt-engine-4.4.2.2-0.3.el8ev.noarch

Steps:
1. Test with VPP security profile selected
(Use RHVH 4.3 in this test, because RHVH 4.4 can't be added to engine when VPP is selected, see bug 1833254)
1) install redhat-virtualization-host-4.3.10-20200513.0.el7_8 host, select VPP security profile during installation
2) create 4.3 data center, cluster on engine, and add rhvh 4.3 host to engine
3) create a repo providing redhat-virtualization-host-4.3.10-20200615.0.el7_8 update
4) upgrade rhvh 4.3 host on engine

2. Test with /etc/ssh/sshd_config configured
1) install redhat-virtualization-host-4.4.1-20200722.0.el8_2 host
2) add rhvh 4.4 host to engine
3) modify /etc/ssh/sshd_config and restart sshd.service on host:
ClientAliveInterval 300
ClientAliveCountMax 0
4) create a repo providing redhat-virtualization-host-4.4.2-20200812.3.el8_2 update
5) upgrade rhvh 4.4 on engine

Results:
1. When VPP security profile was selected during host installation, upgrade packages task succeeded:
2020-08-19 10:45:33 IDT - TASK [ovirt-host-upgrade : Upgrade packages] ***********************************
2020-08-19 10:45:33 IDT -
2020-08-19 10:45:33 IDT - {
  "status" : "OK",
  "msg" : "",
  "data" : {
    "event_data" : {
      "playbook" : "ovirt-host-upgrade.yml",
      "playbook_uuid" : "f22e9551-699e-43ec-a56e-fb296996f8e2",
      "play" : "all",
      "play_uuid" : "001a4a16-1064-9792-4690-000000000009",
      "play_pattern" : "all",
      "task" : "Upgrade packages",
      "task_uuid" : "001a4a16-1064-9792-4690-000000000034",
      "task_action" : "yum",
      "task_args" : "",
      "task_path" : "/usr/share/ovirt-engine/ansible-runner-service-project/project/roles/ovirt-host-upgrade/tasks/main.yml:30",
      "role" : "ovirt-host-upgrade",
      "host" : "10.73.131.97",
      "remote_addr" : "10.73.131.97",
      "res" : {
        "results" : [ {
          "changed" : true,
          "results" : [ "Installed:\n  redhat-virtualization-host-image-update.noarch 0:4.3.10-20200615.0.el7_8      \n\nReplaced:\n  redhat-virtualization-host-image-update-placeholder.noarch 0:4.3.10-1.el7ev   \n\nComplete!\nUploading Enabled Repositories Report\n" ],
      "start" : "2020-08-19T07:34:50.825566",
      "end" : "2020-08-19T07:45:32.197702",


2. When /etc/ssh/sshd_config was configured, upgrade packages task succeeded:

2020-08-19 16:37:29 IDT - TASK [ovirt-host-upgrade : Upgrade packages] ***********************************
2020-08-19 16:46:25 IDT - {
  "status" : "OK",
  "data" : {
    "event_data" : {
      "playbook" : "ovirt-host-upgrade.yml",
      "playbook_uuid" : "44f35c32-1975-4e6b-b27d-be87eabc0936",
      "play" : "all",
      "play_uuid" : "001a4a16-1064-115d-9337-00000000000a",
      "play_pattern" : "all",
      "task" : "Upgrade packages",
      "task_uuid" : "001a4a16-1064-115d-9337-000000000035",
      "task_action" : "yum",
      "task_args" : "",
      "task_path" : "/usr/share/ovirt-engine/ansible-runner-service-project/project/roles/ovirt-host-upgrade/tasks/main.yml:30",
      "role" : "ovirt-host-upgrade",
      "host" : "10.73.73.13",
      "remote_addr" : "10.73.73.13",
      "res" : {
        "results" : [ {
          "msg" : "",
          "changed" : true,
          "results" : [ "Installed: redhat-virtualization-host-image-update-4.4.2-20200812.3.el8_2.noarch", "Removed: redhat-virtualization-host-image-update-placeholder-4.4.1-1.el8ev.noarch" ],
      "start" : "2020-08-19T13:37:29.373697",
      "end" : "2020-08-19T13:46:24.113720",
      "duration" : 534.740023,

Comment 9 errata-xmlrpc 2020-09-23 16:11:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Virtualization security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3807