Bug 1645412 - RHEL 7.6 - After yum update on overcloud, existing VM's are not starting
Summary: RHEL 7.6 - After yum update on overcloud, existing VM's are not starting
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Linux
urgent
urgent
Target Milestone: z10
: 10.0 (Newton)
Assignee: yogananth subramanian
QA Contact: Sanjay Upadhyay
URL:
Whiteboard:
Depends On: 1649408
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-02 07:48 UTC by Sanjay Upadhyay
Modified: 2020-11-24 08:22 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
After performing an OSP 10 update on overcloud nodes, new VMs can fail to start. When executing a modprobe command, you see an error similar to the following: modprobe: ERROR: could not insert 'kvm_intel': Unknown symbol in module, or unknown parameter (see dmesg) Workaround: On each overcloud node machine, remove the following option from /etc/modprobe.d/kvm.rt.tuned.conf: kvm_intel ple_gap=0 Then, reboot each node machine.
Clone Of:
: 1649408 (view as bug list)
Environment:
Last Closed: 2019-03-13 21:08:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Sanjay Upadhyay 2018-11-02 07:48:56 UTC
Description of problem:

deploy RHOSP10 with RHEL 7.6 using our public CDN. 
As overcloud images have not been updated since 10.z9, Overcloud deploy is fine.

Then I "yum update" on all of my (overcloud) nodes, rebooted them (on RHEL7.6), and then I cannot start any new VM for an obvious reason:


openstack server show my_failed_vm
| fault                                | {u'message': u'Exceeded maximum number of retries. Exceeded max          |

|                                      | scheduling attempts 3 for instance e1f3a644-c9de-4526-aee4-4f0a9743f721. |

|                                      | Last exception: invalid argument: could not find capabilities for        |

|                                      | domaintype=kvm ', u'code': 500, u'details': u'  File "/usr/lib/python2.7 |

|                                      | /site-packages/nova/conductor/manager.py", line 493, in                  |

|                                      | build_instances\n    filter_properties, instances[0].uuid)\n  File       |

|                                      | "/usr/lib/python2.7/site-packages/nova/scheduler/utils.py", line 184, in |

|                                      | populate_retry\n    raise exception.MaxRetriesExceeded(reason=msg)\n',   |

|                                      | u'created': u'2018-10-31T20:44:00Z'}                                     |

On a compute node, kvm_intel is not loaded!!

[root@overcloud-compute-0 etc]#  modprobe kvm_intel

modprobe: ERROR: could not insert 'kvm_intel': Unknown symbol in module, or unknown parameter (see dmesg)

[root@overcloud-compute-0 etc]# dmesg 

[ 8923.582176] kvm_intel: Unknown parameter `ple_gap'

[root@overcloud-compute-0 etc]# grep -r ple_gap /etc/

/etc/modprobe.d/kvm.rt.tuned.conf:options kvm_intel ple_gap=0



I removed that line from kvm.rt.tuned.conf, rebooted (as I'm lazy) and now I can start VMs. This is not an outstanding bug but if no update script that I skipped is removing that line, we have a regression. Can someone check urgently if we have a regression?




Version-Release number of selected component (if applicable):
RHEL 7.6
Openstack 10 - z9

How reproducible:
Always


Steps to Reproduce:
1. Deploy RHOSP 10 z9 with base undercloud image as 7.6
2. Deploy overcloud with defualt overcloud image (which is still 7.5)
3. On the Overcloud nodes do a yum update to 7.6, and this issue is seen.

Actual results:


Expected results:


Additional info:

Comment 1 Sanjay Upadhyay 2018-11-02 08:03:43 UTC
We have also tested it via 'overcloud stack update' -- ie..

1. Deploy the undercloud (rhel 7.6) - then deploy overcloud (OSP version 10)
2. POint the repositories to 7.6
3. Then ran openstack overcloud deploy --update-plan-only ...
4. Then Ran openstack overcloud update stack -i overcloud

Still the same issue kvm_intel module is not loaded

Comment 2 Mike Burns 2018-11-02 12:29:55 UTC
This might be better directed to the RHEL team (either kernel or virt or RT team).

> /etc/modprobe.d/kvm.rt.tuned.conf

The RT makes me think this is RealTime, is that correct?  did you have the RT kernel running on 7.6 but update to the non-RT kernel?

Comment 3 Sanjay Upadhyay 2018-11-02 14:17:47 UTC
(In reply to Mike Burns from comment #2)
> This might be better directed to the RHEL team (either kernel or virt or RT
> team).
> 
> > /etc/modprobe.d/kvm.rt.tuned.conf
> 
> The RT makes me think this is RealTime, is that correct?  did you have the
> RT kernel running on 7.6 but update to the non-RT kernel?

These overcloud nodes are not RT-KVM images. Certainly the naming is confusing, However these are normal compute nodes with 7.5 images, after deployment and spawning VM's, the overcloud nodes were updated (via update stack). After the update, reboot was done, and after which we see that the kvm_intel module to be not loaded.

I guess this bug should first be investigated with osp update folks?

Comment 10 Sanjay Upadhyay 2019-01-17 18:38:01 UTC
This was verified. 

We did a upgrade and reboot of the overcloud nodes and the VM's did come up. 
The issue with loading of module kvm_intel is not seen with update to 7.6 from 7.5.

Comment 11 Cody Swanson 2019-01-29 22:14:49 UTC
Hi all,

My customer just hit this issue while upgrading their RHOSP 10 update 5 environment to RHOSP 10 update 10 (RHEL 7.4 -> RHEL 7.6). To perform the update they did the standard overcloud update procedure. The workaround of removing the option:

options kvm_intel ple_gap=0 

From /etc/modprobe.d/kvm.rt.tuned.conf resolved their issue. They've since put a post-install script step in their deployment templates to work around this issue. As far as I can tell this issue is not yet resolved. For reference this is the version they ended up on post-upgrade. 

tuned-2.10.0-6.el7.noarch                                   Wed Jan 23 01:45:49 2019
tuned-profiles-cpu-partitioning-2.10.0-6.el7.noarch         Wed Jan 23 02:02:27 2019

Let me know if you need any additional data to troubleshoot further.

Comment 15 Shelley Dunne 2019-03-13 21:08:01 UTC
Since the problem described in this bug report should be
resolved in this build tuned-2.10.0-6.el7_6.3 which shipped 13-Mar-19, 
it has been closed with a resolution of CURRENTRELEASE.

For information, please reference  https://bugzilla.redhat.com/show_bug.cgi?id=1653767


Note You need to log in before you can comment on or make changes to this bug.