Bug 1472310

Summary: Compute node freezes after two server instance deploys in ROL-Staging
Product: Red Hat OpenStack Reporter: Philip Sweany <psweany>
Component: openstack-novaAssignee: Eoghan Glynn <eglynn>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Joe H. Rahme <jhakimra>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: berrange, dasmith, eglynn, kchamart, psweany, rlocke, sbauza, sferdjao, sgordon, srevivo, svanders, vromanso
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-15 02:06:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
The steps taken starting from a clean CL210-OSP10 ROL-stage environment
none
An overview of working with ROL-stage, for those unfamiliar with Red Hat Training none

Description Philip Sweany 2017-07-18 12:49:55 UTC
Description of problem:

In Red Hat Training CL210-OSP10 classroom student environment installed in ROL (Staging), deploying project server instances causes the single compute node to freeze completely, and it never recovers.   

Version-Release number of selected component (if applicable):

How reproducible:

Consistently fails in the ROL environment.  Similar steps on physical hardware (ILT classroom) does not fail.  This bug requires testing using our ROL-Staging test environment.  The attachments to this bug report include a text file of the "steps to reproduce" and "rol-stage overview" to introduce the student classroom ROL environment (if you are not familiar already) to document procedures we use to access ROL systems through both GUI and external SSH methods.

Steps to Reproduce:

Click "Online Lab" tab.
Click "Provision Lab" button.
Wait for classroom, workstation, director, power to say "STARTED".

Click workstation's "Open Console" button.
Log in as student, password student.
Menu: Applications -> System Tools -> Settings
Dialog: Displays ->  Unknown Display -> Resolution 1920x1080 -> Keep Changes.
Close Dialog.
Menu: Applications -> Favorites -> Terminal

[student@workstation ~]$ ssh stack@director

[stack@director ~]$ openstack compute service list
     # Ensure that all four services are up.
     # If not up, wait however long it takes

[stack@director ~]$ openstack server list
     # All three nodes display SHUTOFF

[stack@director ~]$ openstack server start overcloud-controller-0
[stack@director ~]$ openstack server start overcloud-cephstorage-0
[stack@director ~]$ openstack server start overcloud-compute-0

Leave this browser window tab as is.
Click on "Red Hat Online Learning" tab.
Observe compute0, controller0 and ceph0 starting.  Wait until "STARTED".

Click controller0's "Open Console" button, for only one purpose, to observe the boot process and accurately know when controller0 is completely finished booting.
    # This opens a "controller0" browser tab.
    # Do not disturb the boot process.
    # Wait until "overcloud-controller-0 login:"
    # Do not log in, just close the "controller0" browser tab.

Return to "workstation" browser tab.

[stack@director ~]$ openstack server list
     # All three nodes display ACTIVE

[stack@director ~]$ exit
[student@workstation ~]$ lab troubleshooting-compute-nodes setup
     # this will take 7+ minutes
     # this has built resources and deployed the "production-web1" server instance

[student@workstation ~]$ source admin-rc
[student@workstation ~]$ openstack compute service set --enable overcloud-compute-0.localdomain nova-compute
     # to re-enable a service disabled by the setup script (intentionally, for students to find the problem)

[student@workstation ~]$ source developer1-finance-rc
[student@workstation ~]$ openstack server create --nic net-id=finance-network1 --security-group finance-web --image rhel7 --flavor m1.web --key-name developer1-keypair1 --wait finance-web1
[student@workstation ~]$ openstack server create --nic net-id=finance-network1 --security-group finance-web --image rhel7 --flavor m1.web --key-name developer1-keypair1 --wait finance-web2
     # compute0 locks up on this second deploy.

Actual results:

The "compute0" compute node freezes when the second instance is deployed.  The node cannot be accessed by SSH or by using the ROL environment to open that systems console.  Not a temporary freeze; the node never returns

Expected results:

The second instance should deploy cleanly, exactly like the first instance.

Additional info:

Comment 1 Philip Sweany 2017-07-18 12:51:24 UTC
Created attachment 1300469 [details]
The steps taken starting from a clean CL210-OSP10 ROL-stage environment

Comment 2 Philip Sweany 2017-07-18 12:53:09 UTC
Created attachment 1300470 [details]
An overview of working with ROL-stage, for those unfamiliar with Red Hat Training

Comment 3 Sven Anderson 2017-07-21 14:08:13 UTC
Is the compute node deployed on a baremetal machine or on a vm? In any case, that could not be a nova issue, which should never be in the position to freeze a machine. ;-)

Comment 4 Philip Sweany 2017-07-21 20:02:59 UTC
All machines (undercloud and overcloud) are VMs. It appears that I, as a novice bugzilla submitter, have chosen the wrong category.  Since my troubleshooting has been unsuccessful to point to a root cause, and there is no category for just "openstack", I chose openstack-nova.  This might be caused by nova overcommit misconfiguration, errant CPU detection or classification, or many other things.  Being that this is running on top of our ROL platform (Ravello emulation), I am unclear about how to proceed.

If you know that this cannot be a nova issue, then please help me determine who should be looking at this.  This is critical to the Red Hat Training group's ability to deploy this course on our ROL platform for our customers, and we do not have the engineering knowledge depth we think is needed to track this one down.  This has already stumped us for over two weeks.

Comment 5 Sven Anderson 2017-07-27 12:17:04 UTC
(In reply to Philip Sweany from comment #4)
> All machines (undercloud and overcloud) are VMs. It appears that I, as a
> novice bugzilla submitter, have chosen the wrong category.  Since my
> troubleshooting has been unsuccessful to point to a root cause, and there is
> no category for just "openstack", I chose openstack-nova.  This might be
> caused by nova overcommit misconfiguration, errant CPU detection or
> classification, or many other things.  Being that this is running on top of
> our ROL platform (Ravello emulation), I am unclear about how to proceed.

So, you say the node "freezes", but you just can't reach it anymore, right? Does it still respond to ping? So I only see two possibilities: the network gets messed up or the machine really freezes. I'm not 100% confident about the network case, but if the machine really freezes, that cannot be a nova issue, even if it triggers it. The node shouldn't freeze, no matter what nova does.

> If you know that this cannot be a nova issue, then please help me determine
> who should be looking at this.  This is critical to the Red Hat Training
> group's ability to deploy this course on our ROL platform for our customers,
> and we do not have the engineering knowledge depth we think is needed to
> track this one down.  This has already stumped us for over two weeks.

Who is operating this platform? They would be the first people I would talk to. Other than that, it depends if we can find out if the VM actually freezes, or just gets inaccessible. In the second case I would ask the kernel people, in the first maybe openstack neutron (networking).

Comment 6 Philip Sweany 2017-08-15 02:06:17 UTC
No further information.  The environment is Ravello.  Our internal Red Hat course development staging. We were counting on assistance to determine *how* to determine this cause, as the undercloud gives no indication of what happened.  Closing this, since we are still where we were at the beginning.  (To answer your question: yes, the node just freezes.  No, I hadn't ever experienced that before.  Being frozen, it did not give us much to look at.)