Bug 1898407 - [Deployment timing regression] Deployment takes longer with 4.7
Summary: [Deployment timing regression] Deployment takes longer with 4.7
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.7
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 4.7.0
Assignee: Bob Fournier
QA Contact: Raviv Bar-Tal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-17 01:40 UTC by Sai Sindhur Malleni
Modified: 2021-02-24 15:34 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:33:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
console-log-during-provisioning (371.98 KB, image/png)
2020-11-17 01:49 UTC, Sai Sindhur Malleni
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:34:06 UTC

Description Sai Sindhur Malleni 2020-11-17 01:40:59 UTC
Version:
4.7.0-0.nightly-2020-11-12-200927


Platform:

Baremetal on Dell Hardware driven through IPI install (BIOS/IPMI)

Please specify:
* IPI

What happened?
The time to deploy has regressed in 4.7 compared to 4.6.

For a 3 master node + 2 worker node deployment on baremetal

                      4.6.4                         4.7 nightly
Bootstrap Complete    46 mins                        51 mins
Install Complete      45 mins                        1 hr 28 mins

The install finishes 45 minutes after successful bootstrap in the case of 4.6 and 1hr 28 minutes after bootstrap completion in the case of 4.7.

So total bootstrap+install time in 4.6 = 1hr 31 mins
total bootstrap + install time in 4.7 = 2 hr 19 mins

It looked like the "provisioning" stage of the workers took a long time in 4.7 and instead of the one pxe boot, there was one pxe boot after which ironic python agent kept sleeping and eventually there was another reboot to pxe after which the RHCOS image was written. Ironic python agent kept sleeping for several minutes as can be seen on the console.

The "provisioning" stage itself took around 31 minutes in 4.7. I will attach a  screenshot of the console of a worker node during the provisioning where ironic python agent kept on sleeping until an eventual reboot to pxe and subsequent OS installation and run of OStree.

I'm suspecting this is a  regression in ironic as there is no rate limiting through BMO or like implemented yet.

Comment 2 Sai Sindhur Malleni 2020-11-17 01:49:48 UTC
Created attachment 1729960 [details]
console-log-during-provisioning

Please find the ironic python agent logs as seen from the console of worker during "provisioning" attached.

Comment 3 Bob Fournier 2020-11-17 14:48:48 UTC
Sai - which ironic rpm is being used in this test?

There are these log message regarding the agent token in the ironic-conductor.log file:

Token is properly generated
020-11-17 00:11:43.066 1 DEBUG ironic.conductor.manager [req-3d7af0d9-a790-44cd-968c-5c765cb072e8 - - - - -] Generating agent token for node 8af71141-c7af-4922-bc13-2a610d5360c6 get_node_with_token /usr/lib/python3.6/site-packages/ironic/conductor/manager.py:3535[00m

After a power cycle the conductor can't generate a new token:
2020-11-17 00:17:45.722 1 WARNING ironic.conductor.manager [req-52cb6f42-5470-4aa8-84d3-a6493c9374b4 - - - - -] An agent token generation request is being refused as one is already present for node 8af71141-c7af-4922-bc13-2a610d5360c6[00m

Following another power cycle the token is properly generated:
020-11-17 00:34:07.890 1 DEBUG ironic.common.utils [req-766d4d1f-89f6-47a0-84ce-2c6b52fd8730 - - - - -] Command stdout is: "Chassis Power Control: Down/Off
...
2020-11-17 00:38:59.848 1 DEBUG ironic.conductor.manager [req-10c4c7a6-7443-47af-a4fe-a03c7457dcb2 - - - - -] Generating agent token for node 8af71141-c7af-4922-bc13-2a610d5360c6 get_node_with_token /usr/lib/python3.6/site-packages/ironic/conductor/manager.py:3535[00m


We have seen problems in which IPA can't get a token, which ends up resulting in a extra timeout/reset cycle.  I thought that was fixed with https://review.opendev.org/#/c/756479/ for https://bugzilla.redhat.com/show_bug.cgi?id=1885355 but need to see if that fix is in the version you are testing.

Comment 4 Sai Sindhur Malleni 2020-11-17 20:11:33 UTC
The version of ironic should be whatever was carried in 4.7.0-0.nightly-2020-11-12-200927. I do not have the env now to check.

Comment 5 Sai Sindhur Malleni 2020-11-18 15:24:47 UTC
Bob, can you please let me know what nightly has the fix so I can try and report back?

Comment 6 Derek Higgins 2020-11-18 18:14:12 UTC
Looking at your logs I'm seeing a lot of "timed out" messages, e.g.

2020-11-17 00:11:42.328 1 DEBUG eventlet.wsgi.server [req-c31f9184-5780-4838-be45-42646e36973d - - - - -] (1) timed out ('::ffff:172.22.0.3', 57408, 0, 0) process_request /usr/lib/python3.6/site-packages/eventlet/wsgi.py:823[00m


When I run 4.7 locally my inspection, stalls for 10 minutes or so, while posting the collected data, I'm thinking that possibly connections from the BMO to inspector aren't being closed and as a result holding workers open until they timeout.

I can't see your IPA log so don't know if the same thing is happening here, do you have the IPA logs? And if so is there a long delay between these lines

Nov 18 11:50:51 worker-1 ironic-python-agent[733]: 2020-11-18 11:50:51.627 733 DEBUG ironic_python_agent.inspector [-] collected data: {'inventory': {...
Nov 18 12:00:51 worker-1 ironic-python-agent[733]: 2020-11-18 12:00:51.617 733 INFO ironic_python_agent.inspector [-] inspection finished successfull

Comment 8 Derek Higgins 2020-11-18 23:34:33 UTC
(In reply to Derek Higgins from comment #6)
> Looking at your logs I'm seeing a lot of "timed out" messages, e.g.
> 
> 2020-11-17 00:11:42.328 1 DEBUG eventlet.wsgi.server
> [req-c31f9184-5780-4838-be45-42646e36973d - - - - -] (1) timed out
> ('::ffff:172.22.0.3', 57408, 0, 0) process_request
> /usr/lib/python3.6/site-packages/eventlet/wsgi.py:823[00m
> 
> 
> When I run 4.7 locally my inspection, stalls for 10 minutes or so, while
> posting the collected data, I'm thinking that possibly connections from the
> BMO to inspector aren't being closed and as a result holding workers open
> until they timeout.
> 
> I can't see your IPA log so don't know if the same thing is happening here,
> do you have the IPA logs? And if so is there a long delay between these lines
> 
> Nov 18 11:50:51 worker-1 ironic-python-agent[733]: 2020-11-18 11:50:51.627
> 733 DEBUG ironic_python_agent.inspector [-] collected data: {'inventory':
> {...
> Nov 18 12:00:51 worker-1 ironic-python-agent[733]: 2020-11-18 12:00:51.617
> 733 INFO ironic_python_agent.inspector [-] inspection finished successfull

In my local dev-scripts watching sockets to ironic inspector (during inspection), ESTABLISHED sockets raises to 100 and then remains there.

[root@master-1 /]# netstat -apn | grep :5050 | grep barem                                          
tcp6       0      0 fd00:1101::3:54118      fd00:1101::3:5050       ESTABLISHED 25332/baremetal-ope
tcp6       0      0 fd00:1101::3:39390      fd00:1101::3:5050       ESTABLISHED 25332/baremetal-ope
tcp6       0      0 fd00:1101::3:49100      fd00:1101::3:5050       ESTABLISHED 25332/baremetal-ope
tcp6       0      0 fd00:1101::3:52448      fd00:1101::3:5050       ESTABLISHED 25332/baremetal-ope
tcp6       0      0 fd00:1101::3:39982      fd00:1101::3:5050       ESTABLISHED 25332/baremetal-ope
<snip>

[root@master-2 tmp]# netstat -apn | grep 5050 | grep barem | wc
    102     714   10302

Then inspector stops responding until connections hit the 15 minute timeout.

During inspection in my tests this accounted for roughly 12 to 15 minutes but the same thing may be happening in ironic (I haven't looked through the logs as closely)
[root@master-2 tmp]# netstat -apn | grep 6385 | grep barem | wc
    101     707   10201
 
I'm not sure this accounts for all of the additional time but if the same thing is happening in your environment its at least some of it.

e.g. I think this BMO call needs to be closed but there may be others
https://github.com/openshift/baremetal-operator/blob/3f71294/pkg/provisioner/ironic/dependencies.go#L48

it might also be that the default server behaviour has changed to enable keepalive

Comment 11 Sai Sindhur Malleni 2020-11-19 15:43:23 UTC
So, I agree with Derek that there is a combination of problems. With the nightly .7.0-0.nightly-2020-11-18-085225


[kni@e16-h12-b01-fc640 ~]$ grep "agent token" conductor
2020-11-18 22:41:14.764 1 DEBUG ironic.conductor.manager [req-d3cb2911-1143-4cc9-a9b7-2c677f221dbd - - - - -] Generating agent token for node 9e19b004-be4f-456f-bc5b-5dceb8c50650 get_node_with_token /usr/lib/python3.6/site-packages/ironic/conductor/manager.py:3540
2020-11-18 22:41:14.785 1 DEBUG ironic.conductor.manager [req-4d4de588-7935-427c-9dd8-b02edc4a9f10 - - - - -] Generating agent token for node 764dedc7-b92e-45d6-a8a4-852bd4a80e1e get_node_with_token /usr/lib/python3.6/site-packages/ironic/conductor/manager.py:3540

So the token issue BOb mentioned is no longer an issue. But let me look into Derek's comments as the "install" after bootstrap still took 56 minutes which is higher than 4.6

Comment 12 Sai Sindhur Malleni 2020-11-19 15:49:08 UTC
Can you guys help me figure out how to get access to the IPA logs in the inspector container.

I tried crictl exec 07dc56e736e92 /bin/bash from the master node running the ironic-inspector pod and oc exec metal3-5f5678b464-72kp4 -c metal3-ironic-inspector  -- bash
none of which work

Comment 13 Sai Sindhur Malleni 2020-11-19 15:54:33 UTC
There's a bunch of timedout messages in the ironic-inspector logs 

 inspector:24190:2020-11-19 15:28:02.593 1 DEBUG eventlet.wsgi.server [req-1d178150-4ef6-4176-8f2f-557a9dd87693 - - - - -] (1) timed out ('::ffff:172.22.0.17', 37264, 0, 0) process_request /usr/lib/python3.6/site-packages/eventlet/wsgi.py:823
inspector:24192:2020-11-19 15:28:02.597 1 DEBUG eventlet.wsgi.server [req-928acc5f-6b8a-4c01-8a91-9b660ac1eee1 - - - - -] (1) timed out ('::ffff:172.22.0.17', 37266, 0, 0) process_request /usr/lib/python3.6/site-packages/eventlet/wsgi.py:823
inspector:24193:2020-11-19 15:28:02.664 1 DEBUG eventlet.wsgi.server [req-b74e6395-0e8b-4430-a854-f19bb9ad3522 - - - - -] (1) timed out ('::ffff:172.22.0.17', 37278, 0, 0) process_request /usr/lib/python3.6/site-packages/eventlet/wsgi.py:823
inspector:24194:2020-11-19 15:28:04.669 1 DEBUG eventlet.wsgi.server [req-64abc33c-028b-4bd3-89d1-c1f3832a8ee2 - - - - -] (1) timed out ('::ffff:172.22.0.17', 37280, 0, 0) process_request /usr/lib/python3.6/site-packages/eventlet/wsgi.py:823
inspector:24195:2020-11-19 15:28:18.513 1 DEBUG eventlet.wsgi.server [req-8ed534e6-390a-4cd2-ae76-6832a4e1aeb2 - - - - -] (1) timed out ('::ffff:172.22.0.17', 37612, 0, 0) process_request /usr/lib/python3.6/site-packages/eventlet/wsgi.py:823
inspector:24196:2020-11-19 15:28:21.234 1 DEBUG eventlet.wsgi.server [req-092f61f6-8624-46af-b5ba-559ee82d12af - - - - -] (1) timed out ('::ffff:172.22.0.17', 37614, 0, 0) process_request /usr/lib/python3.6/site-packages/eventlet/wsgi.py:823
inspector:24207:2020-11-19 15:28:33.707 1 DEBUG eventlet.wsgi.server [req-53aa1eee-49d5-4e8b-a16f-fb02f2c69ec7 - - - - -] (1) timed out ('::ffff:172.22.0.17', 37618, 0, 0) process_request /usr/lib/python3.6/site-packages/eventlet/wsgi.py:823
inspector:24208:2020-11-19 15:28:35.752 1 DEBUG eventlet.wsgi.server [req-eb5bc448-0a36-4f07-b3ee-0995b223f551 - - - - -] (1) timed out ('::ffff:172.22.0.17', 37760, 0, 0) process_request /usr/lib/python3.6/site-packages/eventlet/wsgi.py:823
inspector:24212:2020-11-19 15:28:51.361 1 DEBUG eventlet.wsgi.server [req-04c0ab62-61e8-4291-8b95-9e1779a36699 - - - - -] (1) timed out ('::ffff:172.22.0.17', 37850, 0, 0) process_request /usr/lib/python3.6/site-packages/eventlet/wsgi.py:823
inspector:24213:2020-11-19 15:29:04.738 1 DEBUG eventlet.wsgi.server [req-50225c96-7f4a-4dab-b2ab-fb01fad29906 - - - - -] (1) timed out ('::ffff:172.22.0.17', 38192, 0, 0) process_request /usr/lib/python3.6/site-packages/eventlet/wsgi.py:823


I see deployment taking 11 minutes longer, with the agent token fix that Bob referenced, so kind of in line with Derek's observations. 

I missed a -it in my oc and crictl commands pasted above <need more coffee>, so will get u that information as well.

Comment 14 Sai Sindhur Malleni 2020-11-19 18:41:49 UTC
The ramdisk logs for both the nodes in my deployment truncates at
/ironic_python_agent/inspector.py:224
Nov 18 17:37:53 e16-h14-b02-fc640.rdu2.scalelab.redhat.com ironic-python-agent[1804]: 2020-11-18 17:37:53.283 1804 DEBUG ironic_python_agent.inspector [-] BMC IP address: 10.1.30.154 collect_default /usr/lib/python3.6/site-packages/ironic_python_agent/inspector.py:225
Nov 18 17:37:53 e16-h14-b02-fc640.rdu2.scalelab.redhat.com ironic-python-agent[1804]: 2020-11-18 17:37:53.284 1804 INFO ironic_python_agent.utils [-] Collecting system logs and debugging information
Nov 18 17:37:53 e16-h14-b02-fc640.rdu2.scalelab.redhat.com ironic-python-agent[1804]: 2020-11-18 17:37:53.290 1804 DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): journalctl --full --no-pager -b -n 10000 execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:377


So couldn't find the log messages Derek was referencing.

Comment 16 Sai Sindhur Malleni 2020-11-20 13:49:21 UTC
I'm OK with removing the TestBlocker flag now.

Comment 17 Derek Higgins 2020-11-20 23:52:12 UTC
> e.g. I think this BMO call needs to be closed but there may be others
> https://github.com/openshift/baremetal-operator/blob/3f71294/pkg/provisioner/
> ironic/dependencies.go#L48

Workaround for the tcp connection problem submitted upstream https://github.com/metal3-io/baremetal-operator/pull/726
Looks like gophercloud has had functionality added to reuse tcp connections[1] but it isn't reusing them in inspector status checks

1- https://github.com/gophercloud/gophercloud/pull/1952

Comment 18 Dmitry Tantsur 2020-11-26 10:08:39 UTC
It's possible that https://github.com/metal3-io/baremetal-operator/pull/712 has fixed the bug.

Comment 19 Bob Fournier 2020-12-04 13:51:58 UTC
It does appear that https://github.com/metal3-io/baremetal-operator/pull/712 has fixed the tcp socket issue.  That fix was merged downstream in https://github.com/openshift/baremetal-operator/pull/113.

At this point both fixes - the token generation fix and the tcp socket issue, should be available in 4.7. Moving this to MODIFIED.

Comment 21 Raviv Bar-Tal 2020-12-07 06:48:21 UTC
Hey Shi,
I'm closing this BZ, as it seemed to be solved.
If you sill get timeouts and long installation please open new BZ/
B.R
Raviv

Comment 24 errata-xmlrpc 2021-02-24 15:33:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.