Bug 1564918 - On ipmitool failure, it doesn't look like ironic is retrying eventhough it says after "4 retry"
Summary: On ipmitool failure, it doesn't look like ironic is retrying eventhough it sa...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: z2
: 13.0 (Queens)
Assignee: Ilya Etingof
QA Contact: bjacot
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-08 22:53 UTC by David Hill
Modified: 2018-08-29 16:32 UTC (History)
8 users (show)

Fixed In Version: openstack-ironic-10.1.3-2.el7ost
Doc Type: Enhancement
Doc Text:
Previously, Ironic considered just one IPMI error as retryable. That might have caused unjustified Ironic failure. With this enhancement, Ironic treats more types of IPMI error messages as retryable by the IPMI-backed hardware interfaces, such as power and management hardware interfaces. Specifically, "Node busy", "Timeout", "Out of space", and "BMC initialization in progress" IPMI errors cause Ironic to retry the IPMI command. The result is improved reliability of IPMI based communication with BMC.
Clone Of:
Environment:
Last Closed: 2018-08-29 16:32:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 575458 0 'None' MERGED Adds more `ipmitool` errors as retryable 2021-01-01 18:05:33 UTC
Red Hat Product Errata RHBA-2018:2593 0 None None None 2018-08-29 16:32:53 UTC

Description David Hill 2018-04-08 22:53:52 UTC
Description of problem:


2018-04-08 18:45:26.852 32338 ERROR ironic.drivers.modules.ipmitool [-] IPMI power on timed out after 4 retries on node ffdf17c9-cfb1-4351-87bb-3005fd04e469.
[root@undercloud-0-rhosp10 ironic]# ipmitool -I lanplus -H 192.168.122.65 -L ADMINISTRATOR -U root -R 3 -N 5 -f /tmp/tmpV2WhTS power on



2018-04-08 15:45:24,476.476 8208 DEBUG VirtualBMC [-] Get power state called for domain control-2-rhosp10
2018-04-08 15:45:29,522.522 8208 DEBUG VirtualBMC [-] Power on called for domain control-2-rhosp10
2018-04-08 15:45:31,548.548 8208 ERROR VirtualBMC [-] Error powering on the domain control-2-rhosp10. Error: Network not found: no network with matching name 'default'
2018-04-08 15:45:36,585.585 8208 DEBUG VirtualBMC [-] Get power state called for domain control-2-rhosp10
2018-04-08 15:45:41,617.617 8208 DEBUG VirtualBMC [-] Get power state called for domain control-2-rhosp10


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Deploy undercloud in kvm
2. Use virtualbmc to power on 3 controllers/ 1 compute /3 ceph nodes 
3.

Actual results:
Fails to power on 4 nodes almost each time introspection is ran but appears to only be trying once

Expected results:
Should retry 

Additional info:
Adding retry in virtual BMC solves the problem but I'm not convinced this is the best place to solve this.

--- a/vbmc.py.orig
+++ b/vbmc.py
@@ -15,6 +15,8 @@ import xml.etree.ElementTree as ET
 import libvirt
 import pyghmi.ipmi.bmc as bmc
 
+import time
+
 from virtualbmc import log
 from virtualbmc import utils
 
@@ -137,17 +139,28 @@ class VirtualBMC(bmc.Bmc):
 
     def power_on(self):
         LOG.debug('Power on called for domain %s', self.domain_name)
-        try:
+        ret = self.retry_power_on()
+        if ret != 0:
+          return 0xd5
+
+    def retry_power_on(self):
+        ret=0xd5
+        for i in range(3):
+          LOG.debug("%s", i)
+          try:
             with utils.libvirt_open(**self._conn_args) as conn:
                 domain = utils.get_libvirt_domain(conn, self.domain_name)
                 if not domain.isActive():
                     domain.create()
-        except libvirt.libvirtError as e:
-            LOG.error('Error powering on the domain %(domain)s. '
-                      'Error: %(error)s' % {'domain': self.domain_name,
-                                            'error': e})
-            # Command not supported in present state
-            return 0xd5
+                    ret = 0
+                    break
+          except libvirt.libvirtError as e:
+              LOG.error('Error powering on the domain %(domain)s. '
+                        'Error: %(error)s' % {'domain': self.domain_name,
+                                              'error': e})
+              time.sleep(5)
+        return ret
+
 
     def power_shutdown(self):
         LOG.debug('Soft power off called for domain %s', self.domain_name)

Comment 1 Julia Kreger 2018-04-26 20:19:40 UTC
Greetings,

I think retry logic in vbmc is likely the right course of action to take.  The reason being is that ipmitool only retries when there are cases of no reply received from the BMC. Vbmc is in essence indicating that there has been a hard failure powering on the "node", at which point ipmitool and ironic don't try any sort of retry operation. The best way to think about it is connectivity or transitory issue retry as opposed to hard failure retries.

Hopefully that makes sense!

-Julia

Comment 2 Ilya Etingof 2018-04-27 21:31:51 UTC
It also seems that Ironic's notion of retryable failures is limited to a single  `lanplus` specific error message. Thus, it may make sense to hardcode more error messages into Ironic to appropriate other error messages that `ipmitool` can potentially produce when reporting soft failures. 

Once Ironic retries on soft errors like "timeout" or "node busy", we could make VirtualBMC reporting one of these soft errors on libvirt faults what finally brings all the pieces together.

I think it makes sense to retry on more error conditions because in the field, various BMCs may behave differently.

For error messages see [1], particularly generic "completion codes" (p.44) and "device specific codes" (p.154).

1. https://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/ipmi-second-gen-interface-spec-v2-rev1-1.pdf

Comment 3 David Hill 2018-05-09 13:35:38 UTC
Is there any reasons for not simply retrying on every kind of failures ?  I'd expect the retry to always retry unless it's in the specifications... and if a specification says "try only once if this error is returned" , I'd be saying it's a bad specification.

Comment 4 Ilya Etingof 2018-05-14 08:29:18 UTC
> Is there any reasons for not simply retrying on every kind of failures?

I so not think they have specific instructions in the spec on the retries and timing. But they enumerate many errors and some of them look "softer" e.g. retryable.

Consider for example "authentication failure" versus "BMC initialization in progress" errors. To me it feels like there is no much point in pondering unsuccessful authentication (you can even get blacklisted at the BMC if you do), but it does make sense to wait a little for BMC to come up and try again.

Also, retrying implies wasted time. It may make sense to fail fast if we have little hope to recover.

Comment 12 bjacot 2018-08-07 17:21:52 UTC
Installed OSP13

cat /etc/yum.repos.d/latest-installed 
13   -p 2018-07-30.2

verified that /ironic/drivers/modules/ipmitool.py had latest gerrit changes.  Changes are included.

Please reopen if you experience this problem again.

Comment 13 Joanne O'Flynn 2018-08-15 13:50:18 UTC
This bug is marked for inclusion in the errata but does not currently contain draft documentation text. To ensure the timely release of this advisory please provide draft documentation text for this bug as soon as possible.

If you do not think this bug requires errata documentation, set the requires_doc_text flag to "-".


To add draft documentation text:

* Select the documentation type from the "Doc Type" drop down field.

* A template will be provided in the "Doc Text" field based on the "Doc Type" value selected. Enter draft text in the "Doc Text" field.

Comment 14 Ilya Etingof 2018-08-27 13:15:48 UTC
May be this sentence "Ironic fails on potentially recoverable IPMI failures." should be re-phrased to indicate that this is what has been happening *prior* to this erratum.

Comment 15 Ilya Etingof 2018-08-27 13:25:49 UTC
I think the logic is inverse here.

Before this enhancement ironic considered just one IPMI error as retryable. That may have caused unjustified ironic failure. This enhancement made ironic considering more types of IPMI errors as recoverable e.g. ironic would retry the failed command if the command failed with one of those now retryable errors.

Does it make sense?

Comment 16 Joanne O'Flynn 2018-08-27 14:12:34 UTC
Thanks Ilya Etingof

Comment 17 Joanne O'Flynn 2018-08-27 14:12:56 UTC
Thanks Ilya Etingof

Comment 19 errata-xmlrpc 2018-08-29 16:32:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2593


Note You need to log in before you can comment on or make changes to this bug.