Bug 496748
Summary: | QA_ONLY: Dell iDRAC support with fence_ipmilan | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Vinny Valdez <vvaldez> | ||||||||||||
Component: | cman | Assignee: | Christine Caulfield <ccaulfie> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||||||
Severity: | high | Docs Contact: | |||||||||||||
Priority: | high | ||||||||||||||
Version: | 5.3 | CC: | cluster-maint, ctatman, cward, djansa, edamato, jfeeney, jkortus, kevin_guinn, liko, llim, martinez, mgrac, narayanan_d, praveen_paladugu, rmccabe, saidur_hasan, swhiteho, tao, wmealing, wwlinuxengineering | ||||||||||||
Target Milestone: | rc | Keywords: | TestOnly | ||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | All | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | |||||||||||||||
: | 516548 (view as bug list) | Environment: | |||||||||||||
Last Closed: | 2010-03-30 08:39:53 UTC | Type: | --- | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | |||||||||||||||
Bug Blocks: | 496328, 496749, 516548, 545216 | ||||||||||||||
Attachments: |
|
Description
Vinny Valdez
2009-04-20 22:06:39 UTC
Changing '-c' to support also this device is not a problem. We will just replace string with list of string (as used in other agents). I can write a new fence agent for iDRAC but it will be easier if I will have an access on machine (few hours of direct access should be enough). Is it possible? Unfortunately it is not. I have to go physically on-site and sign-in to be able to test anything. I will be out of town this week, but I can test something for you next week. Also, Praveen Paladugu (cc on this Bug) is an on-site employee and could test something. Mine is the enterprise one. Copy & paste from the system page: Main System Chassis System Information Description PowerEdge R710 BIOS Version 1.1.4 Service Tag 6PC9C4J Host Name drac-chywoon OS Name Auto Recovery Recovery Action None Initial Countdown 15 Present Countdown 15 [Back to Top] Remote Access Controller RAC Information Name iDRAC Product Information Integrated Dell Remote Access Controller 6 - Enterprise Date/Time Sat Oct 24 16:07:31 2009 Firmware Version 1.20.01 Firmware Updated Tue Oct 13 15:38:12 2009 Hardware Version 0.01 MAC Address 00:22:19:35:5E:90 IPv4 Information Enabled Yes IP Address 10.44.1.18 Subnet Mask 255.255.255.0 Gateway 10.44.1.1 DHCP Enabled No IPv6 Information Enabled No IP Address 1 :: Prefix Length 64 IP Gateway :: Link Local Address :: IP Address 2 :: Auto Config Yes After upgrading firmware to 2.10 on M710 - original fence agent (racadm) works as expected. SMCLP also contains enough information and it can be used but I usually ends with 'maximum smash connection reached'. It is possible to kill such connections? On R710 - Path to managed system is /admin1/system2 on Steve's machine - Is it same on all machines (according to manual it should be /admin1/system1)? I can use 'show system*' but I can't use 'start system*' (it is possible to remember number but imho there is better solution). Created attachment 367157 [details]
Fence agent for iDrac on R* series
Preliminary version of fence agent with fencing library for iDrac on R-series. You will need pexcept package.
If you don't have /admin1/system2 id for you system, you will have to modify it. Timeout problems can be solved by using timeout options, look at help (defaults works for me).
usage:
python fence_idrac.py -o reboot -a drac-ch -l root -p root -x
Dell would participate in testing this feature in RHEL 5.5 Narayanan D: Hi, can you look at preliminary agent (comment #26) - it works as expected on our testing machine but it is possible that it does not cover something (generalization based on two machines) ? Narayanan -- Just sent you an email. We need Dell's immediate testing feedback as requested in comments #32 and #26. If we don't receive this feedback by the end of the week, I'm afraid this fix will miss the next RHEL and RHEV updates. Thanks! Created attachment 377033 [details]
capture from script command
Attached is the capture from script command.
After changing the "/admin1/system2" -> "/admin1/system1" in fence_idrac.py, everything worked fine.
Following are the details of the R710 I used for testing---------------------
Description PowerEdge R710
BIOS Version 1.2.6
Service Tag 8HDPBK1
Host Name vse-8-254.vmware.lab
OS Name VMware ESX Server
Auto Recovery
Recovery Action None
Initial Countdown 15
Present Countdown 15
Embedded NIC MAC Addresses
NIC1 Ethernet 00:24:e8:67:99:15
iSCSI 00:24:e8:67:99:16
NIC2 Ethernet 00:24:e8:67:99:17
iSCSI 00:24:e8:67:99:18
NIC3 Ethernet 00:24:e8:67:99:19
iSCSI 00:24:e8:67:99:1a
NIC4 Ethernet 00:24:e8:67:99:1b
iSCSI 00:24:e8:67:99:1c
[Back to Top]
Remote Access Controller
RAC Information
Name iDRAC6
Product Information Integrated Dell Remote Access Controller 6 - Enterprise
Date/Time Tue Dec 8 21:37:38 2009
Firmware Version 1.30 (Build 23)
Firmware Updated Thu Dec 3 21:40:35 2009
Hardware Version 0.01
MAC Address 00:24:E8:67:99:1D
IPv4 Information
IPv4 Enabled Yes
IP Address 172.17.5.232
Subnet Mask 255.255.0.0
Gateway 172.17.1.150
DHCP Enabled Yes
Use DHCP to obtain DNS server addresses No
Preferred DNS Server 0.0.0.0
Alternate DNS Server 0.0.0.0
IPv6 Information
IPv6 Enabled No
IP Address 1 ::
IP Gateway ::
Link Local Address ::
Autoconfig Enabled Yes
Use DHCPv6 to obtain DNS Server Addresses No
Preferred DNS Server ::
Alternate DNS Server ::
------------------------------------------------------------------------------
Everything is working fine on an R910 as well. Praveen: Does R910 also needs modification system2 -> system1 ? Is there direct way how to determine if we should work with /admin/system2 or /admin/system1 ? I'm not able to find such info in Dell documentation. If there is no direct way we can set default target to /admin/system1 and make it configurable via cmd-line Marek, Yes the R910 also required the "/admin1/system2" -> "/admin1/system1" change. Also, in the file I attached, in comment #36, I ran a command "show". This helped me understand that I am supposed to change "/admin1/system2" -> "/admin1/system1" for the fence_idrac to work. Can't that command to be used in the script to determine whether "system2" or "system1" has to be used? Marek, Could you please advice on how I can test the fencing of blade servers? The fence_idrac attached doesn't seem to accept "module_name" as input? I would like to test the fencing on baldes as well. Praveen Not sure, I have never saw them - perhaps we need to change only target. Please post me an output. I logged into the idrac of R710 with ssh and ran the command "show". Following are the details of the same. ---------------------------------------------------------------------- [user1 idrac]$ ssh root.5.232 root.5.232's password: /admin1-> show /admin1 properties ElementName = SM CLP Admin Domain associations systemcomponent : GroupComponent = /admin1 PartComponent = /admin1/system1 serviceaffectselement : AffectedElement = /admin1 AffectingElement = /admin1/system1/sp1/clpsvc1 AssignedSequence = 0 ElementEffects = NULL OtherElementEffectsDescriptions = NULL owningcollectionelement : OwnedElement = /admin1/hdwr1 OwningElement = /admin1 rolelimitedtotarget : DefiningRole = /admin1/system1/sp1/rolesvc3/role1 TargetElement = /admin1 rolelimitedtotarget : DefiningRole = /admin1/system1/sp1/rolesvc3/role2 TargetElement = /admin1 rolelimitedtotarget : DefiningRole = /admin1/system1/sp1/rolesvc3/role3 TargetElement = /admin1 owningcollectionelement : OwnedElement = /admin1/profiles1 OwningElement = /admin1 elementconformstoprofile : ConformantStandard = /admin1/profiles1/profile3 ManagedElement = /admin1 targets hdwr1 profiles1 system1 verbs cd show help version /admin1-> exit -------------------------------------------------------------------- Please let me know if you need any more details. Where should module name be entered? Ideally please post a process that will power on/off machine. Thanks Marek, In one your previous posts (comment #25) above, you mentioned you were able to fence an M710 blade server. How did you do that? Are you using the CMC or are you using the idrac of the blade directly? If you are using the CMC, you have to provide the module name (which points to a particular blade in the chassis). On CMC command prompt you have to run the following command to "powecycle" a bladed plugged at slot 4. "racadm serveraction -m server-4 powercycle" Other commands include "poweron", "powerdown", "hardreset", "graceshutdown", "powerstatus". In the above command "-m" options points to the module name. If you are using the idrac on the blade directly, you have to ssh to the idrac and run commands like "reset system1" to reset the server. So, my question comes down to: what was used when you mentioned racadm works fine on M710 in comment #25 ? And what script was used to test that? Praveen M710 with new firmware + fence_drac5 (it uses racadm) so CMC was used Praveen, remember we did this when I worked onsite there, which is why I started this bug? I documented it here: http://linux.dell.com/wiki/index.php/Products/HA/DellRedHatHALinuxCluster/Cluster#Additional_Configuration_for_DRAC_Fencing I listed instructions on manually changing the fence agent to fence_drac5. Also, bug 466788 is where fencing a blade through the CMC was discussed. I get it now, I was assuming fence_idrac is going to be used for blades as well. So, fence_idrac is for servers with idrac and fence_drac5 is for servers with drac and blade servers. And fence_drac5 uses CMC for fencing blade servers. Hope I got this correct. Marek, Please let me know if you need any more testing for this fix to be pushed to the next update of RHEL. Thank you Praveen Created attachment 377710 [details]
Fence agent for iDrac on R* series
Fence agent that automatically determines target (tested on /admin1/system2). Please retest on other system. Only detection of target was added so nothing else should change.
Hey Marek, Could you please the the attachment in the comment #49? I still see the lines if options["-o"] == "on": conn.send("start /admin1/system2\n") else: conn.send("stop -f /admin1/system2\n") in the fence_idrac.py. The target seems to be hard coded. When I tried this version of fence_idrac, I get time out errors. Following is the output:::: [user1]# ./fence_idrac.py -o on -a 172.17.6.243 -l root -p calvin -x Failed: Timed out waiting to power ON Thank you Praveen Created attachment 378224 [details]
Fence agent for iDrac on R* series
You are right, wrong version. I believe that now there is a correct one.
Hey Marek, This agent worked fine without any changes. I tested it on an R710 and R910. Everything went fine on R710. But on R910, when I tried "reboot" from fence_idrac, at times, it would go down, but never come back up. This could be because the server I have is still a PT level server. There could be something wrong on the idrac of this servers. At times the "reboot" command also worked fine on R910. Thank you Praveen Hi Praveen, Can you send me a verbose output from that R910 ? If it works sometimes perhaps we will just have to set better defaults for timeout. Marek, Sorry for the delay. I upgraded the R910 server with new parts and server didn't boot for a while. I finally got it to boot yesterday. I get the verbose output today. Praveen Hey Marek, There surely seems to be a problem with fence_idrac.py. When I try to ssh to the server by myself, I am able to login. But when I try to "reboot" the server using fence_idrac.py, I am unable to login. Following are details of my testing: [praveen@vse-r610 idrac]$ ssh root.7.227 root.7.227's password: /admin1-> exit CLP Session terminated Connection to 172.17.7.227 closed. [praveen@vse-r610 idrac]$ ssh root.7.227 root.7.227's password: /admin1-> exit CLP Session terminated Connection to 172.17.7.227 closed. [praveen@vse-r610 idrac]$ ./fence_idrac.py -v -o reboot -a 172.17.7.227 -l root -p calvin -x Unable to connect/login to fencing device [praveen@vse-r610 idrac]$ exit exit I problem I reported above seems to be with the RSA keys. After removing the key for 172.17.2.227 (idrac of R910) from my known_hosts, I got a pop up asking for me to confirm it is ok to continue the ssh, when I ran fence_idrac.py the first time. [praveen@vse-r610 idrac]$ ./fence_idrac.py -v -o reboot -a 172.17.7.227 -l root -p calvin -x """""PRAVEEN>> I SAW THE POP UP HERE""""""""" Unable to connect/login to fencing device [praveen@vse-r610 idrac]$ ./fence_idrac.py -v -o reboot -a 172.17.7.227 -l root -p calvin -x The authenticity of host '172.17.7.227 (172.17.7.227)' can't be established. RSA key fingerprint is 3d:40:69:85:42:91:5d:19:eb:5e:79:01:c6:9c:73:70. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '172.17.7.227' (RSA) to the list of known hosts. root.7.227's password: /admin1-> show -display targets /admin1 targets hdwr1 profiles1 system1 /admin1-> show -display properties /admin1/system1 /admin1/system1 /admin1-> Connection timed out [praveen@vse-r610 idrac]$ May be the timeout values have to be changed as you mentioned. Praveen I tried to increase the all the timeouts in fence_idrac.py, but that doesn't solve the problem: ./fence_idrac.py -v -o reboot -a 172.17.7.227 -l root -p calvin --login-timeout 10 --power-timeout 10 --shell-timeout 10 --power-wait 10 -x root.7.227's password: /admin1-> show -display targets /admin1 targets hdwr1 profiles1 system1 /admin1-> show -display properties /admin1/system1 /admin1/system1 /admin1-> Connection timed out Praveen: Thanks, from your output we can be pretty sure that it is not a timing issue. IMHO problems is that 'show -display properties ..' does not return correct output. At #58 we can see that output is just target name but on my machine it is: /admin1/system2 properties ElementName = Computer System EnabledState = 2 (Enabled) ... version of my system according to 'version' SM CLP Version: 1.0.2 SM ME Addressing Version: 1.0.0b Do you think upgrading firmware on your machine will help? Can you run 'show -display properties /admin1/system1' on your system directly, so we know that it is not parsing problem in fence agent? Thanks, Praveen: Can I have an access to that problematic machine? R910 output of show -display properties /admin1/system1 cmdstat status : 3 status_tag : COMMAND EXECUTION FAILED job job_id : 2 joberr errtype : 9 errtype_desc : Unavailable Resource Error cimstat : 6 cimstat_desc : CIM_ERR_NOT_FOUND severity : 2 severity_desc : Low R710 output /admin1-> show -display properties /admin1/system1 /admin1/system1 properties ElementName = Computer System EnabledState = 2 (Enabled) HealthState = 5 (OK) OperationalStatus[0] = 2 (OK) OperationalStatus[1] = 3 (Degraded) RequestedState = 0 (Unknown) powerstate = 2 (On) /admin1-> I can see that something is wrong with the R910 output. I will change the firmware version and check if that helps. Praveen Unfortunately, I cannot you access to the failing machine today. I am not sure of how to do that either. Praveen Praveen, I was given VPN access to a lab in Round Rock 5 for another project, you may wish to speak with Dustin Orrick (ISV partner manager), and see if you might be able to take that machine to his lab and provide VPN access. I could also come onsite and assist if you need Marek, I am very close to Dell. Hey Vinny, Thanks for the contact. I contacted Dustin, he is checking if we can do this today. Marek, I upadated the firmware from 1.35(build16) to 1.35(build17) and the following is ouutput from R910: /admin1-> show -display properties /admin1/system1 /admin1/system1 /admin1-> show -display properties /admin1/system1 /admin1/system1 Is that an acceptable output? This is way shorter than the output from R710 and the fencing still fails though: ### ./fence_idrac.py -v -o reboot -a 172.17.7.227 -l root -p calvin -xroot.7.227's password: /admin1-> show -display targets /admin1 targets hdwr1 profiles1 system1 /admin1-> show -display properties /admin1/system1 /admin1/system1 /admin1-> Connection timed out Praveen Output have to contains properties - we are waiting for 'EnabledState' - to get status of the machine IIRC, Vinny started vacation today. Praveen -- Any luck in making the system available to Marek via remote access? (Devel freezes tomorrow :-/). Thanks! Unfortunately, The week of 12/21- 12/25 is hoiday for DELL (America). So, I asked the idrac team from DELL (Bangalore, India) to help with this issue. Praveen I did get a confirmation from the Bangalore iDRAC team that the iDRAC 1.36 (build 16) doesn't have all the fixes. A newer version of the Mccave firmware was showing all the required output as follows: /admin1-> show -display properties /admin1/system1 /admin1/system1 properties ElementName = Computer System EnabledState = 2 (Enabled) HealthState = 10 (Degraded/Warning) OperationalStatus[0] = 2 (OK) OperationalStatus[1] = 3 (Degraded) RequestedState = 0 (Unknown) powerstate = 2 (On) Currently the engineer is working on checking the fence_idrac with the latest firmware. Will update here as soon as I anything new. Praveen Hey Marek, Now that everyone is back from their vacation, we can get things moving faster. Please let me know if I can help you with anything, for closing this issue. Thank you Praveen K Paladugu Hi Praveen, can you check if our problem from #69 was solved with new firmware? thanks Praveen -- Ping... Sorry for the delay... The new firmware seems to fix all the problems. Everything works fine with the firmware version 1.35(build 39), which is the latest available firmware for Mccave. Please let me know if there is anything else I need to check. I will make sure I won't delay my responses like this time. Could you please tell me what version of RHEL is this fencing agent targeted for? Thank you Praveen Marek confirmed that fence_ipmilan works properly for iDrac fence devices and enabling ipmi on the iDrac does not disable other drac functionality. So this bug is just TestOnly. I tried to check if fence_ipmilan works with idrac6. Following are the outputs from attempts: 1) This is before I turned on IPMI over LAN in idrac (Telnet enabled) #### fence_ipmilan -a 172.17.5.232 -p calvin -l root -o on -v Powering on machine @ IPMI:172.17.5.232...Spawning: '/usr/bin/ipmitool -I lan -H '172.17.5.232' -U 'root' -P 'calvin' -v chassis power status'... Failed 2) This is after I turned on IPMI over LAN in idrac (telnet enabled) #### fence_ipmilan -a 172.17.5.232 -p calvin -l root -o on -v Powering on machine @ IPMI:172.17.5.232...Spawning: '/usr/bin/ipmitool -I lan -H '172.17.5.232' -U 'root' -P 'calvin' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lan -H '172.17.5.232' -U 'root' -P 'calvin' -v chassis power on'... Spawning: '/usr/bin/ipmitool -I lan -H '172.17.5.232' -U 'root' -P 'calvin' -v chassis power status'... Done 3) Enabling "IPMI over LAN" in idrac (telnet disabled) #### fence_ipmilan -a 172.17.5.232 -p calvin -l root -o off -v Powering off machine @ IPMI:172.17.5.232...Spawning: '/usr/bin/ipmitool -I lan -H '172.17.5.232' -U 'root' -P 'calvin' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lan -H '172.17.5.232' -U 'root' -P 'calvin' -v chassis power off'... Spawning: '/usr/bin/ipmitool -I lan -H '172.17.5.232' -U 'root' -P 'calvin' -v chassis power status'... Done Enabling /disabling telnet has no impact on fencing. But in order for fence_ipmilan to work, "IPMI over LAN" had to enabled. Is this the expected behavior? Yes, in order for IPMI to work on the iDRAC, the setting "IPMI over LAN" has to be enabled. According to the current iDRAC user's guide, this setting is disabled by default: http://support.dell.com/support/edocs/software/smdrac3/idrac/idrac22modular/en/ug/pdf/ug.pdf I tested the fence_ipmilan on the following configurations and everything works fine: 1) 1850 - DRAC4/I (rack server) 2) 2950 - DRAC5 (rack server) 3) R710 - iDRAC6 (rack server) 4) M600 - iDRAC6 ( blade server) NOTE: I tested these configurations without any load. I mean, I just booted the servers, and ran the commands during the POST itself. If the fence_drac is going to be removed from RHEL, I guess DELL CMC is not going to be supported as a fencing agent? Thanks Praveen I tested idrac6 Modular(M710), idrac Enterprise (Using dedicated NIC) and idrac Express (Using Shared NIC) with fence_ipmilan. Everything works fine. Thank you Praveen ~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative. I installed the 5.5 Beta build and still noticed that this is issue is not fixed. When I add a fencing configuration of type "Dell iDRAC" to a node, I still noticed that "fence-idrac" is the agent mentioned in the cluster.conf file. I thought this is supposed to be "fence-ipmilan". Please correct me if I got this wrong. After filling in the fencing details, I fenced a node and nothing happens. I get a message like "Unable to retrieve batch 911501329 status from 12.2.4.6:11111: fence_node:failed" in /var/log/messages (after enabling debugging in luci) While filling in the fencing details, I didn't check "Use SSH" checkbox in Conga, just in case this detail matters. Praveen Hi Praveen, so problem is in web gui, not in fence agent? If answer is yes then it will be better to open new bug with correct component. The GUI is the one that seems to be at fault here. I will create a bug for the same. I verified that the fence_ipmilan agent works fine by itself. I verified that fence_ipmilan works on R710, M710 and 2950 servers, covering different form factors and generations. Praveen I have created a new bug against the behaviour of Conga while working with Dell iDRAC #572514. Please check the same. Is any more information required to turn off the "needinfo?" flag on this bug? Praveen An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0266.html |