Created attachment 340447 [details] idrac cli Description of problem: Dell's new servers use an "iDRAC" that has an entirely new interface to it. These are configurable on the new blade servers and can be manipulated directly instead of via the Dell CMC. It is possible that a customer might want to use the CMC as the primary fencing method, and the iDRAC as a secondary method. The new PowerEdge non-blade servers are called iDRAC6, and respond to the new iDRAC interface, as well as the old DRAC5 style of "racadm" commands. However, iDRAC6 uses a different prompt, so the "-c" parameter needs to be passed. Version-Release number of selected component (if applicable): RHEL 5.3 cman-2.0.98-1.el5 How reproducible: Everytime Steps to Reproduce: 1. Obtain a Dell server with an iDRAC 2. Attempt to configure fencing using either fence_drac or fence_drac5 3. Fencing fails Actual results: The iDRAC on Dell Blades use "SM-CLP" interface, and does not respond to the racadm cli commands. I can manually SSH to it and type "reset system1" to powercycle it. The iDRAC6 in PowerEdge servers respond to both SM-CLP and racadm commands, so this works, but needs the "-c" prompt defined. Expected results: iDRAC SM-CLP cli needs to be supported to fence Dell Blade servers directly (outside of the Dell CMC) Additional info: I also tried using the code posted in comment #5 and comment #6 in bug 466788. Dell's documentation on iDRAC: http://support.dell.com/support/edocs/software/smdrac3/idrac/idrac10mono/en/ug/html/racugc1e.htm#wp46449 See attachment for cli output.
Changing '-c' to support also this device is not a problem. We will just replace string with list of string (as used in other agents). I can write a new fence agent for iDRAC but it will be easier if I will have an access on machine (few hours of direct access should be enough). Is it possible?
Unfortunately it is not. I have to go physically on-site and sign-in to be able to test anything. I will be out of town this week, but I can test something for you next week. Also, Praveen Paladugu (cc on this Bug) is an on-site employee and could test something.
Mine is the enterprise one.
Copy & paste from the system page: Main System Chassis System Information Description PowerEdge R710 BIOS Version 1.1.4 Service Tag 6PC9C4J Host Name drac-chywoon OS Name Auto Recovery Recovery Action None Initial Countdown 15 Present Countdown 15 [Back to Top] Remote Access Controller RAC Information Name iDRAC Product Information Integrated Dell Remote Access Controller 6 - Enterprise Date/Time Sat Oct 24 16:07:31 2009 Firmware Version 1.20.01 Firmware Updated Tue Oct 13 15:38:12 2009 Hardware Version 0.01 MAC Address 00:22:19:35:5E:90 IPv4 Information Enabled Yes IP Address 10.44.1.18 Subnet Mask 255.255.255.0 Gateway 10.44.1.1 DHCP Enabled No IPv6 Information Enabled No IP Address 1 :: Prefix Length 64 IP Gateway :: Link Local Address :: IP Address 2 :: Auto Config Yes
After upgrading firmware to 2.10 on M710 - original fence agent (racadm) works as expected. SMCLP also contains enough information and it can be used but I usually ends with 'maximum smash connection reached'. It is possible to kill such connections? On R710 - Path to managed system is /admin1/system2 on Steve's machine - Is it same on all machines (according to manual it should be /admin1/system1)? I can use 'show system*' but I can't use 'start system*' (it is possible to remember number but imho there is better solution).
Created attachment 367157 [details] Fence agent for iDrac on R* series Preliminary version of fence agent with fencing library for iDrac on R-series. You will need pexcept package. If you don't have /admin1/system2 id for you system, you will have to modify it. Timeout problems can be solved by using timeout options, look at help (defaults works for me). usage: python fence_idrac.py -o reboot -a drac-ch -l root -p root -x
Dell would participate in testing this feature in RHEL 5.5
Narayanan D: Hi, can you look at preliminary agent (comment #26) - it works as expected on our testing machine but it is possible that it does not cover something (generalization based on two machines) ?
Narayanan -- Just sent you an email. We need Dell's immediate testing feedback as requested in comments #32 and #26. If we don't receive this feedback by the end of the week, I'm afraid this fix will miss the next RHEL and RHEV updates. Thanks!
Created attachment 377033 [details] capture from script command Attached is the capture from script command. After changing the "/admin1/system2" -> "/admin1/system1" in fence_idrac.py, everything worked fine. Following are the details of the R710 I used for testing--------------------- Description PowerEdge R710 BIOS Version 1.2.6 Service Tag 8HDPBK1 Host Name vse-8-254.vmware.lab OS Name VMware ESX Server Auto Recovery Recovery Action None Initial Countdown 15 Present Countdown 15 Embedded NIC MAC Addresses NIC1 Ethernet 00:24:e8:67:99:15 iSCSI 00:24:e8:67:99:16 NIC2 Ethernet 00:24:e8:67:99:17 iSCSI 00:24:e8:67:99:18 NIC3 Ethernet 00:24:e8:67:99:19 iSCSI 00:24:e8:67:99:1a NIC4 Ethernet 00:24:e8:67:99:1b iSCSI 00:24:e8:67:99:1c [Back to Top] Remote Access Controller RAC Information Name iDRAC6 Product Information Integrated Dell Remote Access Controller 6 - Enterprise Date/Time Tue Dec 8 21:37:38 2009 Firmware Version 1.30 (Build 23) Firmware Updated Thu Dec 3 21:40:35 2009 Hardware Version 0.01 MAC Address 00:24:E8:67:99:1D IPv4 Information IPv4 Enabled Yes IP Address 172.17.5.232 Subnet Mask 255.255.0.0 Gateway 172.17.1.150 DHCP Enabled Yes Use DHCP to obtain DNS server addresses No Preferred DNS Server 0.0.0.0 Alternate DNS Server 0.0.0.0 IPv6 Information IPv6 Enabled No IP Address 1 :: IP Gateway :: Link Local Address :: Autoconfig Enabled Yes Use DHCPv6 to obtain DNS Server Addresses No Preferred DNS Server :: Alternate DNS Server :: ------------------------------------------------------------------------------
Everything is working fine on an R910 as well.
Praveen: Does R910 also needs modification system2 -> system1 ? Is there direct way how to determine if we should work with /admin/system2 or /admin/system1 ? I'm not able to find such info in Dell documentation. If there is no direct way we can set default target to /admin/system1 and make it configurable via cmd-line
Marek, Yes the R910 also required the "/admin1/system2" -> "/admin1/system1" change. Also, in the file I attached, in comment #36, I ran a command "show". This helped me understand that I am supposed to change "/admin1/system2" -> "/admin1/system1" for the fence_idrac to work. Can't that command to be used in the script to determine whether "system2" or "system1" has to be used?
Marek, Could you please advice on how I can test the fencing of blade servers? The fence_idrac attached doesn't seem to accept "module_name" as input? I would like to test the fencing on baldes as well. Praveen
Not sure, I have never saw them - perhaps we need to change only target. Please post me an output.
I logged into the idrac of R710 with ssh and ran the command "show". Following are the details of the same. ---------------------------------------------------------------------- [user1 idrac]$ ssh root.5.232 root.5.232's password: /admin1-> show /admin1 properties ElementName = SM CLP Admin Domain associations systemcomponent : GroupComponent = /admin1 PartComponent = /admin1/system1 serviceaffectselement : AffectedElement = /admin1 AffectingElement = /admin1/system1/sp1/clpsvc1 AssignedSequence = 0 ElementEffects = NULL OtherElementEffectsDescriptions = NULL owningcollectionelement : OwnedElement = /admin1/hdwr1 OwningElement = /admin1 rolelimitedtotarget : DefiningRole = /admin1/system1/sp1/rolesvc3/role1 TargetElement = /admin1 rolelimitedtotarget : DefiningRole = /admin1/system1/sp1/rolesvc3/role2 TargetElement = /admin1 rolelimitedtotarget : DefiningRole = /admin1/system1/sp1/rolesvc3/role3 TargetElement = /admin1 owningcollectionelement : OwnedElement = /admin1/profiles1 OwningElement = /admin1 elementconformstoprofile : ConformantStandard = /admin1/profiles1/profile3 ManagedElement = /admin1 targets hdwr1 profiles1 system1 verbs cd show help version /admin1-> exit -------------------------------------------------------------------- Please let me know if you need any more details.
Where should module name be entered? Ideally please post a process that will power on/off machine. Thanks
Marek, In one your previous posts (comment #25) above, you mentioned you were able to fence an M710 blade server. How did you do that? Are you using the CMC or are you using the idrac of the blade directly? If you are using the CMC, you have to provide the module name (which points to a particular blade in the chassis). On CMC command prompt you have to run the following command to "powecycle" a bladed plugged at slot 4. "racadm serveraction -m server-4 powercycle" Other commands include "poweron", "powerdown", "hardreset", "graceshutdown", "powerstatus". In the above command "-m" options points to the module name. If you are using the idrac on the blade directly, you have to ssh to the idrac and run commands like "reset system1" to reset the server. So, my question comes down to: what was used when you mentioned racadm works fine on M710 in comment #25 ? And what script was used to test that? Praveen
M710 with new firmware + fence_drac5 (it uses racadm) so CMC was used
Praveen, remember we did this when I worked onsite there, which is why I started this bug? I documented it here: http://linux.dell.com/wiki/index.php/Products/HA/DellRedHatHALinuxCluster/Cluster#Additional_Configuration_for_DRAC_Fencing I listed instructions on manually changing the fence agent to fence_drac5. Also, bug 466788 is where fencing a blade through the CMC was discussed.
I get it now, I was assuming fence_idrac is going to be used for blades as well. So, fence_idrac is for servers with idrac and fence_drac5 is for servers with drac and blade servers. And fence_drac5 uses CMC for fencing blade servers. Hope I got this correct. Marek, Please let me know if you need any more testing for this fix to be pushed to the next update of RHEL. Thank you Praveen
Created attachment 377710 [details] Fence agent for iDrac on R* series Fence agent that automatically determines target (tested on /admin1/system2). Please retest on other system. Only detection of target was added so nothing else should change.
Hey Marek, Could you please the the attachment in the comment #49? I still see the lines if options["-o"] == "on": conn.send("start /admin1/system2\n") else: conn.send("stop -f /admin1/system2\n") in the fence_idrac.py. The target seems to be hard coded. When I tried this version of fence_idrac, I get time out errors. Following is the output:::: [user1]# ./fence_idrac.py -o on -a 172.17.6.243 -l root -p calvin -x Failed: Timed out waiting to power ON Thank you Praveen
Created attachment 378224 [details] Fence agent for iDrac on R* series You are right, wrong version. I believe that now there is a correct one.
Hey Marek, This agent worked fine without any changes. I tested it on an R710 and R910. Everything went fine on R710. But on R910, when I tried "reboot" from fence_idrac, at times, it would go down, but never come back up. This could be because the server I have is still a PT level server. There could be something wrong on the idrac of this servers. At times the "reboot" command also worked fine on R910. Thank you Praveen
Hi Praveen, Can you send me a verbose output from that R910 ? If it works sometimes perhaps we will just have to set better defaults for timeout.
Marek, Sorry for the delay. I upgraded the R910 server with new parts and server didn't boot for a while. I finally got it to boot yesterday. I get the verbose output today. Praveen
Hey Marek, There surely seems to be a problem with fence_idrac.py. When I try to ssh to the server by myself, I am able to login. But when I try to "reboot" the server using fence_idrac.py, I am unable to login. Following are details of my testing: [praveen@vse-r610 idrac]$ ssh root.7.227 root.7.227's password: /admin1-> exit CLP Session terminated Connection to 172.17.7.227 closed. [praveen@vse-r610 idrac]$ ssh root.7.227 root.7.227's password: /admin1-> exit CLP Session terminated Connection to 172.17.7.227 closed. [praveen@vse-r610 idrac]$ ./fence_idrac.py -v -o reboot -a 172.17.7.227 -l root -p calvin -x Unable to connect/login to fencing device [praveen@vse-r610 idrac]$ exit exit
I problem I reported above seems to be with the RSA keys. After removing the key for 172.17.2.227 (idrac of R910) from my known_hosts, I got a pop up asking for me to confirm it is ok to continue the ssh, when I ran fence_idrac.py the first time. [praveen@vse-r610 idrac]$ ./fence_idrac.py -v -o reboot -a 172.17.7.227 -l root -p calvin -x """""PRAVEEN>> I SAW THE POP UP HERE""""""""" Unable to connect/login to fencing device [praveen@vse-r610 idrac]$ ./fence_idrac.py -v -o reboot -a 172.17.7.227 -l root -p calvin -x The authenticity of host '172.17.7.227 (172.17.7.227)' can't be established. RSA key fingerprint is 3d:40:69:85:42:91:5d:19:eb:5e:79:01:c6:9c:73:70. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '172.17.7.227' (RSA) to the list of known hosts. root.7.227's password: /admin1-> show -display targets /admin1 targets hdwr1 profiles1 system1 /admin1-> show -display properties /admin1/system1 /admin1/system1 /admin1-> Connection timed out [praveen@vse-r610 idrac]$ May be the timeout values have to be changed as you mentioned. Praveen
I tried to increase the all the timeouts in fence_idrac.py, but that doesn't solve the problem: ./fence_idrac.py -v -o reboot -a 172.17.7.227 -l root -p calvin --login-timeout 10 --power-timeout 10 --shell-timeout 10 --power-wait 10 -x root.7.227's password: /admin1-> show -display targets /admin1 targets hdwr1 profiles1 system1 /admin1-> show -display properties /admin1/system1 /admin1/system1 /admin1-> Connection timed out
Praveen: Thanks, from your output we can be pretty sure that it is not a timing issue. IMHO problems is that 'show -display properties ..' does not return correct output. At #58 we can see that output is just target name but on my machine it is: /admin1/system2 properties ElementName = Computer System EnabledState = 2 (Enabled) ... version of my system according to 'version' SM CLP Version: 1.0.2 SM ME Addressing Version: 1.0.0b Do you think upgrading firmware on your machine will help? Can you run 'show -display properties /admin1/system1' on your system directly, so we know that it is not parsing problem in fence agent? Thanks,
Praveen: Can I have an access to that problematic machine?
R910 output of show -display properties /admin1/system1 cmdstat status : 3 status_tag : COMMAND EXECUTION FAILED job job_id : 2 joberr errtype : 9 errtype_desc : Unavailable Resource Error cimstat : 6 cimstat_desc : CIM_ERR_NOT_FOUND severity : 2 severity_desc : Low R710 output /admin1-> show -display properties /admin1/system1 /admin1/system1 properties ElementName = Computer System EnabledState = 2 (Enabled) HealthState = 5 (OK) OperationalStatus[0] = 2 (OK) OperationalStatus[1] = 3 (Degraded) RequestedState = 0 (Unknown) powerstate = 2 (On) /admin1-> I can see that something is wrong with the R910 output. I will change the firmware version and check if that helps. Praveen
Unfortunately, I cannot you access to the failing machine today. I am not sure of how to do that either. Praveen
Praveen, I was given VPN access to a lab in Round Rock 5 for another project, you may wish to speak with Dustin Orrick (ISV partner manager), and see if you might be able to take that machine to his lab and provide VPN access. I could also come onsite and assist if you need Marek, I am very close to Dell.
Hey Vinny, Thanks for the contact. I contacted Dustin, he is checking if we can do this today. Marek, I upadated the firmware from 1.35(build16) to 1.35(build17) and the following is ouutput from R910: /admin1-> show -display properties /admin1/system1 /admin1/system1 /admin1-> show -display properties /admin1/system1 /admin1/system1 Is that an acceptable output? This is way shorter than the output from R710 and the fencing still fails though: ### ./fence_idrac.py -v -o reboot -a 172.17.7.227 -l root -p calvin -xroot.7.227's password: /admin1-> show -display targets /admin1 targets hdwr1 profiles1 system1 /admin1-> show -display properties /admin1/system1 /admin1/system1 /admin1-> Connection timed out Praveen
Output have to contains properties - we are waiting for 'EnabledState' - to get status of the machine
IIRC, Vinny started vacation today. Praveen -- Any luck in making the system available to Marek via remote access? (Devel freezes tomorrow :-/). Thanks!
Unfortunately, The week of 12/21- 12/25 is hoiday for DELL (America). So, I asked the idrac team from DELL (Bangalore, India) to help with this issue. Praveen
I did get a confirmation from the Bangalore iDRAC team that the iDRAC 1.36 (build 16) doesn't have all the fixes. A newer version of the Mccave firmware was showing all the required output as follows: /admin1-> show -display properties /admin1/system1 /admin1/system1 properties ElementName = Computer System EnabledState = 2 (Enabled) HealthState = 10 (Degraded/Warning) OperationalStatus[0] = 2 (OK) OperationalStatus[1] = 3 (Degraded) RequestedState = 0 (Unknown) powerstate = 2 (On) Currently the engineer is working on checking the fence_idrac with the latest firmware. Will update here as soon as I anything new. Praveen
Hey Marek, Now that everyone is back from their vacation, we can get things moving faster. Please let me know if I can help you with anything, for closing this issue. Thank you Praveen K Paladugu
Hi Praveen, can you check if our problem from #69 was solved with new firmware? thanks
Praveen -- Ping...
Sorry for the delay... The new firmware seems to fix all the problems. Everything works fine with the firmware version 1.35(build 39), which is the latest available firmware for Mccave. Please let me know if there is anything else I need to check. I will make sure I won't delay my responses like this time. Could you please tell me what version of RHEL is this fencing agent targeted for? Thank you Praveen
Marek confirmed that fence_ipmilan works properly for iDrac fence devices and enabling ipmi on the iDrac does not disable other drac functionality. So this bug is just TestOnly.
I tried to check if fence_ipmilan works with idrac6. Following are the outputs from attempts: 1) This is before I turned on IPMI over LAN in idrac (Telnet enabled) #### fence_ipmilan -a 172.17.5.232 -p calvin -l root -o on -v Powering on machine @ IPMI:172.17.5.232...Spawning: '/usr/bin/ipmitool -I lan -H '172.17.5.232' -U 'root' -P 'calvin' -v chassis power status'... Failed 2) This is after I turned on IPMI over LAN in idrac (telnet enabled) #### fence_ipmilan -a 172.17.5.232 -p calvin -l root -o on -v Powering on machine @ IPMI:172.17.5.232...Spawning: '/usr/bin/ipmitool -I lan -H '172.17.5.232' -U 'root' -P 'calvin' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lan -H '172.17.5.232' -U 'root' -P 'calvin' -v chassis power on'... Spawning: '/usr/bin/ipmitool -I lan -H '172.17.5.232' -U 'root' -P 'calvin' -v chassis power status'... Done 3) Enabling "IPMI over LAN" in idrac (telnet disabled) #### fence_ipmilan -a 172.17.5.232 -p calvin -l root -o off -v Powering off machine @ IPMI:172.17.5.232...Spawning: '/usr/bin/ipmitool -I lan -H '172.17.5.232' -U 'root' -P 'calvin' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lan -H '172.17.5.232' -U 'root' -P 'calvin' -v chassis power off'... Spawning: '/usr/bin/ipmitool -I lan -H '172.17.5.232' -U 'root' -P 'calvin' -v chassis power status'... Done Enabling /disabling telnet has no impact on fencing. But in order for fence_ipmilan to work, "IPMI over LAN" had to enabled. Is this the expected behavior?
Yes, in order for IPMI to work on the iDRAC, the setting "IPMI over LAN" has to be enabled. According to the current iDRAC user's guide, this setting is disabled by default: http://support.dell.com/support/edocs/software/smdrac3/idrac/idrac22modular/en/ug/pdf/ug.pdf
I tested the fence_ipmilan on the following configurations and everything works fine: 1) 1850 - DRAC4/I (rack server) 2) 2950 - DRAC5 (rack server) 3) R710 - iDRAC6 (rack server) 4) M600 - iDRAC6 ( blade server) NOTE: I tested these configurations without any load. I mean, I just booted the servers, and ran the commands during the POST itself. If the fence_drac is going to be removed from RHEL, I guess DELL CMC is not going to be supported as a fencing agent? Thanks Praveen
I tested idrac6 Modular(M710), idrac Enterprise (Using dedicated NIC) and idrac Express (Using Shared NIC) with fence_ipmilan. Everything works fine. Thank you Praveen
~~ Attention Customers and Partners - RHEL 5.5 Beta is now available on RHN ~~ RHEL 5.5 Beta has been released! There should be a fix present in this release that addresses your request. Please test and report back results here, by March 3rd 2010 (2010-03-03) or sooner. Upon successful verification of this request, post your results and update the Verified field in Bugzilla with the appropriate value. If you encounter any issues while testing, please describe them and set this bug into NEED_INFO. If you encounter new defects or have additional patch(es) to request for inclusion, please clone this bug per each request and escalate through your support representative.
I installed the 5.5 Beta build and still noticed that this is issue is not fixed. When I add a fencing configuration of type "Dell iDRAC" to a node, I still noticed that "fence-idrac" is the agent mentioned in the cluster.conf file. I thought this is supposed to be "fence-ipmilan". Please correct me if I got this wrong. After filling in the fencing details, I fenced a node and nothing happens. I get a message like "Unable to retrieve batch 911501329 status from 12.2.4.6:11111: fence_node:failed" in /var/log/messages (after enabling debugging in luci) While filling in the fencing details, I didn't check "Use SSH" checkbox in Conga, just in case this detail matters. Praveen
Hi Praveen, so problem is in web gui, not in fence agent? If answer is yes then it will be better to open new bug with correct component.
The GUI is the one that seems to be at fault here. I will create a bug for the same. I verified that the fence_ipmilan agent works fine by itself. I verified that fence_ipmilan works on R710, M710 and 2950 servers, covering different form factors and generations. Praveen
I have created a new bug against the behaviour of Conga while working with Dell iDRAC #572514. Please check the same. Is any more information required to turn off the "needinfo?" flag on this bug? Praveen
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0266.html