Bug 1551403
Summary: | FCoE is not initiated on boot with lldp enabled | |||
---|---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | Mark Keir <mkeir> | |
Component: | Services | Assignee: | Eyal Shenitzky <eshenitz> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Avihai <aefrat> | |
Severity: | medium | Docs Contact: | ||
Priority: | medium | |||
Version: | 4.19.41 | CC: | aefrat, bugs, danken, dholler, dougsland, ebenahar, gveitmic, mkeir, tnisan | |
Target Milestone: | ovirt-4.3.0 | Flags: | rule-engine:
ovirt-4.3+
ylavi: exception+ |
|
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1636254 (view as bug list) | Environment: | ||
Last Closed: | 2019-03-13 16:37:47 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1636254 | |||
Attachments: |
Description
Mark Keir
2018-03-05 04:52:59 UTC
Do you have the fcoe_before_network_setup hook installed? Can you verify its executable? Can you please attach vdsm's logs? [root@rhevh-13 ~]# find /usr/share/vdsm -name fcoe_before_network_setup.py [root@rhevh-13 ~]# yum search fcoe Loaded plugins: imgbased-persist, package_upload, product-id, search-disabled-repos, subscription-manager ============================================================================================================= N/S matched: fcoe ============================================================================================================== vdsm-hook-fcoe.noarch : Hook to enable FCoE support fcoe-utils.x86_64 : Fibre Channel over Ethernet utilities Name and summary matches only, use "search all" for everything. [root@rhevh-13 ~]# rpm -q vdsm-hook-fcoe vdsm-hook-fcoe-4.19.45-1.el7ev.noarch If the reference is to https://github.com/oVirt/vdsm/blob/master/vdsm_hooks/fcoe/fcoe_before_network_setup.py, no I don't find that on the system. Hey Mark, It should go here: /usr/libexec/vdsm/hooks/before_network_setup/50_fcoe [root@rhevh-13 ~]# ls -la /usr/libexec/vdsm/hooks/before_network_setup/50_fcoe -rwxr-xr-x. 1 root root 6555 Jan 16 11:10 /usr/libexec/vdsm/hooks/before_network_setup/50_fcoe Hi Mark, Can you please provide all the relevant logs? The related log from the time of the first system registration is found at https://drive.google.com/file/d/1_2o_Gztq6TFEgHX4oDOQvXddREfd3aaI/view?usp=sharing BR Mark PS. Apols for delay, was on PTO (In reply to Mark Keir from comment #6) > The related log from the time of the first system registration is found at > https://drive.google.com/file/d/1_2o_Gztq6TFEgHX4oDOQvXddREfd3aaI/ > view?usp=sharing > > BR > Mark > > PS. Apols for delay, was on PTO Can I use your env to investigate this issue, Can you please supply the env details? What form of access would you require? This is our production cluster in BOS, home to Errata, Beaker, Brew, Gerrit etc. https://rhvm.infra.prod.eng.bos.redhat.com I am about to rebuild another system and go through the setup process. Is there any additional data capture I can do to trap better information for this issue? BR Mark This is an additional set of information: https://drive.google.com/open?id=1DD2ozokNqDprGXiL4OoDWX0npt3YYYbX If you download this and run it in a browser, you can see a recording from just after imaging and initial fcoe setup, through registration to RHVM, with a reboot sequence and check afterwards. (In reply to Mark Keir from comment #8) > What form of access would you require? > > This is our production cluster in BOS, home to Errata, Beaker, Brew, Gerrit > etc. > > https://rhvm.infra.prod.eng.bos.redhat.com > > I am about to rebuild another system and go through the setup process. Is > there any additional data capture I can do to trap better information for > this issue? > > BR > Mark Thanks, I need access to the environment as admin, I need also access to the engine machine and VDSM machine. I will contact you directly in email for details around access credentials and conditions of use. (In reply to Mark Keir from comment #0) Germano, please note that Mark is not using RHV's fcoe hook. He is configuring his /etc/fcoe/* with his own Ansible playbook: > Set up FCoE via process in > https://gitlab.infra.prod.eng.rdu2.redhat.com/ansible-roles/rhvm-server/tree/ > master/rhvh-fcoe (derived from > https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/ > html/storage_administration_guide/fcoe-config) I don't believe that this is related to your bug, Mark, but I must comment that I see there that you are writing cfg files. That should not be done if the interfaces are to be managed by RHV (as RHV is going to rewrite them). Does your Ansible playbook work fine when applied to non-RHV hosts? do you see failures to start networking during boot time? Created attachment 1428701 [details]
rhevh12 journalctl
(In reply to Dan Kenigsberg from comment #12) > (In reply to Mark Keir from comment #0) > > > Germano, please note that Mark is not using RHV's fcoe hook. He is > configuring his /etc/fcoe/* with his own Ansible playbook: > > > Set up FCoE via process in > > https://gitlab.infra.prod.eng.rdu2.redhat.com/ansible-roles/rhvm-server/tree/ > > master/rhvh-fcoe (derived from > > https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/ > > html/storage_administration_guide/fcoe-config) > > I don't believe that this is related to your bug, Mark, but I must comment > that I see there that you are writing cfg files. That should not be done if > the interfaces are to be managed by RHV (as RHV is going to rewrite them). > > Does your Ansible playbook work fine when applied to non-RHV hosts? do you > see failures to start networking during boot time? I don't believe we have any standard RHEL7 systems using FCoE. I will ask. This is chicken and egg Dan - we have to get fcoe connection to storage to register and activate the machines as they don't see the storage domains otherwize. We have no choice except to manually create this setup. Its the same setup, on the same hosts, running RHEVH3.x documented in https://docs.engineering.redhat.com/pages/viewpage.action?pageId=42939508 and working for several years. Hey Mark, Are the 'lldpad' and 'fcoe' services active after restarting the host? Eyal, You have full access to the RHVH host. Are you unable to test yourself? If you no longer need access to conduct independent investigations, please advise so and I will take the production resources back. Mark The lldpad and fcoe services are started after reboot. Below is from a system upgraded to RHVH4.2.5 today. [root@rhevh-16 ~]# rpm -qa 'redhat-virtualization-host-image*' redhat-virtualization-host-image-update-4.2-20180724.0.el7_5.noarch redhat-virtualization-host-image-update-placeholder-4.2-5.0.el7.noarch [root@rhevh-16 ~]# hostnamectl status Static hostname: rhevh-16.infra.prod.eng.bos.redhat.com Icon name: computer-server Chassis: server Machine ID: ae2a206dbb4f45a0834f61141455fd63 Boot ID: 5cbcba1406754b018e8398347f815019 Operating System: Red Hat Virtualization Host 4.2.5 (el7.5) CPE OS Name: cpe:/o:redhat:enterprise_linux:7.5:beta:hypervisor Kernel: Linux 3.10.0-862.9.1.el7.x86_64 Architecture: x86-64 [root@rhevh-16 ~]# systemctl status lldpad fcoe ● lldpad.service - Link Layer Discovery Protocol Agent Daemon. Loaded: loaded (/usr/lib/systemd/system/lldpad.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2018-08-01 04:49:57 UTC; 4min 47s ago Main PID: 3272 (lldpad) Tasks: 1 Memory: 372.0K CGroup: /system.slice/lldpad.service └─3272 /usr/sbin/lldpad -t Aug 01 04:49:57 rhevh-16.infra.prod.eng.bos.redhat.com systemd[1]: Started Link Layer Discovery Protocol Agent Daemon.. Aug 01 04:49:57 rhevh-16.infra.prod.eng.bos.redhat.com systemd[1]: Starting Link Layer Discovery Protocol Agent Daemon.... Aug 01 04:52:12 rhevh-16.infra.prod.eng.bos.redhat.com lldpad[3272]: recvfrom(Event interface): No buffer space available Aug 01 04:52:13 rhevh-16.infra.prod.eng.bos.redhat.com lldpad[3272]: recvfrom(Event interface): No buffer space available Aug 01 04:52:13 rhevh-16.infra.prod.eng.bos.redhat.com lldpad[3272]: recvfrom(Event interface): No buffer space available Aug 01 04:52:18 rhevh-16.infra.prod.eng.bos.redhat.com lldpad[3272]: recvfrom(Event interface): No buffer space available Aug 01 04:52:19 rhevh-16.infra.prod.eng.bos.redhat.com lldpad[3272]: recvfrom(Event interface): No buffer space available Aug 01 04:52:19 rhevh-16.infra.prod.eng.bos.redhat.com lldpad[3272]: recvfrom(Event interface): No buffer space available ● fcoe.service - Open-FCoE Inititator. Loaded: loaded (/usr/lib/systemd/system/fcoe.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2018-08-01 04:49:57 UTC; 4min 47s ago Process: 3369 ExecStart=/usr/sbin/fcoemon $FCOEMON_OPTS (code=exited, status=0/SUCCESS) Process: 3357 ExecStartPre=/sbin/modprobe -qa $SUPPORTED_DRIVERS (code=exited, status=0/SUCCESS) Main PID: 3380 (fcoemon) Tasks: 1 Memory: 144.0K CGroup: /system.slice/fcoe.service └─3380 /usr/sbin/fcoemon --syslog Aug 01 04:49:57 rhevh-16.infra.prod.eng.bos.redhat.com systemd[1]: Starting Open-FCoE Inititator.... Aug 01 04:49:57 rhevh-16.infra.prod.eng.bos.redhat.com systemd[1]: Started Open-FCoE Inititator.. Aug 01 04:52:14 rhevh-16.infra.prod.eng.bos.redhat.com fcoemon[3380]: fip_send_vlan_request: error 100 Network is down Aug 01 04:52:14 rhevh-16.infra.prod.eng.bos.redhat.com fcoemon[3380]: fip_send_vlan_request: sendmsg error Aug 01 04:52:14 rhevh-16.infra.prod.eng.bos.redhat.com fcoemon[3380]: fip_send_vlan_request: error 100 Network is down Aug 01 04:52:14 rhevh-16.infra.prod.eng.bos.redhat.com fcoemon[3380]: fip_send_vlan_request: sendmsg error [root@rhevh-16 ~]# fcoeadm -i Description: NetXtreme II BCM57810 10 Gigabit Ethernet Multi Function Revision: 10 Manufacturer: Broadcom Limited Serial Number: 000E1EB71040 Driver: bnx2x 1.712.30-0 Number of Ports: 1 Symbolic Name: bnx2fc (QLogic BCM57810) v2.11.8 over p2p1_4.1002-fco OS Device Name: host15 Node Name: 0x200018fb7b731258 Port Name: 0x200118fb7b731258 Fabric Name: 0x100050eb1a292694 Speed: 10 Gbit Supported Speed: 1 Gbit, 10 Gbit MaxFrameSize: 2048 bytes FC-ID (Port ID): 0x011001 State: Online Description: NetXtreme II BCM57810 10 Gigabit Ethernet Multi Function Revision: 10 Manufacturer: Broadcom Limited Serial Number: 000E1EB71040 Driver: bnx2x 1.712.30-0 Number of Ports: 1 Symbolic Name: bnx2fc (QLogic BCM57810) v2.11.8 over p2p2_4.1002-fco OS Device Name: host16 Node Name: 0x200018fb7b73125b Port Name: 0x200118fb7b73125b Fabric Name: 0x100050eb1a292a94 Speed: 10 Gbit Supported Speed: 1 Gbit, 10 Gbit MaxFrameSize: 2048 bytes FC-ID (Port ID): 0x011001 State: Online [root@rhevh-16 ~]# systemctl reboot Connection to rhevh-16.infra.prod.eng.bos.redhat.com closed by remote host. Connection to rhevh-16.infra.prod.eng.bos.redhat.com closed. [mkeir@mkeir ~]$ ssh root.prod.eng.bos.redhat.com root.prod.eng.bos.redhat.com's password: Last login: Wed Aug 1 04:51:12 2018 from dhcp-40-233.bne.redhat.com node status: OK See `nodectl check` for more information Admin Console: https://10.19.220.21:9090/ [root@rhevh-16 ~]# hostnamectl status Static hostname: rhevh-16.infra.prod.eng.bos.redhat.com Icon name: computer-server Chassis: server Machine ID: ae2a206dbb4f45a0834f61141455fd63 Boot ID: 5725b22ebd8d4274a7c30b67f1aefbb6 Operating System: Red Hat Virtualization Host 4.2.5 (el7.5) CPE OS Name: cpe:/o:redhat:enterprise_linux:7.5:beta:hypervisor Kernel: Linux 3.10.0-862.9.1.el7.x86_64 Architecture: x86-64 [root@rhevh-16 ~]# systemctl status lldpad fcoe ● lldpad.service - Link Layer Discovery Protocol Agent Daemon. Loaded: loaded (/usr/lib/systemd/system/lldpad.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2018-08-01 05:00:01 UTC; 39s ago Main PID: 3783 (lldpad) Tasks: 1 Memory: 348.0K CGroup: /system.slice/lldpad.service └─3783 /usr/sbin/lldpad -t Aug 01 05:00:01 rhevh-16.infra.prod.eng.bos.redhat.com systemd[1]: Started Link Layer Discovery Protocol Agent Daemon.. Aug 01 05:00:01 rhevh-16.infra.prod.eng.bos.redhat.com systemd[1]: Starting Link Layer Discovery Protocol Agent Daemon.... ● fcoe.service - Open-FCoE Inititator. Loaded: loaded (/usr/lib/systemd/system/fcoe.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2018-08-01 05:00:01 UTC; 39s ago Process: 3851 ExecStart=/usr/sbin/fcoemon $FCOEMON_OPTS (code=exited, status=0/SUCCESS) Process: 3849 ExecStartPre=/sbin/modprobe -qa $SUPPORTED_DRIVERS (code=exited, status=0/SUCCESS) Main PID: 3853 (fcoemon) Tasks: 1 Memory: 128.0K CGroup: /system.slice/fcoe.service └─3853 /usr/sbin/fcoemon --syslog Aug 01 05:00:01 rhevh-16.infra.prod.eng.bos.redhat.com systemd[1]: Starting Open-FCoE Inititator.... Aug 01 05:00:01 rhevh-16.infra.prod.eng.bos.redhat.com systemd[1]: Started Open-FCoE Inititator.. [root@rhevh-16 ~]# fcoeadm -i fcoeadm: No action was taken Try 'fcoeadm --help' for more information. [root@rhevh-16 ~]# systemctl restart network [root@rhevh-16 ~]# fcoeadm -i Description: NetXtreme II BCM57810 10 Gigabit Ethernet Multi Function Revision: 10 Manufacturer: Broadcom Limited Serial Number: 000E1EB71040 Driver: bnx2x 1.712.30-0 Number of Ports: 1 Symbolic Name: bnx2fc (QLogic BCM57810) v2.11.8 over p2p1_4.1002-fco OS Device Name: host15 Node Name: 0x200018fb7b731258 Port Name: 0x200118fb7b731258 Fabric Name: 0x100050eb1a292694 Speed: 10 Gbit Supported Speed: 1 Gbit, 10 Gbit MaxFrameSize: 2048 bytes FC-ID (Port ID): 0x011001 State: Online Description: NetXtreme II BCM57810 10 Gigabit Ethernet Multi Function Revision: 10 Manufacturer: Broadcom Limited Serial Number: 000E1EB71040 Driver: bnx2x 1.712.30-0 Number of Ports: 1 Symbolic Name: bnx2fc (QLogic BCM57810) v2.11.8 over p2p2_4.1002-fco OS Device Name: host16 Node Name: 0x200018fb7b73125b Port Name: 0x200118fb7b73125b Fabric Name: 0x100050eb1a292a94 Speed: 10 Gbit Supported Speed: 1 Gbit, 10 Gbit MaxFrameSize: 2048 bytes FC-ID (Port ID): 0x011001 State: Online There is not FCoE environment to test and investigate the flow. Return the status to new until there will be a setup. There is a workaround for the similar bug 1636254, which indicates that bug 1623904 is related. This would mean that this bug can be reproduced by creating a bond on Broadcom Limited NetXtreme II BCM57800 or BCM57810 with lldpad adminStatus=rx, maybe even without FCoE. Created attachment 1521301 [details]
validation that disableing lldp on the interface resolves the initial problem
Created attachment 1521302 [details]
validation that disableing lldp on the interface resolves the initial problem
Elad, from my point of view attachment 1521301 [details] shows the validation for this bug. Do you agree?
I suppose that disabling lldp has its implications. As I'm not an expert, I suggest to check what's the effect of disabling it and to validate this against different switch vendors. Anyway, please change the scope of this bug to be 'FCoE is not initiated on boot with lldp enabled'. (In reply to Elad from comment #23) > I suppose that disabling lldp has its implications. As I'm not an expert, I This is a valuable thought. Do you already have an idea which disabling lldp this could have? According to https://www.kernel.org/doc/Documentation/scsi/bnx2fc.txt: > ** Broadcom FCoE capable devices implement a DCBX/LLDP client on-chip. Only one > LLDP client is allowed per interface. For proper operation all host software > based DCBX/LLDP clients (e.g. lldpad) must be disabled. To disable lldpad on a > given interface, run the following command: > > lldptool set-lldp -i <interface_name> adminStatus=disabled manually disabling the LLDP receiving is the correct way. So lldp would not be disabled on the interface, because it is handled in hardware instead of software. > suggest to check what's the effect of disabling it and to validate this > against different switch vendors. What effects you would check and how you would validate? > Anyway, please change the scope of this bug to be 'FCoE is not initiated on > boot with lldp enabled'. Done. > What effects you would check and how you would validate?
Nothing that I'm aware from the storage side
Verification done. ovirt-engine 4.3.2-0.1 Scenario is take from prior comment#29: [root@green-vdse ~]# lldptool set-lldp -i eno6 adminStatus=disabled adminStatus = disabled [root@green-vdse ~]# lldptool get-lldp -i eno6 adminStatus adminStatus=disabled [root@green-vdse ~]# systemctl restart vdsmd #Checking adminStatus=disabled should remain during vdsm restart and reboot. [root@green-vdse ~]# lldptool get-lldp -i eno6 adminStatus adminStatus=disabled This bugzilla is included in oVirt 4.3.0 release, published on February 4th 2019. Since the problem described in this bug report should be resolved in oVirt 4.3.0 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |