Bug 2073754

Summary: OCP 4-10 deployment IPv6 Address not included in node InternalIP list.
Product: OpenShift Container Platform Reporter: Greg Kopels <gkopels>
Component: ocAssignee: Nobody <nobody>
oc sub component: oc QA Contact: zhou ying <yinzhou>
Status: CLOSED WONTFIX Docs Contact:
Severity: high    
Priority: unspecified CC: bgalvani, bzvonar, dornelas, elevin, grajaiya, jligon, keyoung, mfojtik, miabbott, mrussell, nobody, nstielau, smilner, spresti, thaller
Version: 4.10Keywords: Reopened, TestBlocker
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-02-10 16:14:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Greg Kopels 2022-04-10 07:56:03 UTC
Environment Setup:
- hybrid cluster 4.10 with 2 bare metal worker nodes
- Dual Stack configured on Lab router IPv6 RA with dnsmasq server for static 
  IPv6 address allocation.

Description of problem:
When installing OCP version 4.10 with the above environment both worker nodes receive dualstack IP addresses but only one worker includes the IPv6 address in the InternalIP list. 


Version-Release number of selected component (if applicable):
On 4.10.5, 4.10.6

How reproducible:
Install version 4.10 on a dual stack environment. 

Steps to Reproduce:
1. Install OCP version 4.10 on dual stack environement
2. Verify both worker nodes receive an IPv4 and IPv6 address
3. Verify that second worker doesnt include IPv6 in InternalIP list with oc describe node command.

Actual results:
Worker Node 1:
Addresses:
  InternalIP:  10.46.56.13
  InternalIP:  2620:52:0:2e38::113
  Hostname:  

Worker Node 2:
Addresses:
  InternalIP:  10.46.56.14
  Hostname:    helix14.lab.eng.tlv2.redhat.com

Expected results:
Expected to see on both nodes IPv4 and IPv6 addresses in the InternalIP list.

Additional info:
When installing with OCP version 4.9.26 is workers as expected.
Last week I was able to install with 4.10.5. This week it is not working.

Comment 8 Steven Presti 2022-04-27 17:29:44 UTC
Hello, would it be possible to get the full journal of the broken node that includes the address assignment?

Comment 9 Steven Presti 2022-04-27 18:46:20 UTC
@thaller @bgalvani wanted to reach out to both of you for your expertise on the Network Manager.

Comment 10 Thomas Haller 2022-04-28 08:47:22 UTC
yes, it's probably related to the linked bugs.

I have problems to understand the log from comment 3.

Would it be possible to collect complete `level=TRACE` logs of NetworkManager? See https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/blob/main/contrib/fedora/rpm/NetworkManager.conf#L27 for hints about logging. Also, the logfile only spans a few seconds. Does this show the relevant part? Consider attaching the entire log (of the past minutes), possibly with additional hintss, like what IP addresses would you expect and on which interface?

Comment 11 Greg Kopels 2022-05-02 08:06:30 UTC
Hi I will need to run a redeploy of the cluster.  I will run it at the end of the day and update tomorrow.

Comment 12 Greg Kopels 2022-05-02 12:56:28 UTC
Hi I will need to run a redeploy of the cluster.  I will run it at the end of the day and update tomorrow.

Comment 16 Greg Kopels 2022-05-11 12:58:33 UTC
Added full journals for both nodes

Comment 17 Micah Abbott 2022-06-03 13:47:53 UTC
@thaller Reporter has provided requested journals, etc.  Could you have another look and see if there is enough data collected to perform additional triage?

Comment 21 Thomas Haller 2022-06-16 21:13:58 UTC
Hi,

Btw, the attached files from comment 14 and 15 are not at `level=TRACE`. That might be useful...

But it am confused. Could you give some guidance as to what is happening?
What does it mean that the "InternalIP" ipv6 address is missing? Should this reflect the actually configured addresses on the interface? What does `ip addr` say at this point?

In the non-working log, we see towards the end

  [1652271045.8020] dhcp6 (br-ex): state changed unknown -> bound, address=2620:52:0:2e38::114

so it would seem that this interface should be up with the expected(?) IPv6 address.

Comment 22 Greg Kopels 2022-06-22 05:56:45 UTC
Hi,

For a reminder.  We have a hybrid cluster with two BM workers. The br-ex main interfaces are configured with dnsmasq. Both workers receive IPv4 and IPv6 addresses. 
However when I run the oc describe node on the workers only one of the workers had both IPv4 and IPv6 addresses as InternalIP.

Worker0
Addresses:
  InternalIP:  10.46.56.13
  InternalIP:  2620:52:0:2e38::113
  Hostname:    helix13.lab.eng.tlv2.redhat.com

Worker1
Worker Node 2:
Addresses:
  InternalIP:  10.46.56.14
  Hostname:    helix14.lab.eng.tlv2.redhat.com

Worker1 had an IPv6 address 2620:52:0:2e38::114 on the br-ex interface.

Not sure I answered your question.
Feel free to ping me on Slack

Greg

Comment 23 Thomas Haller 2022-06-22 07:24:07 UTC
Could you collect a complete `level=TRACE` log that shows the boot? Otherwise, detailed information about IP addresses is not logged, and it cannot be seen why an IP address might be missing. Debug logging can be enabled by setting `rd.debug` on the kernel command line and booting. Is there a difficulty reproducing the issue?

Comment 24 Greg Kopels 2022-06-27 11:13:22 UTC
Hi as soon as I have a free cluster I will rerun the deployment.  Can you send me a doc on how to correctly run `level=TRACE` ?
Thanks

Comment 25 Thomas Haller 2022-06-27 13:49:18 UTC
(In reply to Greg Kopels from comment #24)
> Hi as soon as I have a free cluster I will rerun the deployment.  Can you
> send me a doc on how to correctly run `level=TRACE` ?
> Thanks

This is NM in initrd, is that right? Then pass `rd.debug` on the kernel command line. That is documented in `man dracut.cmdline`.


Alternatively, how `level=TRACE` works is documented in `man NetworkManager.conf` and (more to the point) see the example at https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/blob/main/contrib/fedora/rpm/NetworkManager.conf#L27 . That is mainly useful if you enable debug logging in real-root.

Comment 26 Greg Kopels 2022-07-06 13:07:27 UTC
Hi in attempting to deploy a dual stack cluster with OCP 4.11 we are hitting a new bug blocking us from further investigation of this bz.
 https://bugzilla.redhat.com/show_bug.cgi?id=2102158

Comment 27 Greg Kopels 2022-07-06 15:29:36 UTC
sorry please ignore the above comment I made incorrect reference to 4.11 bug.  We currently dont have a free cluster to deploy 4.10 dualstack. I believe it will be free already tomorrow or my Thursday. And then I will supply you with the trace logs. Thanks

Comment 28 Greg Kopels 2022-07-12 13:43:55 UTC
I will have the a cluster today to start deploying a 4.10 dualstack.  I will reach out to Thomas during deployment.

Comment 29 Greg Kopels 2022-07-12 13:44:07 UTC
I will have the a cluster today to start deploying a 4.10 dualstack.  I will reach out to Thomas during deployment.

Comment 30 Timothée Ravier 2022-07-27 11:22:19 UTC
You're talking about the output of an `oc node ...` command here, but what are the IP addresses on the nodes? Can you give us the output of `ip a`?

Comment 31 Greg Kopels 2022-08-11 07:35:05 UTC
Hi I am still being blocked from deploying a dualstack cluster from bz https://bugzilla.redhat.com/show_bug.cgi?id=2102158

Comment 32 Timothée Ravier 2022-08-11 10:36:18 UTC
Sure. However we still need some info here to make progress so I'm keeping the NEEDINFO.

Comment 33 RHCOS Bug Triage 2022-08-25 14:14:01 UTC
We are unable to make progress on this bug without the requested information, so the bug is now being closed. If the problem persists, please provide the requested information and reopen the bug.

Comment 34 elevin 2023-01-05 16:20:30 UTC
(In reply to Timothée Ravier from comment #30)
> You're talking about the output of an `oc node ...` command here, but what
> are the IP addresses on the nodes? Can you give us the output of `ip a`?

PROBLEMATIC NODE:
19: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 34:48:ed:f3:88:c4 brd ff:ff:ff:ff:ff:ff
    inet 10.46.56.13/24 brd 10.46.56.255 scope global dynamic noprefixroute br-ex
       valid_lft 2689sec preferred_lft 2689sec
    inet 10.46.56.72/32 scope global br-ex
       valid_lft forever preferred_lft forever
    inet6 2620:52:0:2e38::113/128 scope global dynamic noprefixroute 
       valid_lft 2459sec preferred_lft 2459sec
    inet6 fe80::3648:edff:fef3:88c4/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
------------------

but(`oc node ...`):
status:
  addresses:
  - address: 10.46.56.13
    type: InternalIP
  - address: helix13.lab.eng.tlv2.redhat.com
    type: Hostname
====================================================

ANOTHER NODE:
19: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 34:48:ed:f3:e2:2c brd ff:ff:ff:ff:ff:ff
    inet 10.46.56.14/24 brd 10.46.56.255 scope global dynamic noprefixroute br-ex
       valid_lft 2135sec preferred_lft 2135sec
    inet6 2620:52:0:2e38::114/128 scope global dynamic noprefixroute 
       valid_lft 2233sec preferred_lft 2233sec
    inet6 fe80::3648:edff:fef3:e22c/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

-----------------
`oc node ...`:
status:
  addresses:
  - address: 10.46.56.14
    type: InternalIP
  - address: 2620:52:0:2e38::114
    type: InternalIP
  - address: helix14.lab.eng.tlv2.redhat.com
    type: Hostname

Comment 36 Timothée Ravier 2023-01-09 13:58:51 UTC
So if I understand correctly, this is a problem with the output of `oc`, not the IP address set on the node itself. Redirecting to the `oc` team.

Comment 37 Greg Kopels 2023-02-07 15:47:50 UTC
I am rerunning the test with latest 4.10 OCP

Comment 38 Greg Kopels 2023-02-07 15:48:03 UTC
I am rerunning the test with latest 4.10 OCP

Comment 39 Greg Kopels 2023-02-07 15:48:14 UTC
I am rerunning the test with latest 4.10 OCP

Comment 40 Greg Kopels 2023-02-07 15:48:24 UTC
I am rerunning the test with latest 4.10 OCP

Comment 41 Greg Kopels 2023-02-07 19:23:13 UTC
OCP 4.10.47
Still the same issue:
Deployed a dual stack cluster

Worker 0
oc describe node helix13.lab.eng.tlv2.redhat.com
Annotations:        k8s.ovn.org/host-addresses: ["10.46.56.13","2620:52:0:2e38::113"]

* But internal IP address shows only IPv4
Addresses:
  InternalIP:  10.46.56.13
  Hostname:    helix13.lab.eng.tlv2.redhat.com

Worker 1
 oc describe node helix14.lab.eng.tlv2.redhat.com

Annotations:        k8s.ovn.org/host-addresses: ["10.46.56.14","10.46.56.72","2620:52:0:2e38::114"]
Addresses:
  InternalIP:  10.46.56.14
  Hostname:    helix14.lab.eng.tlv2.redhat.com