Skip to content

Conversation

@ventifus
Copy link
Collaborator

@ventifus ventifus commented Dec 5, 2025

Derive all information from NetworkManager directly to avoid setting dnsmasq's name server to the node IP in some circumstances

Which issue this PR addresses:

Fixes ARO-23104

What this PR does / why we need it:

In some circumstances where dnsmasq.service's ExecStopPost fails to fire, we end up consuming our own NetworkManager configuration resulting in resolv.conf.dnsmasq containing the node's IP not the upstream DNS servers' IPs. To avoid that, we now look directly at NetworkManager's configuration as obtained by DHCP. As a result, we no longer need to delete the NetworkManager drop-in with dnsmasq's ExecStopPost, and re-executing aro-dnsmasq-pre.sh is idempotent.

Test plan for issue:

  • Install cluster with local RP
  • Scale up worker nodes
  • Canary install of cluster with UDR / invalid DNS

Is there any documentation that needs to be updated for this PR?

How do you know this will function as expected in production?

Derive all information from NetworkManager directly to avoid setting dnsmasq's name server to the node IP in some circumstances
@ventifus
Copy link
Collaborator Author

ventifus commented Dec 5, 2025

/azp run ci

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@mociarain mociarain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a good change and based on the context in the JIRA I think it's good to merge as is but I wonder if it's worth testing the operator change for "real". AFAIK @ehvs wrote this process up recently. I tried to find it in the wiki but gave up. Do you have the link?

I'll approve anyway but wdyt?

@ventifus
Copy link
Collaborator Author

ventifus commented Dec 8, 2025

Test best test for this is to perform an install of a UDR cluster with invalid DNS on it's VNET. This requires a working gateway so will likely need Canary.

Copy link
Contributor

@kimorris27 kimorris27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Maitiu. I think it could merge as-is, but testing definitely won't hurt.

Copy link
Collaborator

@hlipsig hlipsig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, is a test in canary the only way, or can we install a dev cluster into production and validate that way? Can do a hive-less cluster install via local RP, and I know we can do it with a local build ARO operator as well. Amber's done that in the past.

@ventifus
Copy link
Collaborator Author

We need to test with egress lockdown enabled and a working gateway, I don't think we can do that in an environment less than canary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants