Skip to content

Conversation

@olivierlemasle
Copy link
Contributor

Proposed Commit Message

CloudStack: fix data-server DNS resolution

CloudStack DNS resolution should be done against 
the DNS search domain (with the final dot, DNS
resolution does not work with e.g. Fedora 34)

LP: #1942232

Additional Context

Test Steps

Checklist:

  • My code follows the process laid out in the documentation
  • I have updated or added any unit tests accordingly
  • I have updated or added any documentation accordingly

CloudStack DNS resolution should be done against
the DNS search domain (with the final dot, DNS
resolution does not work with e.g. Fedora 34)

LP: #1942232
@TheRealFalcon
Copy link
Contributor

Hey @olivierlemasle , thanks for this PR. I have some questions over at https://bugs.launchpad.net/cloud-init/+bug/1942232 . I'll leave this PR open while we have a conversation over there.

@olivierlemasle
Copy link
Contributor Author

Hi @onitake @joschi36,
As you recently updated CloudStack and cloud-init documentation regarding the CloudStack meta-data service (@onitake) and you added the data-server. DNS request to cloud-init (@joschi36), could you please look at my bug report in https://bugs.launchpad.net/cloud-init/+bug/1942232 ?

With Fedora 34, something changed compared to Fedora 33, and the data-server. request fails (data-server or the complete FQDN are still ok). Can you reproduce the issue? Why did you use data-server. and not data-server?

@onitake
Copy link
Contributor

onitake commented Sep 2, 2021

This is a bit odd. I remember that we specifically had to add a dot, so data-server would not be searched in the network/VPC domain. Maybe this has changed on the CloudStack side?

But maybe I'm mistaken.

@TheRealFalcon
Copy link
Contributor

It's hard to know the "right" solution without more context, but is there any harm in trying the FQDN first, and then try the relative address if that fails?

@TheRealFalcon
Copy link
Contributor

Given the discussion on https://bugs.launchpad.net/cloud-init/+bug/1942232 , it seems it was never intended for data-server to be a FQDN, so I'm ok merging this change. Thanks everybody!

@onitake
Copy link
Contributor

onitake commented Sep 8, 2021

Given the discussion on https://bugs.launchpad.net/cloud-init/+bug/1942232 , it seems it was never intended for data-server to be a FQDN, so I'm ok merging this change. Thanks everybody!

I don't fully agree. Yes, the justification given by @smoser for adding the dot didn't seem conclusive, but I'd prefer a clarification first.

@onitake
Copy link
Contributor

onitake commented Sep 8, 2021

On another note: If the trailing dot in cloud-init is removed, the CloudStack documentation needs to be updated as well.

For reference: apache/cloudstack-documentation#132

@TheRealFalcon
Copy link
Contributor

I see. I mistook your comments in launchpad as justification for removing the dot. We can wait to see if @smoser has any further information.

@onitake
Copy link
Contributor

onitake commented Sep 9, 2021

@weizhouapache What do you think? You commented on the Launchpad issue that the dot should be removed.

IMHO, there is a risk that the wrong DNS server could be asked about the host configuration. Imagine for example someone writing Google's DNS servers and a custom domain into their VM's /etc/resolv.conf. This would cause the DNS request to never reach the virtual router, and consequently cloud-init not getting the correct data-server address. If someone impersonates the data-server and introduces a suitable record into public DNS, you'd have a serious security issue. Obviously, resolving data-server. wouldn't work either in this case, but at least there would be no risk.

Perhaps this should be solved on the virtual router instead? If it always returns a correct result for data-server. (with the dot), then this wouldn't be an issue at all.

@olivierlemasle
Copy link
Contributor Author

FYI, I did some additional tests in another environment (original tests in the Launchpad ticket):

  • CloudStack version: 4.15
  • VM instance: Ubuntu 20.04
  • both VPC and non-VPC networks

Again, data-server. was not resolved:

ubuntu@test:~$ host data-server.
Host data-server not found: 2(SERVFAIL)

ubuntu@test:~$ host data-server
data-server.cs2cloud.internal has address 10.0.100.1

ubuntu@test:~$ python3
Python 3.8.10 (default, Jun  2 2021, 10:49:15) 
>>> from socket import getaddrinfo
>>> getaddrinfo("data-server.", 80)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/socket.py", line 918, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution

>>> getaddrinfo("data-server", 80)
[(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('10.0.100.1', 80)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_DGRAM: 2>, 17, '', ('10.0.100.1', 80)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_RAW: 3>, 0, '', ('10.0.100.1', 80))]

@weizhouapache
Copy link
Contributor

same results in ubuntu 18.04

(but it works in centos 5.5 created from default template in cloudstack)

@onitake
I tried to add "data-server." in /etc/hosts and reload dnsmasq, but still cannot resolve 'data-server.' in ubuntu 18.04 vm.

FYI, I did some additional tests in another environment (original tests in the Launchpad ticket):

  • CloudStack version: 4.15
  • VM instance: Ubuntu 20.04
  • both VPC and non-VPC networks

Again, data-server. was not resolved:

ubuntu@test:~$ host data-server.
Host data-server not found: 2(SERVFAIL)

ubuntu@test:~$ host data-server
data-server.cs2cloud.internal has address 10.0.100.1

ubuntu@test:~$ python3
Python 3.8.10 (default, Jun  2 2021, 10:49:15) 
>>> from socket import getaddrinfo
>>> getaddrinfo("data-server.", 80)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/socket.py", line 918, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution

>>> getaddrinfo("data-server", 80)
[(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('10.0.100.1', 80)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_DGRAM: 2>, 17, '', ('10.0.100.1', 80)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_RAW: 3>, 0, '', ('10.0.100.1', 80))]

@TheRealFalcon
Copy link
Contributor

Is there any reason we shouldn't do my previous suggestion?

trying the FQDN first, and then try the relative address if that fails?

@github-actions
Copy link

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging mitechie, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag mitechie to reopen it.)

@github-actions github-actions bot added the stale-pr Pull request is stale; will be auto-closed soon label Sep 28, 2021
Copy link
Contributor

@weizhouapache weizhouapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm

@github-actions github-actions bot closed this Oct 6, 2021
@TheRealFalcon TheRealFalcon reopened this Oct 8, 2021
@TheRealFalcon TheRealFalcon removed the stale-pr Pull request is stale; will be auto-closed soon label Oct 8, 2021
@TheRealFalcon
Copy link
Contributor

I don't think we want to let this one die just yet. @onitake , you had the reservations with taking the relative approach. Do you still have reservations? If so, can you explain what they are? Would they be satisfied by my suggestion to try the . first and then fall back to relative later?

@weizhouapache
Copy link
Contributor

@TheRealFalcon @onitake @olivierlemasle

I just had an idea in cloudstack. I will test it and update you next week.

@weizhouapache
Copy link
Contributor

weizhouapache commented Oct 11, 2021

@TheRealFalcon @onitake @olivierlemasle

small update 1:
it seems to be caused by systemd-resolved.
If I add "nameserver 10.x.x.x" in /etc/resolv.conf, or create /etc/resolv.conf as symbolic link of /run/systemd/resolve/resolv.conf, the domain "data-server." can be resolved.

I am not going to fix it in cloudstack, as it is a cloudstack issue in my opinion.

refer to https://askubuntu.com/questions/1068131/ubuntu-18-04-local-domain-dns-lookup-not-working

Update2 : no idea if it is related to DNSSEC NTA below

# systemd-resolve --status
Global
         DNS Servers: 10.1.1.1
                      10.0.32.1
                      8.8.8.8
          DNS Domain: cs2cloud.cloud
          DNSSEC NTA: 10.in-addr.arpa
                      16.172.in-addr.arpa
                      168.192.in-addr.arpa
                      17.172.in-addr.arpa
                      18.172.in-addr.arpa
                      19.172.in-addr.arpa
                      20.172.in-addr.arpa
                      21.172.in-addr.arpa
                      22.172.in-addr.arpa
                      23.172.in-addr.arpa
                      24.172.in-addr.arpa
                      25.172.in-addr.arpa
                      26.172.in-addr.arpa
                      27.172.in-addr.arpa
                      28.172.in-addr.arpa
                      29.172.in-addr.arpa
                      30.172.in-addr.arpa
                      31.172.in-addr.arpa
                      corp
                      d.f.ip6.arpa
                      home
                      internal
                      intranet
                      lan
                      local
                      private
                      test

Link 2 (eth0)
      Current Scopes: none
       LLMNR setting: yes
MulticastDNS setting: no
      DNSSEC setting: no
    DNSSEC supported: no

@TheRealFalcon
Copy link
Contributor

@weizhouapache , "I am not going to fix it in cloudstack, as it is a cloudstack issue in my opinion." I'm not sure I understand what you mean by this. Was that first "cloudstack" supposed to say "cloud-init", or did you mean to say it's not a cloudstack issue?

@weizhouapache
Copy link
Contributor

@weizhouapache , "I am not going to fix it in cloudstack, as it is a cloudstack issue in my opinion." I'm not sure I understand what you mean by this. Was that first "cloudstack" supposed to say "cloud-init", or did you mean to say it's not a cloudstack issue?

@TheRealFalcon
I think it is not an issue with dns server (it runs inside cloudstack virtual router, it is implemented by dnsmasq), no need to change cloudstack.
the issue should be fixed in cloud-init, vm configuration (/etc/resolv.conf) or systemd-resolved.

@TheRealFalcon
Copy link
Contributor

I'm going to go ahead and merge this. We got feedback and evidence from multiple people that the current solution doesn't work but that this proposed solution does work. Additionally, I have asked for feedback multiple times about any possible reservations and/or workarounds and have not heard anything.

@TheRealFalcon TheRealFalcon merged commit 62c2a56 into canonical:main Oct 18, 2021
@weizhouapache
Copy link
Contributor

I'm going to go ahead and merge this. We got feedback and evidence from multiple people that the current solution doesn't work but that this proposed solution does work. Additionally, I have asked for feedback multiple times about any possible reservations and/or workarounds and have not heard anything.

@TheRealFalcon
thanks for merging !

@olivierlemasle
Copy link
Contributor Author

Thank you @TheRealFalcon for merging this!

@olivierlemasle olivierlemasle deleted the fix-cs-dataserver branch October 18, 2021 16:54
@synergiator
Copy link

synergiator commented Nov 2, 2021

side question - what is to expect with regards of the backward compatibility here, if this has been tested only with ACS 4.15?

@weizhouapache
Copy link
Contributor

Hi @synergiator
from my understanding,
if addrinfo = getaddrinfo("data-server.", 80) works, addrinfo = getaddrinfo("data-server", 80) works as well.
if addrinfo = getaddrinfo("data-server.", 80) does not work in some guest os, addrinfo = getaddrinfo("data-server", 80) still works.
I did not see any backward compatibility issue.
Please correct me if I am wrong.

@synergiator
Copy link

synergiator commented Nov 4, 2021

thank you for the explanation - indeed the implementation seems to be independent of the ACS version.
p.s. I see now some discussion on the Apache CloudStack user mailing list where indeed OS compatibility might be a topic, but my queston was about the ACS bc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants