K3S Series Articles - 5G IoT Gateway Device POD Access Error DNS 'I/O Timeout' Analysis and Resolution

This article was last updated on: February 7, 2024 pm

Opening

《K3s series of articles》
《Rancher series of articles》

Overview of the problem

20220606 5G IoT gateway device installs K3S Server at the same time, but the POD cannot access the Internet address, check the CoreDNS log as follows:

...
[ERROR] plugin/errors: 2 update.traefik.io. A: read udp 10.42.0.3:38545->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 update.traefik.io. AAAA: read udp 10.42.0.3:38990->8.8.8.8:53: i/o timeout
...

That is, the DNS query forwards to the 8.8.8.8 DNS server, and the query times out.

As a result, PODs that require network startup cannot start normally, and frequent CrashLoopBackoff, as shown below:

需要联网启动的 POD CrashLoopBackoff

However, direct access through Node can be accessed normally, as shown below:

通过 Node 可以正常访问

Environmental information

Hardware: 5G IoT gateway
Internet:
1. Internet access: 5G network card: is a USB network card, needs to be started through the dialer, the program will call the system’s dhcp/dnsmasq/resolvconf, etc
2. Intranet access: WLAN network card
Software: K3S Server v1.21.7+k3s1, dnsmasq, etc

analyse

Network detailed configuration information

Check the analysis step by step:

see /etc/resolv.conf, the discovery configuration is 127.0.0.1
netstat sees that local port 53 is indeed running
This is usually the case when a DNS server or cache is started locally to see if the dnsmasq process exists, and it does exist
The resolv.conf configuration for dnsmasq is /run/dnsmasq/resolv.conf

$ cat /etc/resolv.conf
# Generated by resolvconf
nameserver 127.0.0.1

$ netstat -anpl|grep 53
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 0.0.0.0:53              0.0.0.0:*               LISTEN      -
tcp6       0      0 :::53                   :::*                    LISTEN      -
udp        0      0 0.0.0.0:53              0.0.0.0:*                           -
udp6       0      0 :::53                   :::*                                -

$ ps -ef|grep dnsmasq
dnsmasq    912     1  0 6 月 06 ?       00:00:00 /usr/sbin/dnsmasq -x /run/dnsmasq/dnsmasq.pid -u dnsmasq -r /run/dnsmasq/resolv.conf -7 /etc/dnsmasq.d,.dpkg-dist,.dpkg-old,.dpkg-new --local-service --trust-anchor=...

$ systemctl status dnsmasq.service
● dnsmasq.service - dnsmasq - A lightweight DHCP and caching DNS server
   Loaded: loaded (/lib/systemd/system/dnsmasq.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2022-06-06 17:21:31 CST; 16h ago
 Main PID: 912 (dnsmasq)
    Tasks: 1 (limit: 4242)
   Memory: 1.1M
   CGroup: /system.slice/dnsmasq.service
           └─912 /usr/sbin/dnsmasq -x /run/dnsmasq/dnsmasq.pid -u dnsmasq -r /run/dnsmasq/resolv.conf -7 /etc/dnsmasq.d,.dpkg-dist,.dpkg-old,.dpkg-new --local-service --trust-anchor=...

6 月 06 17:21:31 orangebox-7eb3 dnsmasq[912]: started, version 2.80 cachesize 150
6 月 06 17:21:31 orangebox-7eb3 dnsmasq[912]: compile time options: IPv6 GNU-getopt DBus i18n IDN DHCP DHCPv6 no-Lua TFTP conntrack ipset auth DNSSEC loop-detect inotify dumpfile
6 月 06 17:21:31 orangebox-7eb3 dnsmasq-dhcp[912]: DHCP, IP range 192.168.51.100 -- 192.168.51.200, lease time 3d
6 月 06 17:21:31 orangebox-7eb3 dnsmasq[912]: read /etc/hosts - 8 addresses
6 月 06 17:21:31 orangebox-7eb3 dnsmasq[912]: no servers found in /run/dnsmasq/resolv.conf, will retry
6 月 06 17:21:31 orangebox-7eb3 dnsmasq[928]: Too few arguments.
6 月 06 17:21:31 orangebox-7eb3 systemd[1]: Started dnsmasq - A lightweight DHCP and caching DNS server.
6 月 06 17:22:18 orangebox-7eb3 dnsmasq[912]: reading /run/dnsmasq/resolv.conf
6 月 06 17:22:18 orangebox-7eb3 dnsmasq[912]: using nameserver 222.66.251.8#53
6 月 06 17:22:18 orangebox-7eb3 dnsmasq[912]: using nameserver 116.236.159.8#53

$ cat /run/dnsmasq/resolv.conf
# Generated by resolvconf
nameserver 222.66.251.8
nameserver 116.236.159.8

CoreDNS analysis

It’s strange here, but I didn’t find where DNS 8.8.8.8 is configured, but the log shows that it points to this DNS:

1
2

[ERROR] plugin/errors: 2 update.traefik.io. A: read udp 10.42.0.3:38545->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 update.traefik.io. AAAA: read udp 10.42.0.3:38990->8.8.8.8:53: i/o timeout

Let’s take a look at the configuration of CoreDNS: (K3S’s CoreDNS is initiated via manifests at:/var/lib/rancher/k3s/server/manifests/coredns.yaml Down)

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
        }
        hosts /etc/coredns/NodeHosts {
          ttl 60
          reload 15s
          fallthrough
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }

There are 2 main configurations to focus on here:

forward . /etc/resolv.conf
loop

Common analysis process for CoreDNS issues

Check if the DNS pod is working properly - Result: Yes;

1
2
3

# kubectl -n kube-system get pods -l k8s-app=kube-dns
NAME                       READY   STATUS    RESTARTS   AGE
coredns-7448499f4d-pbxk6   1/1     Running   1          15h

Check that the DNS service has the correct cluster-ip - Result: Yes:

1
2
3

# kubectl -n kube-system get svc -l k8s-app=kube-dns
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.43.0.10   <none>        53/UDP,53/TCP,9153/TCP   15h

Check whether the domain name can be resolved:
First internal domain name - Result: Unable to resolve:

$ kubectl run -it --rm --restart=Never busybox --image=busybox -- nslookup kubernetes.default
Server:         10.43.0.10
Address:        10.43.0.10:53

;; connection timed out; no servers could be reached

Retry the external domain name - Result: Unable to resolve:

$ kubectl run -it --rm --restart=Never busybox --image=busybox -- nslookup www.baidu.com
Server:         10.43.0.10
Address:        10.43.0.10:53

;; connection timed out; no servers could be reached

Check the nameserver configuration in resolv.conf - result: 8.8.8.8 indeed

1
2
3

$ kubectl run -i --restart=Never --rm test-${RANDOM} --image=ubuntu --overrides='{"kind":"Pod", "apiVersion":"v1", "spec": {"dnsPolicy":"Default"}}' -- sh -c 'cat /etc/resolv.conf'
nameserver 8.8.8.8
pod "test-7517" deleted

In summary:
It should be inside the POD /etc/resolv.conf is configured to nameserver 8.8.8.8 This caused this problem.
But the entire Node OS level is not configured nameserver 8.8.8.8So the suspicion is: Kubernetes, Kubelet, CoreDNS, or CRI level has such a mechanism to automatically configure DNS when it is abnormal nameserver 8.8.8.8。

So, to solve the problem, it is still necessary to find the DNS configuration exception.

Container network DNS service

I haven’t found specific evidence of DNS related to Kubernetes, Kubelet, CoreDNS, or CRI here, and the CRI for K3S is containerd, but I’m here The official documentation for Docker Saw a description like this:

📚️ Reference:
If the container cannot reach any of the IP addresses you specify, Google’s public DNS server 8.8.8.8 is added, so that your container can resolve internet domains.
If the container can’t reach any of the (DNS) IP addresses you specify, add Google’s public DNS server 8.8.8.8 so that your container can resolve the internet domain.

Here’s speculation that Kubernetes, Kubelet, CoreDNS, or CRI may have a similar mechanism.

From here analysis, we can know that the root cause is still a DNS configuration problem, and the CoreDNS configuration is the default, so the biggest possibility is /etc/resolv.conf configured as nameserver 127.0.0.1 caused this problem.

Root cause analysis

Root cause: /etc/resolv.conf configured as nameserver 127.0.0.1 caused this problem.

The official documentation for CoreDNS makes this clear:

📚️ Reference:

loop | CoreDNS Docs
When the CoreDNS log contains messagesLoop ... detected ..., this means detecting the plug-inloopAn infinite forwarding loop was detected in one of the upstream DNS servers. This is a fatal error because operating with an infinite loop will consume memory and CPU until the host eventually dies out of memory.
Forwarding loops are typically caused by the following:
Most commonly, CoreDNS forwards requests directly to itself. For example, pass127.0.0.1、::1or127.0.0.53Equal loopback address
To resolve this issue, review any forwarding in the region in the Corefile where the loop is detected. Make sure they are not forwarded to a local address or to another DNS server, which is forwarding the request back to CoreDNS. If forward is using a file (for example, /etc/resolv.conf), make sure that the file does not contain a local address.

As you can see here, our CoreDNS configuration contains:forward . /etc/resolv.conf, and on Node /etc/resolv.conf fornameserver 127.0.0.1. And the ones mentioned aboveInfinite forwarding loop target Fatal error Matching.

Forwarding loops are typically caused by the following:

Most commonly, CoreDNS forwards requests directly to itself. For example, through a loopback address, such as , or 127.0.0.1::1127.0.0.53

Workaround

📚️ Reference:

loop | CoreDNS Docs
There are 3 official solutions:

kubelet added --resolv-conf Directly to the “real” resolv.conf, Generally:/run/systemd/resolve/resolv.conf

Disable local DNS caching on the Node

Quick dirty method: modify the Corefile, put forward . /etc/resolv.conf to be replaced with forward . 8.8.8.8 and so on for DNS addresses that can be accessed

In view of the above methods, we analyze them one by one:

✔️ Feasible: kubelet added --resolv-conf Directly to the “real” resolv.conf: As mentioned above, our “real” resolv.confFor:/run/dnsmasq/resolv.conf
❌ Not feasible: Disable local DNS caching on Node, as this is a special case based on 5G IoT gateways, which is the case with the 5G gateway program mechanism, which uses dnsmasq
❌ Not feasible: dirty method, and the DNS obtained by the 5G gateway is not fixed and changes at any time, so we can’t specify it forward . < 固定的 DNS 地址 >

In summary, the solution is as follows:
Add the following fields to the K3S service:--resolv-conf /run/dnsmasq/resolv.conf

After adding, it is as follows:

[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
Wants=network-online.target
After=network-online.target

[Install]
WantedBy=multi-user.target

[Service]
Type=notify
EnvironmentFile=/etc/systemd/system/k3s.service.env
KillMode=process
Delegate=yes
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s server --flannel-iface wlan0 --write-kubeconfig-mode "644" --disable-cloud-controller --resolv-conf  /run/dnsmasq/resolv.conf

Then execute the following commands reload and restart:

systemctl daemon-reload
systemctl stop k3s.service
k3s-killall.sh
systemctl start k3s.service

to return to normal.

If it needs to be resolved at installation time, the workaround is as follows:

Using the k3s-ansible script, group_vars additionally add the following--resolv-confParameter:extra_server_args: '--resolv-conf /run/dnsmasq/resolv.conf'
Use the official k3s script: Reference K3s Server Configuration Reference | Rancher documentation, add parameters:--resolv-conf /run/dnsmasq/resolv.conf Or use environment variables:K3S_RESOLV_CONF=/run/dnsmasq/resolv.conf

🎉🎉🎉

📚️ Reference documentation

CloudNative

#K8S #Container #K3S #IoT #5G #RaspberryPi #Raspberry #Pi #Analysis #DNS #CoreDNS

K3S Series Articles - 5G IoT Gateway Device POD Access Error DNS 'I/O Timeout' Analysis and Resolution

https://e-whisper.com/posts/11073/

Author

east4ming

Posted on

June 7, 2022

Licensed under

Reprint - automatically pushes the generic implementation of the new blog Previous

Velero series (5): Kubernetes cluster backup disaster recovery production best practices based on Velero Next