nsswitch配置缺失导致Calico无法启动

面试经常被问到处理过哪些故障,由于缺乏准备,经常卡壳,说不好,或者干脆完全想不起来要说啥。接下来几篇就总结一下自己处理过的故障。这一篇首先写写 alpine 镜像 /etc/nsswitch.conf 缺失导致的 Calico 无法启动问题

报错日志

2018-08-26 15:06:04.447 [ERROR][68] health.go 196: Health endpoint failed, trying to restart it... error=listen tcp 115.18.44.218:9099: bind: cannot assign requested address
bird: Mesh_10_112_35_117: State changed to start
2018-08-26 15:06:05.448 [ERROR][68] health.go 196: Health endpoint failed, trying to restart it... error=listen tcp 115.18.44.218:9099: bind: cannot assign requested address

处理过程

首先想到的是报 bug,到 github 提了 Issue,在作者答复之前,通过查找文档,指定 FELIX_HEALTHHOST 变量为 spec.nodeName (我的环境下 spec.nodeName 为宿主机 IP),从表面上解决了问题。

225a226,230
>             # FELIX_HEALTHHOST
>             - name: FELIX_HEALTHHOST
>               valueFrom:
>                 fieldRef:
>                   fieldPath: spec.nodeName
283c288
<               host: localhost
---
>               #host: 127.0.0.1
288,292c293,301
<             exec:
<               command:
<               - /bin/calico-node
<               - -bird-ready
<               - -felix-ready
---
>             httpGet:
>               path: /readiness
>               port: 9099
>             #  host: 127.0.0.1
>             #exec:
>             #  command:
>             #  - /bin/calico-node
>             #  - -bird-ready
>             #  - -felix-ready

作者答复

calico-node -felix-ready should just be doing an http get under the covers. It's mainly so we can wrap multiple liveness checks into a single command.
I haven't seen a need to set HEALTHHOST explicitly before, it's a bit odd. It should default to localhost per this: https://github.com/projectcalico/felix/blob/2a2fedd7e2831db07d4f36d2ddc928df783e19bb/config/config_params.go
What does localhost resolve to on this machine?

根据作者的答复,转向调查 localhost 解析。搜索网络,找到根本原因是因为 alpine 版本更新后删除了 /etc/nsswitch.conf ,导致 Go 程序直接使用 DNS 来解析域名,/etc/hosts 中的配置并没有被使用,localhost 也就不能被正确解析到 127.0.0.1

写一个简单的程序来验证

package main
 
import (
	"net"
	"fmt"
	"os"
)
func main() {
 
 
	ns, err := net.LookupHost("localhost")
	if err != nil {
		fmt.Fprintf(os.Stderr, "Err: %s", err.Error())
		return
	}
 
	for _, n := range ns {
		fmt.Fprintf(os.Stdout, "--%s\n", n) 
	}
 
}

执行

# with out nsswitch.conf  (#hosts:      files dns myhostname)
[root@repo tmp]# go run dns.go 
--115.9.3.123
# with nsswitch.conf
[root@repo tmp]# vim /etc/nsswitch.conf
[root@repo tmp]# 
[root@repo tmp]# go run dns.go 
--127.0.0.1

处理方案

给镜像补上 /etc/nsswitch.conf

hosts:      files dns myhostname

最新版 Calico 已经修复该问题

参考链接

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注