面试经常被问到处理过哪些故障,由于缺乏准备,经常卡壳,说不好,或者干脆完全想不起来要说啥。接下来几篇就总结一下自己处理过的故障。这一篇首先写写 alpine 镜像 /etc/nsswitch.conf
缺失导致的 Calico 无法启动问题
报错日志
2018-08-26 15:06:04.447 [ERROR][68] health.go 196: Health endpoint failed, trying to restart it... error=listen tcp 115.18.44.218:9099: bind: cannot assign requested address
bird: Mesh_10_112_35_117: State changed to start
2018-08-26 15:06:05.448 [ERROR][68] health.go 196: Health endpoint failed, trying to restart it... error=listen tcp 115.18.44.218:9099: bind: cannot assign requested address
处理过程
首先想到的是报 bug,到 github 提了 Issue,在作者答复之前,通过查找文档,指定 FELIX_HEALTHHOST
变量为 spec.nodeName
(我的环境下 spec.nodeName
为宿主机 IP),从表面上解决了问题。
225a226,230
> # FELIX_HEALTHHOST
> - name: FELIX_HEALTHHOST
> valueFrom:
> fieldRef:
> fieldPath: spec.nodeName
283c288
< host: localhost
---
> #host: 127.0.0.1
288,292c293,301
< exec:
< command:
< - /bin/calico-node
< - -bird-ready
< - -felix-ready
---
> httpGet:
> path: /readiness
> port: 9099
> # host: 127.0.0.1
> #exec:
> # command:
> # - /bin/calico-node
> # - -bird-ready
> # - -felix-ready
作者答复
calico-node -felix-ready
should just be doing an http get under the covers. It's mainly so we can wrap multiple liveness checks into a single command.
I haven't seen a need to set HEALTHHOST explicitly before, it's a bit odd. It should default to localhost
per this: https://github.com/projectcalico/felix/blob/2a2fedd7e2831db07d4f36d2ddc928df783e19bb/config/config_params.go
What does localhost
resolve to on this machine?
根据作者的答复,转向调查 localhost
解析。搜索网络,找到根本原因是因为 alpine 版本更新后删除了 /etc/nsswitch.conf
,导致 Go 程序直接使用 DNS 来解析域名,/etc/hosts
中的配置并没有被使用,localhost
也就不能被正确解析到 127.0.0.1
。
写一个简单的程序来验证
package main
import (
"net"
"fmt"
"os"
)
func main() {
ns, err := net.LookupHost("localhost")
if err != nil {
fmt.Fprintf(os.Stderr, "Err: %s", err.Error())
return
}
for _, n := range ns {
fmt.Fprintf(os.Stdout, "--%s\n", n)
}
}
执行
# with out nsswitch.conf (#hosts: files dns myhostname)
[root@repo tmp]# go run dns.go
--115.9.3.123
# with nsswitch.conf
[root@repo tmp]# vim /etc/nsswitch.conf
[root@repo tmp]#
[root@repo tmp]# go run dns.go
--127.0.0.1
处理方案
给镜像补上 /etc/nsswitch.conf
。
hosts: files dns myhostname
最新版 Calico 已经修复该问题
发表回复