我的解题之路: 容器子进程
一个据说是遗留很久的问题,被转到我这里了,只有以下描述
排查 docker 中,主进程接受kill sigterm信号 超过10s 强杀,且子进程无法接受 sigterm 信号问题,导致子进程无法优雅退出(方案:tini/自研,需要调研),该问题可能是导致组件重启数据丢失问题
没有复现的场景,看起来还得自己造,该问题可能会导致数据丢失,说明还只是猜测,调研了tini是可能的解决方案,但是问了相关同事,似乎不行……所以各种似是而非,看似难解的问题又一次到了我这里
先来复现一下这个场景,条件如下
容器中有多个进程,此处可用脚本启动应用进程
10s后会强杀
先来构造示例程序,如下所示,目录结构如下
├── Dockerfile
├── go.mod
├── main.go
└── test.sh
main.go
package main
import (
"context"
"log"
"net/http"
"os/signal"
"syscall"
)
func main() {
srv := http.Server{
Addr: ":8000",
Handler: http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.Write([]byte("hello world"))
}),
}
ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGTERM)
defer stop()
go func() {
log.Println("Server is running")
if err := srv.ListenAndServe(); err != http.ErrServerClosed {
log.Fatal(err)
}
}()
for {
select {
case <-ctx.Done():
log.Println("received SIGTERM")
srv.Shutdown(ctx)
return
}
}
}
Dockerfile
FROM golang:1.17 as builder
WORKDIR /workspace
ADD test.sh .
RUN chmod +x test.sh
COPY go.mod go.mod
RUN go env -w GO111MODULE=on && \
go env -w GOPROXY=https://goproxy.cn,direct && \
go mod download
COPY main.go main.go
RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -a -o app main.go
FROM alpine
COPY --from=builder /workspace/app /app
COPY --from=builder /workspace/test.sh /test.sh
ENTRYPOINT ["/bin/sh", "/test.sh"]
#!/bin/sh
./app
docker run ,然后再docker stop
❯ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
9398c6c38f55 bt "/bin/sh /test.sh" 37 seconds ago Up 36 seconds mystifying_elbakyan
❯ docker stop 9398c6c38f55
9398c6c38f55
took 10s
❯
不多不少,刚好10s,对上了,再来观察下docker events
❯ docker events
2023-06-07T22:54:12.597708119+08:00 container kill 9398c6c38f55a7f85285453b7743ba198919f66508c11c9951b38eeda530b9d4 (image=bt, name=mystifying_elbakyan, signal=15)
2023-06-07T22:54:22.620442554+08:00 container kill 9398c6c38f55a7f85285453b7743ba198919f66508c11c9951b38eeda530b9d4 (image=bt, name=mystifying_elbakyan, signal=9)
2023-06-07T22:54:22.900211893+08:00 network disconnect be46b9b50890526a33927cb6b14776ec1c6b34721e56bbe5cd7776f4698e17bd (container=9398c6c38f55a7f85285453b7743ba198919f66508c11c9951b38eeda530b9d4, name=bridge, type=bridge)
2023-06-07T22:54:22.910430519+08:00 container stop 9398c6c38f55a7f85285453b7743ba198919f66508c11c9951b38eeda530b9d4 (image=bt, name=mystifying_elbakyan)
2023-06-07T22:54:22.914132154+08:00 container die 9398c6c38f55a7f85285453b7743ba198919f66508c11c9951b38eeda530b9d4 (exitCode=137, image=bt, name=mystifying_elbakyan)
使用tini
tini是什么
tini 容器init 是一个最小化的
init
系统,运行在容器内部,用于启动一个子进程,并等待进程退出时清理僵尸和执行信号转发
变更后的Dockefile
FROM alpine
COPY --from=builder /workspace/app /app
COPY --from=builder /workspace/test.sh /test.sh
RUN apk add --no-cache tini
ENTRYPOINT ["/sbin/tini", "--"]
CMD ["/test.sh"]
执行
❯ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a9d595a31a26 bt1 "/sbin/tini -- /test…" 12 seconds ago Up 11 seconds brave_lamarr
❯ docker stop a9d595a31a26
a9d595a31a26
对应的docker events
2023-06-07T23:13:27.961297294+08:00 container kill a9d595a31a26e6273061167b47cf66c44bf7b8c872906a756d05ef0968abb3f8 (image=bt1, name=brave_lamarr, signal=15)
2023-06-07T23:13:28.231211949+08:00 network disconnect be46b9b50890526a33927cb6b14776ec1c6b34721e56bbe5cd7776f4698e17bd (container=a9d595a31a26e6273061167b47cf66c44bf7b8c872906a756d05ef0968abb3f8, name=bridge, type=bridge)
2023-06-07T23:13:28.241647255+08:00 container stop a9d595a31a26e6273061167b47cf66c44bf7b8c872906a756d05ef0968abb3f8 (image=bt1, name=brave_lamarr)
2023-06-07T23:13:28.245237103+08:00 container die a9d595a31a26e6273061167b47cf66c44bf7b8c872906a756d05ef0968abb3f8 (exitCode=143, image=bt1, name=brave_lamarr)
然后标准输出并没有打印我想要的received SIGTERM
这说明SIGTERM可能并没有传递到app
应用中
dumb-init
A minimal init system for Linux containers
显然,它是tini的竞争对手,先实验一把再说
dockerfile
FROM alpine
COPY --from=builder /workspace/app /app
COPY --from=builder /workspace/test.sh /test.sh
RUN wget -O /usr/local/bin/dumb-init https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_x86_64
RUN chmod +x /usr/local/bin/dumb-init
ENTRYPOINT ["/usr/local/bin/dumb-init", "--"]
CMD ["/test.sh"]
执行
❯ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a8013b4f5450 bt2 "/usr/local/bin/dumb…" 10 seconds ago Up 9 seconds agitated_tesla
❯ docker stop a8013b4f5450
a8013b4f5450
docker events事件
2023-06-07T23:19:03.818896256+08:00 container kill a8013b4f5450b650bab01b37dce0f91da528bf051909de7a06edab03b4267b2f (image=bt2, name=agitated_tesla, signal=15)
2023-06-07T23:19:04.126668505+08:00 network disconnect be46b9b50890526a33927cb6b14776ec1c6b34721e56bbe5cd7776f4698e17bd (container=a8013b4f5450b650bab01b37dce0f91da528bf051909de7a06edab03b4267b2f, name=bridge, type=bridge)
2023-06-07T23:19:04.136470132+08:00 container stop a8013b4f5450b650bab01b37dce0f91da528bf051909de7a06edab03b4267b2f (image=bt2, name=agitated_tesla)
2023-06-07T23:19:04.138676021+08:00 container die a8013b4f5450b650bab01b37dce0f91da528bf051909de7a06edab03b4267b2f (exitCode=143, image=bt2, name=agitated_tesla)
标准输出
2023/06/07 15:18:47 Server is running
2023/06/07 15:19:03 received SIGTERM
一切都对应上了,从结果看dumb-init满足了我的需求。
总结
从我目前的知识储备来看,盲点还是挺多的,比如:为何docker中的目标进程不是1号进程?为何docker中有多进程管理的需求……然而,我明白这些都是真实存在的场景,有的是为了容器化而容器化,限于笔者的眼界,显得比较奇怪而已。
另一方面,比何市面上看到的容器化多进程管理工具,并不完全如描述的那样可行,难道是我实验的姿势不对?这点也值得进一步思考,我觉得这是一个未完的话题,理应有后续才行,
参考: