我的解题之路: 容器子进程

·

3 min read

一个据说是遗留很久的问题,被转到我这里了,只有以下描述

排查 docker 中,主进程接受kill sigterm信号 超过10s 强杀,且子进程无法接受 sigterm 信号问题,导致子进程无法优雅退出(方案:tini/自研,需要调研),该问题可能是导致组件重启数据丢失问题

没有复现的场景,看起来还得自己造,该问题可能会导致数据丢失,说明还只是猜测,调研了tini是可能的解决方案,但是问了相关同事,似乎不行……所以各种似是而非,看似难解的问题又一次到了我这里

先来复现一下这个场景,条件如下

  1. 容器中有多个进程,此处可用脚本启动应用进程

  2. 10s后会强杀

先来构造示例程序,如下所示,目录结构如下

├── Dockerfile
├── go.mod
├── main.go
└── test.sh

main.go

package main
​
import (
  "context"
  "log"
  "net/http"
  "os/signal"
  "syscall"
)
​
func main() {
  srv := http.Server{
    Addr: ":8000",
    Handler: http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
      w.Write([]byte("hello world"))
    }),
  }
​
  ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGTERM)
  defer stop()
  go func() {
    log.Println("Server is running")
    if err := srv.ListenAndServe(); err != http.ErrServerClosed {
      log.Fatal(err)
    }
  }()
​
  for {
    select {
    case <-ctx.Done():
      log.Println("received SIGTERM")
      srv.Shutdown(ctx)
      return
    }
  }
}
​

Dockerfile

FROM golang:1.17 as builder
​
WORKDIR /workspace
​
ADD test.sh .
RUN chmod +x test.sh
​
COPY go.mod go.mod
​
RUN go env -w GO111MODULE=on && \
    go env -w GOPROXY=https://goproxy.cn,direct && \
    go mod download
​
COPY main.go main.go
​
RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -a -o app main.go
​
FROM alpine
​
COPY --from=builder /workspace/app /app
COPY --from=builder /workspace/test.sh /test.sh
​
ENTRYPOINT ["/bin/sh", "/test.sh"]

test.sh

#!/bin/sh
./app

docker run ,然后再docker stop

❯ docker ps
CONTAINER ID   IMAGE     COMMAND              CREATED          STATUS          PORTS     NAMES
9398c6c38f55   bt        "/bin/sh /test.sh"   37 seconds ago   Up 36 seconds             mystifying_elbakyan
​
❯ docker stop 9398c6c38f55
9398c6c38f55
​
 took 10s
❯

不多不少,刚好10s,对上了,再来观察下docker events

❯ docker events
2023-06-07T22:54:12.597708119+08:00 container kill 9398c6c38f55a7f85285453b7743ba198919f66508c11c9951b38eeda530b9d4 (image=bt, name=mystifying_elbakyan, signal=15)
2023-06-07T22:54:22.620442554+08:00 container kill 9398c6c38f55a7f85285453b7743ba198919f66508c11c9951b38eeda530b9d4 (image=bt, name=mystifying_elbakyan, signal=9)
2023-06-07T22:54:22.900211893+08:00 network disconnect be46b9b50890526a33927cb6b14776ec1c6b34721e56bbe5cd7776f4698e17bd (container=9398c6c38f55a7f85285453b7743ba198919f66508c11c9951b38eeda530b9d4, name=bridge, type=bridge)
2023-06-07T22:54:22.910430519+08:00 container stop 9398c6c38f55a7f85285453b7743ba198919f66508c11c9951b38eeda530b9d4 (image=bt, name=mystifying_elbakyan)
2023-06-07T22:54:22.914132154+08:00 container die 9398c6c38f55a7f85285453b7743ba198919f66508c11c9951b38eeda530b9d4 (exitCode=137, image=bt, name=mystifying_elbakyan)

使用tini

tini是什么

tini 容器init 是一个最小化的 init 系统,运行在容器内部,用于启动一个子进程,并等待进程退出时清理僵尸和执行信号转发

变更后的Dockefile

FROM alpine
​
COPY --from=builder /workspace/app /app
COPY --from=builder /workspace/test.sh /test.sh
​
RUN apk add --no-cache tini
ENTRYPOINT ["/sbin/tini", "--"]
​
CMD ["/test.sh"]

执行

❯ docker ps
CONTAINER ID   IMAGE     COMMAND                   CREATED          STATUS          PORTS     NAMES
a9d595a31a26   bt1       "/sbin/tini -- /test…"   12 seconds ago   Up 11 seconds             brave_lamarr
​
❯ docker stop a9d595a31a26
a9d595a31a26

对应的docker events

2023-06-07T23:13:27.961297294+08:00 container kill a9d595a31a26e6273061167b47cf66c44bf7b8c872906a756d05ef0968abb3f8 (image=bt1, name=brave_lamarr, signal=15)
2023-06-07T23:13:28.231211949+08:00 network disconnect be46b9b50890526a33927cb6b14776ec1c6b34721e56bbe5cd7776f4698e17bd (container=a9d595a31a26e6273061167b47cf66c44bf7b8c872906a756d05ef0968abb3f8, name=bridge, type=bridge)
2023-06-07T23:13:28.241647255+08:00 container stop a9d595a31a26e6273061167b47cf66c44bf7b8c872906a756d05ef0968abb3f8 (image=bt1, name=brave_lamarr)
2023-06-07T23:13:28.245237103+08:00 container die a9d595a31a26e6273061167b47cf66c44bf7b8c872906a756d05ef0968abb3f8 (exitCode=143, image=bt1, name=brave_lamarr)

然后标准输出并没有打印我想要的received SIGTERM 这说明SIGTERM可能并没有传递到app应用中

dumb-init

A minimal init system for Linux containers

显然,它是tini的竞争对手,先实验一把再说

dockerfile

FROM alpine
​
COPY --from=builder /workspace/app /app
COPY --from=builder /workspace/test.sh /test.sh
​
​
RUN wget -O /usr/local/bin/dumb-init https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_x86_64
RUN chmod +x /usr/local/bin/dumb-init
​
ENTRYPOINT ["/usr/local/bin/dumb-init", "--"]
​
CMD ["/test.sh"]

执行

❯ docker ps
CONTAINER ID   IMAGE     COMMAND                   CREATED          STATUS         PORTS     NAMES
a8013b4f5450   bt2       "/usr/local/bin/dumb…"   10 seconds ago   Up 9 seconds             agitated_tesla
​
❯ docker stop a8013b4f5450
a8013b4f5450

docker events事件

2023-06-07T23:19:03.818896256+08:00 container kill a8013b4f5450b650bab01b37dce0f91da528bf051909de7a06edab03b4267b2f (image=bt2, name=agitated_tesla, signal=15)
2023-06-07T23:19:04.126668505+08:00 network disconnect be46b9b50890526a33927cb6b14776ec1c6b34721e56bbe5cd7776f4698e17bd (container=a8013b4f5450b650bab01b37dce0f91da528bf051909de7a06edab03b4267b2f, name=bridge, type=bridge)
2023-06-07T23:19:04.136470132+08:00 container stop a8013b4f5450b650bab01b37dce0f91da528bf051909de7a06edab03b4267b2f (image=bt2, name=agitated_tesla)
2023-06-07T23:19:04.138676021+08:00 container die a8013b4f5450b650bab01b37dce0f91da528bf051909de7a06edab03b4267b2f (exitCode=143, image=bt2, name=agitated_tesla)

标准输出

2023/06/07 15:18:47 Server is running
2023/06/07 15:19:03 received SIGTERM

一切都对应上了,从结果看dumb-init满足了我的需求。

总结

从我目前的知识储备来看,盲点还是挺多的,比如:为何docker中的目标进程不是1号进程?为何docker中有多进程管理的需求……然而,我明白这些都是真实存在的场景,有的是为了容器化而容器化,限于笔者的眼界,显得比较奇怪而已。

另一方面,比何市面上看到的容器化多进程管理工具,并不完全如描述的那样可行,难道是我实验的姿势不对?这点也值得进一步思考,我觉得这是一个未完的话题,理应有后续才行,

参考: