蓝鲸智云 SaaS 应用部署超时问题解决

今天在集群部署蓝鲸智云平台的时候,出现了很多问题,例如模块安装不上,部署脚本远程安装 Docker 死机等。这些都不重要,手动部署都可以解决。但是到部署 SaaS 服务这块,碰到的问题就少见了,如下:

[root@bkce01 install]# ./bk_install saas-o bk_iam

                     Deploy official saas bk_iam                    
2022-02-10 18:24:06 36   INFO   uploading file /data/src/official_saas/bk_iam_V1.4.24-bkofficial.tar.gz, url:http://paas.service.consul:80/saas/upload0/bk_iam/, headers: {'X-APP-CODE': 'bk_paas', 'X-APP-TOKEN': '6d1da7d2-69e1-42d1-bbf1-c90fbc7c1270'} ...
2022-02-10 18:24:06 37   INFO   http://paas.service.consul:80/saas/upload0/bk_iam/
2022-02-10 18:24:07 39   INFO   b'\n    <script>\n    window.parent.document.getElementById("import_msg").innerHTML=\n    "<span class="\'text-success\'"><i></i> \xe4\xb8\x8a\xe4\xbc\xa0\xe6\x88\x90\xe5\x8a\x9f</span>";\n    window.parent.document.getElementById("to_deploy").innerHTML="<a class="\'btn" href> \xe5\x8e\xbb\xe9\x83\xa8\xe7\xbd\xb2 </a>"\n    </script>\n    '
2022-02-10 18:24:07 210  INFO   query saas_version_id: 1
2022-02-10 18:24:07 212  INFO   start deploy app:bk_iam url: http://paas.service.consul:80/saas/release/online0/1/
2022-02-10 18:24:07 53   INFO   start deploy bk_iam
2022-02-10 18:24:08 62   INFO   resposne: {'msg': 'SaaS App正式部署事件提交成功!', 'event_id': 'fcced63c-beba-47a0-a9c7-f9cd7b7b2885', 'app_code': 'bk_iam', 'result': True}
2022-02-10 18:24:08 216  INFO   checking deploy result...
2022-02-10 18:24:10 74   INFO   check deploy result. retry 0
2022-02-10 18:24:10 83   ERROR  deploy failed: timeout

如果以上步骤没有报错, 已经完成 蓝鲸SaaS(bk_iam) 的部署

部署总是超时,但是日志又不够详细,排查起来较为困难,下面详细说说排查步骤及解决办法。

1. 查看 SaaS 服务日志

SaaS 应用安装不上,安装报错又如此简单,肯定是需要查找具体日志来进行问题分析的。

查看 APPO 主机 PaasAgent 报错日志如下:

[root@bkce02 install]# tail -100f /data/bkce/logs/paasagent/agent.log
...
2022/02/10 18:24:09 jobrun.go:240:                -------- create app container --------               

2022/02/10 18:24:09 jobrun.go:240: [10.147.118.32]20220210-182409 381   mount files/directories to docker container

2022/02/10 18:24:09 jobrun.go:240: [10.147.118.32]20220210-182409 140   mount directories/files:

2022/02/10 18:24:09 jobrun.go:240: [10.147.118.32]20220210-182409 144     - /etc/yum.repos.d

2022/02/10 18:24:09 jobrun.go:240: [10.147.118.32]20220210-182409 144     - /etc/yum

2022/02/10 18:24:09 jobrun.go:240: [10.147.118.32]20220210-182409 144     - /etc/pki

2022/02/10 18:24:09 jobrun.go:240: [10.147.118.32]20220210-182409 144     - /etc/yum.conf

2022/02/10 18:24:09 jobrun.go:240: [10.147.118.32]20220210-182409 159   developer_code: bkoffical

2022/02/10 18:24:09 jobrun.go:240: [10.147.118.32]20220210-182409 160   crypt_version: 1

2022/02/10 18:24:09 jobrun.go:240: docker: Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/create?name=bk_iam-1644488648": dial unix /var/run/docker.sock: connect: permission denied.

2022/02/10 18:24:09 jobrun.go:240: See 'docker run --help'.

2022/02/10 18:24:09 jobrun.go:240: [10.147.118.32]20220210-182409 105   create app container failed.[JOB FAILURE]

2022/02/10 18:24:09 jobrun.go:265: error waiting for Cmd exit status 1
...

很明显,就是 docker 命令执行无权限导致的。

2. 验证远程执行命令权限

当然,本地使用 root 用户执行 docker 命令,肯定是没问题的,那就要验证从中控机远程执行是否会有问题:

[root@bkce01 install]# pcmd -m appo 'docker run hello-world'
[1] 18:40:49 [SUCCESS] 10.147.118.32

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

很明显,远程执行时没有问题的,那问题出在哪里,导致 SaaS 应用部署的时候,出现 docker 执行无权限的问题呢?

3. 分析执行权限

既然使用 root 账号从远程和本地都能执行成功,那就只有一个可能,就是中控机远程执行命令的时候,用的不是 root 账号。

通过查看 /etc/passwd 文件,果然,里面静静地躺着一个名叫 blueking 的账号:

[root@bkce02 install]# more /etc/passwd
...
blueking:x:10000:10000:BlueKing EE User:/home/blueking:/bin/bash
...

切换到 blueking 账号执行 docker 命令,果然没权限:

[root@bkce02 install]# su - blueking
[blueking@bkce02 ~]$ docker ps
Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/json": dial unix /var/run/docker.sock: connect: permission denied

虽然无法确认我推断得对不对,但直觉告诉我,应该是这个账号的原因了。

4. 解决问题

既然是部署脚本新建的 blueking 账号,那应该是已经加入到 docker 组的,但是目前还是无权限执行 docker 命令,想来问题应该不是简单加个组的问题:

[blueking@bkce02 ~]$ groups
blueking docker
[blueking@bkce03 ~]$

切换到 blueking 账号查验,确实已经加入了 docker 组,那问题还会出在哪呢?

嗯,加入 docker 组了还不能连上 /var/run/docker.sock,那问题也极有可能出现在这个套接字文件的权限上。通过查看相关文件,果然:

[root@bkce02 install]# ls -l /var/run/docker.sock
srw-rw----. 1 root root 0 Feb 10 16:57 /var/run/docker.sock

正常的 docker.sock,都是 root:docker 权限的,不懂为何我的变成了 root:root 权限了。。。怪不得加入了 docker 组也无法获得权限。

解决办法也很简单,把权限改过来就可以了:

[root@bkce02 install]# chown root.docker /var/run/docker.sock
[root@bkce02 install]# su - blueking
Last login: Thu Feb 10 18:59:10 CST 2022 on pts/2
[blueking@bkce02 ~]$ docker ps
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

可以看到,改完之后,blueking 立马获得了执行权限。剩下的,就只是验证部署脚本了:

[root@bkce01 install]# ./bk_install saas-o bk_iam

                     Deploy official saas bk_iam
2022-02-10 19:05:30 36   INFO   uploading file /data/src/official_saas/bk_iam_V1.4.24-bkofficial.tar.gz, url:http://paas.service.consul:80/saas/upload0/bk_iam/, headers: {'X-APP-CODE': 'bk_paas', 'X-APP-TOKEN': '6d1da7d2-69e1-42d1-bbf1-c90fbc7c1270'} ...
2022-02-10 19:05:30 37   INFO   http://paas.service.consul:80/saas/upload0/bk_iam/
2022-02-10 19:05:31 39   INFO   b'\n    <script>\n    window.parent.document.getElementById("import_msg").innerHTML=\n    " \xe4\xb8\x8a\xe4\xbc\xa0\xe6\x88\x90\xe5\x8a\x9f";\n    window.parent.document.getElementById("to_deploy").innerHTML=" \xe5\x8e\xbb\xe9\x83\xa8\xe7\xbd\xb2 "\n    </script>\n    '
2022-02-10 19:05:31 210  INFO   query saas_version_id: 1
2022-02-10 19:05:31 212  INFO   start deploy app:bk_iam url: http://paas.service.consul:80/saas/release/online0/1/
2022-02-10 19:05:31 53   INFO   start deploy bk_iam
2022-02-10 19:05:31 62   INFO   resposne: {'msg': 'SaaS App正式部署事件提交成功!', 'event_id': 'cc7ab210-50a2-46b6-bc6d-85903345c6f6', 'app_code': 'bk_iam', 'result': True}
2022-02-10 19:05:31 216  INFO   checking deploy result...
2022-02-10 19:05:33 74   INFO   check deploy result. retry 0
2022-02-10 19:05:35 74   INFO   check deploy result. retry 1
2022-02-10 19:05:37 74   INFO   check deploy result. retry 2
2022-02-10 19:05:39 74   INFO   check deploy result. retry 3
2022-02-10 19:05:41 74   INFO   check deploy result. retry 4
2022-02-10 19:05:43 74   INFO   check deploy result. retry 5
2022-02-10 19:05:45 74   INFO   check deploy result. retry 6
2022-02-10 19:05:47 74   INFO   check deploy result. retry 7
2022-02-10 19:05:49 74   INFO   check deploy result. retry 8
2022-02-10 19:05:51 74   INFO   check deploy result. retry 9
2022-02-10 19:05:53 74   INFO   check deploy result. retry 10
2022-02-10 19:05:56 74   INFO   check deploy result. retry 11
2022-02-10 19:05:58 74   INFO   check deploy result. retry 12
2022-02-10 19:06:00 74   INFO   check deploy result. retry 13
2022-02-10 19:06:02 74   INFO   check deploy result. retry 14
2022-02-10 19:06:04 74   INFO   check deploy result. retry 15
2022-02-10 19:06:04 80   INFO   bk_iam have been deployed successfully
[10.147.118.31]20220210-190604 177   SaaS application bk_iam has been deployed successfully

如果以上步骤没有报错, 已经完成 蓝鲸SaaS(bk_iam) 的部署

成功安装,直觉诚不欺我也。得意.gif

PS:跟 xfs 格式无关,我的磁盘格式也是 xfs 的,可以正常部署。