K8S日志收集方案实践
流程图
+------------------+ +------------------+ +--------------------+ +----------------------+
| | | | | | | |
| | | | | | 轮转脚本 | |
| Fluentd日志采集端 +------------> |FluentBit日志转发端 +--------+-------> | Fluentd日志落地端 +---------> | AWS S3 |
| | | | | | | | |
| | | | | | | | |
+------------------+ +------------------+ | +--------------------+ +----------------------+
|
|http
| +---------------------+
| | |
| | |
+-------->+logExporter日志分析组件|
| |
| |
+---------------------+
日志采集端
日志采集由部署在k8s集群内的Fluentd完成,Fluntd以DaemonSet的方式部署,Fluentd会自动以tail的方式获取 /var/log/containers 目录下各个容器的日志文件内容并转发出去。
没有用FluentBit收集日志,是因为Fluentd有现成的配置直接使用,不需要再去摸索配置踩一遍坑;如果以后有现成的FluentBit收集k8s集群日志的配置,直接部署替换Fluentd即可。
参考配置:
# yaml来源# https://github.com/fluent/fluentd-kubernetes-daemonset/blob/master/fluentd-daemonset-forward.yaml# 镜像来源,版本参考# https://hub.docker.com/r/fluent/fluentd-kubernetes-daemonset/# 文件内容有部分修改# 1. 修改文件名,添加为v1# 2. 配置必要的环境变量参数# 3. 目前锁定image版本为 v1.11.5-debian-forward-1.0# 4. 覆盖配置文件,缩减日志采集周期,修改部分以 # modification start 标注 apiVersion : v1
kind : ConfigMap
metadata :
name : fluentd -daemonset -fluentconf
namespace : kube -system
data :
fluent.conf : |
# AUTOMATICALLY GENERATED
# DO NOT EDIT THIS FILE DIRECTLY, USE /templates/conf/fluent.conf.erb
@include " #{ENV['FLUENTD_SYSTEMD_CONF'] || 'systemd'}.conf"
@include " #{ENV['FLUENTD_PROMETHEUS_CONF'] || 'prometheus'}.conf"
@include kubernetes.conf
@include conf.d/ *.conf
<match **>
@type forward
@id out_fwd
@log_level info
<server >
host " #{ENV['FLUENT_FOWARD_HOST']}"
port " #{ENV['FLUENT_FOWARD_PORT']}"
</server >
# modification start
<buffer >
@type memory
flush_at_shutdown true
flush_mode interval
flush_interval 1s
flush_thread_count 4
flush_thread_interval 0.1
flush_thread_burst_interval 0.1
</buffer >
# modification end
</match >--- apiVersion : v1
kind : ConfigMap
metadata :
name : fluentd -daemonset -kubernetesconf
namespace : kube -system
data :
kubernetes.conf : |
# AUTOMATICALLY GENERATED
# DO NOT EDIT THIS FILE DIRECTLY, USE /templates/conf/kubernetes.conf.erb
<label @FLUENT_LOG >
<match fluent. **>
@type null
@id ignore_fluent_logs
</match >
</label >
<source >
@type tail
@id in_tail_container_logs
path /var/log/containers/ *.log
pos_file /var/log/fluentd -containers.log.pos
tag " #{ENV['FLUENT_CONTAINER_TAIL_TAG'] || 'kubernetes.*'}"
exclude_path " #{ENV['FLUENT_CONTAINER_TAIL_EXCLUDE_PATH'] || use_default}"
read_from_head true
# modification start
refresh_interval 10
multiline_flush_interval 5
# modification end
<parse >
@type " #{ENV['FLUENT_CONTAINER_TAIL_PARSER_TYPE'] || 'json'}"
time_format %Y -%m -%dT%H :%M :%S.%NZ
</parse >
</source >
<source >
@type tail
@id in_tail_minion
path /var/log/salt/minion
pos_file /var/log/fluentd -salt.pos
tag salt
<parse >
@type regexp
expression /^( ?<time >[^ ]* [^ ,] *) [^\ [] *\ [[^\ ]] *\ ]\ [( ?<severity >[^ \ ]] *) *\ ] ( ?<message >. *)$/
time_format %Y -%m -%d %H :%M :%S
</parse >
</source >
<source >
@type tail
@id in_tail_startupscript
path /var/log/startupscript.log
pos_file /var/log/fluentd -startupscript.log.pos
tag startupscript
<parse >
@type syslog
</parse >
</source >
<source >
@type tail
@id in_tail_docker
path /var/log/docker.log
pos_file /var/log/fluentd -docker.log.pos
tag docker
<parse >
@type regexp
expression /^time="( ?<time >[^) ] *)" level=( ?<severity >[^ ] *) msg="( ?<message >[^" ] *)"( err="( ?<error >[^" ] *)")?( statusCode=($<status_code >\d+)) ?/
</parse >
</source >
<source >
@type tail
@id in_tail_etcd
path /var/log/etcd.log
pos_file /var/log/fluentd -etcd.log.pos
tag etcd
<parse >
@type none
</parse >
</source >
<source >
@type tail
@id in_tail_kubelet
multiline_flush_interval 5s
path /var/log/kubelet.log
pos_file /var/log/fluentd -kubelet.log.pos
tag kubelet
<parse >
@type kubernetes
</parse >
</source >
<source >
@type tail
@id in_tail_kube_proxy
multiline_flush_interval 5s
path /var/log/kube -proxy.log
pos_file /var/log/fluentd -kube -proxy.log.pos
tag kube -proxy
<parse >
@type kubernetes
</parse >
</source >
<source >
@type tail
@id in_tail_kube_apiserver
multiline_flush_interval 5s
path /var/log/kube -apiserver.log
pos_file /var/log/fluentd -kube -apiserver.log.pos
tag kube -apiserver
<parse >
@type kubernetes
</parse >
</source >
<source >
@type tail
@id in_tail_kube_controller_manager
multiline_flush_interval 5s
path /var/log/kube -controller -manager.log
pos_file /var/log/fluentd -kube -controller -manager.log.pos
tag kube -controller -manager
<parse >
@type kubernetes
</parse >
</source >
<source >
@type tail
@id in_tail_kube_scheduler
multiline_flush_interval 5s
path /var/log/kube -scheduler.log
pos_file /var/log/fluentd -kube -scheduler.log.pos
tag kube -scheduler
<parse >
@type kubernetes
</parse >
</source >
<source >
@type tail
@id in_tail_rescheduler
multiline_flush_interval 5s
path /var/log/rescheduler.log
pos_file /var/log/fluentd -rescheduler.log.pos
tag rescheduler
<parse >
@type kubernetes
</parse >
</source >
<source >
@type tail
@id in_tail_glbc
multiline_flush_interval 5s
path /var/log/glbc.log
pos_file /var/log/fluentd -glbc.log.pos
tag glbc
<parse >
@type kubernetes
</parse >
</source >
<source >
@type tail
@id in_tail_cluster_autoscaler
multiline_flush_interval 5s
path /var/log/cluster -autoscaler.log
pos_file /var/log/fluentd -cluster -autoscaler.log.pos
tag cluster -autoscaler
<parse >
@type kubernetes
</parse >
</source >
# Example:
# 2017-02-09T00:15:57.992775796Z AUDIT: id="90c73c7c-97d6-4b65-9461-f94606ff825f" ip="104.132.1.72" method="GET" user="kubecfg" as="<self>" asgroups="<lookup>" namespace="default" uri="/api/v1/namespaces/default/pods"
# 2017-02-09T00:15:57.993528822Z AUDIT: id="90c73c7c-97d6-4b65-9461-f94606ff825f" response="200"
<source >
@type tail
@id in_tail_kube_apiserver_audit
multiline_flush_interval 5s
path /var/log/kubernetes/kube -apiserver -audit.log
pos_file /var/log/kube -apiserver -audit.log.pos
tag kube -apiserver -audit
<parse >
@type multiline
format_firstline /^\S+\s+AUDIT :/
# Fields must be explicitly captured by name to be parsed into the record.
# Fields may not always be present, and order may change, so this just looks
# for a list of key="\"quoted\" value" pairs separated by spaces.
# Unknown fields are ignored.
# Note: We can't separate query/response lines as format1/format2 because
# they don't always come one after the other for a given query.
format1 /^( ?<time >\S+) AUDIT : (? : ( ?:id="( ?<id >( ?:[^"\\ ]|\\.) *)"|ip="(?<ip>(?: [^"\\ ]|\\.) *)"|method="(?<method>(?: [^"\\ ]|\\.) *)"|user="(?<user>(?: [^"\\ ]|\\.) *)"|groups="(?<groups>(?: [^"\\ ]|\\.) *)"|as="(?<as>(?: [^"\\ ]|\\.) *)"|asgroups="(?<asgroups>(?: [^"\\ ]|\\.) *)"|namespace="(?<namespace>(?: [^"\\ ]|\\.) *)"|uri="(?<uri>(?: [^"\\ ]|\\.) *)"|response="(?<response>(?: [^"\\ ]|\\.) *)"|\w+="(?: [^"\\ ]|\\.) *"))*/
time_format %Y -%m -%dT%T.%L%Z
</parse >
</source >
<filter kubernetes. **>
@type kubernetes_metadata
@id filter_kube_metadata
kubernetes_url " #{ENV['FLUENT_FILTER_KUBERNETES_URL'] || 'https://' + ENV.fetch('KUBERNETES_SERVICE_HOST') + ':' + ENV.fetch('KUBERNETES_SERVICE_PORT') + '/api'}"
verify_ssl " #{ENV['KUBERNETES_VERIFY_SSL'] || true}"
ca_file " #{ENV['KUBERNETES_CA_FILE']}"
skip_labels " #{ENV['FLUENT_KUBERNETES_METADATA_SKIP_LABELS'] || 'false'}"
skip_container_metadata " #{ENV['FLUENT_KUBERNETES_METADATA_SKIP_CONTAINER_METADATA'] || 'false'}"
skip_master_url " #{ENV['FLUENT_KUBERNETES_METADATA_SKIP_MASTER_URL'] || 'false'}"
skip_namespace_metadata " #{ENV['FLUENT_KUBERNETES_METADATA_SKIP_NAMESPACE_METADATA'] || 'false'}"
</filter >--- apiVersion : apps/v1
kind : DaemonSet
metadata :
name : fluentd
namespace : kube -system
labels :
k8s-app : fluentd -logging
version : v1
servicename : fluentdcollector
clustername : pf -beta -b1k
spec :
selector :
matchLabels :
k8s-app : fluentd -logging
version : v1
servicename : fluentdcollector
clustername : pf -beta -b1k
template :
metadata :
labels :
k8s-app : fluentd -logging
version : v1
servicename : fluentdcollector
clustername : pf -beta -b1k
spec :
tolerations :
- key : node -role.kubernetes.io/master
effect : NoSchedule
containers :
- name : fluentd
image : fluent/fluentd -kubernetes -daemonset :v1.11.5 -debian -forward - 1.0
env :
- name : FLUENT_FOWARD_HOST
value : "FLUENT_FOWARD_HOST"
- name : FLUENT_FOWARD_PORT
value : "FLUENT_FOWARD_PORT"
resources :
limits :
memory : 200Mi
requests :
cpu : 100m
memory : 200Mi
volumeMounts :
- name : varlog
mountPath : /var/log
- name : varlibdockercontainers
mountPath : /var/lib/docker/containers
readOnly : true
- name : fluentdetckubernetesconf
mountPath : /fluentd/etc/kubernetes.conf
subPath : kubernetes.conf
- name : fluentdetcfluentconf
mountPath : /fluentd/etc/fluent.conf
subPath : fluent.conf
terminationGracePeriodSeconds : 30
volumes :
- name : varlog
hostPath :
path : /var/log
- name : varlibdockercontainers
hostPath :
path : /var/lib/docker/containers
- name : fluentdetckubernetesconf
configMap :
name : fluentd -daemonset -kubernetesconf
- name : fluentdetcfluentconf
configMap :
name : fluentd -daemonset -fluentconf
日志转发端
收集到的日志统一投递到日志转发端,我们的日志转发统一由FluentBit完成,因为FluentBit的性能更高,资源消耗更低。
日志转发端只负责日志转发,多路转发,尽量不包含任何日志处理逻辑。
如果遇到性能问题,日志转发端可以部署多个副本,前面再加一个四层负载均衡即可。
如果后续需要新增日志处理流程,则直接修改转发配置新增转发即可,这样可以做到各个日志处理流程相互隔离影响。
参考配置文件:
[SERVICE]
# Flush
# =====
# Set an interval of seconds before to flush records to a destination
Flush 1
# Daemon
# ======
# Instruct Fluent Bit to run in foreground or background mode.
Daemon Off
# Log_Level
# =========
# Set the verbosity level of the service, values can be:
#
# - error
# - warning
# - info
# - debug
# - trace
#
# By default 'info' is set, that means it includes 'error' and 'warning'.
Log_Level info
# Parsers_File
# ============
# Specify an optional 'Parsers' configuration file
Parsers_File parsers.conf
Plugins_File plugins.conf
# HTTP Server
# ===========
# Enable/Disable the built-in HTTP Server for metrics
HTTP_Server Off
HTTP_Listen 0.0.0.0
HTTP_Port 2020
# [INPUT]
# Name cpu
# Tag cpu.local
# # Interval Sec
# # ====
# # Read interval (sec) Default: 1
# Interval_Sec 1
# [OUTPUT]
# Name stdout
# Match *
[INPUT]
Name forward
Listen 0.0.0.0
Port 24224
Buffer_Chunk_Size 2MB
Buffer_Max_Size 64MB
# 测试消息生成
# [INPUT]
# Name dummy
# Tag docker.fluentbitcollector
# Dummy {"log":"[Error ] this is dummy\n","stream":"stderr","attrs":{"tag":"docker.test.fluentbitforwoarder"},"time":"2020-11-07T10:59:53.399975037Z"}
# Rate 1
# [OUTPUT]
# Name stdout
# Match *
# 日志转发落地 fluentd
[OUTPUT]
Name forward
Match *
Host fluentd.host.addr
Port 24225
# Require_ack_response True
# 日志转发分析 logexporter
[OUTPUT]
Name http
Match *
Host logexporter.host.addr
Port 12203
URI /logs
Format json
日志落地端
我们的日志落地并未采用 elasticsearch 等方案,而是通过一个Fluentd统一收集后写入本地磁盘目录,按天分不同文件,然后通过一个每天运行的轮转脚本将老日志文件同步到AWS S3,并删除本地磁盘的老日志文件。
参考配置:
<source>
@type forward
@id input1
@label @mainstream
port 24225
</source>
<filter **>
@type stdout
</filter>
<label @mainstream>
# docker&swarm收集的容器日志
<match docker.**>
@type copy
<store>
@type file
@id output_docker
path /data/docker/fluentd/log/fluentd.${$.attrs.tag}.*.log
append true
<format>
@type single_value
add_newline false
message_key log
</format>
<buffer time, $.attrs.tag>
@type memory
flush_at_shutdown true
flush_thread_count 4
flush_thread_interval 0.1
flush_thread_burst_interval 0.1
# flush_mode immediate
flush_mode interval
flush_interval 1s
</buffer>
</store>
# <store>
# @type stdout
# </store>
</match>
# k8s容器日志
<match kubernetes.**>
@type copy
<store>
@type file
@id output_k8s1
path /data/docker/fluentd/log/fluentd.k8s.${$.kubernetes.labels.clustername}.${$.kubernetes.labels.servicename}.*.log
append true
<format>
@type single_value
add_newline false
message_key log
</format>
<buffer time, $.kubernetes.labels.servicename, $.kubernetes.labels.clustername>
@type memory
flush_at_shutdown true
flush_thread_count 4
flush_thread_interval 0.1
flush_thread_burst_interval 0.1
# flush_mode immediate
flush_mode interval
flush_interval 1s
</buffer>
</store>
# <store>
# @type stdout
# </store>
</match>
# kubelet日志
<match kubelet>
@type copy
<store>
@type file
@id output_kubelet
path /data/docker/fluentd/log/fluentd.${tag}.${$._HOSTNAME}.*.log
append true
<format>
@type single_value
add_newline true
message_key MESSAGE
</format>
<buffer time, tag, $._HOSTNAME>
@type memory
flush_at_shutdown true
flush_thread_count 4
flush_thread_interval 0.1
flush_thread_burst_interval 0.1
# flush_mode immediate
flush_mode interval
flush_interval 1s
</buffer>
</store>
# <store>
# @type stdout
# </store>
</match>
# 默认落地规则
<match **>
@type file
@id output1
path /data/docker/fluentd/log/data.*.log
append true
<buffer time>
@type memory
flush_at_shutdown true
flush_thread_count 4
flush_thread_interval 0.1
flush_thread_burst_interval 0.1
# flush_mode immediate
flush_mode interval
flush_interval 1s
</buffer>
</match>
</label>
日志分析
我们的日志分析组件由golang开发,接受http请求并分析日志中出现的错误日志等级的标记,将错误日志写入数据库并将统计指标打入prometheus进行告警。
收集到的日志不是只有我们自己的业务日志,还可能有其他基础组件的日志,这里我们自己开发日志分析组件灵活性更高,更能适配我们自身的需求。
写入数据库的错误日志我们通过自己开发的前端工具进行快速查询,甚至可以接入钉钉机器人实现告警后自动将错误日志发送至告警群,方便值班的同学进行处理。