公司的Url监控需求之前是用zabbix的web scenarios
来做的,优点是zabbix的触发器功能很好用,缺点是:
- 和其他系统的结合有些不够方便灵活
- item名称长度有限制,不能显示完整的url
- 无法记录请求失败时的网页内容(只告诉你匹配失败了)
因此决定尝试使用telegraf,写一个Url监控插件 来做这个事情。因为是向influxdb提交数据,支持push任意tags和fields(指标),因此可以很方便的记录url,url归属的app,失败时的返回数据,甚至阈值匹配结果。
报警使用TickStack里的kapacitor,但是这个工具没有zabbix的trigger简单直观,理解起来有点费劲,设置不好会导致报警频繁。下面是我一开始约遇到的问题:
- 不用滑动窗口的情况下(window),报警很频繁
- 用滑动窗口,窗口里有一个监控点失败,立即报警,单要等窗口里全是正常point之后才会触发OK
- 如果使用.all(),可以改善第2条的问题,但是一旦有正常point进入窗口,立即触发OK状态(在zabbix里有Hysterisis功能,可以设置连续n次都成功才能从Problem转到OK)
- 使用influxqlnode,计算平均值,,会丢弃原始的fields
需要实现的报警效果:窗口中所有的point都异常,触发Problem状态,窗口中所有point都恢复,触发OK状态。
下面是我最近几天的测试结果(基于0.13.1版本和1.0-rc2版本)以及目前的解决方案。
Contents
测试数据
#!/bin/bash
influx="http://localhost:8086/write?db=test"
i1=(1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)
i1=(1 1 0 1 0 1 1 1 1 1)
for id in ${i1[@]};do
curl -i -XPOST $influx --data-binary "ka,app=cmdb code_match=$id,http_code=200"
sleep 1
done
简单序列测试
tickscript:
var origin = stream
|from()
.database('test')
.retentionPolicy('default')
.measurement('ka')
.groupBy('app')
origin
|alert()
.id('HTTP_CODE:{{ index .Tags "app" }}')
.message('')
.crit(lambda: "code_match" == 0)
.log('/tmp/alerts.log')
测试数据
i1=(1 1 0 1 0 1 1 1 1 1)
结果
[root@repo ~]# cat /tmp/alerts.log |jq . |grep "level"
"level": "CRITICAL",
"level": "OK",
"level": "CRITICAL",
"level": "OK",
结论:
- 每个点顺序匹配,状态频繁变更。
加windows测试
tickscript:
var win = 5s
var origin = stream
|from()
.database('test')
.retentionPolicy('default')
.measurement('ka')
.groupBy('app')
origin
|window()
.period(win)
.every(1s)
|alert()
.id('HTTP_CODE:{{ index .Tags "app" }}')
.message('')
.crit(lambda: "code_match" == 0)
.log('/tmp/alerts.log')
测试数据
i1=(1 1 0 1 0 1 1 1 1 1)
结果
1,1,0 "CRITICAL" 0
1,1,0,1 "CRITICAL" 0
1,0,1,0 "CRITICAL" 0
0,1,0,1 "CRITICAL" 0
1,0,1,1 "CRITICAL" 2034708173
0,1,1,1 "CRITICAL" 2034708173
1,1,1,1 "OK" 7107346394
结论:
- 可以看到,如果有一个0进入win,那么就保持crit状态,并且duration直到这个0出了win之后才会变化
- win全部为1时才触发OK状态
- 故障时间为7s,从第一个0开始算起。实际故障时间可能认为只有
0 1 0
这个序列的3s。故实际故障时间可能是7-win+1
all()测试
tickscript:
var win = 5s
var origin = stream
|from()
.database('test')
.retentionPolicy('default')
.measurement('ka')
.groupBy('app')
origin
|window()
.period(win)
.every(1s)
|alert()
.id('HTTP_CODE:{{ index .Tags "app" }}')
.message('')
.all()
.crit(lambda: "code_match" == 0)
.log('/tmp/alerts.log')
.all()
测试数据
i1=(1 1 0 1 0 1 1 1 1 1)
结果
无报警
结论:
- 一个win里面全部数据都满足条件才能触发
测试数据2
i1=(1 0 0 0 0 1 1 1 1 1 0 1 1)
结果
0,0,0,0 "CRITICAL" 0
0,0,0,1 "OK" 1022011392
结论:
- 一个win里面全部数据都满足条件才能触发
- 报警之后如果有一个1进入win,那么立即离开crit状态
- duration为1s,少于实际故障时间。实际应为
1s + win - 1
influxql测试
tickscript:
var win = 5s
var origin = stream
|from()
.database('test')
.retentionPolicy('default')
.measurement('ka')
.groupBy('app')
origin
|window()
.period(win)
.every(1s)
|stddev('code_match')
|alert()
.id('HTTP_CODE:{{ index .Tags "app" }}')
.message('')
.all()
.crit(lambda: "stddev" > 0)
.log('/tmp/alerts.log')
.all()
结论:
- 原始fileds被丢弃,不能满足需求,放弃
kapacitor 1.0 reset表达式
github上提issue:#863,得到答复后尝试了 reset表达式(感觉跟flapping很像),然而还是要用influxqlnode,原始的Fields还是会消失。
joinNode
继续自己试验探索,最后通过joinNode,终于实现了既使用influxqlnode的计算node,又保留Fields的需求。
var win = 5s
var origin = stream
|from()
.database('test')
.retentionPolicy('default')
.measurement('ka')
.groupBy('app')
|window()
.period(win)
.every(1s)
var code_match = origin
|mean('code_match')
var http_code = origin
|last('http_code')
code_match
|join(http_code)
.as('code_match', 'http_code')
|alert()
.id('HTTP_CODE:{{ index .Tags "app" }}')
.message('')
.stateChangesOnly()
.crit(lambda: "code_match.mean" == 0)
.critReset(lambda: "code_match.mean" == 1)
.log('/tmp/alerts.log')
测试数据
i1=(1 0 0 0 0 1 1 1 1 1 0 1 1)
结果可以看到,Fields没有少,而且duration也更准确了,完美
{
"data": {
"series": [
{
"values": [
[
"2016-09-04T16:16:35.686926673Z",
0,
0,
200
]
],
"columns": [
"time",
"code_match.mean",
"http_code.code_match",
"http_code.last"
],
"tags": {
"app": "cmdb"
},
"name": "ka"
}
]
},
"level": "CRITICAL",
"duration": 0,
"time": "2016-09-04T16:16:35.686926673Z",
"details": "{"Name":"ka","TaskName":"test","Group":"app=cmdb","Tags":{"app":"cmdb"},"ID":"HTTP_CODE:cmdb","Fields":{"code_match.mean":0,"http_code.code_match":0,"http_code.last":200},"Level":"CRITICAL","Time":"2016-09-04T16:16:35.686926673Z","Message":""}\n",
"message": "",
"id": "HTTP_CODE:cmdb"
}
{
"data": {
"series": [
{
"values": [
[
"2016-09-04T16:16:39.83037438Z",
1,
1,
200
]
],
"columns": [
"time",
"code_match.mean",
"http_code.code_match",
"http_code.last"
],
"tags": {
"app": "cmdb"
},
"name": "ka"
}
]
},
"level": "OK",
"duration": 4143447707,
"time": "2016-09-04T16:16:39.83037438Z",
"details": "{"Name":"ka","TaskName":"test","Group":"app=cmdb","Tags":{"app":"cmdb"},"ID":"HTTP_CODE:cmdb","Fields":{"code_match.mean":1,"http_code.code_match":1,"http_code.last":200},"Level":"OK","Time":"2016-09-04T16:16:39.83037438Z","Message":""}\n",
"message": "",
"id": "HTTP_CODE:cmdb"
}
使用Fields作为变量
tickscript:
var win = 5s
var origin = stream
|from()
.database('test')
.retentionPolicy('default')
.measurement('ka')
.groupBy('app')
|window()
.period(win)
.every(1s)
origin
|alert()
.id('HTTP_CODE:{{ index .Tags "app" }}')
.message('')
.stateChangesOnly()
.all()
.crit(lambda: "code_match" < "count")
.log('/tmp/alerts.log')
测试数据及脚本:
i1=(1 0 0 0 0 1 1 1 1 1 0 1 1)
for id in ${i1[@]};do
curl -i -XPOST $influx --data-binary "ka,app=cmdb code_match=$id,http_code=200,count=0.5"
sleep 1
done
结果:
{
"data": {
"series": [
{
"values": [
[
"2016-09-05T06:24:31.532711141Z",
0,
0.5,
200
],
[
"2016-09-05T06:24:32.621932436Z",
0,
0.5,
200
],
[
"2016-09-05T06:24:33.64890981Z",
0,
0.5,
200
],
[
"2016-09-05T06:24:34.686454948Z",
0,
0.5,
200
]
],
"columns": [
"time",
"code_match",
"count",
"http_code"
],
"tags": {
"app": "cmdb"
},
"name": "ka"
}
]
},
"level": "CRITICAL",
"duration": 0,
"time": "2016-09-05T06:24:35.686454948Z",
"details": "{"Name":"ka","TaskName":"test","Group":"app=cmdb","Tags":{"app":"cmdb"},"ID":"HTTP_CODE:cmdb","Fields":{"code_match":0,"count":0.5,"http_code":200},"Level":"CRITICAL","Time":"2016-09-05T06:24:35.686454948Z","Message":""}\n",
"message": "",
"id": "HTTP_CODE:cmdb"
}
{
"data": {
"series": [
{
"values": [
[
"2016-09-05T06:24:32.621932436Z",
0,
0.5,
200
],
[
"2016-09-05T06:24:33.64890981Z",
0,
0.5,
200
],
[
"2016-09-05T06:24:34.686454948Z",
0,
0.5,
200
],
[
"2016-09-05T06:24:35.705047313Z",
1,
0.5,
200
]
],
"columns": [
"time",
"code_match",
"count",
"http_code"
],
"tags": {
"app": "cmdb"
},
"name": "ka"
}
]
},
"level": "OK",
"duration": 1018592365,
"time": "2016-09-05T06:24:36.705047313Z",
"details": "{"Name":"ka","TaskName":"test","Group":"app=cmdb","Tags":{"app":"cmdb"},"ID":"HTTP_CODE:cmdb","Fields":{"code_match":0,"count":0.5,"http_code":200},"Level":"OK","Time":"2016-09-05T06:24:36.705047313Z","Message":""}\n",
"message": "",
"id": "HTTP_CODE:cmdb"
}
结论:
- 可以实现Fields之间的数值比较
- 可以用来自定义不同tags的阈值,例如监控流量异常,可以由用户定义流量增长几倍才算异常
参考资料
1. https://docs.influxdata.com/kapacitor/v0.13/
2. https://groups.google.com/forum/m/#!topic/influxdb/qg9on1deA_8
楼主请问一下kapacitor可以做同比报警吗,今天时刻的数据对比昨天的此时的数据,如果小于或者大于某个值报警
应该可以
kapacitor真是太麻烦了
是没有zabbix触发器直观
很棒!博主我有个Q群是讨论关于Influxdata(TICK)的问题的。有兴趣的话可以加一下:100630295
需要注意kapacitor show 里显示的dot源码,可以生成图片看一下,对调试很有用。多个alertNode在一个stream下面时,有依赖关系 alert1->alert2->alert3...,即alert1触发,alert2才能触发,alert2不能单独触发
感觉joinNode像是新搞了一个stream(基于原stream的window做了采样)