博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Pacemaker Resource Agent的错误处理
阅读量:6692 次
发布时间:2019-06-25

本文共 5014 字,大约阅读时间需要 16 分钟。

1.前言
Pacemaker通过调用各个resource agent提供的操作(比如start,stop)实现对资源的控制,当这个方法执行出错时,Pacemaker会根据执行的操作和错误类型进行不同的错误处理。

2. 错误类型

Pacemaker将错误分成3类:soft,hard和fatal,后两种属于环境或配置问题,如果没有人工干预是不可能自动修复的。一般的故障都采用OCF_ERR_GENERIC作为返回值,比如,服务进程crash,网络不通等,OCF_ERR_GENERIC属于soft类型。


B.3. How are OCF Return Codes Interpreted?

The first thing the cluster does is to check the return code against the expected result. If the result does not match the expected value, then the operation is considered to have failed and recovery action is initiated.
There are three types of failure recovery:

Table B.3. Types of recovery performed by the cluster

Type Description Action Taken by the Cluster
soft
A transient error occurred
Restart the resource or move it to a new location
hard
A non-transient error that may be specific to the current node occurred
Move the resource elsewhere and prevent it from being retried on the current node
fatal
A non-transient error that will be common to all cluster nodes (eg. a bad configuration was specified)
Stop the resource and prevent it from being started on any cluster node

Assuming an action is considered to have failed, the following table outlines the different OCF return codes and the type of recovery the cluster will initiate when it is received.

B.4. OCF Return Codes

Table B.4. OCF Return Codes and their Recovery Types

RC OCF Alias Description RT
0
OCF_SUCCESS
Success. The command completed successfully. This is the expected result for all start, stop, promote and demote commands.
soft
1
OCF_ERR_GENERIC
Generic "there was a problem" error code.
soft
2
OCF_ERR_ARGS
The resource’s configuration is not valid on this machine. Eg. refers to a location/tool not found on the node.
hard
3
OCF_ERR_UNIMPLEMENTED
The requested action is not implemented.
hard
4
OCF_ERR_PERM
The resource agent does not have sufficient privileges to complete the task.
hard
5
OCF_ERR_INSTALLED
The tools required by the resource are not installed on this machine.
hard
6
OCF_ERR_CONFIGURED
The resource’s configuration is invalid. Eg. required parameters are missing.
fatal
7
OCF_NOT_RUNNING
The resource is safely stopped. The cluster will not attempt to stop a resource that returns this for any action.
N/A
8
OCF_RUNNING_MASTER
The resource is running in Master mode.
soft
9
OCF_FAILED_MASTER
The resource is in Master mode but has failed. The resource will be demoted, stopped and then started (and possibly promoted) again.
soft
other
NA
Custom error code.
soft

Although counterintuitive, even actions that return 0 (aka.
 OCF_SUCCESS) can be considered to have failed.  

3. 错误处理

每个资源的操作(operation)有一个on-fail属性,用于控制如何进行出错处理。


Table 5.3. Properties of an Operation

Field Description
id
Your name for the action. Must be unique.
name
The action to perform. Common values: monitor, start, stop
interval
How frequently (in seconds) to perform the operation. Default value: 0, meaning never.
timeout
How long to wait before declaring the action has failed.
on-fail
The action to take if this action ever fails. Allowed values:
*
ignore - Pretend the resource did not fail
*
block - Don’t perform any further operations on the resource
*
stop - Stop the resource and do not start it elsewhere
*
restart - Stop the resource and start it again (possibly on a different node)
*
fence - STONITH the node on which the resource failed
*
standby - Move
all resources away from the node on which the resource failed
The default for the stop operation is fence when STONITH is enabled and block otherwise. All other operations default to stop.
enabled
If false, the operation is treated as if it does not exist. Allowed values: true, false


 

但是,实际测试验证后,发现2个问题,或者说是Bug。

问题1:

  在老版的Pacemaker(1.1.7)上不管如何设置on-fail,效果都不会变,也就是说永远是缺省行为。在最新的Pacemaker 1.1.14上验证,没有这个问题,即on-fail可以生效。

问题2:

  通过让Resource Agent的各个操作返回OCF_ERR_GENERIC,查看资源管理器的处理,发现其on-fail的缺省行为并不是手册上说的“The default for the stop operation is fence when STONITH is enabled and block otherwise. All other operations default to stop.”。具体如下,对比发现实际的缺省行为更加合理,所以可以认为这是Pacemaker手册的一个Bug。

操作 错误处理 对应的on-fail值
start

设置fail-count=1000000

在本节点上调用stop

在其它节点上start该资源

restart
stop

设置fail-count=1000000

阻止该资源的进一步操作,该资源成为unmanaged FAILED状态,如下

dummy (ocf::heartbeat:Dummy2): Started srdsdevapp69 (unmanaged) FAILED

block
monitor

设置fail-count+=1

在本节点上依次调用stop,start,monitor。如果monitor依然出错,重复stop,start,monitor,直到fail-count达到migration-threshold后,保持资源为stop状态。

restart
promote

设置fail-count+=1

在本节点上依次调用demote,stop,start 。

在其它节点上调用promote以提升其它节点上的资源为master

restart
demote

设置fail-count+=1

在本节点上依次调用stop,start,demote。如果demote依然出错,重复stop,start,demote,直到fail-count达到migration-threshold后,保持资源为stop状态。

restart
notify 无视 ignore

注1:超时的处理与OCF_ERR_GENERIC相同

注2:Pacemaker不会对已经stop了的资源调用post stop notify。

注3:测试环境 Pacemaker 1.1.7-6 + CentOS 6.3 和 Pacemaker 1.1.14 + CentOS 6.3 

4.启示

上面关于错误处理的测试结果,可以给Resource Agent编写者提供几点启示:

  1. 1. 如非确实必要,不要让stop操作返回错误
  2. 2. monitor和start的判断要保持一致,即不应该出现start成功后立刻执行monitor却失败的情况,否则可能导致循环。
  3. 3. restart成功后执行demote不应该失败,否则可能导致循环。
  4. 4. migration-threshold设置为一个比较小的值(默认值是INFINITY,即100000),也可以减少上面的2和3的影响。

转载地址:http://gijoo.baihongyu.com/

你可能感兴趣的文章
高仙机器人秦宝星:2019年,服务机器人量产之年
查看>>
企业应用混合云网络解决方案
查看>>
spring boot2 整合(一)Mybatis (特别完整!)
查看>>
如何购买阿里云服务器(ECS)
查看>>
设计模式简介
查看>>
书籍:python网络编程 Python Network Programming - 2019
查看>>
5G火车站来了!上海虹桥火车站5G网络建设正式启动
查看>>
Flutter终将逆袭!1.2版本发布,或将统一江湖
查看>>
社区团购公司“邻邻壹” 完成 3000 万美元 A 轮融资,今日资本领投
查看>>
mysql5.7获取root密码
查看>>
Mybatis接口编程方式实现增删改查
查看>>
【C#】使用fo-dicom完成BMP,JPG,PNG图片转换为DICOM文件
查看>>
java8学习:Optional的简单使用
查看>>
Docker实战(三)之访问Docker仓库
查看>>
Spring Boot中使用Swagger2
查看>>
windows10:检测windows defender是不是已经连接到了云安全中心
查看>>
每天五分钟linux(11)-nl
查看>>
2018 Python 开发者调查报告发布,数据出乎你意料吗?
查看>>
.net core 持续构建简易教程
查看>>
JVM的内存分配和回收策略
查看>>