Q-Learning Based Failure Detection and Self-Recovery Algorithm for Multi-Robot Domains

Task allocation is the essential part of multirobot coordination researches and it plays a significant role to achieve desired system performance. Uncertainties in multirobot systems’ working environment due to nature of them are the major hurdle for perfect coordination. When learningbased task allocation approaches are used, firstly robots learn about their working environment and then they benefit from their experiences in future task allocation process. These approaches provide useful solutions as long as environmental conditions remain unchanged. If permanent changes in environment characteristics or some failure in multi-robot system occur undesirably e.g. in disaster response which is a good example to represent such cases, the previously-learned information becomes invalid. At this point, the most important mission is to detect the failure and to recover the system initial learning state. For this purpose, Q-learning based failure detection and self-recovery algorithm is proposed in this study. According to this approach, multi-robot system checks whether these variations permanent, then recover the system to learning state if it is required. So, it provides dynamic task allocation procedure having great advantages against unforeseen situations. The experimental results verify that the proposed algorithm offer efficient solutions for multi-robot task allocation problem even in systemic failure cases.


I. INTRODUCTION
In recent years, multi-robot systems (MRS) have become more interested in a lot of areas varying from small indoor applications like home or office serving, museum guiding to more complex and sometimes dangerous fields such as search-and-rescue, fire fighting, underwater researches, mining, etc. The MRS provide concurrent processing and faster task execution features, distributed sensing and acting facilities and robust system architecture against the problems [1]. The key issue to benefit from these advantages and to reach desired system performance in MRS is that the multi-robot coordination should be done precisely and accurately [2]. In most real-life applications, all tasks cannot be accomplished because of the scarcity in the number of robots and their capabilities [3]. This reveals the effects and also necessity of efficient coordination mechanisms on system performance.
Multi-robot task allocation (MRTA) forms a basis for multi-robot coordination studies. MRTA is defined as the Manuscript received 26 May, 2018; accepted 8 November, 2018. assignment of tasks to suitable robots in an appropriate order by aiming to optimize the system performance [4]. Auction protocol is one of the strategies given in the literature to solve MRTA problems [5]. In auction-based MRTA, tasks are simulated as items to be sold and they are announced to the robots which act as auctioneers. Each robot sends a bid representing the cost or profit of the task for its own. In mobile robot studies, bid values are generally calculated in terms of distance or time [6]. The winner robot is determined according to bid values by the way of maximizing utility or minimizing cost for overall system [7], [8]. So, system coordination is realized in a centralized manner, although each robot has its own decision-making mechanism. It is the major advantage of the auction-based strategies [9].
Most of the existing task allocation solutions are proposed for the applications which don't contain any uncertainty [10], whereas in real applications robots are faced with various difficulties to reach complete information about working environment because of various ambiguities [11]. In many cases, any information about in which order and how frequently that the tasks appear, cannot be accessible due to partially-observable and dynamic characteristics of working environment. Moreover, robots cannot predict the teammates' behaviour because each has independent decision-making mechanisms. This is why to make a perfect plan about system coordination is not possible [12]. To examine this problem, the kinds of uncertainties and their origins are investigated [10]. It proposes a task allocation approach based on interval data and applies for various levels of uncertainties in search-and-rescue tasks in disaster cases which are a good example of dynamic environments. It is claimed that on-line task allocation methods have much more successful results rather than off-line methods against to non-modelled characteristics of dynamic environments such as multi robot patrolling tasks [13]. Auction-based task allocation approaches are efficient way due to their dynamic structures [13], [14].
In the studies mentioned above, proposed approaches use instant decisions or actions of robots [13] or they require to model the uncertainties [10], [14]. But, this is not the case in real applications because of the nondeterministic features of environments especially in disaster areas [15], [16]. To ensure the optimized system coordination, it plays a significant role that robots adapt themselves to changing environmental conditions and rearrange their decisions and actions. This becomes possible only if robots are equipped with learning abilities [12]. So, MRS provides adaptive and more reliable system architecture against the unpredictable situations [2]. A learning-based behaviour selection approach for noisy and dynamic MRS environment is studied and successful results are obtained [17]. An effective use of reinforcement learning for fire disaster response, which is a good example of dynamic task allocation problem, is examined in [18]. Reference [15] applies a learning-based approach for MRTA problems and tries to reason about future by task commitment in oversubscribed domains i.e. fire-fighting disaster. In another study, robots learn opportunity costs used as bid values for auction process in underwater exploration which is a kind of dynamic and unknown environments [19].
In most real-time MRS applications, tasks arise at unpredictable time steps during execution and the assignment of these tasks to robots is realized instantaneously. Especially in disaster-like environments, tasks must be done as quickly as possible although robots should clear a lot of hurdle firstly. These temporal and ordering constraints are explained as time-extended task allocation and it is added to Gerkey and Mataric's [4] classical MRTA taxonomy [16]. Similarly, to overcome time constraints task allocation is achieved by rescheduling procedure in time-extended manner [15], [20].
In this study, an auction-based instantaneous task allocation approach is used. According to this, tasks appear in a random sequence and at unpredictable time steps during execution. These tasks have to be immediately assigned to the robots. But this becomes possible only if robots are not busy with another task at that moment. When there is a hierarchical order among tasks, a crucial problem arises about achieving desired system performance. For example, a high-ordered task which is announced when all robots having capable of this task, execute another task, a lowordered ones, cannot be assigned to any robot. To solve this problem, a learning-based MRTA approach similar to [3], [12] is executed. In this approach, robots use their past experiences for future task allocation process by learning to reasoning about task sequences. For this purpose, Qlearning, which is a widely used approach for MRS because it doesn't require environment model and easy to apply especially in dynamic environments [21], is preferred.
The used learning-based MRTA approach gives successful solutions to improve system performance unless a great change doesn't happen in environmental conditions. Additionally, it tolerates small environmental changes due to learning ability [3]. But in the case of failure in characteristics of the working environment or structure of MRS, the previously learned information becomes invalid. In a disaster case such as earthquake [10], fire-fighting [15], etc., great modifications occur in the sequence and ordering of the tasks [22], [23]. And also, a catastrophic failure of systems, i.e. some faulty robots may be out-of-order permanently, causes irretrievable decrease in system performance [13]. It is a major problem for real-time MRS applications that to detect such failure cases and to adapt robots' decision-making and acting mechanisms.
In this study, Q-learning based Failure Detection and Self-Recovery (FDSR) algorithm is proposed to overcome the problem mentioned above. According to the scenario designed as application environment, an extensive disruption in system characteristics during execution, i.e. changes in priority and ordering of tasks and their occurrence frequency, occurs. FDSR algorithm detects the failure cases and recovers the system to a reliable state which means that robots repeat the learning process according to new conditions. The novelty of this paper is that the proposed algorithm provides an adaptive task allocation procedure against dynamic system structure and also it ensures a great advance in system performance even in disaster cases by detecting the systemic or environmental failures.
The organization of paper is as follows: Section II gives brief information about Q-learning theory. In Section III, the problem examined in this study is stated. In Section IV, the proposed FDSR algorithm is presented. Application environment is presented, then experimental results and analysis is given in Section V. The paper ends with conclusion part in Section VI.

II. Q-LEARNING THEORY
Reinforcement learning (RL) is a machine learning approach which maps situations to actions by using reward signals. It does not need any supervisory information or any input-output relationship [24]. Environment transits to the next state as response to agent's current action and sends a reward signal to the agent. This reward signal represents how its action affects the environment. RL approaches are widely used in MRS applications because it works through trial-and-error concept with no system model requirement and it is relevant to use in dynamic environments [25]. of action a k [24]. The aim of agent is to maximize the discounted sum of the expected reward at each step. The long-term total gain at step k, Q h (s, a) is given in (1) According to (2), agent obtains the optimal Q-value, Q * , and then it specifies the action policy resulting in Q * [27]. Q-learning algorithm is a RL approach that calculates the optimal Q-values for each state-action pair in an iterative manner as in (3) In an MDP environment, the learned -Q-values converges to optimal Q * values with probability '1' as long as each state-action pair is repeated infinitely many times and learning rate  is diminished gradually in each step [29].

III. PROBLEM STATEMENT
In most MRS applications, tasks appear in a random sequence and unpredictable time steps during execution. This is the main reason that the planning about task ordering and sharing among the robots is not possible before system starts to work. Tasks can only be assigned to the robots that are not busy when they are announced. This means that some announced tasks cannot be executed if none of the robots are available. This situation causes that the desired system performance aren't achieved especially when the tasks not performed have a priority such as emergency or sensitivity. As a solution for the mentioned problem, a learning-based task allocation method is proposed and successful result are obtained [12], [15].
The learning process of Q-learning algorithm is realized by repeating (3) infinitely many times for each state-action pair. However, in real applications, optimal Q * values are reached in a finite iteration. For a state-action pair (s, a), the learned Q-value at iteration k is represented by Q(s k , a k ) = Q k . The normalized absolute error (NAE) value, e n (k), is defined as follows NAE value is "1" at the start of learning process and it gradually decreases. This means that the learned Q-values approximate to optimal Q * values over enough iteration and NAE value gets close to zero. There exists such an iteration In most Q-learning applications, the learning process, either offline or not, is stopped at the iteration k L . The learned information is used later. This approach provides efficient solutions as long as working environment characteristics remain same [12]. In some cases, permanent variations such as change in number of tasks and occurrence frequencies or their priority levels may happen in the characteristics of environment during execution. Additionally, some robots may be out of order undesirably. Such a situation causes that the prior experiences of robots becomes invalid. It has great importance to detect these changes and to adapt the system to new conditions. For this purpose, Q-learning based Failure Detection and Self-Recovery (FDSR) algorithm is proposed in this study. FDSR algorithm detects the changes in environment, then reorganizes the MRS and restarts the learning process if these changes are permanent. So, it becomes possible to obtain a robust system against environmental changes. Detailed explanation of FDSR is given in the next section. . Robots don't have any knowledge about working environment at the beginning and each one learns for its own state-action pairs. FDSR algorithm proposes that the learning process goes on during execution, either active or passive; it continues to learn after the optimal Q * values are reached. According to NAE values calculated at each step, robots choose one of three behaviours named as essential learning behaviour, hidden learning behaviour and failure detection behaviour.

A. Behaviour-1. Essential Learning Behaviour
Robots are in essential learning behaviour initially. This means that robots don't have any knowledge about working environment yet. Learning process has just begun. Usual bidding strategy is valid such that a robot bids for tasks in its own task list unless is not busy for another task at that time. This behavior is active until the condition in (5) is met at iteration k L where robots believe to be experienced enough.
Optimal Q * values are reached and it is set as * L k QQ  . At this point, essential learning behaviour ends and robot switches to hidden learning behaviour.

B. Behaviour-2. Hidden Learning Behaviour
In hidden learning behaviour, robots continue to calculate Q values and related NAE values although learning process is completed. So, optimal Q * values are not updated so far. As long as the environmental characteristics remain same, Q-values are in a close neighbourhood of Q * values and NAE is nearly zero. Robots in this behaviour bid in according to learned values when a task is announced.
If NAE value gets higher, robots notice that an unexpected variation occurs in characteristics of working environment. At iteration k F that satisfies the condition in (6), robots think that something goes wrong. Then, robots transit to failure detection behaviour.

C. Behaviour-3. Failure Detection Behaviour
The aim of robots in failure detection behaviour is to specify status of changes in environmental conditions. In this behaviour, NAE value is determined by referencing Q * values obtained at iteration k L as shown in (7)

V. APPLICATION AND RESULTS
The proposed algorithm is realized on a heterogeneous MRS with six robots (R 1 , R 2 , R 3 , R 4 , R 5 , R 6 ,) capable of executing five different tasks (T 1 , T 2 , T 3 , T 4 , T 5 ). Each task has two priority degrees; low-priority and high-priority. This means that high priority tasks have high degree of importance, emergency or sensitivity. If a robot has ability to do a task, it can perform both low and high priority of that task. Robots and related tasks are shown in Table I

R1
R2 R3 R4 R5 R6 To represent the working environment, two different scenarios are defined. The first scenario represents the starting configuration and the second one exemplifies the environment after permanent changes occur. In the first scenario, all task types are equally probable and each one has low-priority and high priority tasks with ratio of 65 % and 35 % respectively. In the second scenario, the tasks don't occur with equal probability. The percentage of the tasks becomes 25 %, 20 %, 25 %, 20 % and 10 % respectively. In addition, the percentages of low-priority and high priority tasks becomes 55 % and 45 % for T 2 and 50 % and 50 % for T 3 . At the beginning, the first scenario is valid and the second scenario becomes active at the one third of working duration.
The main purpose is to raise the number of completed high-priority tasks while keeping the total number of completed tasks as high as possible. Assigned Task Ratio (ATR) term is used as performance criteria for proposed algorithm. ATR is defined as the percentage of the number of assigned tasks to the number of all announced tasks. It is essential assumption that all tasks assigned to the robots are finished. To show the effectiveness of the proposed algorithm, experiments are realized for three methods named as no-L, only-QL and FDSR. The first method, no-L, represents the no learning case with usual bidding strategy. The other method, only-QL, uses a Q-learning based MRTA, similar to [12].
FDSR is the proposed approach in this study. The results of low-priority and high-priority tasks for each task are given separately in Fig. 1 for these three methods. It is seen from the graphs in Fig. 1 that ATR of highpriority tasks are almost higher than the low-priority tasks for all methods due to used auction strategy. only-QL method learns about working environment at the beginning of the system and then stops. Because the learned values are not suitable to the environment characteristics after failure and robots continue to obey their prior experience, ATR of all tasks get lower. This point out that inappropriate learning causes undesired results. FDSR method aims to find out the environmental changes and to specify whether these are permanent or not. If permanent, FDSR recovers the system to a reasonable start state, e.g. cancels the previouslylearned values and restarts the learning process for new environmental conditions. The success and efficiency of FDSR algorithm can easily be observed from the graphs in Fig. 1. ATR values for low and high-priority task of all tasks are higher compared to other two methods.

VI. CONCLUSIONS
In this study, Q-learning based failure detection and selfrecovery (FDSR) algorithm is proposed for task allocation problems in dynamic multi-robot domains. The aim of this algorithm is to detect the environmental changes and to recover the system to a reliable state when these changes are permanent as in the case of disaster. The proposed algorithm derives three behaviours as essential learning behaviour, hidden learning behaviour and failure detection behaviour. The results of FDSR algorithm are compared with the results of no-L and only-QL algorithms. Experimental results indicate that the algorithm provides efficient solutions to achieve desired system performance in terms of assigned task ratio when any permanent changes occur in environment characteristics undesirably.