Error handling in BPMN
BPM notation, as opposed by UML, offers support for handling erroneous situations in a way, that is clearly divisible from the normal flow of business tasks. A typical way to take advantage of it would be to incorporate into your model a boundary event on an element in which something can go wrong, like in the example below:
In the circumstance, that during the execution of the task ‘Process booking’ a desired room becomes unavailable, BPM processing engine diverts the flow of the process to the lower part of the diagram, which would result in notifying the end user about the situation.
Error events in BPMN are always interrupting events, meaning that on detection of an error, the current flow is stopped, and the process continues through the path that’s originating from the error event. In our case, when an error occurs, the process never leaves the task ‘Process booking’ positively, and thus never reaches ‘Booking confirmed’ state.
Detection of errors is not restricted to just one task, BPMN offers the possibility to catch exceptions on subprocesses, like shown below:
In this case, if any activity in “Process booking” subprocess register an error, process control will transfer to error BPMN event, just like it did in our previous example. Business logic depicted in the subprocess will likely be much more complex, but this proves even more the value of being able to process errors in bound areas.
This behavior is very useful, because designing a dedicated error flow makes process diagram much clearer and more readable. In most cases, this model is the way to go, but there are moments, when different approach might be more beneficial.
Error handling in Camunda BPM Engine
Camunda BPM is an environment in which processes denoted in BPM notation can be modelled and executed. Said processes can be autonomous, without the need for human interaction, or they can include input from the end user using automatically generated forms or dedicated front-end applications, taking advantage of its rich REST API. Camunda offers BPM processing engine as well as several applications that enables users to interact with the flow of the process as well as manage the technical aspects of the process.
Unsurprisingly, BPMN mechanisms of error handling described in previous parts of this article are available to be taken advantage of in the course of building Camunda based process, but it is not all that Camunda has to offer in this area.
One of the commonly used features in context of error handling is Camunda BPM engine’s ability to automatically repeat tasks, that has failed during their execution. This mechanism is available by default in any modelled process and the number of repetitions can be set by a process administrator on per activity basis. It gives a chance to successfully fulfill a task, during which a recoverable error has occurred and thus, to successfully complete the process.
This behavior should be enough to deal with typical one-time errors, such as network delays, untimely response from a webservice or OptimisticLockException, but this will not be enough if the underlying source of the error takes longer to discover and be repaired or if it needs some other activities to happen, that are outside the process’s scope, such as a manual action by an employee. Additionally, currently in Camunda Community version, time between repetitions is fixed, however one could argue that a better model would be with cumulative timers, increasing with each iteration. That would give a better chance to correct the source of the error before repetition count is reached and process stops.
The ability to restart a task is also available from Camunda cockpit, which is a back-office app intended to provide process administrator with insight into details of how deployed processes work, data that’s been gathered as well as error details. The manual restart can be done by finding the process instance and within it, the failed task, and then clicking a restart button. This could be troublesome in cases, when you have a very large number of active instances, as Cockpit’s UI doesn’t offer support for listing all incidents and restarting them in greater number at the time. However, Camunda’s REST API allows to query for that list and restart the tasks, so it’s not impossible, you just need to go an extra mile
When the default is not enough?
In one of the recent projects that I was involved in, we’ve modelled the business process in BPMN to be executed by Camunda engine. The process was the back end of a service provided to end users of a financial institution, through which they could obtain financial backing for their purchase. All of the user tasks presented in the case below were implemented as custom views with Java + Angular rather than generic Camunda forms.
The process involved, at a certain point, a series of service tasks, that resulted in checking submitted data against antifraud services, internal/external databases, creating and updating records about the application, etc. Each of those activities could fail, some of these failures could be overcome relatively easy, while other (especially those depending on external services) could take minutes or hours to resolve and that period would be way over the automated repetition limit. Simplified version of the initial diagram is shown below:
There were a few principles we had to keep in mind while designing error flow for this part of the diagram:
- Task failure cannot end the process, since it would result in loss of a client and revenue
- Customer should be informed about unusual status of the application, in particular – about an error occurring and what they should do nex
- When a task fails, it should be possible to restart it by a back-office operator once the problem is resolved
It was easy to achieve principle #2, since all we had to do is to add to every service task an error path with a message for client, or add it as a boundary event:
However, with errors modelled in this way the customer would not be able to continue the process once task repetition count was reached (#1). We can only restart a task manually if it is the current task, and by capturing error event we are leaving the service task (remember, error event is always interrupting) to perform activities from error handling path.
In order to keep the process in a state in which its flow could be restarted, we need to keep the failed task active and follow another path, corresponding to error flow, at the same time. We cannot do this currently in BPMN with error event, so we have to use another option. Luckily, not all BPMN events are necessarily interrupting, most of them have their counterpart in non-interrupting version – you can identify them by their border which is dashed rather than solid. We have decided to use message event, since it was easy to implement custom behavior for when error occurs a message is sent. Finally, the diagram look, in general, like below:
This way, all of the principles were satisfied:
- When a service task fails it no longer ends the process or leaves the task, that failed – customer can still finish the process successfully regardless of errors, that occurred
- Customer is informed of abnormal status and knows what to expect if it occurs
- Failed task can be manually restarted, because the flow did not leave the task