Over a year of designing and moving several services from Azure Cloud Service to Service Fabric taught me few things which are important to keep in mind during creating or refactoring microservices hosted in Service Fabric environment. Don’t forget that Service Fabric patterns are tight to .NET, which has gone throw a massive paradigm shift. You must be up-to-date at least with asynchronous programming to be able to code solid services.
I have experienced one failure of the whole cluster because I underestimated an attention to one important detail. The service ran for half a year without any outages. Then it suddenly started to oscillate (slowdown of one part of the system and subsequent domino effect) and finally shut down (the logging indicates that the code is not executing). The Azure Portal noted me that Your cluster version has expired. Go to ‘Fabric upgrades’ to upgrade to a supported version.
It was a surprise because my cluster was set to an automatic upgrade mode. My cluster version stuck at version 5.5.216.0 although the latest available version at that time was 5.7.198.9494.
My attempt to upgrade to the latest version by switching to manual mode was not successful.
Lately I found out that every upgrade attempt was rolled back because of this failure:
This warning means that the CancellationToken provided as an argument of the RunAsync method is ignored. (This warning is relevant to stateful or stateless reliable service. The actor service follows the single entry pattern.) The reason why cancellation is so important is a fact that Service Fabric is moving your services away from a node which is being prepared for an upgrade. When the cancellation takes a very long time, cancellation time multiplied by upgrade domain count max exceed a time limit for an environment upgrade. This causes that the upgrade attempt fails.
Service Fabric is dynamically balancing your services among cluster nodes according to memory and computing characteristics. This mechanism is also paralyzed when the service freezes on a node. Another consequence is Monitored Upgrade blocking. When the current version of the service cannot be shut down it cannot be replaced by a higher version.
The programmer’s mission is code the program in a way that the CancellationToken is propagated to every possible awaitable call. (When you are communicating over HTTP protocol, you should use the HttpClient because both HttpWebRequest and WebClient do not accept the CancellationToken as a parameter.)
Sometimes you can find the CancellationToken.ThrowIfCancellationRequested method useful, for example in the body of long running loops. It does not matter whether the service terminates by throwing an exception or finishing the RunAsync method. Both options are correct.
When the cancelation is requested the OperationCanceledException is thrown. When you are logging exceptions in the catch clause, you may want to exclude this kind of exception. You can do it in many ways, for example like this:
try {
...
cancellationToken.ThrowIfCancellationRequested();
...
} catch (Exception ex) when (!cancellationToken.IsCancellationRequested) {
...
}