Dajbych.net


Keep your service running forever by designing an instant shutdown

, 4 minutes to read

Over a year of de­sign­ing and mov­ing sev­eral ser­vices from Azure Cloud Ser­vice to Ser­vice Fab­ric taught me few things which are im­por­tant to keep in mind dur­ing cre­at­ing or refac­tor­ing mi­croser­vices hosted in Ser­vice Fab­ric en­vi­ron­ment. Don’t for­get that Ser­vice Fab­ric pat­terns are tight to .NET, which has gone throw a mas­sive paradigm shift. You must be up-to-date at least with asyn­chronous pro­gram­ming to be able to code solid ser­vices.

I have ex­pe­ri­enced one fail­ure of the whole clus­ter be­cause I un­der­es­ti­mated an at­ten­tion to one im­por­tant de­tail. The ser­vice ran for half a year without any ou­tages. Then it sud­denly started to oscil­late (slow­down of one part of the sys­tem and sub­se­quent domino ef­fect) and fi­nally shut down (the log­ging indi­cates that the code is not ex­e­cut­ing). The Azure Por­tal noted me that Your clus­ter ver­sion has ex­pired. Go to ‘Fab­ric up­grades’ to up­grade to a sup­ported ver­sion.

Service Fabric

It was a sur­prise be­cause my clus­ter was set to an au­to­matic up­grade mode. My clus­ter ver­sion stuck at ver­sion 5.5.216.0 al­though the lat­est avai­l­able ver­sion at that time was 5.7.198.9494.

Service Fabric

My at­tempt to up­grade to the lat­est ver­sion by switch­ing to man­ual mode was not suc­cess­ful.

Service Fabric
Service Fabric

Lately I found out that ev­ery up­grade at­tempt was rolled back be­cause of this fail­ure:

Service Fabric

This warn­ing means that the Can­cel­la­tion­To­ken pro­vided as an ar­gu­ment of the RunAsync method is ig­nored. (This warn­ing is rel­e­vant to state­ful or state­less re­li­able ser­vice. The ac­tor ser­vice fol­lows the sin­gle en­try pat­tern.) The rea­son why can­cel­la­tion is so im­por­tant is a fact that Ser­vice Fab­ric is mov­ing your ser­vices away from a node which is be­ing pre­pared for an up­grade. When the can­cel­la­tion takes a very long time, can­cel­la­tion time mul­ti­plied by up­grade do­main count max ex­ceed a time limit for an en­vi­ron­ment up­grade. This causes that the up­grade at­tempt fails.

Ser­vice Fab­ric is dy­nam­i­cally bal­anc­ing your ser­vices among clus­ter nodes ac­cord­ing to mem­ory and com­put­ing char­ac­ter­is­tics. This mech­a­nism is also par­a­lyzed when the ser­vice freezes on a node. An­other con­se­quence is Mon­i­tored Up­grade block­ing. When the cur­rent ver­sion of the ser­vice can­not be shut down it can­not be re­placed by a higher ver­sion.

The pro­gram­mer’s mis­sion is code the pro­gram in a way that the Can­cel­la­tion­To­ken is prop­a­gated to ev­ery pos­si­ble awaitable call. (When you are com­mu­ni­cat­ing over HTTP pro­to­col, you should use the Htt­p­Client be­cause both Htt­p­We­bRe­quest and We­b­Client do not ac­cept the Can­cel­la­tion­To­ken as a pa­ram­e­ter.)

Some­times you can find the Can­cel­la­tion­To­ken.Throw­If­Can­cel­la­tion­Re­quested method use­ful, for ex­am­ple in the body of long run­n­ing loops. It does not mat­ter whether the ser­vice ter­mi­nates by throw­ing an ex­cep­tion or fin­ish­ing the RunAsync method. Both op­tions are cor­rect.

When the can­ce­la­tion is re­quested the Op­er­a­tion­CanceledEx­cep­tion is thrown. When you are log­ging ex­cep­tions in the catch clause, you may want to ex­clude this kind of ex­cep­tion. You can do it in many ways, for ex­am­ple like this:

try { ... cancellationToken.ThrowIfCancellationRequested(); ... } catch (Exception ex) when (!cancellationToken.IsCancellationRequested) { ... }