6 Fault-Tolerant Examples

This chapter shows how to use the failure model to build robust distributed applications. We first present basic fault-tolerant versions of common language operations. Then we present fault-tolerant versions of the server examples. We conclude with a bigger example: reliable objects with recovery.

6.1 A fault-tolerant hello server

Let's take a fresh look at the hello server. How can we make it resistant to distribution faults? First we specify the client and server behavior. The server should continue working even though there is a problem with a particular client. The client should be informed in finite time of a server problem by means of a new exception, serverError.

We show how to rewrite this example with the basic failure model. In this model, the system raises exceptions when it tries to do operations on entities that have problems related to distribution. All these exceptions are of the form system(dp(conditions:FS ......) where FS is the list of actual fault states as defined before. By default, the system will raise exceptions only on the fault states tempFail and permFail.

Assume that we have two new abstractions:

We first show how to use these abstractions before defining them in the basic model. With these abstractions, we can write the client and the server almost exactly in the same way as in the non-fault-tolerant case. Let's first write the server:

declare Str Prt Srv in  
{NewPort Str Prt}
thread   
   {ForAll Str
    proc {$ S}
       try 
          S="Hello world" 
       catch system(dp(......then skip end 
    end}
end 
 
proc {Srv X}
   {SafeSend Prt X}
end 
                                         
{Pickle.save {Connection.offerUnlimited Srv}  
             "/usr/staff/pvr/public_html/hw"}

This server does one distributed operation, namely the binding S="Hello world". We wrap this binding to catch any distributed exception that occurs. This allows the server to ignore clients with problems and to continue working.

Here's the client:

declare Srv
 
try X in 
   try 
      Srv={Connection.take {Pickle.load "http://www.info.ucl.ac.be/~pvr/hw"}}
   catch _ then raise serverError end 
   end 
    
   {Srv X}
   {SafeWait X infinity}
   {Browse X}
catch serverError then 
   {Browse 'Server down'}
end

This client does two distributed operations, namely a send (inside Srv), which is replaced by SafeSend, and a wait, which is replaced by SafeWait. If there is a problem sending the message or receiving the reply, then the exception serverError is raised. This example also raises an exception if there is any problem during the startup phase, that is during Connection.take and Pickle.load.

6.1.1 Definition of SafeSend and SafeWait

We define SafeSend and SafeWait in the basic model. To make things easier to read, we use the two utility functions FOneOf and FSomeOf, which are defined just afterwards. SafeSend is defined as follows:

declare 
proc {SafeSend Prt X}
   try 
      {Send Prt X}
   catch system(dp(conditions:FS ......then 
      if {FOneOf permFail FS} then 
         raise serverError end 
      elseif {FOneOf tempFail FS} then 
         {Delay 100} {SafeSend Prt X}
      else skip end 
   end 
end

This raises a serverError if there is a permanent server failure and retries indefinitely each 100 ms if there is a temporary failure.

SafeWait is defined as follows:

declare 
local 
   proc {InnerSafeWait X Time}
      try 
         cond {Wait X} then skip 
         [] {Wait Time} then raise serverError end 
         end 
      catch system(dp(conditions:FS ......then 
         if {FSomeOf [permFail remoteProblem(permSome)] FS} then 
            raise serverError end 
         if {FSomeOf [tempFail remoteProblem(tempSome)] FS} then 
            {Delay 100} {InnerSafeWait X Time}
         else skip end 
      end 
   end 
in 
   proc {SafeWait X TimeOut}
   Time in 
      if TimeOut\=infinity then 
         thread {Delay TimeOut} Time=done end 
      end 
      {Fault.enable X 'thread'(this)
         [permFail remoteProblem(permSome) tempFail remoteProblem(tempSome)] _}
      {InnerSafeWait X Time}
   end 
end

This raises a serverError if there is a permanent server failure and retries each 100 ms if there is a temporary failure. The client and the server are the only two sites on which X exists. Therefore remoteProblem(permFail:_ ...) means that the server has crashed.

To keep the client from blocking indefinitely, it must time out. We need a time-out since otherwise a client will be stuck when the server drops it like a hot potato. The duration of the time-out is an argument to SafeWait.

6.1.2 Definition of FOneOf and FSomeOf

In the above example and later on in this chapter (e.g., in Section 6.2.3), we use the utility functions FOneOf and FSomeOf to simplify checking for fault states. We specify these functions as follows.

The call {FOneOf permFail AFS} is true if the fault state permFail occurs in the set of actual fault states AFS. Extra information in AFS is not taken into account in the membership check. The function FOneOf is defined as follows:

declare 
fun {FOneOf F AFS}
   case AFS of nil then false 
   [] AF2|AFS2 then 
      case F#AF2
      of permFail#permFail(...then true 
      [] tempFail#tempFail(...then true 
      [] remoteProblem(I)#remoteProblem(I ...then true 
      else {FOneOf F AFS2}
      end 
   end 
end

The call {FSomeOf [permFail remoteProblem(permSome)] AFS} is true if either permFail or remoteProblem(permSome) (or both) occurs in the set AFS. Just like for FOneOf, extra information in AFS is not taken into account in the membership check. The function FSomeOf is defined as follows:

declare 
fun {FSomeOf FS AFS}
   case FS of nil then false 
   [] F2|FS2 then 
      {FOneOf F2 AFS} orelse {FSomeOf FS2 AFS}
   end 
end

6.2 Fault-tolerant stationary objects

To be useful in practice, stationary objects must have well-defined behavior when there are faults. We propose the following specification for the stationary object (the "server") and a caller (the "client"):

We present two quite different ways of implementing this specification, one based on guards (Section 6.2.2) and the other based on exceptions (Section 6.2.3). The guard-based technique is the shortest and simplest to understand. The exception-based technique is similar to what one would do in standard languages such as Java.

But first let's see how easy it is to create and use a remote stationary object.

6.2.1 Using fault-tolerant stationary objects

We show how to use Remote and NewSafeStat to create a remote stationary object. First, we need a class--let's define a simple class Counter that implements a counter.

declare 
class Counter 
   attr i
   meth init i<-end 
   meth get(X) X=@end 
   meth inc i<-@i+end 
end

Then we define a functor that creates an instance of Counter with NewSafeStat. Note that the object is not created yet. It will be created later, when the functor is applied.

declare 
F=functor 
  import Fault
  export statObj:StatObj
  define 
     {Fault.defaultEnable nil _}
     StatObj={NewSafeStat Counter init}
  end

Do not forget the "import Fault" clause! If it's left out, the system will try to use the local Fault on the remote site. This raises an exception since Fault is sited (technically, it is a resource, see Section 2.1.5). The import Fault clause ensures that installing the functor uses the Fault of the installation site.

It may seem overkill to use a functor just to create a single object. But the idea of functors goes much beyond this. With import, functors can specify which resources to use on the remote site. This makes functors a basic building block for mobile computations (and mobile agents).

Now let's create a remote site and make an instance of Counter called StatObj. The class Remote.manager gives several ways to create a remote site; this example uses the option fork:sh, which just creates another process on the same machine. The process is accessible through the module manager MM, which allows to install functors on the remote site (with the method "apply").

declare 
MM={New Remote.manager init(fork:sh)}
StatObj={MM apply(F $)}.statObj

Finally, let's call the object. We've put the object calls inside a try just to demonstrate the fault-tolerance. The simplest way to see it work is to kill the remote process and to call the object again. It also works if the remote process is killed during an object call, of course.

try 
   {StatObj inc}
   {StatObj inc}
   {Show {StatObj get($)}}
catch X then 
   {Show X}
end

6.2.2 Guard-based fault tolerance

The simplest way to implement fault-tolerant stationary objects is to use a guard. A guard watches over a computation, and if there is a distribution fault, then it gracefully terminates the computation. To be precise, we introduce the procedure Guard with the following specification:

  • {Guard E FS S1 S2} guards entity E for fault states FS during statement S1, replacing S1 by S2 if a fault is detected during S1. That is, it first executes S1. If there is no fault, then S1 completes normally. If there is a fault on E in FS, then it interrupts S1 as soon as a faulty operation is attempted on any entity. It then executes statement S2. S1 must not raise any distribution exceptions. The application is responsible for cleaning up from the partial work done in S1. Guards are defined in Section ``Definition of Guard''.

With the procedure Guard, we define NewSafeStat as follows. Note that this definition is almost identical to the definition of NewStat in Section 3.2.3. The only difference is that all distributed operations are put in guards.

<Guard-based stationary object>=
proc {MakeStat PO ?StatP}
   S P={NewPort S}
   N={NewName}
in 
   % Client interface to server:
   
<Client side> 
   % Server implementation:
   
<Server side> 
end 
 
proc {NewSafeStat Class Init Object}
   Object={MakeStat {New Class Init}}
end 

The client raises an exception if there is a problem with the server:

<Client side>=
proc {StatP M}
in 
   {Fault.enable R 'thread'(this) nil _}
   {Guard P [permFail]
    proc {$}  
       {Send P M#R}  
       if R==then skip else raise R end end 
    end 
    proc {$raise remoteObjectError end end}
end

The server terminates the client request gracefully if there is a problem with a client:

<Server side>=
thread 
   {ForAll S
    proc{$ M#R}
       thread RL in 
          try {PO M} RL=N catch X then RL=X end 
          {Guard R [permFail remoteProblem(permSome)]
           proc {$} R=RL end 
           proc {$skip end}
       end 
   end}
end

There is a minor point related to the default enabled exceptions. This example calls Fault.enable before Guard to guarantee that no exceptions are raised on R. This can be changed by using Fault.defaultEnable at startup time for each site.

Definition of Guard

Guards allow to replace a statement S1 by another statement S2 if there is a fault. See Section 6.2.2 for a precise specification. The procedure {Guard E FS S1 S2} first disables all exception raising on E. Then it executes S1 with a local watcher W (see Section ``Definition of LocalWatcher''). If the watcher is invoked during S1, then S1 is interrupted and the exception N is raised. This causes S2 to be executed. The unforgeable and unique name N occurs nowhere else in the system.

declare 
proc {Guard E FS S1 S2}
   N={NewName}
   T={Thread.this}
   proc {W E FS} {Thread.injectException T N} end 
in 
   {Fault.enable E 'thread'(T) nil _}
   try 
      {LocalWatcher E FS W S1}
   catch X then 
      if X==then 
         {S2}
      else 
         raise X end 
      end 
   end 
end

Definition of LocalWatcher

A local watcher is a watcher that is installed only during the execution of a statement. When the statement finishes or raises an exception, then the watcher is removed. The procedure LocalWatcher defines a local watcher according to the following specification:

  • {LocalWatcher E FS W S} watches entity E for fault states FS with watcher W during the execution of S. That is, it installs the watcher, then executes S, and then removes the watcher when execution leaves S.

declare 
proc {LocalWatcher E FS W S}
   {Fault.installWatcher E FS W _}
   try 
      {S}
   finally 
      {Fault.deInstallWatcher E W _}
   end 
end

6.2.3 Exception-based fault tolerance

We show how to implement NewSafeStat by means of exceptions only, i.e., using the basic failure model. First New makes an instance of the object and then MakeStat makes it stationary. In MakeStat, we distinguish four parts. The first two implement the client interface to the server.

<Exception-based stationary object>=
declare 
proc {MakeStat PO ?StatP}
   S P={NewPort S}
   N={NewName}
   EndLoop TryToBind
in 
   % Client interface to server:
   
<Client call to the server> 
   
<Client synchronizes with the server> 
   % Server implementation:
   
<Main server loop> 
   
<Server synchronizes with the client> 
end

proc {NewSafeStat Class Init ?Object}
   Object={MakeStat {New Class Init}}
end

First the client sends its message to the server together with a synchronizing variable. This variable is used to signal to the client that the server has finished the object call. The variable passes an exception back to the client if there was one. If there is a permanent failure of the send, then raise remoteObjectError. If there is a temporary failure of the send, then wait 100 ms and try again.

<Client call to the server>=
proc {StatP M}
      R in 
      try 
         {Send P M#R}
      catch system(dp(conditions:FS ......then 
         if {FOneOf permFail FS} then 
            raise remoteObjectError end 
         elseif {FOneOf tempFail FS} then 
            {Delay 100}
            {StatP M}
         else skip end 
      end 
      {EndLoop R}
   end

Then the client waits for the server to bind the synchronizing variable. If there is a permanent failure, then raise the exception. If there is a temporary failure, then wait 100 ms and try again.

<Client synchronizes with the server>=
proc {EndLoop R}
      {Fault.enable R 'thread'(this)  
         [permFail remoteProblem(permSome) tempFail remoteProblem(tempSome)] _}
      try 
         if R==then skip else raise R end end 
      catch system(dp(conditions:FS ......then 
         if {FSomeOf [permFail remoteProblem(permSome)] FS} then 
            raise remoteObjectError end 
         elseif {FSomeOf [tempFail remoteProblem(tempSome)] FS} then 
            {Delay 100} {EndLoop R}
         else skip end 
      end 
   end

The following two parts implement the server. The server runs in its own thread and creates a new thread for each client call. The server is less tenacious on temporary failures than the client: it tries once every 2000 ms and gives up after 10 tries.

<Main server loop>=
thread 
      {ForAll S
       proc {$ M#R}
          thread 
             try 
                {PO M}
                {TryToBind 10 R N}
             catch X then 
                try 
                   {TryToBind 10 R X}
                catch Y then skip end 
             end 
          end 
       end}
   end

<Server synchronizes with the client>=
proc {TryToBind Count R N}
      if Count==then skip 
      else 
         try 
            R=N
         catch system(dp(conditions:FS ......then 
            if {FOneOf tempFail FS} then 
               {Delay 2000}
               {TryToBind Count-1 R N}
            else skip end 
         end 
      end 
   end

6.3 A fault-tolerant broadcast channel

We can use the fault-tolerant stationary object (see Section 6.2) to define a simple open fault-tolerant broadcast channel. This is a useful abstraction; for example it can be used as the heart of a chat tool such as IRC. The service has a client/server structure and is aware of permanent crashes of clients or the server. In case of a client crash, the system continues to work. In case of a server crash, the service will no longer be available. Clients receive notification of this.

Users access the broadcast service through a local client. The user creates the client by using a procedure given by the server. The client is accessed as an object. It has a method sendMessage for broadcasting a message. When the client receives a message or is notified of a client or server crash, it informs the user by calling a user-defined procedure with one argument. The following events are possible:

We give an example of how the broadcast channel is used, and we follow this by showing its implementation. We first show how to use and implement a non-fault-tolerant broadcast channel, and then we show the small extensions needed for it to detect client and server crashes.

6.3.1 Sample use (no fault tolerance)

First we create the channel server. To connect with clients, the server offers a ticket with unlimited connection ability. The ticket is available through a publicly-accessible URL.

local  
   S={NewStat ChannelServer init(S)}
in 
   {Pickle.save {Connection.offerUnlimited S}
                "/usr/staff/pvr/public_html/chat"}
end

A client can be created on another site. We first define on the client's site a procedure HandleIncomingMessage that will handle incoming messages from the broadcast channel. Then we access to the channel by its URL. Finally, we create a local client and give it our handler procedure.

local  
   proc {HandleIncomingMessage M}
      {Show {VirtualString.toString
         case M  
         of message(From Content) then From#' : '#Content
         [] registered(UserID)    then UserID#' joined us' 
         [] unregistered(UserID)  then UserID#' left us' 
         end}}
   end 
   S={Connection.take {Pickle.load "http://www.info.ucl.ac.be/~pvr/chat"}}
   MakeNewClient={S getMakeNewClient($)}
   C={MakeNewClient HandleIncomingMessage 'myNameAsID'}
in 
   {For 1 1000 1
      proc {$ I}
         {C sendMessage('hello'#I)}
         {Delay 800}
      end}
 
   {C close}
end

In this example we send 1000 messages of the form 'hello'#I, where I takes successive values from 1 to 1000. Then we close the client.

A nice property of this channel abstraction is that the client site only needs to know the channel's URL and its interface. All this can be stored in Ascii form and transmitted to the client at any time. In particular, the syntax of the interfaces, i.e., the messages understood by user, client, and server, is defined completely by a simple Ascii list of the message names and their number of arguments. The client site does not need to know any program code. When a client is created through a call to MakeNewClient, then at that time the client code is transferred from the channel server to the client site.

6.3.2 Definition (no fault tolerance)

Since a fault-tolerant stationary object has well-defined behavior in the case of a permanent crash, we can show the service's implementation in two steps. First, we show how it is written without taking fault tolerance into account. Second, we complete the example by adding fault handling code. This is easy; it amounts to catching the remoteObjectError exception for each remote method call (client to server and server to client).

The client and server are stationary objects with the following structure:

<Client and server classes>=
class ChannelClient 
   feat 
      server selfStatic usrMsgHandler userID
   
<Client interface to user> 
   
<Client interface to server> 
end 
 
local 
   
<Concurrent ForAll procedure> 
in 
   class ChannelServer 
      prop locking
      feat selfStatic
      attr clientList
      meth init(S)
         lock 
            self.selfStatic=S
            clientList<-nil
         end 
      end 
      
<Server's getMakeClient method> 
      
<Server interface to client> 
   end 
end

Client definition

The client provides two methods to the server. The first, put, for receiving broadcasted message from registered clients. The second, init, for the client initialization (remember that a client is created using a procedure defined by the server).

<Client interface to server>=
meth put(Msg)
   {self.usrMsgHandler Msg}
end 
 
meth init(Server SelfReference UsrMsgHandler UserID)
   self.server=Server
   self.selfStatic=SelfReference
   self.usrMsgHandler=UsrMsgHandler
   self.userID=UserID
   {self.server register(self.selfStatic self.userID)}
end

The client keeps a reference to the server, to itself for unregistering, to the user-defined handler procedure, and to its user identification.

A user accesses the broadcast channel only through a client. The client provides the user with a method for sending a message through the channel and a method for leaving the channel.

<Client interface to user>=
meth sendMessage(Msg)
   {self.server broadcast(self.userID Msg)}
end 
 
meth close 
   {self.server unregister(self.selfStatic self.userID)}
end

Server definition

The server's getMakeClient method returns a reference to a procedure that creates clients:

<Server's getMakeClient method>=
meth getMakeNewClient(MakeNewClient)
   proc {MakeNewClient UserMessageHandler UserID StaticClientObj}
      StaticClientObj={NewStat   
         ChannelClient  
         init(self.selfStatic StaticClientObj
              UserID UserMessageHandler)}
   end 
end

The server uses a concurrent ForAll procedure that starts all sends concurrently and waits until they are all finished. This is important for implementing broadcasts. With the concurrent ForAll, the total time for the broadcast is the maximum of all client round-trip times, instead of the sum, if the broadcast would sequentially send to each client and wait for an acknowledgement before continuing. Concurrent broadcast is efficient in Mozart due to its extremely lightweight threads.

<Concurrent ForAll procedure>=
proc {ConcurrentForAll Ls P}
   Sync
   proc {LoopConcurrentForAll Ls PrevSync FinalSync}
      case Ls
      of L|Ls2 then 
      NewSync in 
         thread {P L} PrevSync=NewSync end 
         {LoopConcurrentForAll Ls2 NewSync FinalSync}
      [] nil then 
         PrevSync=FinalSync
      end 
   end 
in 
   {LoopConcurrentForAll Ls unit Sync}
   {Wait Sync}
end

The server provides three methods for the client, namely register, unregister, and broadcast. A client can register to the broadcast channel by calling the register method and unregister by calling the unregister method. Note that clients are identified uniquely by references to the client object Client, and not by the client's user ID UserID. This means that the channel will work correctly even if there are clients with the same user ID. The users may get confused, but the channel will not.

A client can broadcast a message on the channel by calling the broadcast method. The server will concurrently forward the message to all registered clients. The broadcast call will block until the message has reached all the clients.

<Server interface to client>=
meth register(Client UserID)
CL in 
   lock 
      CL=@clientList
      clientList <- c(ref:Client id:UserID)|@clientList
   end 
   {ConcurrentForAll CL
      proc {$ Element} {Element.ref put(registered(UserID))} end}
end 
 
meth unregister(Client UserID)
CL in 
   lock 
      clientList <- 
         {List.filter @clientList
            fun {$ Element} Element.ref\=Client end}
      CL=@clientList
   end 
   {ConcurrentForAll CL
      proc {$ Element} {Element.ref put(unregistered(UserID))} end}
end 
 
meth broadcast(SenderID Msg)
   {ConcurrentForAll @clientList
      proc {$ Element} {Element.ref put(message(SenderID Msg))} end}
end

6.3.3 Sample use (with fault tolerance)

The fault-tolerant channel can be used in exactly the same way as the non-fault-tolerant version. The only difference is that the user-defined handler procedure can receive two extra messages, permClient and permServer, to indicate client and server crashes:

proc {UserMessageHandler Msg}
   {Show {VirtualString.toString
      case Msg
      of message(From Content) then From#' : '#Content
      [] registered(UserID)    then UserID#' joined us' 
      [] unregistered(UserID)  then UserID#' left us' 
      [] permClient(UserID)    then UserID#' has crashed' 
      [] permServer            then 'Server has crashed' 
      end}}
end

6.3.4 Definition (with fault tolerance)

The non-fault-tolerant version of Section 6.3.2 is easily extended to detect client and server crashes. First, the server and all clients must be created by calling NewSafeStat instead of NewStat. This means creating the server as follows:

S={NewSafeStat ChannelServer init(S)}

This makes the channel server a fault-tolerant stationary object. In addition, several small extensions to the client and server definitions are needed. This section gives these extensions.

Client definition

This definition extends the definition given in Section 6.3.2. We assume that the server has been created with NewSafeStat. Two changes are needed to the client. First, the client can detect a server crash by catching the remoteObjectError exception. Second, the server can detect a client crash in the same way, when it calls the client's selfStatic reference. Both of these changes can be done by redefining the values of self.server and self.selfStatic at the client.

meth init(Server SelfReference UsrMsgHandler UserID)
   self.server =
      proc {$ Msg}
         try 
            {Server broadcast(self.userID Msg)}
         catch remoteObjectError then 
            {self.usrMsgHandler permServ}
         end 
      end 
   self.selfStatic =
      proc {$ Msg}
         try 
            {SelfReference Msg}
         catch remoteObjectError then 
            {Server unregister(self.selfStatic self.userId)}
            {Server broadcastCrashEvent(UserID)}
         end 
      end 
   self.usrMsgHandler=UsrMsgHandler
   self.userID=UserID
   {self.server register(self.selfStatic self.userID)}
end

Server definition

The server has the new method broadcastCrashEvent.

meth broadcastCrashEvent(CrashID)
   {ConcurrentForAll @clientList
      proc {$ Element}
         {Element.ref put(permClient(CrashID))}
      end}
end

In the old method getMakeNewClient, the procedure MakeNewClient has to be changed to call NewSafeStat instead of NewStat:

meth getMakeNewClient(MakeNewClient)
   proc {MakeNewClient UserMessageHandler UserID StaticClientObj}
      StaticClientObj={NewSafeStat
         ChannelClient
         init(self.selfStatic StaticClientObj
              UserID UserMessageHandler)}
   end 
end


Peter Van Roy, Seif Haridi and Per Brand
Version 1.1.0 (20000207)