ROBUST COLLABORATIVE MULTICAST SERVICE FOR AIRBORNE COMMAND AND CONTROL ENVIRONMENT
by user
Comments
Transcript
ROBUST COLLABORATIVE MULTICAST SERVICE FOR AIRBORNE COMMAND AND CONTROL ENVIRONMENT
ROBUST COLLABORATIVE MULTICAST SERVICE FOR AIRBORNE COMMAND AND CONTROL ENVIRONMENT∗ Roger I. Khazan, Scott M. Lewandowski, Clifford J. Weinstein, Stephen A. Goulet, Steven J. Rak, Prasad Ramanan†, Thomas M. Parks‡, Michael C. Hamler§ MIT Lincoln Laboratory, Information Systems Technology Group 244 Wood Street, Lexington, MA 02420-9108 Phone: (781)981-7620; Email: {rkh, scl, cjw, goulet, rak} @ll.mit.edu ABSTRACT RCM (Robust Collaborative Multicast) is a communication service designed to support collaborative applications operating in dynamic, mission-critical environments. RCM implements a set of well-specified message ordering and reliability properties that balance two conflicting goals: a) providing low-latency, highly-available, reliable communication service, and b) guaranteeing global consistency in how different participants perceive their communication. Both of these goals are important for collaborative applications. In this paper, we describe RCM, its modular and flexible design, and a collection of simple, light-weight protocols that implement it. We also report on several experiments with an RCM prototype in a test-bed environment. INTRODUCTION Collaborative applications, such as textual chat and shared white-boards, have been becoming important tools for supporting military operations [1]. However, mainstream, commercial implementations of these applications were not designed to provide the properties required for their use in military operations. What is even more important, technologies underlying these implementations (that is, algorithms, protocols, and architectures) were not designed to operate in dynamic, unstable network environments. For example, during the Joint Expeditionary Force Experiment (JEFX) 2002, a centralized, single-server chat application failed to tolerate link outages and resulted in poor availability, much lower than that of underlying communication links [3][4]. In order for collaborative applications to be useful, users must have consistent perceptions of their collaboration. This typically requires that different users receive all mes∗ This research was sponsored by the United States Air Force under Air Force Contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are not necessarily endorsed by the US Government. Dr. Khazan is the point-of-contact author. All authors are U.S. citizens. † Mr. Ramanan was a research assistant. He completed an MIT M.Eng. thesis under the supervision of Dr. Khazan and Dr. Weinstein in Summer, 2003. ‡ Dr. Parks participated in this project as a visiting scientist at MIT LL in Summer, 2003. Computer Science Dept., Colgate Univ., Hamilton, NY 13346. § Mr. Hamler was a research assistant. He completed an MIT M.Eng.. thesis under the supervision of Dr. Khazan and Dr. Weinstein in Summer, 2004. sages according to the same global order. One way to ensure globally ordered delivery of messages is to order them at a single location, using a single server. This is the approach typically used in existing collaborative applications, but it is obviously not appropriate for deployment in dynamic, mission critical environments. An alternative is to use protocols designed to implement globally ordered broadcast in a distributed setting using multiple servers. Many such protocols exist [2], but they rely on the servers’ abilities to continuously communicate and synchronize with one another. In an environment with frequent connectivity and link quality fluctuations, these protocols may result in prohibitive levels of overhead and sometimes, during network partitions, in blocking the service until partitions repair. Such behavior is unacceptable for real-time collaborative applications supporting military operations. In this paper, we present RCM (Robust Collaborative Multicast), a communication service designed to support collaborative applications operating in dynamic, missioncritical environments. RCM implements a set of wellspecified message ordering and reliability properties that balance two conflicting goals: a) providing low-latency, highly-available, reliable communication service, and b) guaranteeing global consistency in how different participants perceive their communication. Both of these goals are important for collaborative applications. The balance between these goals is accomplished by the IGO (Intermittent Global Order) multicast protocol. This protocol guarantees globally ordered delivery of messages during the times when the underlying network connectivity is stable and satisfies certain minimal levels of quality of service. However, during periods of instability, when participants disconnect or are poorly connected, the protocol may deviate from the global order: In particular, messages sent by participants that remain well connected are still delivered in the same order, but the messages sent by disconnected or poorly connected participants may be missing or may be delivered late. When this happens, RCM informs its clients about the reduction in the quality of service. The upshot of relaxing the ordering for messages sent by disconnected or poorly connected participants is that the protocol does not wait for these messages and is able to provide reliable, low-latency delivery of messages sent by other participants. 1 of 9 Note that IGO is more powerful than “best-effort” global order because the conditions under which the global order is guaranteed are well-defined, and because the clients are informed when these conditions do and do not hold. fect connectivity, as it was configured to weather outages of up to 60 sec. in duration. RCM design was guided by and successfully realizes several goals: The environment is comprised of several sites, which can either be airborne, like a Command and Control aircraft, or on the ground, like a Combined Air Operations Center. Each site hosts a number of communication end-points, such as chat application clients; see Figure 1. A reasonable upper-bound on the number of sites is ten, and that for communication end-points at each site is a hundred. • Connection-awareness: RCM informs its clients about changes in their connectivity to each other. The connectivity information is presented in terms of logical connection statuses, which correspond to RCM’s ability (or inability) to communicate with certain quality of service levels. • Weathering transient connectivity problems: RCM can be configured to tolerate connectivity problems (e.g., outages and congestions) up to a certain duration. It guarantees reliable communication among participants regardless of such transient problems. • • Intermittent global order: RCM strikes a balance between conflicting needs of collaborative applications by guaranteeing global ordering during stable times, and relaxing the ordering for the sake of low latency and high availability during problematic times. RCM realizes an important idea that one or several problematic participants should not “spoil” everyone else’s communication (which they would do with standard global order multicast protocols). Robustness: RCM is able to remain available regardless of connectivity changes, and in particular, regardless of periods of high dynamism. RCM is designed as a modular, flexible system; it is made up of well-defined protocol modules. The protocols implementing RCM reliability and ordering properties are decoupled from the protocols handling low-level connections and communication; the former comprise Protocol Stack and the latter – Connection Manager. Protocol Stack is insulated from low-level communication problems and, instead, makes use of high-level (logical) connectivity information provided by Connection Manager. We implemented an RCM prototype in Java. Our experiments in a test environment demonstrate RCM’s ability to maintain robust, highly-available service under adverse, dynamic network conditions. As an example, in one 12hour experiment involving four RCM servers, we configured each link to be available 73% of the time, and the service to tolerate all link outages up to 60 sec. in duration. The results were the following: RCM remained operational throughout the experiment and provided logical connectivity for ~91% of the time. An analysis of link outages showed that ~9% of the time links were down due to outages lasting more than 60 sec. Thus, RCM achieved per- ENVIRONMENT MODEL Intra-site net Intra-site net Inter-site network Intra-site net Figure 1. Airborne command and control environment. Within each site, the communication end-points are connected by a fast and reliable communication network, which is similar to a typical local-area network (LAN). The sites themselves are also connected by a communication network. However, this network is significantly worse than within sites. The inter-site network utilizes different types of links (line-of-sight and beyond-line-of-site) some of which are unreliable and may exhibit variable performance. Because of link outages and restorations, the intersite network may partition and merge dynamically. Both intra-site and inter-site networks implement standard IPbased communication protocols, such as TCP and UDP. ARCHITECTURE FOR ROBUST CHAT Our project was motivated by the need to create a robust chat application appropriate for deployment in an airborne command and control environment. In its simplest form, this application should support upto a hundred users at each site collaborating in multiple chat rooms by posting text messages there. Thus, messages originated by a given user typically need to be broadcast to many users located at different sites. The environment model suggests a natural architecture for a robust chat application; see Figure 2. Each site has a single chat server and a collection of chat clients located at communication end-points. When a user sends a message to a group of users, the user’s chat client transmits the message to its local chat server via the reliable and stable intra-site network. The chat server then transmits the mes- 2 of 9 sage to other chat servers using a communication service designed for the inter-site environment. Once the chat servers receive the message, they transmit it to the chat clients of the appropriate local users. Chat Server RCM Server Dynamic Wireless Network Chat Server RCM Server to have a single RCM server at each site, and to have each server serve a single RCM client – a Chat server in our case (the chat server itself has many clients); see Figure 2. This design decision makes sense in the context of our target application and the target deployment environment. If necessary in the future, multiple RCM clients can be supported with a multiplexer/demultiplexer layer on top of RCM. As a result of this design decision, client addresses and server addresses are specified as sites. For simplicity, we often say “site” instead of “client” or “server”. We now describe the interface and semantics of the RCM service. Interface Chat Server RCM Server The service has a simple interface consisting of the following methods: • send(message): A method that is used by clients to send messages to sets of clients. The source and destinations are specified in the message as source and destinations fields. We will say that “a message is sent to a site”, when the site is included as one of the message’s destinations. • receive(message): A blocking method that is used by clients to receive messages sent to them; (this method is typically invoked by a dedicated receive thread at the client). Naturally, when a message is received by a client, the destinations field includes the client as one of its destination sites, and the source field contains the sender of the message. • setConnectionStatus(site, status): A callback method that is provided by clients in order for them to be notified about changes in connectivity with other clients. The site parameter specifies the location of a remote client. The status parameter takes one of three logical values: connected, suspected, or disconnected. These correspond to RCM’s ability (or inability) to communicate with the remote site at pre-defined quality of service levels, such as latency and bandwidth. The definitions of different QoS levels are applicationand environment-specific. Figure 2. Decentralized, scalable architecture for Robust Chat. Each site has a chat server providing chat services to its local chat clients. Chat servers use the RCM (Robust Collaborative Multicast) service to communicate with each other. This architecture has a number of strong points. It is: • Decentralized: Individual sites and groups of sites can operate independently from other sites if they are disconnected from them. This is necessary for achieving robustness and high-availability. • Scalable in the number of users: Because chat servers collect messages from and distribute messages to their users, only a single copy of each message is sent to each site, regardless of how many users that site has. • Modular: This architecture separates the chat application from the underlying inter-site communication service. This has two implications: First, the architecture can be utilized for other collaborative applications. Second, it allows for different implementations of inter-site communication services to be plugged in to the architecture. • Tailored to the environment: The architecture respects the heterogeneity inherent to the environment and brings attention to the one thing that is the true challenge here: inter-site communication. The focus of this paper is RCM, an inter-site communication service designed to support robust collaboration in this environment. RCM SPECIFICATION RCM is a generic multicast service that allows its clients to communicate messages to sets of clients. Messages are addressed by clients’ locations. We made a design decision Clients are allowed to send messages to a site only when the site is connected or suspected, but not when it is disconnected. In the current design, the set of all sites in the system is known a priori, at the time RCM servers are initialized. In the future, the service can be extended with a protocol supporting join requests from new servers and leave requests from existing servers. Semantics RCM provides a collection of properties that is tailored to the needs of collaborative applications operating in dy- 3 of 9 service starts routing messages between A and C through B. According to the definition, set {A, B} is connected, but it is not stable (because B can communicate with a site outside the set). Set {A, B, C} is neither connected nor stable. namic, distributed environments. In this paper, for the sake of brevity, we give intuitive, but informal, definitions of these properties; precise definitions will appear in [6]. • • Connection-awareness: Clients of RCM are notified about changes in connectivity to other clients via the setConnectionStatus(site,status) method. RCM guarantees that if two sites become connected and are able to communicate with adequate quality of service for a sufficiently long period of time, RCM will notify the clients at these sites that they are connected. Likewise, if connection quality between two sites falls below a pre-defined level and remains such for a sufficiently long period of time, RCM will inform the clients at these sites that they are disconnected. As was mentioned before, RCM can be configured to weather connectivity problems of up to certain duration (e.g., 1 minute); this duration can be specified by the client application. The suspected status is used during such problematic periods to give an early notification to clients about connectivity problems that RCM is attempting to weather. Reliability of message delivery: RCM guarantees that sites sending messages to each other will receive them as long as they do not see each other as being disconnected. What is important is that reliability is guaranteed not only during the times when sites are connected, but also when their connections are suspected (as long as they are not disconnected later). RCM clients rely on this property to know that their messages will be delivered to their destinations regardless of transient problems. • FIFO ordering of message delivery: RCM guarantees that messages are delivered in the order in which they are sent by their senders. This is a local persender ordering property, not a global one. • Intermittent Global Order of message delivery: RCM balances two conflicting needs of collaborative applications: low-latency and global order. In order to keep message latency small it guarantees global order only during the times when the environment satisfies certain conditions – when it is “stable”. More precisely, define a set of sites to be connected if every pair of sites is connected, that is, if the sites can communicate with a certain quality level. A set of sites is stable if it is connected, and if every site in this set is disconnected from every site outside the set. To exemplify the difference, consider three sites, A, B, and C, with A connected to B, and B connected to C, but A and C disconnected from each other. Such intransitivity may happen in dynamic networks, typically only for a short time, until the underlying communication The Intermittent Global Order property guarantees that all sites in a stable network component receive the same messages in the same order; that is, each site receives the same sequence of messages. When a network component is not stable, different sites may differ in the sequences of messages they receive: The sites that remain connected, will still receive their messages in the same order, but the sequences of messages that these sites receive may differ in the messages sent by other sites. Some of these messages may be received late, in which case, RCM informs the clients about the lateness of these messages and about their proper places in the sequences of messages received. • Robustness: RCM guarantees to remain operational even during periods of high dynamism of the inter-site network. Sites that remain connected will be able to communicate. Moreover, once the network stabilizes and remains in that state for a sufficiently long period of time, the connectivity provided by RCM will match the connectivity available in the network. RCM DESIGN The RCM service is provided by servers located at the sites. Each site has a single RCM server and, as was mentioned above, a single RCM client. When a client submits a message to its RCM server, the server sends the message to the servers located at the destination sites; the servers then deliver the message to their clients. Note that, in order to satisfy message ordering properties, RCM servers may need to delay delivering messages to their clients; this will be discussed below. An RCM server is made up of two components: Connection Manager and Protocol Stack; see Figure 3. Connection Manager encapsulates all connection-related activities and translates low-level connectivity information into the high-level, logical connection statuses: connected, suspected, and disconnected. Protocol Stack encapsulates all messaging-related protocols, which together implement the reliability and ordering properties of RCM. The protocols are shielded from all low-level connection issues (such as for example socket exceptions); instead, they deal only with the logical connection statuses. This decoupling of the protocols handling low-level connectivity and communication from the protocols implementing reliability and ordering properties is an important design feature of RCM. 4 of 9 • Chat Server RCM Server mcast Connection Manager ProtocolStack igo Monitoring TCP connections for liveness and quality: Connection Manager listens for socket channel events, keeping track of liveness and quality of TCP connections. It sends a heartbeat message to a remote site if no messages are sent there for some time (determined by heartbeat timeout). p2p s2s s2s socket s2s s2s Dynamic wireless network (TCP-like service) Figure 3. Architecture of a Robust Collaborative Multicast server. The server consists of two components, Connection Manager and Protocol Stack, and relies on a TCP-like underlying service to communicate with other RCM servers. Figure 4. Low-level connection states and transitions implemented by Connection Manager. RCM operates atop TCP, thereby taking advantage of the facts that TCP implements reliable FIFO communication within individual connection instances, and that TCP has flow-control. We are assuming that TCP (or a TCP-like service) is available as part of the inter-site communication infrastructure and that it is optimized for operation in the airborne environment. This assumption is valid in our specific target environment, and it is also reasonable in general, as numerous R&D projects are aimed at developing standard communication services for dynamic, wireless networks. One example is the 4xComm project of MIT Lincoln Laboratory. The 4xComm system leverages multiple types of links available in airborne command and control environments to do intelligent routing and bandwidth allocation [11]. Figure 4 shows a state machine that governs how Con- nection Manager handles each TCP connection. There are three states: TCP_Connected, TCP_Silent, and TCP_Disconnected. The TCP_Connected state is entered when a TCP connection is established and the state machine remains in this state while the connection satisfies certain predefined quality levels. If the connection does not satisfy these quality levels, for example, if no messages are received for some time (liveness timeout), the state machine transitions to TCP_Silent. If a socket exception occurs either in TCP_Connected or TCP_Silent states, Connection Manager closes the socket and transitions to TCP_Disconnected; There, it periodically attempts to establish a new connection with the remote site. Also, while in the TCP_Silent state, Connection Manager has an option of closing the socket after some time (silence timeout). Connection Manager Connection Manager implements the following connection-related activities: • Establishing and re-establishing connections to other sites: Whenever an RCM server is not connected to another RCM server, it attempts to establish a connection with that server once every reconnect period. When Connection Manager succeeds in establishing a TCP connection, the socket associated with the TCP connection is passed to a socket channel. These are the channels at the bottom of Protocol Stack that implement socket-level communication to individual sites, i.e., reading messages from and writing them to the socket. All socket exceptions are passed to Connection Manager and typically result in the socket being closed. Connection Manager then attempts to reestablish a connection, and if it succeeds, it passes the new socket to the appropriate socket channel. • 5 of 9 Informing its RCM server about connectivity changes: Connection Manager translates low-level connection events into the three logical connection statuses: connected, suspected, and disconnected, which it passes up to its server. Figure 5 shows the state machine that governs this process. Transitions are labeled with the states of the state machine handling TCP connections. Thus, when a TCP connection with a remote site enters the TCP_Connected state, Connection Manager notifies the server that the site has connected. If the TCP connection then goes to TCP_Silent or TCP_Disconnected, the server is notified that the connection is suspected. Starting from this point, if a live connection with the remote site is not established within certain time (suspect timeout), Connection Manager notifies its server that the site has disconnected. The value of suspect timeout is the maximal duration of transient problems that RCM will weather. TCP connection fails, it loses all messages that were written into the socket but did not make it to the remote site. This is why Robust P2P is needed. When the server learns about a new connection status, it propagates this information to its client and to the protocol stack. The client and the protocol stack are unaware of sockets and their exceptions; they see only the logical connection statuses produced by Connection Manager. 2. Robust Point-to-Point (P2P) channel: These channels lie on top of socket channels. They provide reliable FIFO communication to remote sites as long as their connectivity statuses remain connected or suspected. The basic idea is to buffer messages while they remain unacknowledged and retransmit them in situations when the underlying socket channel (TCP) may have lost them. Here is the protocol in a nutshell: Figure 5. Logical connection states are passed to the RCM server, and then later to its client. Transitions occur as a result of low-level connection changes. Protocol Stack Protocol Stack implements the reliability and ordering properties of RCM. It is comprised of four types of channels. A channel is a stackable abstraction for point-to-point and multicast protocols, where upper channels use underlying channels to send and receive messages. The protocol stack and the channel abstraction is a generic and flexible framework for experimenting with different protocol implementations and different protocol combinations. We now describe the four types of channels as they are currently implemented in RCM. 1. Socket channel: A socket channel implements socketbased communication with a remote site; it uses a socket that it gets from Connection Manager. The interface of socket channel includes the send(message) method, which writes message to the socket, or returns false if it has no socket . The channel also has a receive thread that reads messages from the socket and delivers them to the upper channel. Recall that Connection Manager uses socket channels to send heartbeats, and also listens for all send and receive events generated by the channels. Socket exceptions are propagated to Connection Manager and typically result in the socket being closed. When Connection Manager succeeds in reestablishing a connection with the remote site, it initializes the appropriate socket channel with the new socket. Note that, while TCP implements reliable FIFO communication within individual connections, it is not able to tolerate long-lasting link outages (seconds or minutes). If a On the sending side, the channel buffers all messages that it sends until they are acknowledged by the remote site. When connectivity with the remote site is suspected, the channel suspends sending messages, and just buffers them. When the remote site becomes connected again, the channel restarts sending buffered messages, starting from the first unacknowledged message. On the receiving side, the channel monitors sequence numbers of the messages received from the underlying socket channel. It delivers messages to the channel above if their sequence numbers monotonically increase. Otherwise, it simply ignores them because they are duplicates of already delivered messages. The reason duplicate messages may be received is because all unacknowledged messages are retransmitted by the sending side after a reconnection; however, some of these messages may have been successfully received before. The receiving side also periodically acknowledges the sequence number of the last message it delivers to the channel above; this allows the remote site to remove from its buffer all messages with smaller sequence numbers. If the channel learns that the remote site has disconnected, it no longer has to provide reliability for the buffered messages, and can delete them from its buffers. Even though these messages are allowed to be deleted, our implementation does not delete them for some time if there is enough memory to hold them. Correctness of this protocol relies on the reliability and FIFO ordering properties of TCP and the fact that after a reconnection the protocol sends buffered messages starting with the first unacknowledged message. Thus, the receiving side receives overlapping FIFO sequences and it uses message sequence numbers to ignore the overlap. This is a simple, straightforward protocol that works well for low sending rates and short messages; otherwise the overlapping regions may become large, leading to inefficient use of bandwidth. A more efficient protocol can eliminate the overlaps by having servers synchronize and retransmit only the messages that were lost. 6 of 9 3. Multicast channel: The third layer from the bottom is a multicast channel, implemented as a simple pass-through multiplexer/demultiplexer. When the channel receives a send(message) request from the channel above (IGO in this case), it copies this request to the Robust P2P channels that correspond to the destination sites. When it receives a message from an underlying Robust P2P channel, it delivers the message up to the channel above. In general, implementing multicast as a combination of point-to-point transmissions, one for each destination, can be inefficient. This is because point-to-point transmissions to different destinations may occur on the same physical links, resulting in redundant communication. However, we decided to go with the simple solution because there are several factors that contribute to communication redundancy being small in our case: First, the number of destination sites in the targeted environment is small. Second, many sites are connected by dedicated physical links, like TCDL. Third, the messages generated by the target application and by RCM are typically small. A standard way of implementing multicast more efficiently is with a multicast tree. Such an implementation can be easily plugged into the protocol stack instead of the simple multicast channel, if this becomes necessary in the future. 4. Intermittent Global Order channel: The top layer of the protocol stack implements the intermittent global ordering property, which guarantees that messages are delivered in the same order to different destination sites, as long as the sites form a stable set (see RCM Specification). However, during instabilities, message streams delivered to different sites may be different; in particular, they may differ in the messages sent by problematic sites. IGO realizes the idea that a single site or a group of sites with slow or broken links should not “spoil” (delay or block) everyone’s communication (which they would do with standard global order multicast protocols). or broken link results in a delay or blocking of everyone’s messages. Our IGO protocol is a simple modification of the standard logical time protocol: In deciding whether a message can be delivered, the protocol considers the logical times of only those sites to which it is connected, and does not consider the logical times of suspected or disconnected sites. As a result, messages from suspected or disconnected sites may arrive and be delivered late. Late messages are marked as such and the client is informed about their proper global order positions. The upshot of allowing late messages from suspected and disconnected sites is that the protocol does not wait for these messages and keeps ordering and delivering messages from other sites. EXPERIMENTAL RESULTS We implemented an RCM prototype in Java, paying particular attention to making the low-level design and implementation correspond to the high-level design. We have also put together a four site test-bed network and have been conducting experiments, testing each of the five properties of RCM: connection-awareness, reliability, FIFO, IGO, and robustness. In this paper we report results on the following experiment setup: A group of four sites is connected via 56kbps links. The client at each site broadcasts 2 messages per second to the four sites. Each experiment lasts one hour and alternates between an interval with all links being up and an interval with some links being down, creating two two-site partitions; the durations of the two intervals are fixed for each experiment. RCM is configured with a 1 sec. heartbeat timeout, 5 sec. liveness timeout, 60 sec. suspect timeout, and 3 sec. reconnection period (see the Connection Manager section). Connection awareness and availability connected We have designed four different IGO protocols as simple modifications of standard global order protocols, in particular, the protocols based on Lamport’s logical time, sequencer server, real time, and logical/real time hybrid [2]. The first two are described and experimentally compared in [10]; the last two are the subject of a future publication. In this paper we describe the logical time protocol; it has been formally modeled and verified by Matlon [9]. In the standard logical time protocol [8], in order for a site to deliver messages to its client it must be able to continuously learn about the logical times of all other sites. Otherwise, it cannot deliver messages; this is because, if it does, it may later receive a message that should have been delivered first. Hence, in this protocol, even a single slow Link disconnected 2 connected suspected RCM Time (sec) 0 80 100 120 140 160 180 200 Figure 6. Link availability vs RCM logical connectivity. 7 of 9 • Connection-awareness and availability: Figure 6 shows an RCM client’s view of its connectivity to another site in an experiment with 30 sec. up and 10 sec. down intervals. RCM implements connection-awareness: Within ~6 sec. after a link outage occurs, RCM notifies the client that the link is suspected, and then reverses this suspicion within ~3 seconds after the link comes back up. Throughout the experiment, RCM maintains 100% logical link availability, i.e., it never tells the client that the remote site has disconnected and allows the client to communicate with the remote site. • Reliability: Figure 7 demonstrates RCM’s ability to tolerate transient link outages. The figure reports on a set of one-hour long experiments with different durations of the “down” interval, from 0 to 60 seconds. In all experiments, Robust P2P was able to successfully tolerate link outages and provided 100% reliability. Reliability characteristics of three P2P protocols during transient link outages ing period at each site, and agrees with the theoretical performance property of the logical time protocol. The latency of several seconds is appropriate for our text chat application. (Other IGO protocols, such as those based on a sequencer and those based on real-time yield smaller latencies; the latency properties of these protocols do not depend on the sending periods but on network latencies.) When a down interval begins, RCM temporarily stops delivering messages because IGO does not receive timestamps from the two disconnected sites and, hence, cannot establish the global ordering of messages. Then, when IGO is notified that the connections to the two sites are suspected, it goes on ordering and delivering messages without waiting for these sites. About 6 sec. after each down interval starts, messages are delivered in a burst, with latencies at most 6 seconds. Then, for the remainder of the down interval, only the messages of the two connected sites are delivered, with the average delivery rate of ~4 msgs/sec. Messages received vs sent 100% 95% 90% TCP 85% TCP + Suspected Robust P2P 80% 75% 70% 65% 0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 Duration of link outages (sec) Figure 7. Robust P2P, unlike TCP, tolerates link outages. Figure 7 also highlights the importance of the Robust P2P protocol by showing the reliability measurements for two simpler options: TCP and TCP+Suspected. TCP failed during outages lasting more than several seconds, resulting in more than 30% of messages being lost. TCP+Suspected is a simple point-to-point channel that stops sending messages when it learns that the link is suspected, and instead buffers them until connection is reestablished. This protocol loses far fewer messages than TCP. Nevertheless, it still loses messages that are sent while the link is down but not yet suspected. • Intermittent Global Order: Figure 8 shows latency and data rate measurements at a single RCM server in an experiment with 30 sec. up and 10 sec. down intervals. During the up intervals, the average delivery rate is 8 msgs/sec, which matches the sending rate of 2 msgs/sec. at each of the four sites. The average latency during the up intervals is 0.5 sec. – this corresponds to the average send- Figure 8. Delivery rate and latency at an RCM server, in an experiment with 10 sec. down and 30 sec. up intervals. When an up interval begins, and RCM reestablishes connection with the disconnected sites, Robust P2P transfers messages that were sent during the down interval, and IGO orders and delivers them; some of these messages are “late” according to the global order. This second burst of messages is greater than the first one, with latencies going as high as 14 sec. After these messages are delivered, the rate and latency drop back to normal levels. • Robustness: We have also tested robustness of RCM with experiments that alternate between a stable interval and an interval of high dynamism (links going up and down and servers restarting). We confirmed that, within several seconds after stable periods begin, RCM successfully (re)establishes connections with the connected sites. 8 of 9 CONCLUSIONS REFERENCES This paper presented RCM, a multicast service for supporting collaborative applications in dynamic, mission critical environments. RCM is connection-aware, highlyavailable, and robust. It guarantees low-latency, reliability, and global order during stable periods, when sites are wellconnected, and relaxes the global order property in favor of low-latency and reliability when connectivity problems occur. RCM has a number of parameters that can be tailored to the application needs and its environment. For example, RCM can be configured to weather transient problems of certain durations. [1] A. Butler, War Planners Talk In 'Chat Rooms' to exchange targeting data, tips, in Inside the Air Force. April 25, 2003. As part of this project, Hamler implemented a prototype chat application, called Robust Chat, consisting of chat servers and chat clients [5]. Chat servers operate on top of RCM servers; the connection between chat clients and chat servers uses non-blocking IO, which was chosen to allow the application to scale well as the number of clients grows. Robust Chat is tailored to the airborne command and control environment: It features multiple chat rooms, connection-awareness, and site-awareness. Future versions will include missing message and late message notifications, access control mechanisms, and message acknowledgments. Recently, Robust Chat was used by the MIT Lincoln Laboratory team participating at the Joint Expeditionary Force Experiment (JEFX) 2004. According to the team, the application operated robustly in highly dynamic network conditions and helped the team maintain situation awareness and coordinate activities between the ground and airborne sites. RCM is not just a service with the protocols described in this paper. Rather, it is a platform for experimentation with different protocols and different algorithms. We have demonstrated that relatively simple, light-weight protocols can be used to achieve a useful combination of properties. We are currently refining and optimizing these protocols, as well as exploring new ones. We are also planning on considering applicability of RCM and its founding ideas to other collaborative applications, and to other dynamic, mission-critical environments. [2] X. Défago, A. Schiper, and P. Urbán. Totally ordered broadcast and multicast algorithms: A comprehensive survey. Technical Report DSC/2000/036, ÉPFL, Switzerland, September 2000. [3] S.T. Dougherty, JEFX 2002 ends with positive results, in Air Force Link. August 9, 2002. [4] D. A. Fulghum, Tests of MC2A Reveal Problems and Promise, in Aviation Week & Space Technology. 2002. [5] M. Hamler. A Robust Multi-Server Chat Application for Dynamic Distributed Networks. MIT Masters of Engineering thesis, August 2004. [6] R. Khazan, S. Lewandowski, C. Weinstein, et al. Robust Collaboration in Airborne Command and Control Environment. MIT Lincoln Laboratory Technical Report, in preparation. [7] R. Khazan and N. Lynch. An Algorithm for an Intermittently Atomic Data Service Based on Group Communication. Proc. of the International Workshop on Large-Scale Group Communication, pp. 25-30, October 2003. [8] L. Lamport, Time, Clocks, and the Ordering of events in a distributed system. Communications of the ACM, 21(7): p. 558-565, 1978. [9] C. Matlon. A specification and verification of Intermittent Global Order Broadcast. MIT Masters of Engineering thesis, May 2004. [10] P. Ramanan. Robust chat for airborne command and control. MIT Masters of Engineering thesis. August 2003. [11] D. Van Hook, S. McGarry, J. Calvin. The 4xComm Architecture Concept for Wireless Airborne Networks, MILCOM 2003. Acknowledgements: We thank Lincoln Laboratory staff, especially Stephen McGarry, David Kettner, Joseph Cooley, Alen Peacock, Lenny Veytser, Orton Huang, and Paul Christensen, for their assistance with testing and experiments. We also thank Nancy Lynch, Catherine Matlon, and Simon Zaslavsky for suggestions and interesting discussions related to different parts of this project. 9 of 9