GASNet Portals-conduit documentation
Michael Welcome <mlwelcome@lbl.gov>
$Revision: 1.18 $

User Information:
-----------------
The GASNet Portals conduit is being developed exclusively for the 
Cray XT3/XT4 and follow-on Portals based systems.  The design and implementation is 
based on the Portals 3.3 Message Passing Interface (Revision 1.0) developed at 
Sandia National Laboratory and the University of New Mexico.  These should
be only a few modifications required to port this implementation to non-Cray
Portals systems.

This implementation runs under both the Cray Catamount and Compute Node Linux (CNL)
operating systems.  Also, note that earlier implementations of this conduit
were hybrid implementations, using Portals directly for Put and Get operations
and the MPI-conduit for the active message layer.  This version of the conduit
is implemented entirely over Portals and has no MPI dependencies.

This implementation will only work under the GASNET_SEGMENT_FAST environment, although
modification to GASNET_SEGMENT_EVERYTHING are possible using the GASNet FireHose
algorithm.
In addition, this implementation is restricted as follows:
  Under Catamount:  GASNET_SEQ  (no threading environment under Catamount)
  Under CNL:        GASNET_SEQ and GASNET_PAR.  

Note:  One could use GASNET_PARSYNC by simply defining GASNET_PAR but there 
       are probably several efficiencies to be gained by combing through the 
       uses of locks to see which would be required in this environment and
       which are not needed.


Some notes on building GASNet for the Cray XT3/XT4
--------------------------------------------------

* Since the XT3/XT4 requires using a cross-compiler, there is a special cross
  configuration script located in 
      $GASNET_SRC_DIR/other/contrib/cross-configure-crayxt-catamount
  for a Catamount system OR
      $GASNET_SRC_DIR/other/contrib/cross-configure-crayxt-linux
  for CNL.

   Instructions for building gasnet are:
   cd $GASNET_SRC_DIR
   cp other/contrib/cross-configure-crayxt-<OS> .
   ./Bootstrap
   vi ./cross-configure-crayxt-<OS>
      to modify any of the configure-time options, such as turning on DEBUG or TRACE mode.
   ./cross-configure-crayxt-<OS>
   make

* Special notes on configuring for GASNET-PAR under CNL
  You must be sure to use the correct version of the pthreads, one that works
  with Portals.  At the time of this writing, 09/01/2007, the only Portals-compatible
  pthreads library available on CNL is NPTL, which is located
  in /usr/lib64/nptl.  The header files are in /usr/include/nptl.
  Unfortunately, the pthread.h header file in /usr/include/nptl is buggy and needs
  to be modified.  The easiest thing to do is copy the /usr/include/nptl directory
  to a local location, say $HOME/nptl.  Modify the pthread.h file as follows:

  Change:
	extern int __sigsetjmp (struct __jmp_buf_tag __env[1], int __savemask) __THROW;
  To:
	extern int __sigsetjmp (struct __jmp_buf_tag *__env, int __savemask) __THROW;
  Change:
  #define PTHREAD_MUTEX_INITIALIZER {}
  #define PTHREAD_COND_INITIALIZER {}
  To:
  #define PTHREAD_MUTEX_INITIALIZER { { 0 } }
  #define PTHREAD_COND_INITIALIZER { { 0 } }

  Assuming no significant changes to pthread.h, this patching can be done automatically
  (and non-destructively) during configure by using the configure argument: 
    --with-pthreads-patch=other/contrib/crayxt-nptl-pthread-patch

  Finally, the following to EXTRA_CONFIGURE_ARGS in cross-configure-crayxt-linux ensure
  the use of NPTL libraries and headers:
    --with-pthreads-include=/usr/include/nptl --with-pthreads-lib=/usr/lib64/nptl 
  these settings are on by default in cross-configure-crayxt-linux.

Recognized environment variables:
---------------------------------
GASNET_PORTAL_MSG_LIMIT=N
  This value limits the total number of messages in-flight from this node.
  Given that the Cray seastar has 256 available slots in the DMA engine, any more
  in-flight messages requires a more expensive reliability algorithm to be
  used at the low levels.  This value only limits the number of INITIATED
  messages, there is no control over the number of incoming messages, which
  also consume DMA slots.  Earlier tests showed that performance would drop
  if this value got much larger than about 300.  The default value is 250.

GASNET_PORTAL_NUM_TMPMD=N
  Specifies the number of temporary memory descriptors that can be allocated
  at any time.  The default value is 1024.
  Temporary MDs are allocated when the payload of a Long lies outside the
  GASNet segment and is too large to "pack" into a Request chunk together
  with the AM arguments and metadata.  See also GASNET_PORTAL_PACKED_LONG.
  If firehose is disabled Temporary MDs are also allocated
  for Put or Get operations with local addresses (src for Put, dest for get)
  outside the GASNet segment and length too large for a Request chunk (bounce
  buffer).  See also GASNET_PORTAL_{PUT,GET}_BOUNCE_LIMIT.

GASNET_PORTAL_PACKED_LONG=boolean (0, 1, y, n, true, false)
  If FALSE, this turns off the ability to pack short-payload AM Long messages
  into a Request Send Buffer chunk.  A normal AM Long request or reply will
  send two messages, a header and a data put.  For small AM Long messages,
  using PACKED longs has better performance.
  The default value is TRUE, allowing PACKED AM Longs.

GASNET_PORTAL_FLOW_CONTROL=boolean
  if TRUE, then active message flow control is enabled.  This is the default.
  FALSE disables flow control, but the application may fail if a node receives too
  many AM requests such that its receive buffer overflows.

GASNET_PORTAL_DYNAMIC_CREDITS=boolean
  If TRUE, then dynamic credit management is enabled.  This is the default.
  With dynamic credit management, a node dynamically distributes its credits to
  the nodes it has been communicating with recently.  
  FALSE disables this feature.

GASNET_PORTAL_SB_CHUNKS=N
  Specifies to allocate this number of memory chunks to the Request Send Buffer
  memory descriptor.  Each active message request must allocate a chunk for its
  use.  The chunk is held until the corresponding AM Reply arrives.
  These chunks are also used as bounce buffers for small Put and Get operations.

GASNET_PORTAL_RPL_CHUNKS=N
  Specifies to allocate this number of memory chunks to the Request Reply Buffer
  memory descriptor.  Each active message reply must allocate a chunk for its use.
  The chunk is freed when the AM Reply is known to have been sent off-node.

GASNET_PORTAL_SAFE_LIMIT=N
  This value defines the maximum number of events that are allowed to be processed
  on the SAFE event queue, per polling call.  
  The default value is 12.

GASNET_PORTAL_AM_LIMIT=N
  This value defines the maximum number of events that are allowed to be processed
  on the AM event queue, per polling call.  
  The default value is 8.

GASNET_PORTAL_SYS_LIMIT=N
  This value defines the maximum number of events that are allowed to be processed
  on the SYS event queue, per polling call.  
  The default value is 0, meaning there is no limit.  That is, all events are 
  processed per polling call.

GASNET_PORTAL_MAX_CRED_PER_NODE=N
  This value limits the total number of credits that can be allocated to a 
  remote node.
  The default value is 400.

GASNET_PORTAL_ACCEL=boolean  /* if built with -DGASNETC_USE_SANDIA_ACCEL */
  This value can only be used on a Sandia based Catamount system with
  accelerated portals.  In this environment, the majority of the portals
  processing is done on the seastar chip, rather than on the opteron.
  If TRUE, (and GASNet is built with the -DGASNETC_USE_SANDIA_ACCEL flag)
  then accelerated mode is enabled.
  The default is FALSE.

GASNET_PORTAL_CREDIT_STATS=1 
  will write credit stats to a file.  This allows a developer to get a sense of
  how the flow-control credits are distributed.  It should not be used in performance
  runs.

GASNET_PORTAL_EPOCH_DURATION=N
  The number of AMRequest operations a node receives before it ends its credit Epoch.
  The default value is 1024.
  See the section on dynamic credit management (below).

GASNET_PORTAL_CRED_PER_NODE=N
  The number of flow_control credits issued to each remote node at startup.
  By default, this is determined via a table lookup based on the number of
  nodes in the job.
  
GASNET_PORTAL_BANKED_CREDITS=N
  The number of flow_control credits that are put into the descressionary bankpool
  on each node at startup.
  By default, this is determined via a table lookup based on the number of
  nodes in the job.

GASNET_PORTAL_AMRECV_SPACE=N
  The number of bytes to allocate for the ReqRB space.
  By default, this is determined via a table lookup based on the number of
  nodes in the job.

GASNET_PORTAL_PUT_BOUNCE_LIMIT=N
  The largest transfer size for which a Put with out-of-segment local
  address is copied through pre-pinned bounce buffers.  For out-of-segment
  transfers of larger size, dynamic memory registration (TMPMD or firehose)
  is used.
  On Catamount the current default is 1k.
  On CNL the current default is 0 (or 2k if firehose is disabled).
  The maximum is the size of a ReqSB chunk.

GASNET_PORTAL_GET_BOUNCE_LIMIT=N
  The largest transfer size for which a Get with out-of-segment local
  address is copied through pre-pinned bounce buffers.  For out-of-segment
  transfers of larger size, dynamic memory registration (TMPMD or firehose)
  is used.
  The current defaults are the same as for GASNET_PORTAL_PUT_BOUNCE_LIMIT.
  The maximum is sizeof(void*) less than the size of a ReqSB chunk.
  If the value of this parameter equals the chunk size, then the required
  sizeof(void*) is silently deducted.

GASNET_PORTAL_PUTGET_BOUNCE_LIMIT=N
  For compatibility with past releases, if GASNET_PORTAL_PUT_BOUNCE_LIMIT
  or GASNET_PORTAL_GET_BOUNCE_LIMIT are unset, then this variable is used.
  The defaults are the same as for GASNET_PORTAL_PUT_BOUNCE_LIMIT.

GASNET_PHYSMEM_PINNABLE_RATIO=DOUBLE (default=0.75)
  The fraction of physical memory that can be made pinnable under CNL.
  This value should be reduced if a job dies at startup with a message like
     [NID 63]Apid 294285: initiated application termination
     [NID 00063] Apid 294285: OOM killer terminated this process.
  or
     [NID 18]Apid 49229: initiated application termination
     Application 49229 exit signals: Killed
  This parameter has no effect under Catamount.

* All the standard GASNet environment variables (see top-level README)

* The GASNET_EXITTIMEOUT family of environment variables (see top-level README)


Optional compile-time settings:
------------------------------

* All the compile-time settings from extended-ref (see the extended-ref README)

Known problems:
---------------

* See the Berkeley UPC Bugzilla server for details on known bugs.

*  On CNL, if we try to pin more memory than the OS will allow, the job is killed.
   Therefore there is really no way (that we know of) to determine the exact maximum
   pinnable memory (and therefore maximal GASNET_SEGMENT_FAST segment size)
   under CNL without dire consequences - furthermore the maximal value appears
   to be site-specific.  So portals-conduit uses a heuristic of assuming that some
   fraction of physical memory (controlled by GASNET_PHYSMEM_PINNABLE_RATIO)
   may be pinned.  At startup an attempt is made to pin the lesser of this amount
   and GASNET_MAX_SEGSIZE.  If this attempt fails, the application will die with
   an error message like:

     [NID 63]Apid 294285: initiated application termination
     [NID 00063] Apid 294285: OOM killer terminated this process.
   or
     [NID 18]Apid 49229: initiated application termination
     Application 49229 exit signals: Killed

   Programs encountering this error are recommended to reduce their GASNet
   segment size demands, either at the GASNet client level (eg
   UPC_SHARED_HEAP_SIZE), or by setting environment variable GASNET_MAX_SEGSIZE
   to a smaller value to impose a segment limit.  The default value for
   GASNET_MAX_SEGSIZE can be established at configure time using:
   --with-segment-mmap-max=XGB


==============================================================================

A Brief Overview of Portals:
---------------------------
All GASNet communication operations (puts, gets, active messages, barriers, etc)
are implemented in terms of Portals PtlPut, PtlPutRegion, PtlGet and PtlGetRegion
operations.

Portals Put and Get operations are RDMA operations between a local and remote "Memory
Descriptor".  A Memory Descriptor represents a pinned region of memory that is endowed 
with various properties, such the type of events that will be generated on the MD, the set of
operations that can be performed on the MD, and how the memory is managed ("Locally"
or "Remotely").  In addition, Portals MDs can be Free Floating, or on an ordered list attached
to a Portals Table Entry.  MDs that are attached to a Portals Table Entry are accessable 
to remote nodes as the target of a Put or source of a Get RDMA operation.  Free Floating
MDs can only be used as the source of a Put or the destination of a Get operation.
Each MD attached to a Portals Table Entry has a set of Match-Bits and Ignore-Bits (64 bits
wide) that are used to determine if it is the target of an imcoming message.

In addition, an MD may be associated with a local Portals Event Queue (EQ).  Depending 
on how the MD is configured, communication operations that operate on the MD will generate
events on the associated EQ.  The types of events are:
PTL_EVENT_SEND_END   - a locally initiated Put operation has been sent
PTL_EVENT_ACK        - a locally initiated Put operation has reached its remote destination MD
PTL_EVENT_PUT_END    - a remotely initiated Put operation has completed on a local MD
PTL_EVENT_GET_END    - a remotely initiated Get operation has completed on a local MD
PTL_EVENT_REPLY_END  - a locally initiated Get operation has completed on a local MD
PTL_EVENT_UNLINK     - a "Locally Managed" MD has filled and has been unlinked from its 
                       Portals Table List.

Portals Put and Get operations are non-blocking and do not return a handle.  As an example,
a PtlPutRegion option takes the following arguments:
  src_md:        A local memory descriptor, the source of the data put.
  local_offset:  The byte offset within src_md where the source message starts
  msg_bytes:     The length (in bytes) of the Put operation
  ack_req:       Should an ACK be generated when the data reaches the remote MD?
  target_id:     The process ID of the target node
  pt_index:      The Portals Table Index to target on the remote node
  ac_index:      The Access-Control table index on the remote node
  match_bits:    The 64-bit match-bits used to select the remote MD for this operation
  remote_offset: The byte offset in the remote MD where the data should be placed.
  hdr_data:      64-bits of immediate data that should be sent with the message.

The operation works as follows (and not necessairly in this strict order):
(1) The data in the src_md of length msg_bytes starting at local_offset is sent to
    the target_id node along with the match_bits and hdr_data.  
(2) When the data is safely copied out of src_md, and if src_md is configured to generate
    such an event, a PTL_EVENT_SEND_END event is generated on the EQ associated with the
    src_md (if one exists).
(3) When the message arrives at the remote node, Portals searches the list of MDs attached
    to the Portals Table at index pt_index.  For each MD, it masks the message match-bits
    with the MD's ignore-bits then compares it with the MDs match-bits.  If they match,
    it checks if the MD accepts Put opertions.  Finally, if the MD is "Locally Managed"
    and there is enough space to fit the message, the MD is selected and the data is
    delivered.  If any of these conditions fail, the next MD on the list is examined.
    When an MD is selected:
(4) If the sender requested an ACK, it is delivered back to the EQ of the src_md.
(5) If the selected target_md is configured to generate PTL_EVENT_PUT_END events and
    if the target_md is associated with an EQ, such an event is generated for this operation.

Several notes:
* If the target MD is configured to be "Locally Managed", then each incoming message 
  bound for that MD is placed, in sequence, after the previous.  In this case, the
  remote_offset of the PtlPut operation is ignored.  When the MD fills it will no longer
  accept incoming data.  The MD can be configured to be automatically unlinked from the
  Portals Table list and can generate a PTL_EVENT_UNLINK event on its MD.
* If the target MD is not configured as "Locally Managed" then the incoming data is
  placed at the remote_offset specified in the PtlPut operation.
* Get operations do not have the equivalent of the hdr_data (64-bits of immediate data).

When events are generated, the event structure contains the type of the event, a copy
of the match_bits used along with the hdr_data (if available) along with other data.
Since Portals Put/Get operations do not return a handle, the user must determine which 
operation the event was associated with via the context contained in the event structure.
Fortunately, the GASNet implementation uses only a handful of MDs so that most of the
64-bits in the match-bits can be used to hide information (masked off by the MDs
ignore-bits) to associate the event with the corresponding GASNet operation.

Full details of the Portals software interface can be found in:
"The Portals 3.3 Message Passing Interface, Revision 1.0" authored by Ron Brightwell,
Arthur B. Maccabe, Rolf Riesen and Trammell Hudson in May of 2003.

Memory Descriptors uses in the GASNet Implementation
----------------------------------------------------
RAR    - covers the local GASNet memory segment.  It has no EQ but will generated
         ACK events if requested.  It is linked on the Portals Table and is intended
	 as the target of remote GASNet Put/Get operations.
RARSRC - Also covers the local GASNet memory segment.  It is free floating and used as
         the Src MD of a Put or dest MD of a Get when the data happens to reside in
	 the GASNet segment and events are needed.  It is also used as the target of
	 AMLong Reply operations.  It generates events on the SAFE_EQ event queue.
RARAM  - Also covers the local GASNet memory segment and is associated with the AM_EQ
         event queue.  It is used as the target of AMLong Request operations.
	 This MD is also linked on the Portals Table.
ReqSB  - This MD covers a local send buffer that is divided into "chunks", that get
         allocated and freed for use by various GASNet communication operations.  It
         is associated with the SAFE_EQ.  It is used as the source of AM Requests, the
         destination of AM Reply operations, and as a pre-pinned bounce-buffer for
	 small Put and Get operations.  This MD is linked through the Portals Table.
RplSB  - This free-floating MD covers a local send buffer, also divided into 
         allocatable "chunks".  It is associated with the SAFE_EQ and used as the 
         source of AM Reply operations.
ReqRB  - These MDs are the only "Locally Managed" MDs used by GASNet.  They are associated
         with the AM_EQ and are used as the target of AM Request operations.  They are
	 linked on the Portals Table.
CB     - This MD is always linked at the end of the chain of ReqRB MDs and has the
         same match bits as ReqRB.  If an operation makes it to this MD, it represents
         a ReqRB overflow and results in a fatal error.  Its called the "catch-basin" MD.
TMPMD  - These MDs are temporary and free floating.  They are used as the source of large
         Put and destinations of large Get operations where the local data is not in
         the GASNet segment (RAR) and is too big to fit into a ReqSB bounce buffer.
SYSMD  - This memory descriptor is associated with a zero-length buffer and SYS_EQ 
         event queue.  This MD is used to send zero-length out-of-band system messages
         between the nodes.  All the information in the messages is contained in the
	 hdr_data and unused portions of the match-bits.

Event Queues used in the GASNet Implementation
----------------------------------------------
SAFE_EQ - The events on this queue can be polled and processed at any time.  They
          represent the completion of operations and will generatlly result in the
	  release of resources.  The operations associated with these events will
	  never require the acquisition of additional resources.
AM_EQ   - This event queue is only associated with the ReqRB buffers.  Events on this
          queue will result in the execution of AM Request handlers that will require
	  sending AM Reply messages.  This EQ can only be polled when resources are
	  guaranteed to be available to complete the AM Reply message.
SYS_EQ  - The event queue used to send/recv out-of-band system messages between nodes.
          It is used during program termination and in the dynamic re-distribution of
          flow-control credits.

The System Event Queue
----------------------
The SYS_EQ is used for out-of-band management messages sent between nodes.
The messages are short and all the information can be packed into the unused 
match-bits and hdr_data of the underlying PtlPut operations.  In order to limit
the size of the event queue, each node can have at most one outstanding message
to any other node at a time.  For this reason, each message must have a corresponding
reply or acknowledgement.
The SYS_EQ is used for clean shutdown operations and to implement the dynamic
credit re-distribution algorithm.
When a thread polls the network, it processes all available events on the SYS_EQ
before attempting to process any events on either the SAFE_EQ or the AM_EQ.

The Implemention of GASNet Put and Get Operations
-------------------------------------------------
All GASNet Put operations send data to the RAR of a remote node and all GASNet Get
operations pull data from the RAR of a remote node.  The remote node does not
have to be informed of the delivery (or theaft) of this data so the GASNet operations
target the RAR MD, which has no EQ associated with it.  However, in the case of
a GASNet Put operation, the underlying PtlPut operation requests an ACK event to
be sent back to the sender when the data has arrived.

Each GASNet Put or Get operation is associated with a GASNet handle, a polymorphic
typed object that records the state of the operation.  The objects can be referenced
by either a standard (64 bit) pointer, or a compact 24-bit representation that can
be converted to a pointer to the object.  

Consider the implementation of a non-blocking GASNet Put operation:

gasnet_handle_t gasnet_put_nb_bulk(void *dest, gasnet_node_t node, void *src, size_t nbytes);

* poll until a send ticket can be obtained.

* We know the destination node and virtual memory address of where the data is to be put.
  This region must lie within the RAR of the remote node.  We know the starting address of
  this RAR since this information was exchanged at job startup.  We know the RAR is 
  covered by the RAR MD memory descriptor and we can compute the (remote) offset 
  from the start of this MD.

* There are several cases for determining the MD of the source memory region:
  (1) It lives within the local RAR.  If so, use the local RARSRC as the source
      MD and compute its offset within this MD.
  (2) It does not live within the local RAR, but the message size is small enough to
      fit into a ReqSB chunk.  If so, attempt to allocate a chunk and copy the
      data into this "bounce buffer".  The local offset is the offset of this
      chunk within the ReqSB.
  (3) It does not live in the local RAR, it is too big to be copied though a bounce
      buffer, or its small enough but no ReqSB chunks were available.  In this case,
      construct a TmpMD to cover the source region.  Set the local offset to zero.

* Allocate a gasnet_handle_t object and encode its 24-bit representation into a portion of
  the 64-bit MATCH_BITS that will be ignored by the remote memory descriptors
  (the upper 60 bits).
  The handle is marked "IN_FLIGHT".

* Issue the PtlPutRegion operation from the selected local MD and local_offset to the
  remote node, specifying the RAR_PTE portals table entry and MATCH_BITS as specified 
  above.  Request that an ACK event be delivered when the data has been written to the
  remote memory.

* return the gasnet_handle_t object to the client.

At some point later in time, the data will be sent to the remote node, generating a 
PTL_EVENT_SEND_END event on the local source MD.  In addition, when the data has been 
delivered to the target memory, an PTL_EVENT_ACK event will delivered to the source MD.
At some point in time, the client or runtime layer will poll the SAFE_EQ to process
some of the outstanding events.
For this Put operation, the events will cause the following action:

** The SEND_END event will be ignored by all memory descriptors for this operation.
** The ACK event will cause the following actions:
   - the 24 bit representation of the gasnet_handle_t object will be extracted from
     the MATCH_BITS in the event structure.  The 64-bit pointer to the object will be
     generate from the 24-bit representation.  The operation will be marked "DONE".
   - If the event occurred on the local RAR (case (1) above), no action is taken.
     If the event occurred on the ReqSB_MD (case (2)), this was a copy through a 
     ReqSB bounce buffer.  The chunk is freed for re-use.
     If the event occurred on a TEMP_MD, the memory descriptor is "unlinked" (unpinned).

Finally, the next time the client or the runtime layer calls gasnet_wait_syncnb() or
gasnet_try_syncnb() with this handle, the handle is freed and the operation is complete.

GASNet Get operations are handled in a similar manner.  
All blocking Put/Get operations are implemented in terms of non-blocking operations.
They poll the network until the handles indicate the operations are complete before
returning from the call.

The Implemention of GASNet Active Messages
------------------------------------------
Before an active message request can be issues the sending thread must acquire
several resources.  If it cannot obtain these resources immediately, it must
poll on the event queues until the resources become available.
The required resources are:
* N flow-control send credits.  The value of N is dependent on the size of the AM message
  header.  One credits corresponds to 256 bytes of data.  All AMShort messages require
  1 credit.  AMMedium and AMLong messages may require up to 4 (or 8) credits.
  4 if GASNet is configured with a chunksize of 1024 bytes and 8 if it is configured
  with a chunksize of 2048 bytes.  Note that if the data payload of an AMLong will
  fit inside a Chunk, the message will be send with the data payload packed in with
  the header (otherwise, only the args are sent in the header and the data is sent
  in a seperate message).
* N send tickets.  AMShort, AMMedium and packed AMLong requests will require N=1 
  send tickets.  Non-packed AMLong message will require N=2 send tickets.  Send tickets
  are used to limit the total number of outstanding messages sent off-node at any one
  time.  The Cray XT seastar has 256 DMA slots and attempting to use more than this
  at any time can cause additional protocol overhead, decreasing performance.
* One ReqSB chunk.  This memory chunk will be used to send the AM Request header
  message, which (along with the unused portions of the match-bits and hdr_data)
  will include the Request handler to run on the remote node, the arguments to that
  handler, the offset of this ReqSB chunk, and the data payload for AMMedium and packed
  AMLong requests.  The ReqSB offset will be needed by the remote node since it will
  send the corresponding AMReply header message back to this chunk.
* In the case of a non-packed AMLong message the sending thread may also have to
  allocate a TMPMD ticket.  This will only be required if the data payload 
  is not already in an existing MD (such as the RAR).  We limit the number of outstanding
  TmpMDs in use any any time, although the total allowed is rather liberal.

Once the thread has acquired all the necessary resources, it packs the Request handler
id, the ReqSB offset, the handler arguments, the credits used (and additional requested)
and, in the case of the AMMedium and packed AMLong, the data payload into the 
match-bits, hdr_data and ReqSB chunk.  It then issues a PtlPutRegion to the target
node with the match-bits set to select the ReqRB MD.  In the case of a non-packed
AMLong request, the data payload is sent to the remote RARAM MD at the desired
location.

On the remote node, before a thread can poll the AM_EQ, it also must acquire certain
resources.  It must gather 2 send tickets, one TmpMD ticket and a RplSB chunk.
If it cant get these, it must wait until the next polling opportunity to try again.
Once these reqources have been obtained, it can poll the AM_EQ to receive an event
(if any exist).  The event will correspond to either a PUT_END operation on the
ReqRB MD, indicating the receipt of an AMRequest header message, or a PUT_END event
on the RARAM MD, indicating the receipt of the data segment associated with an
AMLong Request.  In the case of an arriving AMShort or AMMedium Request, the handler
id, args and data payload are unpacked from the message and the requested handler is
executed.  However, before doing this, it constructs a token that contains the ID
of the calling node, the ReqSB offset of the sender, the RepSB offset it just obtained,
and credit info into the token.  If the Request handler issues an AMReply operation, the
AMReply notes this by setting a bit in the token.  When the handler completes, the 
calling thread checks for this bit.  If it was not set, then the handler did not issue
a reply so the thread issues an AMReply Short with a no-op handler.  The reply message
servers to (1) return send credits and (2) cause the ReqSB chunk to be deallocated.
If the handler issues an AMLong Reply, that cannot be packed into the RplSB chunk, then
two messages are sent: the header is sent back to the originating chunk of the ReqSB
and the data payload is sent to the specified offset of the RAR.

When the AMReply header arrives on the originating node, an event is generated on
the SAFE_EQ for the ReqSB MD.  In response to this event, the receiving thread
unpacks the data from the match-bits, hdr_data and header payload and executes the
requested Reply handler.  Once complete, the ReqSB chunk is deallocated.  Also,
the Reply message will contain credit update values, allowing future AMRequests to
be sent to that target.

There is an additional complication in the case of non-packed AMLong Request and 
Reply operations.  Since there are two messages, the Request/Reply handler cannot
be executed until both messages have arrived.  To resolve this problem, the sender
computes a unique "Long ID" or LID and includes it in both the header and data messages
(as part of the match-bits or hdr_data).  The receiving thread checks a local cache
to see if a message with that LID from that source node has arrived yet.  If not, it
adds an object to the LID cache with the metadata contained in the message and returns
a NULL object.  If, when it checks the LID cache, it finds an object with that LID 
and source node ID, it removes the object from the cache and returns it.  The call
that get a non-null object back from the LID cache means that both header and payload
have arrived and the AMLong Request handler function can then be executed.

Refresh of ReqRB buffers
------------------------
At startup, a number of ReqRB buffers are allocated and chained, one after the other,
from the same Portals Table Entry with the same match-bits and ignore-bits (with the
CB at the end of the list).  As AM Requests arrive, they are placed, in order,
in the first ReqRB buffer.  The ReqRBs are configured so that if the remaining
available space in the buffer is less than one ReqSB chunk, then the next AMRequest
will be placed in the next ReqSB in the chain.  As soon as all the AM Requests
contained in the buffer have been processed, the buffer is unlinked from the 
match-list and re-linked as an empty buffer at the end of the list (just before CB).

Credit Management for GASNet Active Messages
--------------------------------------------
The only messages that require flow-control in the GASNet Portals-conduit are
AMRequest messages.  Regardless of how large the ReqRB buffers are, if all
nodes issue AMRequests to a single node (say X), quickly enough, then the
ReqRB buffers on X will be exhausted before X can reply and relink the buffers.
In this case, the incoming messages will fall into the CB MD.  In this case, the
message body is dropped and an event is generated on the CB.  When X processes this
event, it is a fatal error and the application is killed.
In order to prevent this, and in an attempt to limit the size of the ReqRB space
this conduit implements a credit-based flow control algorithm that dynamically
attempts to migrate credits to nodes that are frequent communication partners.
See the section at the end of this document for a description of the algorithm.

Notes on Bootstrapping at Startup
---------------------------------
Like every other GASNet communication operation in the portals-conduit, gasnet barriers
are implemented using Portals Put (or Get) operations.  However, we cannot issue
a Portals Put operation to a remote node until we are sure that node has allocated
and configured all of its Portals resources (MDs, EQs, etc).  So, at startup we need
an external mechanism to act as a barrier such that after the barrier, we can be sure
that all nodes have thier resources initialized.  This mechanism will be system dependent.
On the Cray XT series, we use the "cnos_barrier()" function.  However, before using
this function, each node must call "cnos_barrier_init()" and "cnos_register_ptlid()"
In addition, we get the total number of nodes in the job and our rank in the job
from the "cnos_get_size()" and "cnos_get_rank()" runtime functions.  We also
get a mapping of job ranks to portals_id addresses using the "cnos_get_nidpid_map()"
function.  See gasnetc_init_portals_network() in gasnet_portals.c for details.

Notes on the Job Shutdown Process
---------------------------------
Job shutdown is a complicated affair.  For best behaviour, all applications should
have a barrier before calling gasnet_exit to terminate the job.  However, any node
can call gasnet_exit at any time (after gasnet_init) and the result should be to 
cleanly terminate all the nodes in the job.  So, the first thread of a gasnet node
to enter gasnet_exit will send shutdown messages to all the other nodes in the job
(using the SYS queue) and wait, for a limited time, for them to reply that they 
too are shutting down.  If they all reply, the shutdown is clean.
If not all reply, and a reasonable time limit passes, the thread will reset the 
signal handler for SIGINT to its default value and raise a SIGINT signal.
This causes the process to terminate and, as a side effect on the Cray XT, causes the
job launcher to kill all the other processes in the job.

Note that at startup, GASNet does redefine the handler for SIGINT.  Upon receipt of
SIGINT, the node calls gasnet_exit.  Unfortunately, this has the effect of running
the shutdown process, including sending Portals messages to other nodes, in a signal
handler context.  If the signal arrived when the thread was in a non-reentrant 
Portals call, or holding a lock required to send Portals messages, the result could
be deadlock (on that node) or the results of Portals operations could be undefined.
Recall that on the Cray XT, all communication and I/O are performed using Portals
operations.  In practice, this has not been an issue.  However, a better way to deal
with this situation would be to set a global "shutdown" flag that is checked frequently
(such as each time we poll).  If a thread sees that the flag is set, it should call
gasnet_exit() in this safe context.  

=======================================================================================
Scheme for credit balancing and re-distribution

In Portals Conduit:
- 1 credit is equivalent to 256 bytes of ReqRB space + one event structure (128 bytes) = 384 bytes.

- Initially, total number of credits is computed based on size of ReqRB space:
  credit_memsize = (num_ReqRB_buffers-1)*(chunks_per_buffer-1)*sizeof_chunk
  total credits = credit_memsize/256
  Must allocate AM_EQ with this number of event structures.  Once allocated, cannot increase
  size of EQ dynamically, at least not easily.
  NOTE: we can only allocate credits for num_ReqRB_buffers-1, since we cannot recycle a buffer 
  until all AM Requests contained in the buffer have been processed.  Credits are returned to
  the requester with the AM Reply message, so the sender of earlier messages may already have
  received the credits for the space it used in this buffer but the buffer has yet to be recycled.
  Furthermore, we cant count the last chunk in each buffer, since our mechanism for recycling the
  chunks insures that it is unlinked when the space remaining in the buffer is less than one chunk
  in size.
  The values of num_ReqRB_buffers and chunks_per_buffer will have to be adjusted at job
  bootstrap time depending on the value of gasneti_nodes and a reasonable value for
  the number of credits per node, etc.  These will have to be adjustable by runtime
  environment vars.  See table at end of this note for possible default values.

- A percentage of the total credits is banked for re-distribution.  The remainder are 
  distributed to the Nodes-1 other gasnet_nodes.
  (Note: no initial communication necessary, all nodes compute the same values).

- In order for Node X to send an AM Request to Node Y, it must have 
  the required number of credits before it can send.  If not, it must
  poll the network until the credits are available:
  * An AMShort is 1 credit
  * An AMMedium is 1-4 credits, depending on the payload size.
  * An AMLong is 2 credits: 1 for header (small) and 1 for data Put which accounts
    for the event consumed by remote RARAM PUT_END.
  * An AMLongPacked is 1-4 credits, depending on the payload size.

- AMReply messages do not consume credits since:
  * Short, Medium and Long headers are sent back to the ReqSB chunk the Request
    was issued from.  
  * Number of AM's in-flight is controlled => AMLong Data PUT_END event in RARSRC
    is already accounted for in size of SAFE_EQ.

- AMReply messages carry a credit_update value back to the requester.  This is generally
  the number of credits the requester used to send the AMRequest.  It may be more, see below.

In UPC, main use of AMs is:
  * global memory allocation, which only requires one AM per node even if all going to same target.
  * Lock management.  Again, 1 AM should do this.
  * collectives: usually tree based so log(nodes) communicating partners which is small (< 15 nodes).
    Also, only a few AMs per collective (usually).
  * VIS: Point-to-point, but packed message bodies can be large.
    Example: - 32x32x32 grid of doubles per physical variable
             - 5 variables per cell
             - Ghost zones are 4 cells wide.
    There are 6 faces, assume only face values need to be communicated, no edge or corner values.
    For example, the left face boundary region is 4x32x32 doubles for each variable.
    This is 32KB/variable.  AMMedium messages are limited to less than 1KB so this requires
    about 32 AMMediums per variable per face.  Each AMMedium is 4 credits and there are 5
    variables per cell so this requires 32*4*5 = 640 credits for the owner of the patch to 
    send the 160 AMMediums to us.  Note that if the remote node is attentive to the network
    and can process the AM Requests quickly, it can return credits in the corresponding AM Replys
    and the actual number of credits needed to send these messages without stalling may be lower.
    Ignoring this possibility and granting 640 credits is equivalent to 240KB of buffer/event 
    space on my node.  The 6 senders require a total of 1.4MB of space on my node, which is 
    very reasonable.  However if we have to allocate this to ALL the nodes in the job, a 
    1024 node job would need 240MB of buffer space and a 10,000 node job would require 2.4GB.  
    Our goal would be to grant fewer credits to each node to constrain the local buffer footprint
    and dynamically redistribute the credits so that frequent communication partners have the
    credits they need.

Goals of the algorithm:
  1) The algorithm should be demand driven.  If there is no use for dynamic re-distribution
     no credits should be moved.  
  2) The overhead should be as low as possible.
  3) If the communication pattern is static, the credits should migrate to the nodes 
     needing them and not move afterwards.
  4) We should have parameters that control the rate at which credits are migrated.  Migrating
     too quickly could cause credits to oscillate around the system.  Dampening factors 
     can help prevent this at the expense of building up the amount of credits over time.
     The goal should be to reach a steady state in a 'reasonable' amount of time.

Limits:
  * Each remote node MUST be allocated a MINIMUM of 4 credits.  This is required for send a 
    full-sized AM Medium.  
    NOTE: For a 10,000 node job, this would require 15MB of ReqRB and event queue space on
          each node.  Allocating 6 credits per node would require 22 MB of space, but would 
          allow each node to redistribute a maximum of 20,000 credits, which would be sufficient 
          to grant 200 additional credits to 100 frequent communicating partners.  
          If tree based communication algorithms are used for collectives, ln(10000) is less 
          than 17 frequent partners.

  * There is no point to giving a remote node more than 2^16 = 64*1024 credits.  This would
    be equivalent to 24MB of ReqRB + event space for that node alone.  Thus, to keep metadata
    overhead small, per-node credit variables can be of type uint16_t.

Basic algorithm overview:
   - When a node X stalls due to lack of credits trying to send an AM to a remote node Y,
     it requests additional credits.  If Y has them banked, it increases its loan to X, if
     the loan is not increased.
   - When a nodes BankedCredits gets low, it will issue CREDIT_REVOKE requests to nodes
     that communicate with it infrequently.  The target nodes determine how many, if any,
     credits to return to the requester.  The returned credits are banked for future use.
   - state variables that record recent send/credit activity decay over time so that decisions
     on how many credits to revoke or request are made with current information, not old information.

Algorithm Details:
 * Each node allocates its ReqRB and AM event queue structures resulting in TotalCredits
   credits.  A portion of these are banked for discretionary use, the rest are loaned, 
   or distributed evenly to all other nodes during job initialization.
   By default, each node computes the the number of credits_per_node and BankedCredits by interpolating
   values from a table, based on the number of GASNet nodes in the job.  From this.
     TotalCredits_est = credits_per_node*(num_nodes-1) + BankedCredits
   Since each credit is equivalent to 256 bytes of ReqRB space, the number of ReqRB buffers
   (each of a fixed, pre-determined size) needed to hold TotalCredits_est worth of ReqRB space
   is allocated.  Since ReqRB buffers are a fixed size, the total number of credits available 
   is greater than TotalCredits_est with the additional going into the BankedCredits pool.

    gasneti_semaphore_t BankedCredits;    /* discretionary use credits */

   The default values of credits_per_node and BankedCredits can be changed at run-time
   via environment variables.

 * We use the prefix 'Loan' for variables that represent a loan of our credits to a remote node.
   We use the prefix 'Send' for variables that represent credits that have been loaned to us
   and that enable us to send AMRequests to the corresponding remote node.

 * Runtime on each node is divided into epochs.  Certain state information is maintained on
   credit use during an epoch, and these values are reduced by a factor, or decayed, at the
   end of each epoch.  We use this to maintain temporal use information that decays to zero
   after a few epochs without use. 

 * Each node maintains a small amount of "connection state" for each other in the job.  The 
   state is maintained in the conn_state[] array, indexed by node number.

    #define GASNETC_MIN_CREDITS 4
    uint16_t  LoanCredits;     /* number of my credits allocated to this remote node */
    uint16_t  SendCredits;     /* number of remote node credits allocated to me */

    uint16_t  SendStalls;      /* number of times we stall on credits trying to send AMs.  Decayed */
    uint16_t  SendInuse;       /* current number of SEND credits in flight to them.                */
    uint16_t  SendMax;         /* max number of SEND credits in flight this epoch          Decayed */
    uint16_t  SendRevoked;     /* number of SEND credits I give back this epoch.           Decayed */

    uint16_t  LoanRequested;   /* number of additional LOAN credits requested by them.     Decayed */
    uint16_t  LoanGiven;       /* number of additional LOAN credits given to them.         Decayed */

    dll_link_t link;           /* 32 bit structure used as link in doubly linked list. */
    uint8_t   flags;           /* used mainly in credit management */

    NOTE:   node[X]:conn_state[Y].LoanCredits == node[Y]:conn_state[X].SendCredits
         except during times when messages containing additional credits are in-flight.

    NOTE:   SendCredits only changes when additional credits are allocated to us.  We can
            reduce the number of vars in the state by eliminating SendInuse and just decrementing
            SendCredits when they are in use.  In this case, we would record SendMin rather
            than SendMax to record how many "extra" credits (above active use) we have.
            Unfortunately, we would not be able to Decay SendMin in the same way as other
            values, since in this case its value would have to increase, and we no longer
            know the total number of credits we have been granted so dont know the max value
            that SendMin could take.  

    NOTE:   SendStalls and LoanRequested could be replaced with flag values, but then would not 
            decay slowly but instantly at end of epoch.

    NOTE:   The link field allows the individual records to be threaded on a list of nodes
            to be scavenged for un-used credits when the bank pool is depleted.  Only nodes
            that have been lent more than the minimum number of credits are on the list.
            For small problems, it will usually be the case that all nodes are always on
            the list.  For Large problems, the list should be significantly smaller than
            the total number of nodes.

   Some flags values are:
   #define GASNETC_SYS_MSG_INFLIGHT           0x02U
   #define GASNETC_CREDIT_REVOKE_ZERO_REPLY   0x04U
   #define GASNETC_REACHED_EPOCH              0x08U

   NOTE: We must update these fields as atomic operations.  The locks would be held for only short
         periods of time so we can possibly use gasneti_spinlock_t, but they might not be fair.
         These cannot be used in interrupt handler code.

   NOTE: conn_state ~ 40 bytes per node or less than 400 KB for 10000 nodes.

 * Several (global) control knobs are set:
   int    gasnetc_revoke_limit = XXX;           /* Max number of credits a node can revoke in an epoch */
   int    gasnetc_lender_limit = XXX;           /* Max number of credits a node can lend another within an epoch */
   int    gasnetc_max_cpn = XXX;                /* The max number of credits that will be lent to a node */
   int    gasnetc_epoch_duration = XXX;         /* The number of AM Requests a node will receive before
                                                 * aging LOAN variables and starting a new epoch. */

 * Other global variables:
   gasneti_mutex_t gasnetc_epoch_lock;          /* control access to epoch update */
   gasneti_mutex_t gasnetc_scavenge_lock;       /* control access to/updates of link fields for scavenge list */
   uint16_t gasnetc_scavenge_list;              /* head of list of nodes to scavenge for credits */
   /* GASNETC_CREDIT_DECAY is used to decay the credit values at the end of an epoch. */
   #define GASNETC_CREDIT_DECAY(var) var = ((var)>>2)

 * Each node Y will end its own epoch after it receives gasnetc_epoch_duration AM Requests.  
   At this point, it will decay the LOAN values (LoanGiven, LoanRequested) of all its 
   conn_state[x] records.
   It also sets the GASNETC_REACHED_EPOCH bit in conn_state[X].flag for each state record.
   The next time Y sends an AMRequest or Reply, or SYS queue Revoke Request message 
   to a node X, it will clear the GASNETC_REACHED_EPOCH bit in conn_state[X].flags
   and include a tag in the message informing the target to decay its SEND variables in
   its conn_state[Y] record.

 * When Node X wants to send an AM Request to node Y, it computes the credits required (ncredit)
   then checks if it has enough credits:

      credits_avail = conn_state[Y].SendCredits - conn_state[Y].SendInuse;
      stalled = false
      if (ncredit < credits_avail) {
         stalled = true;
         nextra = credit - credits_avail;
         conn_state[Y].SendStalls++;
         do { poll_the_network } until(ncredit >= credits_avail);
      }
      conn_state[Y].SendInuse += ncredit;
      conn_state[Y].SendMax = MAX(conn_state[Y].SendMax,conn_state[Y].SendInuse);
      Construct a credit_byte = EXXXNNNN where each character represents a bit:
           E = 1          If sender ended an epoch, otherwise 0.
         XXX = need       The number of credits it was stalled waiting for.
        NNNN = ncredit    The number of credits required to send the message.
      The credit_byte is included in the header of the message.
   
 * When Node Y processes the AMRequest from node X:

      It extracts the credit_byte and examines the fields: EXXXNNNN
         If E == 1 then it decays the SEND vars of conn_state[Y].
         set nextra = XXX,   the number of additional credits the node would like to have.
         set ncredit = NNNN, the number of credits required to send the Request.
          
      It executes the requested GASNet handler, then the AMReplyXXXX code.  The return
      message will also contain a credit_byte: EXXXNNNN, in which:

          if (conn_state[Y].flags & GASNETC_REACHED_EPOCH) {
             return_E = 1;  /* we reached our end of epoch since last message */
             conn_state[Y].flags &= ~GASNETC_REACHED_EPOCH;  /* clear the bit */
          }
          return_credits = ncredit;  /* always return the number of credits the sender used */
          return_nextra = 0;
          if (nextra > 0) {
             conn_state[X].LoanRequested++;
             if ((nextra <= BankedCredits) && (conn_state[X].LoanGiven <= gasnetc_lender_limit)) {
               BankedCredits-= nextra;
               conn_state[X].LoanGiven += nextra;
               return_nextra = nextra;
             }
          }
          Construct credit_byte  EXXXNNNN with:
             E = return_E
             XXX = return_extra
             NNNN = return_credits
          Send AM Reply message.
          if (BankedCredits is low) scavenge_for_credits();

 * When Node X processes AM Reply from node Y:

       It extracts the credit_byte field EXXXNNNN = {E}{nextra}{ncredit}
       if (E == 1) {
           Decay SEND variables in conn_state[Y].
       }
       conn_state[Y].SendInuse   -= ncredit;  /* the original number have been returned */
       conn_state[Y].SendCredits += nextra;   /* and possibly extra credits have been loaned to us */
            
 * When a lender is running low on BankedCredits, it calls scavenge_for_credits(), which
   uses a special out-of-band system queue to issue credit_revoke() messages to nodes that
   have not been sending it AMRequests in the recent past.  
   scavenge_for_credits() will first lock the gasnetc_scavenge_lock to insure no other thread
   can add or remove nodes from the list.
   It walks the gasnetc_scavenge_list looking for nodes to give up or revoke credits.  
   As it walks the list, it will skip by a node on the list if any of the following reasons:
     * node == gasneti_mynode    We dont need credits to send to ourself.
     * trylock to obtain the state lock fails (another thread holding it).
     * An out-of-band system message is in progress.  We can only send one system message
       to a node at a time.
     * LoanCredits <= GASNETC_MIN_CREDITS.  In general, this will not happen, since only
          nodes with more than the minimal number of credits will be put on the list.
	  However, there is a small race condition between when the LoanCredits is
	  decremented and when the node gets removed from the list.
     * a previous revoke request resulted in a zero reply during this epoch.
       No sense trying again.
     * If the node requested additional credits during this epoch, dont ask it to give up any.

   Otherwise, send a SYS_CREDIT_REVOKE message to the node.
   Continue this search until the list wraps or until gasnetc_num_scavenge targets are hit.
   NOTE: The head of the list: gasnetc_scavenge_list will be set to point to where we left
   off insuring that nodes on the list are targeted in a fair manner.

   When Y sends an out-of-band SYS_CREDIT_REVOKE  message to node X, it sets 
     conn_state[X].flags |= GASNETC_SYS_MSG_INFLIGHT;
   and will not send another system message to X until this bit is cleared by the reply message.
   Note: if con_state[X].flags & GASNETC_REACHED_EPOCH then the system message will also send
         a bit indicating that X should decay its SEND variables to Y.

 * When node X receives a SYS_CREDIT_REVOKE messaage from Y on the system queue, it will
   evaluate whether it has SendCredits that can be returned to the lender, but first:

      if the message contained a flag indicating that Y reached an end of epoch, decay
      our SEND variables in conn_state[Y].

      navail   = conn_state[Y].SendCredits - MAX(MIN_CREDITS,conn_state[Y].SendMax);
      nreturn  = MIN(navail, MAX(0,EPOCH_REVOKE_LIMIT - conn_state[Y].SendRevoked));
    
        /* Since SendMax is the max number of credits we used recently, navail is
         * the number we could return.  However, if this is a large number, we should
         * probably limit the amount we return by EPOCH_REVOKE_LIMIT, at least this epoch,
         * since if we have a large number of extra credits, that means we used them
         * in the past and may need them again.  On the other hand, if navail is small,
         * then we may be in the "each node of 10,000 gets 6 credits" situation and
         * all have to the extra credits.  We can possibly have other run-time variables
         * to indicate we are in a low-overall-credit-per-node situation to change
         * this behavior */

      if (conn_state[Y].SendStalls > 0) ncredit = 0;
          since we recently stalled sending to the requester and so are low on credits.
      if (conn_state[Y].SendRevoked >= gasnetc_revoke_limit) ncredit = 0;
          since we have already given up enough SendCredits recently
      conn_state[Y].SendCredits -= ncredit;
      conn_state[Y].SendRevoked += ncredit;
      end_epoch = 0;
      if (conn_state[Y].flags & GASNETC_REACHED_EPOCH) {
          /* Inform Y we have reached an epoch and it should decay send vars */
          conn_state[Y].flags &= ~GASNETC_REACHED_EPOCH;
          end_epoch = 1;
      }
      issue a SYS_CREDIT_RETURN message back to the calling node Y with ncredit credits
      and including the value of end_epoch.

 * When node Y receives a SYS_CREDIT_RETURN message with arguments (ncredit, end_epoch)
   from node X it will:

      conn_state[X].flags &= ~SYS_MSG_INFLIGHT;  
           to indicate the REVOKE message is complete
      if (end_epoch) decay SEND vars in conn_state[X].
      if (ncredit > 0) {
         conn_state[X].LoanCredits -= ncredit;   /* reduce the amount of their loan */
         BankedCredits += ncredit;               /* bank the credits for later distribution */
      } else {
         conn_state[X].flags |= SYS_CREDIT_REVOKE_ZERO_REPLY;
         /* this will prevent us from hitting up node X again this epoch */
      }

 * Epoch:  Used to decay influence of usage variables over time.

      When polling network, if epoch time has expired, call start_epoch();
      Insure only one thread enters start_epoch() per epoch by using lock and 
      detecting checking start_time.
      
#define DECAY(val) val = (uint16_t)((double)(val) * (1.0-gasnetc_decay_rate))
      void start_epoch()
	  lock(gasnetc_epoch_lock);
          if (current_time - gasnetc_epoch_starttime < gasnetc_epoch_duration) {
             /* another thread already did this, just exit */
             unlock(gasnetc_epoch_lock);
             return;
          }
          gasnetc_epoch_starttime = current_time;
          for (node = 0; node < gasneti_nodes; node++) {
             state = &conn_state[node];
             LOCK(state);     /* spinlock */
             DECAY(state->SendStalls);
             DECAY(state->SendRevoked);
             DECAY(state->SendMax);
             DECAY(state->LoanRequested);
             DECAY(state->LoanGiven);
             if (state->flags & SYS_MSG_INFLIGHT) {
                 state->flags = SYS_MSG_INFLIGHT;
             } else {
                 state->flags = 0;
             }
             UNLOCK(state);
          }
          unlock(gasnetc_epoch_lock);
          return;
      }

Additional Notes:
* The state variables LoanRequested and SendStalls have accumulated values but are only
  used to see if, respectively, a loan was requested or a stall occurred.  These could
  be replaced by flag values.  If so, they would only last from when the time they were
  set to the end of the epoch (similar to REVOKE_ZERO_REPLY).  As variables, they will
  be decayed, but not necessarily set to zero at the start of a new epoch.  Could try
  both approaches to see if it matters.


=======================================================================================
Firehose configuration

  The following environment variables control the resources used by the
  "firehose" [ref 1] dynamic registration library.  By default firehose
  will use 1024 regions of maximum length 128KB to manage dynamic pinning
  of local out-of-segment memory.

  + GASNET_USE_FIREHOSE
    This gives a boolean: "0" to disable or "1" to enable the use
    of the firehose dynamic pinning library.
    The GASNet segment is registered (pinned) with portals at
    initialization time, because pinning is required for RDMA.  However,
    GASNet allows for local addresses (source of a PUT or destination
    of a GET) to lie outside of the GASNet segment.  So, to perform RDMA
    GETs and PUTs, portals-conduit must either copy out-of-segment
    transfers though preregistered bounce buffers, or dynamically register
    memory.  By default firehose is used to manage registration of
    out-of-segment memory.  (default is ON).
    Setting this environment variable to "0" (or "no") will disable use
    of firehose, forcing the use of bounce buffers or non-cached dynamic
    memory registration for out-of-segment transfers.  This will result
    in lower peak bandwidth for large PUTs and GETs, with little or no
    affect on small message latency.

  + GASNET_FIREHOSE_M
  + GASNET_FIREHOSE_R
    These parameters are not presently used and should not be set.

  + GASNET_FIREHOSE_MAXVICTIM_M
    This gives the amount of memory to place in the victim (local) pool.
    The suffixes "K", "M" and "G" are interpreted as Kilobytes, Megabytes
    and Gigabytes respectively, with "M" assumed if no suffix is given.
    The default is
    GASNET_FIREHOSE_MAXVICTIM_R * GASNET_FIREHOSE_MAXREGION_SIZE.

  + GASNET_FIREHOSE_MAXREGION_SIZE
    This gives the maximum size of a single dynamically pinned region,
    should be a multiple of the pagesize, and preferably a power of two.
    The suffixes "K", "M" and "G" are interpreted as Kilobytes, Megabytes
    and Gigabytes respectively, with "M" assumed if no suffix is given.
    The current default is 128k.

  + GASNET_FIREHOSE_MAXVICTIM_R
    This gives the number of pinned regions to allocate for the management
    of the victim (local) pool.  Values will be truncated if larger than
    the default of (GASNET_FIREHOSE_MAXVICTIM_M /
    GASNET_FIREHOSE_MAXREGION_SIZE).

  + GASNET_FIREHOSE_VERBOSE
    This gives a boolean: "0" to disable or "1" to enable the output of
    internal information of use to the developers.  You may be asked
    to run with this environment variable set if you report a bug that
    appears related to the firehose algorithm. 

[1]	Bell, Bonachea. "A New DMA Registration Strategy for Pinning-Based
	High Performance Networks", Workshop on Communication Architecture
	for Clusters (CAC'03), 2003.
	Also at  http://upc.lbl.gov/publications/
