GASNet elan-conduit documentation
Dan Bonachea <bonachea@cs.berkeley.edu>
$Revision: 1.6 $

User Information:
-----------------

Recognized environment variables:
---------------------------------

* All the standard GASNet environment variables (see top-level README)

* GASNET_NETWORKDEPTH - depth of libelan put queues to allocate
  Use larger values to prevent stalls in put_nb(i) initiation operations
  when queueing up many non-blocking puts in a short period of time.
  Note that larger values increase memory consumption on the elan NIC, 
  and specifying a value which is too large can lead to elan memory exhaustion.

* GASNET_NBI_THROTTLE - max number of outstanding nbi operations to issue before
  stalling to reap some completions. Defaults to GASNET_NETWORKDEPTH.

* GASNET_BARRIER=ELANFAST (default) - use elan hardware support for barrier.
 Additionally, assume that all threads agree on whether a given barrier is 
 named or anonymous, to allow higher performance barrier

* GASNET_BARRIER=ELANSLOW - use libelan hardware collectives for barrier, 
  but comply strictly to barrier semantics.

Optional compile-time settings:
------------------------------

* All the compile-time settings from extended-ref (see the extended-ref README)

* GASNETC_PREALLOC_AMLONG_BOUNCEBUF(1/0) - enable optimization to preallocate
 bounce buffers for AMLongs that need it (disabling this may cause failures on
 elan out-of-memory conditions)

* GASNETC_USE_SIGNALING_EXIT(1/0) - enable the use of RMS-based signals for 
 non-collective terminations via gasnet_exit. Turning this off may result in
 job hangs and orphaned nodes during non-collective calls to gasnet_exit().

* GASNETE_USE_ELAN_PUTGET(1/0) - setting to zero makes all put/gets use AM
 rather than elanlib put/gets

Known problems:
---------------

* Dual-rail doesn't work on older versions of elan3/libelan due to bugs in libelan - 
  If you run into trouble on dual-rail hardware, try disabling dual-rail operation.

* On some older versions of libelan/elan3, libelan queues can cause AM messages 
  between GASNet nodes sharing an elan NIC to deadlock - if you see deadlocks,
  try running only a single GASNet process per elan NIC (and use pthreads to 
  get concurrency on SMP nodes).

* See the Berkeley UPC Bugzilla server for details on known bugs.

Future work:
------------

* Fix the above bugs

* Implement GASNet split-phase barriers entirely on the elan NIC thread
 processor

===============================================================================

Design Overview:
----------------

elan-conduit targets the Quadrics libelan software, a low-level interface which
is implemented on top of the hardware-level elan3/4 library, and is intended to
provide portability to new versions of the elan hardware.

The core API is implemented using the elan queue abstraction in elanlib (for
short, long and sufficiently small medium AM's), and the elanlib TPORTS API (a
generalized message send/recv mechanism with optional tag-matching which is used
to implement MPI) is used for larger medium messages. All AM handlers currently
run synchronously within explicit or implicit polling operations.

The extended API is implemented as an extension of the ref-extended API
implementation (similar eop/iop handling). All of the put/get operations use
elanlib put/get operations (which translate to purely one-sided, zero-copy RDMA
operations) in the common case. Some uncommon cases require the use of
bounce-buffers on the initiator node, and a few very exceptional cases are
handled by falling back to AM-based put/gets. When stats/tracing are enabled,
the conduit keeps detailed information about which elanlib mechanism was used
for each put/get operation. Barriers are implemented using the elanlib hardware
barrier (for anonymous barriers), or a hardware broadcast/hardware barrier
combination for named barriers. The elanlib collective operations are all
blocking, so barrier_notify actually blocks during these calls (but the hardware
barrier is so fast that it's probably worthwhile). Future work includes
implementing hardware split-phase collectives on the elan NIC thread processor.

Detailed design of core and extended implementation:
----------------------------------------------------

Basic design of the core implementation:
=======================================
(the canonical version is in gasnet_core_reqrep.c)

All Shorts/All Longs/Mediums <= GASNETC_ELAN_MAX_QUEUEMSG(320):
  sent using an elan queue of length LIBELAN_TPORT_NSLOTS
  Longs use a blocking elan_put before queuing to ensure ordering  
    use a bounce-buffer if > GASNETC_ELAN_SMALLPUTSZ and not elan-mapped
  AMPoll checks for incoming queue entries 
  All mediums are argument-padded to ensure payload alignment on recvr

Mediums > GASNETC_ELAN_MAX_QUEUEMSG(320):
  sent using a tport message in a pre-allocated buffer
  Keep tport Tx buffers in a FIFO of length LIBELAN_TPORT_NSLOTS - 
    poll for Tx completion starting at oldest Tx buffer whenever we need one
    may spin-poll during a send if all Tx buffers occupied
  Keep a FIFO of posted tport Rx bufs, which are guaranteed to arrive in order
    AMPoll checks the head for completion
  every tport buffer has a dedicated descriptor (gasnetc_bufdesc_t)
    holds ELAN_EVENT for pending Tx/Rx
    pointer to the buffer (gasnetc_buf_t) and possibly a system Rx buffer

AMPoll handles up to GASNETC_MAX_RECVMSGS_PER_POLL messages from 
  either the queue or tport (giving precedence to the queue)

All loopback AM messages are run synchronously inside the request/reply


Basic design of the extended implementation:
===========================================
(the canonical version is in gasnet_extended.c)

gasnet_handle_t - can be either an (gasnete_op_t *) or (ELAN_EVENT *),
  as flagged by LSB

eops - marked with the operation type - 
  always AM-based op or put/get w/ bouncebuf
iops - completion counters for AM-based ops
  linked list of put/get eops that need bounce-bufs
  pgctrl objects for holding put/get ELAN_EVENT's for nbi

if !GASNETE_USE_ELAN_PUTGET,
  all put/gets are done using AM (extended-ref)
  GASNETE_USE_LONG_GETS works just like in extended-ref otherwise...

get_nb:
  if elan-addressable dest
    use a simple elan_get
  else if < GASNETE_MAX_COPYBUFFER_SZ and mem available
    use a elan bouncebuf dest with an eop
  else 
    use AM ref-ext med or long (if larger than 1 long, use get_nbi)

put_nb:
  if bulk and elan-addressable src or < GASNETC_ELAN_SMALLPUTSZ (128)
    use a simple elan_put
  else if < GASNETE_MAX_COPYBUFFER_SZ and mem available
    copy to elan bouncebuf with an eop
  else 
    use AM ref-ext med or long (if larger than 1 long, use put_nbi)

get_nbi:
  if elan-addressable dest
    use a simple elan_get and add ELAN_EVENT to pgctrl 
      (spin-poll if more than GASNETE_DEFAULT_NBI_THROTTLE(256) outstanding) 
  else if < GASNETE_MAX_COPYBUFFER_SZ and mem available
    use a elan bouncebuf dest with an eop and add to nbi linked-list
  else 
    use AM ref-ext

put_nbi:
  if bulk and elan-addressable src or < GASNETC_ELAN_SMALLPUTSZ (128)
    use a simple elan_put and add ELAN_EVENT to pgctrl 
      (spin-poll if more than GASNETE_DEFAULT_NBI_THROTTLE(256) outstanding) 
  else if < GASNETE_MAX_COPYBUFFER_SZ and mem available
    copy to a elan bouncebuf src with an eop and add to nbi linked-list
  else 
    use AM ref-ext

barrier:
  if GASNET_BARRIER != ELANFAST && GASNET_BARRIER != ELANSLOW
    use AM (extended ref)
  else
    register a poll callback function at startup to ensure polling 
     during hardware barrier
    if GASNET_BARRIER==ELANFAST and barrier anonymous
      mismatchers report to all nodes
      hardware elan barrier
    else
      hardware broadcast id from zero
      if node detects mismatch, report to all nodes
      hardware elan barrier

