DFSA - Direct File-System Access:

The first generation of openMosix has brought about great performance
improvements in CPU jobs - "number crunchers", but cannot help in the
case of I/O tasks, which need to communicate with their home-node as
often as every system-call, and are therefore better off remaining there.

The second generation of openMosix, includes DFSA, whereby the more common
system-calls can be (under certain conditions) performed directly on the
caller's current node, thus increasing the benefit and probability that
I/O-oriented (or mixed I/O and CPU) tasks will also migrate.

DFSA operates over suitable, cluster-wide shared file-systems that fulfill
certain requirements.  The only file-system to currently fulfill those
requirements is the openMosix File-System (MFS).

To use DFSA without violating access permissions, the permission-scheme
(user and group ID's) must be identical, or at least compatible throughout
the openMosix cluster.

Each partition that is to operate in DFSA mode must be assigned a unique
DFSA index, currently in the range of 1-8, that must be identical on all
the nodes in the openMosix cluster.

To request a particular partition to operate in DFSA mode, mount (or remount)
it with the "-odfsa={n}" argument  (1 <= n <= 8).

You should do the same on all the nodes in the cluster either at about the
same time or before openMosix is configured: failure to assign all DFSA
mount-points on some of the nodes is not fatal, but may result in serious
performance degradation, while simultaneous use of the same index for
different partitions, is likely to cause various faults.

To disassociate a partition from DFSA, run:

mount -o remount {mount-point} -odfsa=0.

You may also designate symbolic-links to operate in DFSA mode: this is
equivalent to a declaration that the given links are identical on all
nodes and point to the same partition.  It saves remote processes who use
those link(s) the need to contact their home node every use in order to
read those links.  To declare a symbolic-link as identical, type:

echo {symbolic-link} > /proc/hpc/admin/dfsalinks,

where the symbolic-link must be an absolute-pathname, pointing at an existing
file (or directory or another symbolic-link) on an already-mounted partition
that is capable of DFSA (but it is not required to be already associated with
a DFSA index).

To remove a symbolic-link declaration, type:

echo -{symbolic-link} > /proc/hpc/admin/dfsalinks:

If you intend to re-define a declared symbolic link, you must first re-
move its declaration, then re-declare after the change is made.

To see a list of all currently-declared symbolic links, type:

cat /proc/hpc/admin/dfsalinks.

To cancel all symbolic-link declarations, type:

echo - > /proc/hpc/admin/dfsalinks.

The number of declared symbolic-links is currently limited to 8 and their
path-name length is limited to 128 characters.

requirements from a complying file-system:
------------------------------------------
1) all operations on the file system must be synchronous, in the sense that
   there is [at most] only one buffer/inode cache throughout the cluster.
   (on client-server file-systems, this usually means that the whole cache
   is maintained on the server - however, a sophisticated server may "lend"
   the cache of particular inodes to particular clients at any given time.
   on shared-hardware file-systems, this probably requires either a hardware
   invalidation signal or a new version to be marked on each inode after each
   modification).

2) The time-stamps on files and between files of the same file-system must be
   consistent and advancing (unless the clock is deliberately set backwards),
   regardless from which node modifications are made.

3) The file-system must populate the following two new super-block methods:
   a) "identify":
       Given a "dentry", encapsulate identifying information about it into
       a finite, rather-small structure, in a way that is sufficient to be
       able to re-establish that open file/directory on another node.
   b) "reconstruct":
       Given only a mount-structure ("vfsmnt") and information that was
       provided by "identify", produce a live new "dentry".

       Also, while not enforced by DFSA itself, in order for the getcwd
       system-call to work correctly on a shared file-system, regardless
       of where the call is made from, it is also highly recommended to
       populate the following new inode-method:
   c) "checkpath":
       Given a "dentry", ensure that following its path via the "dcache"
       will truly reflect its current position on the shared file-system -
       and if not, make the necessary fixes by adjusting the "dentry" around
       the directory cache: The "dcache" of shared file-systems cannot be
       trusted, since processes running on other nodes can move (or remove)
       a directory at any time.

4) The file-system must ensure that files/directories are not cleared when
   unlinked, for as long as any process in the cluster still holds them open.
   There are several possible techniques to achieve this, but given the
   distributed nature of the file-system, some form of garbage-collection
   is probably also called upon.

Which system-calls are supported:
---------------------------------
The following system-calls are normally supported and usually run directly
by the process, while any other calls, or hard cases still need to go via
the home-node:

	read, readv, write, writev, readahead
	lseek, llseek
	open, creat, close
	dup, dup2, fcntl/fcntl64 (F_DUPFD,F_GETFL,F_SETFL)
	getdents, getdents64, old_readdir
	fsync, fdatasync
	chdir, fchdir, getcwd
	stat, stat64, newstat, lstat, lstat64, newlstat,
	fstat, fstat64, newfstat
	access
	truncate, truncate64, ftruncate, ftruncate64
	chmod, chown, chown16, lchown, lchown16, fchmod, fchown, fchown16
	utime, utimes
	symlink, readlink
	mkdir, rmdir
	link, unlink, rename

Examples of hard cases:
* if not all nodes have the same mounted DFSA partitions, or they do -
  but with different mount-flags.
* if the calling process is being traced.
* if the process has a non-standard root-directory.
* If the calling process has an emulating personality that causes it
  to use an alternate root (but this is currently not relevant for the
  i386 architecture).
* if the calling process shares either its files or current directory
  as a result of the "clone" system-call.
* operations occuring during re-configuration of DFSA on either the
  home-node or the node where the process runs.
* operations involving special files (eg. other than regular, directories
  or symbolic-links)
* operations on files that were commonly opened and still shared with other
  related processes.
* dup2, where the second file-descriptor is an already open non-DFSA file
  (that requires closing).
* chdir/fchdir when the previous directory is non-DFSA.
* link/rename that fail due to an attempt to cross-device link.
* open/dup/dup2/fcntl(F_DUPFD) that requires an allowable-increase in
  the maximal file-descriptor index (initially 1023!).
* When the home-node has pending requests for the process (such as
  signals, requests for "ps" information, request to migrate or consider
  migration, etc.)
* Use of path-names that leave the DFSA partition, as demonstrated by
  the following example:
	"/mfs" is a DFSA file-system
	/mfstmp is a symbolic link to /mfs/2/tmp, and is declared in
		/proc/hpc/admin/dfsalinks.
	/mtmp is a symbolic link to /mfstmp, and is declared in
		/proc/hpc/admin/dfsalinks.
	/mfs2 is a symbolic link to /mfs/2, but is not declared.
	on node #2, /fie is a symbolic link to "/tmp/foo".
  then the following are accepted as simple cases (and identical):
	/mfs/2/tmp/foo
	//mfs//2/tmp/foo
	/./mfs/2/tmp/foo
	/mfstmp/foo
	/mtmp/foo
	/mfs/2/fie
	mfs/2/tmp/foo  (when in the root directory)
	
  but not the following:
	/tmp/../mfs/tmp/foo
		(the kernel is not allowed to assume that each node has an
		 accessible "/tmp" directory!)
	/mfs/2/../../mfs/2/tmp/foo
		(the secon ".." steps out of the "/mfs" DFSA partition)
        /mfs2/tmp/foo
		(/mfs2 is not declared, hence no assurance was provided
		that it is identical on all nodes)
	mfstmp/foo (or mfstmp/foo) when in the root directory
		(just a difficult case to recognize)

* when the home-node DEPUTY has pending requests for the process (such as
  signals, requests for "ps" information, request to migrate or consider
  migration, etc.)

Deviations from normal Linux/Unix/Posix behavior:
--------------------------------------------------
It was impossible to maintain 100% compatibility on DFSA file-systems,
but the deviations are kept to the very minimum:

* A process that received a signal may continue running a few DFSA system-calls
  before it actually receives and handles the signal.
  (in contrast, any POSIX process that receives a signal may possibly
   complete the next system-call, but cannot issue any new ones beyond that).

* Simultaneous mapping and I/O on the same DFSA file creates unpredictable
  results as follows:
  1) execution (and library and all other file-mappings) is not always
     protected against other process(es) modifying the file: either the
     writing-process or the executing/mapping process may fail to receive the
     "ETXTBSY" error.
  2) The "MS_INVALIDATE" flag of "msync" may fail to ensure that previous
     "write"(s) to a mapped DFSA file are discarded.
  3) when a process modifies memory that is mapped as "MAP_SHARED" to a DFSA
     file, but has not yet written it back (using "msync", "munmap", "exec"
     or "exit"), it is possible that another process that reads that file as
     it migrates will first see some of the changes but later (as opposed to
     normal behavior), see the old values (or some of them) again.
