Things to do in librsync/rproxy

> seems to issue random syslog broadcasts!!

> generic test driver for header.c

> why are we dropping the last byte from .debs?

Do we need to SO_LINGER or explicitly close to make sure the client
receives all the data?

> clean dead prototypes out of header files

> selectively encode based on MIME type

 is anything going to be useful except HTML?  perhaps text/*?

> clean dead prototypes out of header files

> perhaps use mapptr for our IO as well

especially for passing data through?

> would nonblocking io improve liveness?

can we ever for example be blocked waiting for input when there is
useful input we could send?

> perhaps children should call _exit rather than exit()?  or vice versa?

> change to a standard HTTP library?

> logging

  > show content length in logs

  > show actual bytes sent/recv

  > write statistics to their own log file

> example startup scripts

Should do chroot if possible.  One each for upstream and downstream,
with default port numbers and directories -- perhaps take these from
autoconf?  (Like, /var/cache/rproxy and so on.)  This would be very
friendly. 

> localized error messages

Look at the Accept-Language header, and if we have it send back a
message in the appropriate language.  Also we need to set the
character set.

> optionally log all data

Obviously this will be expensive, but it will be just right for
tracking down tricky encoding problems.  Basically we just need to
catch the before- and after-image of the cache body and signature, and
the complete request and response both upstream and downstream.  I
reckon we should create a directory somewhere for each request and put
each into a separate file in there.  Something like
/var/log/rproxy-trace/$time.$pid/.

> support HTTP CONNECT 

Apparently this is convenient to some people, because some secure web
sites object to queries from different proxies.  Do we care enough to
do this before merging with Squid.

> organization

 > .debs and .rpms

 > Run the regression suite on some other machine

Make sure there are no hidden dependencies on things installed on sanguine.

> copy requests to separate files so that we can replay them

> versioning

Client should say what protocol version it can accept.  

 Accept-Encoding: rsync-0.5

or maybe

  Accept-Encoding: rsync/0.5

if that's allowed.

> headers in Lynx

According to Neale, Lynx seems to think rsync is a Content-Encoding.
Is there something wrong or is Lynx just confused?

> logging

DONE include content-type

 > include result statuscode

 > script to digest logfiles

 > time to do various things

Open socket, finish streaming, etc.

 > write to log file

> clean up librsync code

 > change to zero copy

 > use literal chunks larger than 256 bytes

> profile

From here on in, try to only change the code in response to test case
failures or profiling results; or when rewriting under the constraint
of a test case.

> transparent proxying

rusty knows about this.  The IP stack will redirect a connection into
our local port.  On receiving a connection, we need to check where the
connection was _trying_ to get to, and we use that as the destination
host address.

> don't send signature on empty requests

The Accept-encoding header should be sufficient in itself; the server
knows the signature for an empty file.

Perhaps this is not worth worrying about; it will be irrelevant when
we switch to server-side signatures.

> error pages

If an error occurs while transferring a page then display a reasonable
error message to the client.  This is already partly done.

 > more detailed DNS messages

 > error on timeout

 > explain HTTP error code

> IO

 > EPIPE / SIGPIPE

What's the right way to deal with these?

If the client drops a connection then we probably want to stop trying
right away.  Similarly if the server times out. 

But it seems like we sometimes get EPIPE when the connection really is
not broken!  Is that true?

 > timeouts

Are the default TCP timeouts too long?  Can we change them with a
sockopt, or must we do our own counter?

 > stream content up

At the moment we load everything into a big memory buffer, which is a
bit ugly.
 
 > ditch stdio

Go straight to fds; but make sure we use large chunks.

grep FILE *.[ch]

 > any tcp / socket options to improve performance?

> server-generated signatures

At end or intermingled?

 > how to store on client

In tdb?

> reset stats after each request

Better yet, put stats somewhere where they're inherently reset.

> interrupted transfers

If a transfer is interrupted, then it _might_ be worth keeping the
results.  However, if we already have a full older download for that
page, then it's not worth keeping it.  Is that a reasonable approach?

Perhaps it's not even good to resume requests?  No, I think it would
be -- after all, we guarantee to send the request to the origin server
and to return the full result to the client.

> documentation

 > Diagrams in documentation to explain operation modes

Cool!  This works in linuxdoc sgml.

 > Posters/slides for ApacheCon

 > split docs into several parts

 > protocol document

 > make automake understand documentation

 > http://www.ozemail.com.au/~bod/help2man.tar.gz

> cache directory

 > hashing

Rather than storing directly as the URL, consider hashing the URL and
storing that way.  

 > don't walk the directory

Too slow!  Now fixed.

 > more subdirectories

Squid uses two levels of directories, each of a customizable size.
This should keep the number of files per directory more manageable.
Also we can avoid having to create directories while processing
requests.

 > trimming

How should we trim the cache?  To start with,

 # find /var/cache/rproxy -atime +7 |xargs rm

would do, but we might need a little script to trim it down to a given
size.

This won't do if there are also records in the database.

Decoders can't assume they're allowed to use the cache file until they
have it opened.  Holding it opened will protect against it being
removed. 

 > size limits

If a file's getting too big, then we will want to discard it rather
than let it fill the cache filesystem.  In general we won't know this
until it gets too big.  This means closing the cache file descriptor,
removing the file, and then proceeding without teeing received data
into the cache.

 > including parameters

Can we do this?  Probably not.  This means some sites like c2.com will
always suck.

 > concatenating requests

Wow!  Suppose we keep *several* cached values for each page, up to a
certain limit, and let the signature pick blocks from any of them.

 > set-associativity

Perhaps we should allow multiple signatures for any request -- perhaps
to encode some of the URL parameters.  Would there be any point, or
any reasonable way to implement this?

> processing upstream responses

 > response code

Only cache the result if it was a 200 and actually contained content.

 > no upstream

If there's no upstream proxy then we should rewrite the request to
contain just an absolute path, not a full URL.

 > multiple upstreams?

Probably easier just to merge with Squid.
 
> sendfile

Consider using sendfile rather than stream if it's available.

> submit tridge's autoconf tests to the maintainer

> bugs

 > if the host doesn't resolve, explain that.

> security

 > don't allow connections to well-known ports

e.g. telnet, etc.

 > chroot / setuid

Command-line options to make things secure

> enable-debug

We could have a configure-time option to turn off debugging symbols,
but at this stage of the program's life it's probably simpler all
around just to always build them in.

> test proxy

 > mirroring

What about mirroring through wget a whole tree of files, then doing a
diff across the directory trees to make sure they're the same.

DONE error-handling

If there is a problem in the response, then generate an error page and
send it back to the user.  Also set the status code.

We want something like abort_request that does the right thing.  On
the other hand it's probably OK to assume that we can exit if
something goes horribly wrong -- we're most likely to be either a
child or an inetd handler.

> re-entrancy

Something breaks on requests after the first if rproxy is run with
DONT_FORK defined.  I guess something is being reused without being
reinitialized.  This is no good, and does not bode well for
integrating it into Apache: when it goes in there it will have to be
pretty clean.

> test.py reveals a memory leak somewhere in librsync

Is this fixed now?

> test cases

 > posting a body

 > funny headers

 > encoded upload

> better IO abstraction

We should be able to do zero-copy IO, or we should at least not do any
unnecessary copies while reading in.  The input routine should return
to us a pointer to the data in its own buffer.  This is an
optimization, but a nice one.

> header handling

 > headers over several lines?

Is this allowed?

> measurement

 > increase in request due to signature

> zlib-encode if we can't rsync-encode

Hmm.  Better yet, keep them completely separated so that we can choose
which to do for any 

> http 

 > parsing

Interpret LWS properly -- it can be tabs, can be more than one
character.
 
 > headers

Can be continued over several lines, etc.

 > cache control

In general we can ignore this, because we're not really a cache.

However, this indicates sensitive data that mustn't get into
nonvolatile storage:

  Cache-Control: no-store

 > proxies

Must detect and fail on requests to access the same URL as a proxy,
so that we can't loop (rfc2616 s5.1). 

 > encoding

Must decode ``% HEX HEX'' URIs.

 > message length

Must be no body on certain requests: response to a HEAD, a 304, etc.

See rfc2616 s4.4: fall back through transfer-encoding, etc.

 > HTTP CONNECT

Do we need to support this?  For the time being we just fail cleanly.

 > chunked encoding

It seems like this is the preferred way for HTTP/1.1 clients to do
other encodings, because we otherwise have to close the connection to
end the encoding.  Hmm.

 > Range

What do we do with these?  The best option might be to encode them but
not update the cache, because that would tend to shorten the cache
file.  Will that work?


Local variables:
mode: outline
outline-regexp: " *>"
End:
