|
| |||
|
|
|
Nolce (Netscape's Off Line Cache Explorer) is a Linux program which allows an off-line navigation of Netscape Navigator cache files adjusting their names and links.
Introduction |
Everyone that uses Netscape Navigator, on every platform, probably knows
that it saves all files downloaded from the Internet in the local hard disk,
unless this option has been disabled by the user. Every html file, every image,
and every downloaded document is normally stored under the directory
$HOME/.netscape/cache
.
One could like to view those downloaded documents also off-line, read them
with calm, and possibly save them with related images.
But this isn't possible, because stored files in the cache have their names
changed, i.e. an original main.html may become a
cache33BAD64001B0829.html
.
Besides they are stored under the cache directory in subdirs like 00,
01, ...
without any respect of the relative positions of files.
So even if you could guess what cached file corresponds to your desired
document, you see it without any image and with all links not working.
Saving a document from Netscape after receiving it doesn't save the related
images and links, so you can see only the textual part of the document when
you are off-line.
One can think that in this situation Netscape could retrieve lacking images from the cache, but it
isn't so because before using a cached file, it tries to connect to the
original site to check if the remote file is more recent than the local. As if
you're off-line this check isn't possible, the local file isn't used.
Using the gold version of the Navigator, it's possible to save a document and the related images entering the editor and saving from it. But this isn't a good solution, because this operation can require too much time and work, as you must repeat it for every page you want to save.
Usage and what the program does |
The file index.db
under the Netscape cache directory contains the informations
necessary to associate cached files with their original names, sizes, creation
date, file type and so on. It is created by Netscape when first documents
are cached.
Nolce must not run when Netscape is in execution, because the file
index.db
may be damaged if two programs open it at the same time. To avoid problems,
nolce uses and recognizes the same lock file of Netscape, so when one of
the two programs runs, the other knows that it can't use the cache.
Lock file is a symbolic link called lock
created in the directory
$HOME/.netscape
.
With those informations nolce can copy those files in a new directory
structure under dest_dir
(default is $HOME/cached
) which
reflects the directory structure of the original site of the file,
restoring obviously their real names.
For example if 00/cache33BAD64001B0829.html
corresponds to an URL like
http://www.rai.it/raiuno/aree.html
, the program creates the directory
www.rai.it
, then under it the direcory raiuno
and finally copies
cache33BAD64001B0829.html
into aree.html
under it.
Nolce creates also symbolic links for images and other documents, for
example for an image like
http://www.rai.it/raiuno/images/backgr.gif
, a link
called backgr.gif
is created in the directory
images
under raiuno
.
A summary file is created as an html file, so after that the program finishes
one can easily know what html documents it retrieved and can easily browse
them.
When viewing retrieved documents, links which are in italics are
links to other cached files, so you can view them off-line too.
Note that some fixed fonts may render italics as bold.
Copied html files are slightly modified when necessary, but
we'll talk of this in the section HOW IT WORKS.
However, it's important to underline that nolce doesn't change in any way
the original Netscape cache, which continues to work normally.
From version 1.5, nolce can also process caches generated by Netscape
for Windows, with the option -p
.
Let's now talk about how using nolce.
First of all you can obtain a small help launching it with --help
and this is what you get:
n_hours
parameter is very useful when you want to process only
the files downloaded during the last connection.
dest_dir
is the direcory under which will be created the direcory structures.
The program will distinguish between http://
and
ftp://
documents putting the
first ones under a subdir http
of dest_dir
and
the second ones under ftp
.
summary_file
will be always created in dest_dir, even if you supply an
absolute path. If summary file exist, it is not overwritten, but new entries
are appended to it.
summary_file
contains an entry for every HTML file processed.
summary_file
.
-m
option, missing
images are kept.
sub_string
(options -g
or -G
) is case sensitive.
-p
must be used if the cache to be processed
is generated by Netscape for Windows. In this case the name of index
file is assumed to be fat.db
and file names are all converted to
lower case, as are Dos files viewed from Linux.
nolce -smc /cache
is the same of nolce -s -m -c /cache
or nolce smc/cache
.
i.
Using previous versions of this program I've noticed that Netscape
doesn't save in the cache HTML files whose it couldn't know modification
time, even if related images are saved. Sometime the percentage of
such files is low, but sometimes it's about the 50% of total files, so this
may be a serious trouble, which, however, can be overridden with a small trick.
In fact Netscape in a first moment saves those files and registers them
in the cache index, but when it exits, checks if there are HTML files
whose it doesn't know modification time and deletes them. So the one
way to maintain these files is to kill brutally Netscape when one
finishes navigation.
When we close Netscape with Ctr-C from the shell, or, worse,
choosing `Exit' from its menu, the browser has all the time for doing
the cache's cleaning we want to avoid, but if we kill it with the SIGKILL
signal its execution ends immediately, because there is no way to
catch and to handle that signal.
The command to give is:
kill -s 9 `pidof netscape`where
`pidof netscape`
is a manner to obtain process
identifier of Netscape (see also the command ps
).
kill -s 9 PIDwhere
PID
is the process ID of your Netscape.
Killing the browser with SIGKILL, it can't delete lock file, so it's necessary doing a
rm $HOME/.netscape/lockA simple shell script can automate this procedure. For example, for a single user environment, create, somewhere in your path, a file called (for example)
nk
with this content:
#!/bin/sh kill -s 9 `pidof netscape` rm $HOME/.netscape/lockthen execute
chmod +x
on it and you're o.k.
Note that if you kill Netscape to retrieve at-risk documents, nolce must to be launched before next Netscape's execution, at the end of which the browser will do the cache's cleaning it couldn't do in the previous execution.
ii.
You may not find everything you expect in the cache. It may happen that documents and
images not completely downloaded aren't saved.
In any case, it's better to press the STOP button before going away from
a page not completely loaded.
Some images, typically counters provided at run-time by cgi-bin servers, aren't
even saved.
index.db
rather than modification time of
the file. This way is faster and better, because if an already
existing cache document is re-visited, the new date is registered
in the index, while the file timestamp isn't changed.
dest_dir
and adjusts their links) these files also, even they won't
appear in the summary file. If one doesn't want this, the option
-f
may be used.
This option is useful also in conjunction with
-g
and -G
.
If neither -w
or -W
option is given, pages will be displayed in the same
window of the summary, but taking the entire space, that is also that of
other two frames. With -W
the document is viewed only in the list frame,
allowing an easy selections of other domains and other documents.
Finall, with -w
, another browser window is created for viewing documents.
Normally the other window is created once, then, if the user doesn't close it,
it is used every time a document is selected.
Selecting Lists & domains or Simple List from the status frame, one can return immediately to the index of processed pages, but in the first case the default layout (domains + list) is used, while in the second the list area takes all the space below the status frame.
dest_dir
, because it must be
modified to make its link to point to local files, and we want to leave
files in the cache untouched, in order to permit Netscape to continue using
them.
-p
option.
dest_dir
are only symbolic links to
files in the cache, so to correctly view retrieved pages, the (dos) partition
containing the cache must be currently mounted under the same dir of when
nolce was executed.
msdos
rather than vfat
because in the first case access is faster and
file names aren't case sensitive.
Installation |
This software is available in a package containing both source and binary versions.
It can be obtained at
ftp://sunsite.unc.edu/pub/Linux/apps/www/plugins, at
http://www.aspide.it/freeweb/giustrov/nolce.html
and at
ftp://194.243.202.167/giustrov
For using this program, you must have installed the DB library.
It's necessary to read records
stored in the index.db
file.
In practice you need libdb.so
to run the compiled version, and also db include
files to compile the program.
For Linux, with Slackware and Redhat distributions, the library should be
present by default.
For the include files, with Redhat you must install a package called
db-devel
or similar. For Slackware, they are in
libc.tgz
, so they aren't a problem.
For compiling, cd to src
subdir and do make
.
Do make install
to compile and copy the executable in
/usr/bin
, the man
page in /usr/man/man1
and the documentation in
/usr/doc/nolce
.
If you haven't the compiler installed, or if you want to use precompiled
version, launch the install.sh
script, from top nolce dir.
If standard destinations don't fit your taste, modify them in the Makefile or in install.sh .
Compatibility |
I have tested the program under Linux only, and with Netscape Navigator 3.01
and 4.0b5.
Probably it works with version 2.0 also, since the present format of the cache
was introduced with this release.
It should work also with other Unix, if their Netscape indexes its cache in the
same way of the linux version, that is with a DB hash file named
index.db
under $HOME/.netscape/cache
.
If the name is different, it's easy to
change the value of CACHE_FILE, in the defines section of the source file.
From the point of view of the language, I use code conforming to ANSI C or
POSIX standards only, so if your system supports them, there must be no
problems.
As I know, the following circumstances may cause problems or errors in
compiling nolce
:
make
correctly defines the
variable CC
as your
site compiler name (i.e cc or gcc), and the variable LEX
as your lex program.
This must be ensured by every make
, but if not, define them by hand.
-lfl
library, and it's provided in the variable
LDFLAGS
of the Makefile.
LFLAGS
variable.
yylex()
function, called in the process_html_file
of main.c
.
Input and output files are supplied to yylex with the extern variables
yyin
and yyout
. Probably this is not conforming with original AT&T lex,
but, as I know, it conforms to POSIX specification for lex, and, above all,
it's almost the only way one can use with flex.
yytext
as a char pointer, while other lex may define it as a
char array. If this is your case, you must compile main.c
with
the -Darray
option, which can be done by setting the variable DEFINES
of the Makefile.
Some persons have encountered problems with nolce which disappeared using the precompiled version.
If you discovery a bug, i.e. an abnormal exit of the program with a Segmentation Fault error, please let me know. You should send me an e-mail with a brief
description of the circumstances under which the error happened, command line
options, and above all the core file generated by the program.
Shells permit to decide if one wants to obtain a core dump after an abnormal
termination of a program. With bash
see the command ulimit
.
For being the core file useful to me, it must be generated by a program
compiled with debug info: add the option -g3
to CGLAGS
in the Makefile. If you have libg
installed, add also -lg
to LDFLAGS
.
How it works |
i. INDEX.HTML
A lot of urls, i.e. http://home.netscape.com
, don't contain an HTML file name.
In this situation the server provides a default HTML file, usually
index.html
,
and nolce appends the same name to these urls.
It could happen that an HTML file contains a link to such an url with the file
name explicited. If this name is different from index.html
, the link doesn't
work.
ii. LINKS
The main work nolce does is changing links in HTML files to point to local
files.
There are various types of links (imagine you're browsing the document
http://www.aaaa.com/bbb/index.html
):
HREF="ccc/image.gif"
. In this case the browser loads
the file image.gif
from the directory ccc
at the same
level of current document directory, that is bbb
.
HREF="http://www.aaaa.com/ccc/image.gif"
. In this
case Netscape will always try to obtain the document from the net, so
nolce transforms the link in something like "../ccc/image.gif"
.
HREF="/ccc/image.gif"
. These links must be
interpreted as http://www.aaaa.com/ccc/image.gif
, not regarding of the
directory in which the HTML files is.
If a link points to a document present in the cache, it is changed to a relative link, otherwise it's turned in an absolute link.
iii. LEX
If your lex program is GNU flex, the flag -Cf
may be given to it (put in
the variable LFLAGS
of the Makefile). This makes the program bigger, but
execution speeds up of 10-15%.
iv. MISCELLANEOUS
nolce.h
there are some defines which can be customized.
<h3>Link</h3>
, the italics isn't shown.
`?'
. Mainly for this reason, when creating
directories, strange characters like `?', `=', `('
and so on are substituted
with an underscore.
Contacting the author |
>
For any question, bug report or comment, email to g.trovato@usa.net
My home page is
http://members.tripod.com/~giustrov
Nolce web page is:
http://www.aspide.it/freeweb/giustrov/nolce.html
LICENCE