[gollem] Fwd: [PEAR-DEV] File_Repository Class Proposal

Jan Schneider jan@horde.org
Thu, 19 Sep 2002 20:33:44 +0200


Perhaps something for the VFS_File driver?

----- Weitergeleitete Nachricht von mikemc-php@contactdesigns.com -----
    Datum: Thu, 19 Sep 2002 11:15:49 -0700
    Von: Mike McCallister <mikemc-php@contactdesigns.com>
Antwort an: Mike McCallister <mikemc-php@contactdesigns.com>
 Betreff: [PEAR-DEV] File_Repository Class Proposal
      An: pear-dev@lists.php.net

Greetings,

I constantly feel guilty for using PEAR code all the time without having 
contributed more than bugfixes.  So here is a File_Repository class you 
guys can have if anyone is interested (if not, you can't blame me for 
not trying ;).  I'd have to Pearify it of course and add PEAR error 
handling and clean it up a little (although it is reasonably clean).  
Currently, it does not autogrow the repository - it has to initialize it 
first.  Anyways, let me know if there is any interest.  Here are the 
docs from the top of the class file (should give you an idea what it does):

* This class is designed to deal with a large store of files that need to be
* accessed quickly.  The majority of modern filesystems do not deal well 
with
* accessing files in a directory where there are many files (say 
5000+).  Why?
* Well basically when the OS wants to a get a file handle, it will ask the
* directory which inodes that file lives on.  When a directory has inode
* information on MANY files, it can take longer (sometimes much longer) 
to find
* out inode information for a single file.  In reality there is a bit 
more to it
* than that but that is the basic idea.  Some filesystems (i.e. ReiserFS
* http://www.reiserfs.org/) don't have this problem as they use more 
advanced
* algorithms for looking up inode data (usually some form of binary 
tree). So
* if you are running one of these cool filesystems this class will do you
* NO GOOD WHATSOEVER.  This class is used to make file access quick on a
* filesystem regardless of what kind of filesystem it is.  How does it 
do this?
* Quite simply it creates a directory hierarchy (usually with a depth of 2).
* Each level has 62 directories in it (A-Za-z0-9).  Therefore a 2 level
* hierarchy has 622 or 3844 directories.  Since file access is still
* reasonably fast on directories with less than say 500 files, a two level
* directory hierachy (aka repository) can store 500*3844 or 1,922,000 
files and
* still have speedy file access.  I don't recommend a three level 
repository
* EVER (623 or 250,047 directories) - if you have this many files, you NEED
* to change your filesystem.  Of course, you can still use this class on the
* nifty filesystems as a means to organize the files - this can be a 
good thing
* since it can be very annoying to "ls" in a directory only to have 2 
million
* files returned ;)
*
* So what does this class do for you?  Well, first it can create the 
repository
* for you which is good because creating 3844 directories by hand could 
really
* suck.  Next, it gives you three important methods for 
accessing/updating files
* in the repository: open(), store(), and retrieve().  These methods 
abstract
* away the fact that there is a directory hierarchy.  In other words, by 
using
* them, it is as is you were manipulating files in a single directory.
*
* This class was written specifically for two primary uses (although others
* exist whereever you have the need to store A LOT of files): a more robust
* session file storage and retrieval system for sites with medium to high
* traffic AND storing files associated with database records.  WHY store 
files
* associated with database records when you can just store them as a 
BLOB type
* with that record?  First, the filesystem is a much more convenient way to
* store files (i.e. don't need SQL to access file).  Second, on servers 
that are
* running many databases that get accessed frequently (i.e. our web 
servers),
* returning BLOBS can wipe out your query/index memory cache therefore
* negatively affecting other databases on the same server.
*
* It is important to understand how we store/organize files in the 
repository.
* It has one key strength and one key weakness (with a work around).  By
* default, it will will store a file based on each beginning character 
up to the
* max levels (2).  So a file named "test.txt" would be stored in /root/t/e.
* This is a good thing because it is easy for developers to track down 
which
* directories contain files since it makes intuitive sense.  This way of
* organizing files also has a weakness - storage location is ONLY based 
on the
* first N characters.  So if each file always started with "prefix" all 
files
* would end up in /root/p/e and therefore there would be no advantage to 
using
* this class.  There are two ways around this problem.  The first is to set
* auto_md5 to TRUE.  This will prefix each filename with a 32 character MD5
* digest for example: dd18bf3a8e0a2a3e53e2661c7fb53534_test.txt.  The 
digest
* is sufficiently random that you will get an even spread over the 
repository
* and each digest is unique to the filename and will always be the same 
for the
* same filename.  While this will give you the best spread, it is not 
convenient
* to look files up since you have to compute the digest to figure out 
what the
* first two characters are.  The other way is to set prefix_seed to 
something
* other than 0.  In the case of all files starting with "prefix" a file 
called
* "prefix_test.txt" would be stored as if it were test.txt by setting
* prefix_seed to 7.  Of course, this is really only useful when all 
files have
* a common prefix.  The MD5 method is the best way to store the files as 
it is
* not affected by common names, prefixes or patterns in filenames.
*
* When you specify root during object initialization is MUST NOT HAVE A 
TRAILING
* dir_delim and it MUST BE ABSOLUTE.  If it does, this class will not 
function
* correctly.
*
* Usage:
*
* $rep = new CDS_File_Repository(array('root' => '/path/to/dir'));
* $data = $rep->retrieve(array('filename' => 'test.txt'));

----- Ende der weitergeleiteten Nachricht -----


Jan.

--
http://www.horde.org - The Horde Project
http://www.ammma.de - discover your knowledge
http://www.tip4all.de - Deine private Tippgemeinschaft