irfpy.util.filepairv

Cache system of data file, with a small effort.

Code author: Yoshifumi Futaana

This module provides an easy extension of the caching system for your reading function.

Example use cases of this functionality are:

  • You have a huge data file, which takes a lot of time to load.

  • You have a data file, which should be post-processed after reading. The post processing takes time.

The caching system has only single interface filepair(). It works as follows.

  1. For the first call, the “master” file is read (and post-processed). The result will be saved to a “cache” file.

  2. From the second call and on, the “cache” file is read and the contents are returned. Exceptions are

    • The master file is newer than the cache file

    • The cache file is expired (you can set the expiration time at the first call)

    • The cache file has lower version than the given version (you can set the cache file version at the first call)

    • If a refresh option is selected explicitly by the user.

All those processing is done in behind automatically, so that the user do not need to care about the caching.

Quick start

1. Assume you have a function read_file(). It may take long. And also, you have some post processing process() that may also take time. What you should do is to make a wrapper function that eats the filename only and returns an object.

For example,

def read_file_and_process(filename):   # Argument is filename
    data = read_file(filename)         # Your original function. It may take long.
    data2 = process(data)              # Your post processing. It may take long.
    return data2                       # Return is the object to be used.
  1. Then you just call filepair().

data = irfpy.util.filepairv.filepair('master_file.dat',        # The master file name
                                     'cache_file.dat',         # The cache file name
                                     converter=read_file_and_process)  # Converter

As this is the first call, the data is read from ‘master_file.dat’ with a given function read_file_and_process. Then, the obtained data is pickled to cache_file.

It is equivalent to

data = read_file_and_process('master_file.dat')
pickle.dump(data, open('cache_file', 'wb'))

3. The second call of the same syntax will be read from the cache file (equivalent to data = pickle.load(open('cache_file.dat', 'rb'))).

data = irfpy.util.filepairv.filepair('master_file.dat',        # The master file name
                                     'cache_file.dat',         # The cache file name
                                     converter=read_file_and_process)  # Converter

Detailed description

The filepair() will take two file names, version, and a function that controls how the “master” file is read and post-processed.

Let’s prepare the master data file first.

>>> import tempfile
>>> lvl, master_filename = tempfile.mkstemp()
>>> fp = open(master_filename, 'w')
>>> b = fp.write('1 3 5\n')   # b is the number of byte written.
>>> b = fp.write('2 7 -2\n')
>>> fp.close()

Then, you may read the data file with numpy’s loadtxt.

>>> import numpy as np
>>> dat = np.loadtxt(master_filename)
>>> print(dat.sum())
16.0

Ok, let’s try to use the filepair module.

>>> pickle_filename = master_filename + 'pickle.gz'
>>> print(os.path.exists(pickle_filename))   # Pickle file not existing.
False
>>> dat = filepair(master_filename, pickle_filename, converter=np.loadtxt, version=3)   # This case, file is read from master file.
>>> print(dat.sum())
16.0
>>> print(os.path.exists(pickle_filename))
True
>>> dat = filepair(master_filename, pickle_filename, converter=np.loadtxt, version=3)  # This case, file is read from pickle file.
>>> print(dat.sum())
16.0

Here converter is a function (callable) that takes file name as an argument returning a data object (in this case (2,3) shaped np.array). Of course users can define the converter as ones want.

The behaviour which file to be read on runtime is decided up to the existence of the pickle file, the version, and the time stamp.

  • If the pickle file exists and readable, and the time stamp of the pickle file is newer than the master file, and the pickle file expiration is not reached, and the version embedded in the pickle file is the same as the given version, the data is read from the pickle file.

  • Otherwise, data is read from the master file.

  • If the refresh keyword is set to True (default is False), the data is always read from the master file.

If the data is read from the master file, the filepair() function try to write the obtained data into the given pickle_filename.

Below is advanced example. (One can skip reading…)

To check if the data is really from the pickle file, a faked pickle file is prepared as follows.

>>> fp = gzip.open(pickle_filename, 'wb')
>>> dummy_data = np.array([[1, 3, 5], [8, 2, 9.]])
>>> pickle.dump({'filepair_version': 3}, fp)
>>> pickle.dump(dummy_data, fp)
>>> fp.close()

Now try to read the data. The below result shows that filepair returns the faked pickle file contents.

>>> dat = filepair(master_filename, pickle_filename, version=3, converter=np.loadtxt)
>>> print(dat.sum())   # If dat is from master, 16 is returned. If from pickle, 28 is returned.
28.0

This exercise tells us an important fact: The master and pickle files’ consistency is NOT checked on runtime. It is natural, as one wants to check the consistency, one has to read the master file; this is non-sense!

The master file is forcely re-read as follows.

>>> dat = filepair(master_filename, pickle_filename, converter=np.loadtxt, refresh=True, version=3)
>>> print(dat.sum())
16.0

The above command re-write the pickle file, since the data is read from master file. Thus, the pickle file contents is now back, being consistent with the master file.

>>> dat = filepair(master_filename, pickle_filename, converter=np.loadtxt, version=3)
>>> print(dat.sum())
16.0

Note

The following is just to remove the original data file in the doctest.

>>> if os.path.exists(master_filename): os.remove(master_filename)
>>> if os.path.exists(pickle_filename): os.remove(pickle_filename)
irfpy.util.filepairv.filepair(master_filename, pickle_filename, version=0, converter=None, refresh=False, expire=inf, compresslevel=9)[source]

Read the data either from master or cache files. Also caches the data.

Parameters
  • master_filename – Master file name.

  • pickle_filename – Cache file name. If it ends with ‘.gz’ or ‘.bz2’, gzip or bzip2 compression is used.

  • version – Version number of the cached data file. If the version number in the data file is different from the given version, the data is loaded from the master file and the cache file is recreated.

  • converter – A function to read the file. It must take exactly one argument of string (filename), returning a pickle-able object. If not given, ‘numpy.loadtxt’ is used.

  • refresh – If True, the data is always read from master file, and dump the obtained object into a pickle file.

  • expire – Seconds the pickle file is considered as expired. Default is infinity. If the pickle file is older than the specified time, pickled file is re-produced.

  • compresslevel – Gzip / Bzip2 compress level for caching. It is only valid if the cache filename (pickle_filename) ends with “.gz” or “.bz2”.