irfpy.util.filepairv

Local cache system for huge data files

Code author: Yoshifumi Futaana

This module provides the functionality of local caching system for (huge) data files.

Example use cases of this functionality are:

  • You have a huge data file, which takes a lot of time to load. Thus, you do not want to load the file several times.

  • You have a data file, which should be post-processed after reading. The post processing takes time, so that the results of the data file shall be cached.

Note

This functionality may be similar to the memory functionity in joblib (joblib.Memory). See https://joblib.readthedocs.io/en/latest/generated/joblib.Memory.html

The caching system has only single function irfpy.util.filepairv.filepair().

It works as follows.

  1. For the first call, the “master” file is read and post-processed if needed. The result will be cached to the “cache” file.

    • If saving to the cache file fails (e.g., wrong permission), nothing happens. No exception / error message is shown.

  2. From the second call and afterward, the “cache” file is read.

  3. Even if the cache file exists, the mater file will be read and post-processed in cases

    • The master file is newer than the cache file

    • The cache file is expired (you can set the expiration time at the first call)

    • The cache file has lower version than the given version (you can set the cache file version at the first call)

    • If a refresh option is chosen explicitly by the user.

All those processing is done automatically in background.

Quick start

Assume you have a function read_file(). The reading function takes longr. In addition, you have some post processing function process() that may also take time.

What the developer should do is to implement a wrapper function taking the filename (only), returning the contents, as follows.

def read_file_and_process(filename):   # Argument is filename
    data = read_file(filename)         # Your read function.
    data2 = process(data)              # Your post processing.
    return data2                       # Return is the object to be used.

Then the user call the read_file_and_process function wrapped by the filepair().

data = irfpy.util.filepairv.filepair('master_file.dat',  # The master file name
                                     'cache_file.dat',   # The cache file name
                                     converter=read_file_and_process)  # Converter

As this is the first call, the data is read from ‘master_file.dat’ with a given function read_file_and_process. Then, the obtained data is pickled to cache_file.

It is equivalent to

data = read_file_and_process('master_file.dat')
pickle.dump(data, open('cache_file', 'wb'))

On the second call of the same syntax will be read from the cache file (equivalent to data = pickle.load(open('cache_file.dat', 'rb'))).

data = irfpy.util.filepairv.filepair('master_file.dat',   # The master file name
                                     'cache_file.dat',    # The cache file name
                                     converter=read_file_and_process)  # Converter

Detailed description

The filepair() will take two file names, version, and a function that controls how the “master” file is read and post-processed.

Let’s prepare the master data file first.

>>> import tempfile
>>> lvl, master_filename = tempfile.mkstemp()
>>> fp = open(master_filename, 'w')
>>> b = fp.write('1 3 5\n')   # b is the number of byte written.
>>> b = fp.write('2 7 -2\n')
>>> fp.close()

Then, you may read the data file with numpy’s loadtxt.

>>> import numpy as np
>>> dat = np.loadtxt(master_filename)
>>> print(dat.sum())
16.0

Ok, let’s try to use the filepair module.

>>> pickle_filename = master_filename + 'pickle.gz'
>>> print(os.path.exists(pickle_filename))   # Pickle file not existing.
False
>>> dat = filepair(master_filename, pickle_filename, converter=np.loadtxt, version=3)   # This case, file is read from master file.
>>> print(dat.sum())
16.0
>>> print(os.path.exists(pickle_filename))
True
>>> dat = filepair(master_filename, pickle_filename, converter=np.loadtxt, version=3)  # This case, file is read from pickle file.
>>> print(dat.sum())
16.0

Here converter is a function (callable) that takes file name as an argument returning a data object (in this case (2,3) shaped np.array). Of course users can define the converter as ones want.

The behaviour which file to be read on runtime is decided up to the existence of the pickle file, the version, and the time stamp.

  • If the pickle file exists and readable, and the time stamp of the pickle file is newer than the master file, and the pickle file expiration is not reached, and the version embedded in the pickle file is the same as the given version, the data is read from the pickle file.

  • Otherwise, data is read from the master file.

  • If the refresh keyword is set to True (default is False), the data is always read from the master file.

If the data is read from the master file, the filepair() function try to write the obtained data into the given pickle_filename.

Below is advanced example. (One can skip reading…)

To check if the data is really from the pickle file, a faked pickle file is prepared as follows.

>>> fp = gzip.open(pickle_filename, 'wb')
>>> dummy_data = np.array([[1, 3, 5], [8, 2, 9.]])
>>> pickle.dump({'filepair_version': 3}, fp)
>>> pickle.dump(dummy_data, fp)
>>> fp.close()

Now try to read the data. The below result shows that filepair returns the faked pickle file contents.

>>> dat = filepair(master_filename, pickle_filename, version=3, converter=np.loadtxt)
>>> print(dat.sum())   # If dat is from master, 16 is returned. If from pickle, 28 is returned.
28.0

This exercise tells us an important fact: The master and pickle files’ consistency is NOT checked on runtime. It is natural, as one wants to check the consistency, one has to read the master file; this is non-sense!

The master file is forcely re-read as follows.

>>> dat = filepair(master_filename, pickle_filename, converter=np.loadtxt, refresh=True, version=3)
>>> print(dat.sum())
16.0

The above command re-write the pickle file, since the data is read from master file. Thus, the pickle file contents is now back, being consistent with the master file.

>>> dat = filepair(master_filename, pickle_filename, converter=np.loadtxt, version=3)
>>> print(dat.sum())
16.0

Note

The following is just to remove the original data file in the doctest.

>>> if os.path.exists(master_filename): os.remove(master_filename)
>>> if os.path.exists(pickle_filename): os.remove(pickle_filename)
irfpy.util.filepairv.filepair(master_filename, pickle_filename, version=0, converter=None, refresh=False, expire=inf, compresslevel=9)[source]

Read the data either from master or cache files. Also caches the data.

Parameters:
  • master_filename – Master file name.

  • pickle_filename – Cache file name. If it ends with ‘.gz’ or ‘.bz2’, gzip or bzip2 compression is used.

  • version – Version number of the cached data file. If the version number in the data file is different from the given version, the data is loaded from the master file and the cache file is recreated.

  • converter – A function to read the file. It must take exactly one argument of string (filename), returning a pickle-able object. If not given, ‘numpy.loadtxt’ is used.

  • refresh – If True, the data is always read from master file, and dump the obtained object into a pickle file.

  • expire – Seconds the pickle file is considered as expired. Default is infinity. If the pickle file is older than the specified time, pickled file is re-produced.

  • compresslevel – Gzip / Bzip2 compress level for caching. It is only valid if the cache filename (pickle_filename) ends with “.gz” or “.bz2”.