Source code for irfpy.util.filepairv

r''' Local cache system for huge data files

.. codeauthor:: Yoshifumi Futaana

This module provides the functionality of local caching system
for (huge) data files.

Example use cases of this functionality are:

- You have a huge data file, which takes a lot of time to load.
  Thus, you do not want to load the file several times.

- You have a data file, which should be post-processed after reading.
  The post processing takes time, so that the results of the
  data file shall be cached.

.. note::

    This functionality may be similar to the
    memory functionity in joblib (joblib.Memory).
    See https://joblib.readthedocs.io/en/latest/generated/joblib.Memory.html

The caching system has only single function
:func:`irfpy.util.filepairv.filepair`.

It works as follows.

1. For the first call, the "master" file is read and post-processed if needed.
   The result will be cached to the "cache" file.

   - If saving to the cache file fails (e.g., wrong permission),
     nothing happens. No exception / error message is shown.

2. From the second call and afterward, the "cache" file is read.

3. Even if the cache file exists, the mater file will be read and post-processed in cases

   - The master file is newer than the cache file
   - The cache file is expired (you can set the expiration time at the first call)
   - The cache file has lower version than the given version (you can set the cache file version at the first call)
   - If a ``refresh`` option is chosen explicitly by the user.

All those processing is done automatically in background.

**Quick start**

Assume you have a function ``read_file()``.
The reading function takes longr.
In addition, you have some post processing function ``process()``
that may also take time.

What the developer should do is to implement a wrapper function
taking the filename (only), returning the contents, as follows.

.. code-block:: python

    def read_file_and_process(filename):   # Argument is filename
        data = read_file(filename)         # Your read function.
        data2 = process(data)              # Your post processing.
        return data2                       # Return is the object to be used.

Then the user call the read_file_and_process function wrapped by the :func:`filepair`.

.. code-block:: python

    data = irfpy.util.filepairv.filepair('master_file.dat',  # The master file name
                                         'cache_file.dat',   # The cache file name
                                         converter=read_file_and_process)  # Converter

As this is the first call, the data is read from 'master_file.dat' with a given function ``read_file_and_process``.
Then, the obtained data is pickled to ``cache_file``.

It is equivalent to

.. code-block:: python

    data = read_file_and_process('master_file.dat')
    pickle.dump(data, open('cache_file', 'wb'))

On the second call of the same syntax will be read from the cache file
(equivalent to ``data = pickle.load(open('cache_file.dat', 'rb'))``).

.. code-block:: python

    data = irfpy.util.filepairv.filepair('master_file.dat',   # The master file name
                                         'cache_file.dat',    # The cache file name
                                         converter=read_file_and_process)  # Converter

**Detailed description**

The :func:`filepair` will take two file names, version, and a function
that controls how the "master" file is read and post-processed.

Let's prepare the master data file first.

>>> import tempfile
>>> lvl, master_filename = tempfile.mkstemp()
>>> fp = open(master_filename, 'w')
>>> b = fp.write('1 3 5\n')   # b is the number of byte written.
>>> b = fp.write('2 7 -2\n')
>>> fp.close()

Then, you may read the data file with ``numpy``'s ``loadtxt``.

>>> import numpy as np
>>> dat = np.loadtxt(master_filename)
>>> print(dat.sum())
16.0

Ok, let's try to use the :mod:`filepair` module.

>>> pickle_filename = master_filename + 'pickle.gz'
>>> print(os.path.exists(pickle_filename))   # Pickle file not existing.
False
>>> dat = filepair(master_filename, pickle_filename, converter=np.loadtxt, version=3)   # This case, file is read from master file.
>>> print(dat.sum())
16.0
>>> print(os.path.exists(pickle_filename))
True
>>> dat = filepair(master_filename, pickle_filename, converter=np.loadtxt, version=3)  # This case, file is read from pickle file.
>>> print(dat.sum())
16.0

Here converter is a function (callable) that takes file name as an argument
returning a data object (in this case (2,3) shaped ``np.array``).
Of course users can define the converter as ones want.

The behaviour which file to be read on runtime is decided up to the
existence of the pickle file, the version, and the time stamp.

- If the pickle file exists and readable, and the time stamp of
  the pickle file is newer than the master file, and
  the pickle file expiration is not reached, and
  the version embedded in the pickle file is the same as the given version,
  the data is read from the pickle file.
- Otherwise, data is read from the master file.
- If the ``refresh`` keyword is set to True (default is False),
  the data is always read from the master file.

If the data is read from the master file, the :func:`filepair` function try to
write the obtained data into the given pickle_filename.

Below is advanced example. (One can skip reading...)

To check if the data is really from the pickle file, a faked pickle file
is prepared as follows.

>>> fp = gzip.open(pickle_filename, 'wb')
>>> dummy_data = np.array([[1, 3, 5], [8, 2, 9.]])
>>> pickle.dump({'filepair_version': 3}, fp)
>>> pickle.dump(dummy_data, fp)
>>> fp.close()

Now try to read the data.
The below result shows that filepair returns the faked pickle file contents.

>>> dat = filepair(master_filename, pickle_filename, version=3, converter=np.loadtxt)
>>> print(dat.sum())   # If dat is from master, 16 is returned. If from pickle, 28 is returned.
28.0

This exercise tells us an important fact:
The master and pickle files' consistency is NOT checked on runtime.
It is natural, as one wants to check the consistency, one has to read the
master file; this is non-sense!

The master file is forcely re-read as follows.

>>> dat = filepair(master_filename, pickle_filename, converter=np.loadtxt, refresh=True, version=3)
>>> print(dat.sum())
16.0

The above command re-write the pickle file, since the data is read from master file.
Thus, the pickle file contents is now back, being consistent with the master file.

>>> dat = filepair(master_filename, pickle_filename, converter=np.loadtxt, version=3)
>>> print(dat.sum())
16.0

.. note::

    The following is just to remove the original data file in the doctest.

    >>> if os.path.exists(master_filename): os.remove(master_filename)
    >>> if os.path.exists(pickle_filename): os.remove(pickle_filename)

'''
import os
import pickle as pickle
import gzip
import bz2
import time

import logging
_logger = logging.getLogger(__name__)

import numpy as np

from irfpy.util import exception as ex

def _read_from_pickle(pickle_filename, version):
    if pickle_filename.endswith('.gz'):
        fp = gzip.open(pickle_filename, 'rb')
    elif pickle_filename.endswith('bz2'):
        fp = bz2.open(pickle_filename, 'rb')
    else:
        fp = open(pickle_filename, 'rb')
    meta = pickle.load(fp)
    if meta['filepair_version'] == version:
        dat = pickle.load(fp)
        fp.close()
    else:
        fp.close()
        raise ex.IrfpyException('Version incompatible.')
        
    return dat

def _read_from_master(master_filename, converter):
    return converter(master_filename)

def _pickle_to(pickle_filename, meta, dat, compresslevel=9):
    if pickle_filename.endswith('.gz'):
        fp = gzip.open(pickle_filename, 'wb', compresslevel=compresslevel)
    elif pickle_filename.endswith('.bz2'):
        fp = bz2.open(pickle_filename, 'wb', compresslevel=compresslevel)
    else:
        fp = open(pickle_filename, 'wb') 
    pickle.dump(meta, fp)
    pickle.dump(dat, fp)
    fp.close()


[docs]def filepair(master_filename, pickle_filename, version=0, converter=None, refresh=False, expire=np.inf, compresslevel=9):
    """ Read the data either from master or cache files. Also caches the data.

    :param master_filename: Master file name.
    :param pickle_filename: Cache file name. If it ends with '.gz' or '.bz2',
        gzip or bzip2 compression is used.
    :keyword version: Version number of the cached data file.  If the version
        number in the data file is different from the given version,
        the data is loaded from the master file and the cache file is recreated.
    :keyword converter: A function to read the file.  It must take
        exactly one argument of string (filename), returning a pickle-able object.
        If not given, 'numpy.loadtxt' is used.
    :keyword refresh: If True, the data is always read from master file, and
        dump the obtained object into a pickle file.
    :keyword expire: Seconds the pickle file is considered as expired.
        Default is infinity.
        If the pickle file is older than the specified time, pickled file
        is re-produced.
    :keyword compresslevel: Gzip / Bzip2 compress level for caching.
        It is only valid if the cache filename (``pickle_filename``)
        ends with ".gz" or ".bz2".
    """
    if converter is None:
        converter = np.loadtxt

    ### The data is read from pickle file. If successful, the contents from the pickle file is returned.
    if (not refresh) and os.path.exists(pickle_filename):

        master_mtime = os.stat(master_filename).st_mtime
        pickle_mtime = os.stat(pickle_filename).st_mtime
        _logger.debug('Master: %f' % master_mtime)
        _logger.debug('Pickle: %f' % pickle_mtime)
        _logger.debug('Pickle (expire): %f' % (pickle_mtime + expire))
        _logger.debug('Present: %f' % time.time())

        if master_mtime <= pickle_mtime and time.time() < pickle_mtime + expire:
            _logger.info(' ... Loading from pickle file ({}).'.format(pickle_filename))
            try:
                t0 = time.time()
                dat = _read_from_pickle(pickle_filename, version)
                _logger.info(' ... Done. %f sec' % (time.time() - t0))
                return dat
            except KeyboardInterrupt as e:
                raise
            except Exception as e:
                _logger.info('Pickle file loading failed (permission? version incompatibility?)')
                _logger.info(e)
                _logger.info('But no problem.  Continue to the master file ({}).'.format(master_filename))
                pass

    ### The data is read from the original file.
    _logger.info('  ... Loading from master file ({}).'.format(master_filename))
    t0 = time.time()
    dat = _read_from_master(master_filename, converter)
    _logger.info('  ... Done. %f sec' % (time.time() - t0))

    meta = {'filepair_version': version}

    ### The read data is saved to a pickle for the next reading.  If failed, just return.
    try:
        _logger.info('   ... Writing to pickle file ({}).'.format(pickle_filename))
        t0 = time.time()
        _pickle_to(pickle_filename, meta, dat, compresslevel=compresslevel)
        _logger.info('   ... Done.  %f sec' % (time.time() - t0))
    except IOError as e:
        _logger.info('Failed to write a pickle file (%s), but no problem.' % str(e))
        if os.path.exists(pickle_filename):
            try:
                os.remove(pickle_filename)
            except:
                pass
    except KeyboardInterrupt:
        raise
    except Exception as e:
        _logger.info('Failed to write to a pickle file (%s).), but no problem.' % str(e))
        if os.path.exists(pickle_filename):
            try:
                os.remove(pickle_filename)
            except:
                pass

    return dat