r''' Local cache system for huge data files
.. codeauthor:: Yoshifumi Futaana
This module provides the functionality of local caching system
for (huge) data files.
Example use cases of this functionality are:
- You have a huge data file, which takes a lot of time to load.
Thus, you do not want to load the file several times.
- You have a data file, which should be post-processed after reading.
The post processing takes time, so that the results of the
data file shall be cached.
.. note::
This functionality may be similar to the
memory functionity in joblib (joblib.Memory).
See https://joblib.readthedocs.io/en/latest/generated/joblib.Memory.html
The caching system has only single function
:func:`irfpy.util.filepairv.filepair`.
It works as follows.
1. For the first call, the "master" file is read and post-processed if needed.
The result will be cached to the "cache" file.
- If saving to the cache file fails (e.g., wrong permission),
nothing happens. No exception / error message is shown.
2. From the second call and afterward, the "cache" file is read.
3. Even if the cache file exists, the mater file will be read and post-processed in cases
- The master file is newer than the cache file
- The cache file is expired (you can set the expiration time at the first call)
- The cache file has lower version than the given version (you can set the cache file version at the first call)
- If a ``refresh`` option is chosen explicitly by the user.
All those processing is done automatically in background.
**Quick start**
Assume you have a function ``read_file()``.
The reading function takes longr.
In addition, you have some post processing function ``process()``
that may also take time.
What the developer should do is to implement a wrapper function
taking the filename (only), returning the contents, as follows.
.. code-block:: python
def read_file_and_process(filename): # Argument is filename
data = read_file(filename) # Your read function.
data2 = process(data) # Your post processing.
return data2 # Return is the object to be used.
Then the user call the read_file_and_process function wrapped by the :func:`filepair`.
.. code-block:: python
data = irfpy.util.filepairv.filepair('master_file.dat', # The master file name
'cache_file.dat', # The cache file name
converter=read_file_and_process) # Converter
As this is the first call, the data is read from 'master_file.dat' with a given function ``read_file_and_process``.
Then, the obtained data is pickled to ``cache_file``.
It is equivalent to
.. code-block:: python
data = read_file_and_process('master_file.dat')
pickle.dump(data, open('cache_file', 'wb'))
On the second call of the same syntax will be read from the cache file
(equivalent to ``data = pickle.load(open('cache_file.dat', 'rb'))``).
.. code-block:: python
data = irfpy.util.filepairv.filepair('master_file.dat', # The master file name
'cache_file.dat', # The cache file name
converter=read_file_and_process) # Converter
**Detailed description**
The :func:`filepair` will take two file names, version, and a function
that controls how the "master" file is read and post-processed.
Let's prepare the master data file first.
>>> import tempfile
>>> lvl, master_filename = tempfile.mkstemp()
>>> fp = open(master_filename, 'w')
>>> b = fp.write('1 3 5\n') # b is the number of byte written.
>>> b = fp.write('2 7 -2\n')
>>> fp.close()
Then, you may read the data file with ``numpy``'s ``loadtxt``.
>>> import numpy as np
>>> dat = np.loadtxt(master_filename)
>>> print(dat.sum())
16.0
Ok, let's try to use the :mod:`filepair` module.
>>> pickle_filename = master_filename + 'pickle.gz'
>>> print(os.path.exists(pickle_filename)) # Pickle file not existing.
False
>>> dat = filepair(master_filename, pickle_filename, converter=np.loadtxt, version=3) # This case, file is read from master file.
>>> print(dat.sum())
16.0
>>> print(os.path.exists(pickle_filename))
True
>>> dat = filepair(master_filename, pickle_filename, converter=np.loadtxt, version=3) # This case, file is read from pickle file.
>>> print(dat.sum())
16.0
Here converter is a function (callable) that takes file name as an argument
returning a data object (in this case (2,3) shaped ``np.array``).
Of course users can define the converter as ones want.
The behaviour which file to be read on runtime is decided up to the
existence of the pickle file, the version, and the time stamp.
- If the pickle file exists and readable, and the time stamp of
the pickle file is newer than the master file, and
the pickle file expiration is not reached, and
the version embedded in the pickle file is the same as the given version,
the data is read from the pickle file.
- Otherwise, data is read from the master file.
- If the ``refresh`` keyword is set to True (default is False),
the data is always read from the master file.
If the data is read from the master file, the :func:`filepair` function try to
write the obtained data into the given pickle_filename.
Below is advanced example. (One can skip reading...)
To check if the data is really from the pickle file, a faked pickle file
is prepared as follows.
>>> fp = gzip.open(pickle_filename, 'wb')
>>> dummy_data = np.array([[1, 3, 5], [8, 2, 9.]])
>>> pickle.dump({'filepair_version': 3}, fp)
>>> pickle.dump(dummy_data, fp)
>>> fp.close()
Now try to read the data.
The below result shows that filepair returns the faked pickle file contents.
>>> dat = filepair(master_filename, pickle_filename, version=3, converter=np.loadtxt)
>>> print(dat.sum()) # If dat is from master, 16 is returned. If from pickle, 28 is returned.
28.0
This exercise tells us an important fact:
The master and pickle files' consistency is NOT checked on runtime.
It is natural, as one wants to check the consistency, one has to read the
master file; this is non-sense!
The master file is forcely re-read as follows.
>>> dat = filepair(master_filename, pickle_filename, converter=np.loadtxt, refresh=True, version=3)
>>> print(dat.sum())
16.0
The above command re-write the pickle file, since the data is read from master file.
Thus, the pickle file contents is now back, being consistent with the master file.
>>> dat = filepair(master_filename, pickle_filename, converter=np.loadtxt, version=3)
>>> print(dat.sum())
16.0
.. note::
The following is just to remove the original data file in the doctest.
>>> if os.path.exists(master_filename): os.remove(master_filename)
>>> if os.path.exists(pickle_filename): os.remove(pickle_filename)
'''
import os
import pickle as pickle
import gzip
import bz2
import time
import logging
_logger = logging.getLogger(__name__)
import numpy as np
from irfpy.util import exception as ex
def _read_from_pickle(pickle_filename, version):
if pickle_filename.endswith('.gz'):
fp = gzip.open(pickle_filename, 'rb')
elif pickle_filename.endswith('bz2'):
fp = bz2.open(pickle_filename, 'rb')
else:
fp = open(pickle_filename, 'rb')
meta = pickle.load(fp)
if meta['filepair_version'] == version:
dat = pickle.load(fp)
fp.close()
else:
fp.close()
raise ex.IrfpyException('Version incompatible.')
return dat
def _read_from_master(master_filename, converter):
return converter(master_filename)
def _pickle_to(pickle_filename, meta, dat, compresslevel=9):
if pickle_filename.endswith('.gz'):
fp = gzip.open(pickle_filename, 'wb', compresslevel=compresslevel)
elif pickle_filename.endswith('.bz2'):
fp = bz2.open(pickle_filename, 'wb', compresslevel=compresslevel)
else:
fp = open(pickle_filename, 'wb')
pickle.dump(meta, fp)
pickle.dump(dat, fp)
fp.close()
[docs]def filepair(master_filename, pickle_filename, version=0, converter=None, refresh=False, expire=np.inf, compresslevel=9):
""" Read the data either from master or cache files. Also caches the data.
:param master_filename: Master file name.
:param pickle_filename: Cache file name. If it ends with '.gz' or '.bz2',
gzip or bzip2 compression is used.
:keyword version: Version number of the cached data file. If the version
number in the data file is different from the given version,
the data is loaded from the master file and the cache file is recreated.
:keyword converter: A function to read the file. It must take
exactly one argument of string (filename), returning a pickle-able object.
If not given, 'numpy.loadtxt' is used.
:keyword refresh: If True, the data is always read from master file, and
dump the obtained object into a pickle file.
:keyword expire: Seconds the pickle file is considered as expired.
Default is infinity.
If the pickle file is older than the specified time, pickled file
is re-produced.
:keyword compresslevel: Gzip / Bzip2 compress level for caching.
It is only valid if the cache filename (``pickle_filename``)
ends with ".gz" or ".bz2".
"""
if converter is None:
converter = np.loadtxt
### The data is read from pickle file. If successful, the contents from the pickle file is returned.
if (not refresh) and os.path.exists(pickle_filename):
master_mtime = os.stat(master_filename).st_mtime
pickle_mtime = os.stat(pickle_filename).st_mtime
_logger.debug('Master: %f' % master_mtime)
_logger.debug('Pickle: %f' % pickle_mtime)
_logger.debug('Pickle (expire): %f' % (pickle_mtime + expire))
_logger.debug('Present: %f' % time.time())
if master_mtime <= pickle_mtime and time.time() < pickle_mtime + expire:
_logger.info(' ... Loading from pickle file ({}).'.format(pickle_filename))
try:
t0 = time.time()
dat = _read_from_pickle(pickle_filename, version)
_logger.info(' ... Done. %f sec' % (time.time() - t0))
return dat
except KeyboardInterrupt as e:
raise
except Exception as e:
_logger.info('Pickle file loading failed (permission? version incompatibility?)')
_logger.info(e)
_logger.info('But no problem. Continue to the master file ({}).'.format(master_filename))
pass
### The data is read from the original file.
_logger.info(' ... Loading from master file ({}).'.format(master_filename))
t0 = time.time()
dat = _read_from_master(master_filename, converter)
_logger.info(' ... Done. %f sec' % (time.time() - t0))
meta = {'filepair_version': version}
### The read data is saved to a pickle for the next reading. If failed, just return.
try:
_logger.info(' ... Writing to pickle file ({}).'.format(pickle_filename))
t0 = time.time()
_pickle_to(pickle_filename, meta, dat, compresslevel=compresslevel)
_logger.info(' ... Done. %f sec' % (time.time() - t0))
except IOError as e:
_logger.info('Failed to write a pickle file (%s), but no problem.' % str(e))
if os.path.exists(pickle_filename):
try:
os.remove(pickle_filename)
except:
pass
except KeyboardInterrupt:
raise
except Exception as e:
_logger.info('Failed to write to a pickle file (%s).), but no problem.' % str(e))
if os.path.exists(pickle_filename):
try:
os.remove(pickle_filename)
except:
pass
return dat