irfpy.util.timeseriesdb

This module provides an implementation of time series of database index.

Code author: Yoshifumi Futaana

Frequently the need of database in timeseries order. Main use is a dataset under a specific folder.

For example, there is a database like:

rootfolder -+- 200501 -+- 20050101.dat    # data from 2005-01-01 00:00:00
            |          +- 20050102.dat    # data from 2005-01-02 00:00:00
            |          +- 20050103.dat    # data from 2005-01-03 00:00:00
            |          +- ...
            |
            +- 200502 -+- 20050215.dat    # data from 2005-02-15 00:00:00
            |          +- 20050216.dat    # data from 2005-02-16 00:00:00
            |          +- 20050217.dat    # data from 2005-02-17 00:00:00
            |          +- ...
            |
            +- ...

The timeseriesdb module provides to connect between the time and the file name. The user can find the file name from time. DB class provides an implementation of such database. In case you do not know the exact start time of the data file, you can use FazzyDB class.

Usage follows. Use append() method to connect the file name and the start time

of the data file.

>>> db = DB()    # Instance the DB object.
>>> db.append('rootfolder/200501/20050101.dat', datetime.datetime(2005, 1, 1))
>>> db.append('rootfolder/200501/20050102.dat', datetime.datetime(2005, 1, 2))
>>> db.append('rootfolder/200501/20050103.dat', datetime.datetime(2005, 1, 3))
>>> db.append('rootfolder/200502/20050215.dat', datetime.datetime(2005, 2, 15))
>>> db.append('rootfolder/200502/20050216.dat', datetime.datetime(2005, 2, 16))
>>> db.append('rootfolder/200502/20050217.dat', datetime.datetime(2005, 2, 17))
>>> print(db.get(datetime.datetime(2005, 1, 1, 12, 0, 0)))
rootfolder/200501/20050101.dat
>>> print(db.get(datetime.datetime(2005, 1, 3, 0, 0, 0)))
rootfolder/200501/20050103.dat
>>> print(db.get(datetime.datetime(2005, 5, 3, 0, 0, 0)))  # later than last data
rootfolder/200502/20050217.dat

Another example is as follows.

rootfolder -+- orb0001  # Data from 2004-01-05 15:30:00
            +- orb0002  # Data from 2004-01-05 18:30:00
            +- orb0003  # Data from 2004-01-06 00:30:00
            +- orb0004  # Data from 2004-01-06 09:30:00
            +- orb0011  # Data from 2004-01-12 11:30:00
            |           #      no data and file between 5 and 10.
            +- orb0012  # Data from 2004-01-12 21:30:00
>>> db = DB()
>>> db.append('rootfolder/orb0001', datetime.datetime(2004, 1, 5, 15, 30))
>>> db.append('rootfolder/orb0002', datetime.datetime(2004, 1, 5, 18, 30))
>>> db.append('rootfolder/orb0003', datetime.datetime(2004, 1, 6, 0, 30))
>>> db.append('rootfolder/orb0004', datetime.datetime(2004, 1, 6, 9, 30))
>>> db.append('rootfolder/orb0011', datetime.datetime(2004, 1, 12, 11, 30))
>>> db.append('rootfolder/orb0012', datetime.datetime(2004, 1, 12, 21, 30))
>>> # If you give the date before the data start, DataNotInDbError is returned.
>>> print(db.get(datetime.datetime(2004, 1, 1)))  
Traceback (most recent call last):
irfpy.util.timeseriesdb.DataNotInDbError: message
>>> print(db.get(datetime.datetime(2004, 1, 5, 15, 30)))
rootfolder/orb0001
>>> # Most likely the data at 2004-01-10T00:00:00 is not in the dataset,
>>> # but still returns as included in orb0004, because the epoch is between
>>> # the starts of 0004 and 0011.
>>> print(db.get(datetime.datetime(2004, 1, 10, 0, 0)))
rootfolder/orb0004
>>> # As the end of the time is not included in the database,
>>> # the last file is always returned if you give later than the database coverage.
>>> print(db.get(datetime.datetime(2100, 1, 1, 0, 0, 0)))
rootfolder/orb0012

See How to load tree-structured data for more information.

Fazzy database

Sometimes, files in the database do not provide exact start times because the computational cost to get the exact start time are expensive. In such cases, “fazzy database” may provide a better solution rather than surveying all the files.

To use this fazzy database strategy, still one must know the chronological order of the files, and estimate the start times for all files. A rough estimates of start times is usually fine. They can be from the file name and other resources (orbit number), for example. Even “evenly-distributed” times may be acceptable.

You can see sample in FazzyDB.

Warning

Before the version 4.2.6a3, the DB and FazzyDB had a critical error. The bug was reported in https://gitlab.irf.se/irfpy/util/issues/2 and has been fixed by the commit ac33aade.

exception irfpy.util.timeseriesdb.DataNotInDbError(value)[source]

Bases: Exception

exception irfpy.util.timeseriesdb.DuplicatedError(value)[source]

Bases: Exception

class irfpy.util.timeseriesdb.DB[source]

Bases: object

Implementation of the timeseries database.

logger = <Logger irfpy.util.timeseriesdb.DB (DEBUG)>
append(filename, starttime)[source]

Append the file into the database together with the start time.

>>> db = DB()
>>> db.append('file1', datetime.datetime(2009, 1, 1, 0, 0, 0))
>>> try:
...    db.append('file1', datetime.datetime(2009, 2, 1, 0, 0, 0))
...    print("!!!! Should not reach here")
... except DuplicatedError as e:
...    print("Duplicated error correctly caught!")
Duplicated error correctly caught!
remove(filename)[source]

Remove the file from the database

Parameters

filename – The file name.

Returns

None

If the filename does not exist, ValueError is raised.

t0()[source]

Return the first time

Note that the “last time” cannot be identified, because the dataset is only for start time.

>>> db = DB()
>>> db.append('a', datetime.datetime(2009, 1, 10))
>>> print(db.t0())
2009-01-10 00:00:00
>>> db.append('b', datetime.datetime(2008, 1, 25, 12))
>>> print(db.t0())
2008-01-25 12:00:00
>>> db.append('c', datetime.datetime(2012, 1, 25, 12))
>>> print(db.t0())
2008-01-25 12:00:00
get(t)[source]

Return the filename that contains the date of the specified time.

Return the filename that contains the data of the specified time. If the specified time is before the DB start time, DataNotInDbError is raised.

First, load the sample data base.

>>> db = DB._get_sample_database()

2004-01-05T15:30:00 contains in orb0001.

>>> t0 = datetime.datetime(2004, 1, 5, 15, 30)
>>> print(db.get(t0))
rootfolder/orb0001

2004-01-05T17:00:00 is also in orb001.

>>> t1 = datetime.datetime(2004, 1, 5, 17)
>>> print(db.get(t1))
rootfolder/orb0001

2004-01-05T00:00:00 is before the database.

>>> t2 = datetime.datetime(2004, 1, 5, 0)
>>> try:
...     print(db.get(t2))
...     print("!!!! Should not reach here")
... except DataNotInDbError as e:
...     print("Exception caught")
Exception caught

2010-01-01T00:00:00 is, in a common sense, not included in this data base, but it returns the last file.

>>> t3 = datetime.datetime(2010, 1, 1)
>>> print(db.get(t3))
rootfolder/orb0012
getfiles(t0, t1)[source]

Return the filenames that covers the specified range

Parameters
  • t0 – Start. (datetime.datetime)

  • t1 – End. (datetime.datetime)

Returns

Tuple of the file names.

>>> db = DB()
>>> db.append('orb0001', datetime.datetime(2004, 1, 5, 15, 30))
>>> db.append('orb0002', datetime.datetime(2004, 1, 5, 18, 30))
>>> db.append('orb0003', datetime.datetime(2004, 1, 6, 0, 30))
>>> db.append('orb0004', datetime.datetime(2004, 1, 6, 9, 30))
>>> db.append('orb0005', datetime.datetime(2004, 1, 6, 18, 30))
>>> db.append('orb0006', datetime.datetime(2004, 1, 7, 3, 30))
>>> db.append('orb0011', datetime.datetime(2004, 1, 12, 11, 30))
>>> db.append('orb0012', datetime.datetime(2004, 1, 12, 21, 30))
>>> print(db.getfiles(datetime.datetime(2004, 1, 6), datetime.datetime(2004, 1, 7)))
('orb0002', 'orb0003', 'orb0004', 'orb0005')
>>> print(db.getfiles(datetime.datetime(2004, 1, 1), datetime.datetime(2004, 1, 6)))
('orb0001', 'orb0002')
nextof(filename)[source]

Return the next data of the given filename.

If the given filename is not found in the database, ValueError is returned.

If the given file is the last file, DataNotInDbError is raised.

>>> db = DB._get_sample_database()
>>> print(db.nextof('rootfolder/orb0001'))
rootfolder/orb0002
>>> try:
...     print(db.nextof('rootfolder/orb0005'))
...     print("!!!! Should not reach here")
... except ValueError as e:
...     print("Exception correctly caught")
Exception correctly caught
>>> try:
...     print(db.nextof('rootfolder/orb0012'))
...     print("!!!! Should not reach here")
... except DataNotInDbError as e:
...     print("Exception correctly caught")
Exception correctly caught
previousof(filename)[source]

Return the previous data of the given filename

If the given filename is not found in the database, KeyError is returned.

If the given file is the last file, DataNotInDbError is raised.

>>> db = DB._get_sample_database()
>>> try:
...     print(db.previousof('rootfolder/orb0001'))
...     print("!!!! Should not reach here")
... except DataNotInDbError as e:
...     print("Exception correctly caught")
Exception correctly caught
gettime(filename)[source]

Return the registered time for corresponding filename.

>>> db = DB._get_sample_database()
>>> print(db.gettime('rootfolder/orb0004'))
2004-01-06 09:30:00
clear()[source]
print_all()[source]
print_invall()[source]
class irfpy.util.timeseriesdb.FazzyDB(guessed_db, func_getstart)[source]

Bases: object

Time series database with a fazzy start time definition.

Sample of fazzy database:

If you have a dataset file0, file1, … file4. Assume you do not know the exact start times of these data files without getting surveying all the data files, which may take time.

Now, assume you have guessed the start times as follows.:

file0   1996-01-01
file1   1997-01-01
file2   1998-01-01
file3   1999-01-01
file4   2000-01-01

Using the guessed times, first create DB object.

>>> guessdb = DB()
>>> guessdb.append('file0', datetime.datetime(1996, 1, 1))
>>> guessdb.append('file1', datetime.datetime(1997, 1, 1))
>>> guessdb.append('file2', datetime.datetime(1998, 1, 1))
>>> guessdb.append('file3', datetime.datetime(1999, 1, 1))
>>> guessdb.append('file4', datetime.datetime(2000, 1, 1))

However, suppose the real start times are indeed as follows.:

file0   1996-06-01
file1   1996-08-01
file2   1996-10-01
file3   1999-08-01
file4   1999-10-01

Note that all the real start times of the files are not known a-pri-ori. Suppose you need to call the following function.

>>> def real_start(filename):    # Emulating the start time retrieval function
...     st={'file0': datetime.datetime(1996, 6, 1),
...         'file1': datetime.datetime(1996, 8, 1),
...         'file2': datetime.datetime(1996,10, 1),
...         'file3': datetime.datetime(1999, 8, 1),
...         'file4': datetime.datetime(1999,10, 1),}
...     # sleep(100)  # Very heavy processing :)
...     return st[filename]

Ok. Now preparation is ready. Make FazzyDB object.

>>> fdb = FazzyDB(guessdb, real_start)
>>> print(len(fdb))
5
>>> fdb.get(datetime.datetime(1999, 3, 1))
'file2'
>>> fdb.get(datetime.datetime(1996, 6, 1))
'file0'
>>> fdb.get(datetime.datetime(1996, 7, 31))
'file0'
>>> fdb.get(datetime.datetime(1996, 8, 1))
'file1'
>>> fdb.get(datetime.datetime(1996, 10, 1))
'file2'
>>> fdb.get(datetime.datetime(1999, 9, 9))
'file3'
>>> fdb.get(datetime.datetime(1999, 10, 1))
'file4'
>>> fdb.get(datetime.datetime(1999, 10, 21))
'file4'
>>> fdb.get(datetime.datetime(2050, 10, 1))
'file4'
>>> fdb.get(datetime.datetime(1996, 3, 1))   
Traceback (most recent call last):
irfpy.util.timeseriesdb.DataNotInDbError: 'This is the first file in the DB (file0).'
>>> fdb.get(datetime.datetime(1990, 3, 1))   
Traceback (most recent call last):
irfpy.util.timeseriesdb.DataNotInDbError: 'This is the first file in the DB (file0).'

Getting the first time >>> print(fdb.t0()) 1996-06-01 00:00:00

Getting files

>>> print(fdb.getfiles(datetime.datetime(1996, 1, 1), datetime.datetime(1997, 1, 1)))
('file0', 'file1', 'file2')

Create fazzy DB.

Parameters
  • guessed_db – Guessed database (DB object)

  • func_getstart – A function that eats filename and returns the start time.

logger = <Logger irfpy.util.timeseriesdb.FazzyDB (DEBUG)>
get_filename_from_database(t)[source]

Return the filename from the guessed databse

get_exactstart(filename)[source]

Return the exact time range

The exact time range is returned by the real evaluation of the data, or from the cache.

timerangefunc specified in the __init__ is used.

get(t)[source]

Return the filename

getfiles(t0, t1)[source]

Return the filenames that covers the specified range

Parameters
  • t0 – Start. (datetime.datetime)

  • t1 – End. (datetime.datetime)

Returns

Tuple of the file names.

nextof(filename)[source]
previousof(filename)[source]
t0()[source]
gettime(filename)[source]