irfpy.util.timeseriesdb
¶
This module provides an implementation of time series of database index.
Code author: Yoshifumi Futaana
Frequently the need of database in timeseries order. Main use is a dataset under a specific folder.
For example, there is a database like:
rootfolder -+- 200501 -+- 20050101.dat # data from 2005-01-01 00:00:00
| +- 20050102.dat # data from 2005-01-02 00:00:00
| +- 20050103.dat # data from 2005-01-03 00:00:00
| +- ...
|
+- 200502 -+- 20050215.dat # data from 2005-02-15 00:00:00
| +- 20050216.dat # data from 2005-02-16 00:00:00
| +- 20050217.dat # data from 2005-02-17 00:00:00
| +- ...
|
+- ...
The timeseriesdb
module provides to connect between the time and the file name.
The user can find the file name from time. DB
class provides an implementation
of such database.
In case you do not know the exact start time of the data file, you can
use FazzyDB
class.
- Usage follows. Use
append()
method to connect the file name and the start time of the data file.
>>> db = DB() # Instance the DB object.
>>> db.append('rootfolder/200501/20050101.dat', datetime.datetime(2005, 1, 1))
>>> db.append('rootfolder/200501/20050102.dat', datetime.datetime(2005, 1, 2))
>>> db.append('rootfolder/200501/20050103.dat', datetime.datetime(2005, 1, 3))
>>> db.append('rootfolder/200502/20050215.dat', datetime.datetime(2005, 2, 15))
>>> db.append('rootfolder/200502/20050216.dat', datetime.datetime(2005, 2, 16))
>>> db.append('rootfolder/200502/20050217.dat', datetime.datetime(2005, 2, 17))
>>> print(db.get(datetime.datetime(2005, 1, 1, 12, 0, 0)))
rootfolder/200501/20050101.dat
>>> print(db.get(datetime.datetime(2005, 1, 3, 0, 0, 0)))
rootfolder/200501/20050103.dat
>>> print(db.get(datetime.datetime(2005, 5, 3, 0, 0, 0))) # later than last data
rootfolder/200502/20050217.dat
Another example is as follows.
rootfolder -+- orb0001 # Data from 2004-01-05 15:30:00
+- orb0002 # Data from 2004-01-05 18:30:00
+- orb0003 # Data from 2004-01-06 00:30:00
+- orb0004 # Data from 2004-01-06 09:30:00
+- orb0011 # Data from 2004-01-12 11:30:00
| # no data and file between 5 and 10.
+- orb0012 # Data from 2004-01-12 21:30:00
>>> db = DB()
>>> db.append('rootfolder/orb0001', datetime.datetime(2004, 1, 5, 15, 30))
>>> db.append('rootfolder/orb0002', datetime.datetime(2004, 1, 5, 18, 30))
>>> db.append('rootfolder/orb0003', datetime.datetime(2004, 1, 6, 0, 30))
>>> db.append('rootfolder/orb0004', datetime.datetime(2004, 1, 6, 9, 30))
>>> db.append('rootfolder/orb0011', datetime.datetime(2004, 1, 12, 11, 30))
>>> db.append('rootfolder/orb0012', datetime.datetime(2004, 1, 12, 21, 30))
>>> # If you give the date before the data start, DataNotInDbError is returned.
>>> print(db.get(datetime.datetime(2004, 1, 1)))
Traceback (most recent call last):
irfpy.util.timeseriesdb.DataNotInDbError: message
>>> print(db.get(datetime.datetime(2004, 1, 5, 15, 30)))
rootfolder/orb0001
>>> # Most likely the data at 2004-01-10T00:00:00 is not in the dataset,
>>> # but still returns as included in orb0004, because the epoch is between
>>> # the starts of 0004 and 0011.
>>> print(db.get(datetime.datetime(2004, 1, 10, 0, 0)))
rootfolder/orb0004
>>> # As the end of the time is not included in the database,
>>> # the last file is always returned if you give later than the database coverage.
>>> print(db.get(datetime.datetime(2100, 1, 1, 0, 0, 0)))
rootfolder/orb0012
See How to load tree-structured data for more information.
Fazzy database
Sometimes, files in the database do not provide exact start times because the computational cost to get the exact start time are expensive. In such cases, “fazzy database” may provide a better solution rather than surveying all the files.
To use this fazzy database strategy, still one must know the chronological order of the files, and estimate the start times for all files. A rough estimates of start times is usually fine. They can be from the file name and other resources (orbit number), for example. Even “evenly-distributed” times may be acceptable.
You can see sample in FazzyDB
.
Warning
Before the version 4.2.6a3, the DB
and FazzyDB
had a critical error.
The bug was reported in https://gitlab.irf.se/irfpy/util/issues/2
and has been fixed by the commit ac33aade.
- class irfpy.util.timeseriesdb.DB[source]¶
Bases:
object
Implementation of the timeseries database.
- logger = <Logger irfpy.util.timeseriesdb.DB (DEBUG)>¶
- append(filename, starttime)[source]¶
Append the file into the database together with the start time.
>>> db = DB() >>> db.append('file1', datetime.datetime(2009, 1, 1, 0, 0, 0)) >>> try: ... db.append('file1', datetime.datetime(2009, 2, 1, 0, 0, 0)) ... print("!!!! Should not reach here") ... except DuplicatedError as e: ... print("Duplicated error correctly caught!") Duplicated error correctly caught!
- remove(filename)[source]¶
Remove the file from the database
- Parameters:
filename – The file name.
- Returns:
None
If the filename does not exist,
ValueError
is raised.
- t0()[source]¶
Return the first time
Note that the “last time” cannot be identified, because the dataset is only for start time.
>>> db = DB() >>> db.append('a', datetime.datetime(2009, 1, 10)) >>> print(db.t0()) 2009-01-10 00:00:00 >>> db.append('b', datetime.datetime(2008, 1, 25, 12)) >>> print(db.t0()) 2008-01-25 12:00:00 >>> db.append('c', datetime.datetime(2012, 1, 25, 12)) >>> print(db.t0()) 2008-01-25 12:00:00
- get(t)[source]¶
Return the filename that contains the date of the specified time.
Return the filename that contains the data of the specified time. If the specified time is before the DB start time,
DataNotInDbError
is raised.First, load the sample data base.
>>> db = DB._get_sample_database()
2004-01-05T15:30:00 contains in orb0001.
>>> t0 = datetime.datetime(2004, 1, 5, 15, 30) >>> print(db.get(t0)) rootfolder/orb0001
2004-01-05T17:00:00 is also in orb001.
>>> t1 = datetime.datetime(2004, 1, 5, 17) >>> print(db.get(t1)) rootfolder/orb0001
2004-01-05T00:00:00 is before the database.
>>> t2 = datetime.datetime(2004, 1, 5, 0) >>> try: ... print(db.get(t2)) ... print("!!!! Should not reach here") ... except DataNotInDbError as e: ... print("Exception caught") Exception caught
2010-01-01T00:00:00 is, in a common sense, not included in this data base, but it returns the last file.
>>> t3 = datetime.datetime(2010, 1, 1) >>> print(db.get(t3)) rootfolder/orb0012
- getfiles(t0, t1)[source]¶
Return the filenames that covers the specified range
- Parameters:
t0 – Start. (
datetime.datetime
)t1 – End. (
datetime.datetime
)
- Returns:
Tuple of the file names.
>>> db = DB() >>> db.append('orb0001', datetime.datetime(2004, 1, 5, 15, 30)) >>> db.append('orb0002', datetime.datetime(2004, 1, 5, 18, 30)) >>> db.append('orb0003', datetime.datetime(2004, 1, 6, 0, 30)) >>> db.append('orb0004', datetime.datetime(2004, 1, 6, 9, 30)) >>> db.append('orb0005', datetime.datetime(2004, 1, 6, 18, 30)) >>> db.append('orb0006', datetime.datetime(2004, 1, 7, 3, 30)) >>> db.append('orb0011', datetime.datetime(2004, 1, 12, 11, 30)) >>> db.append('orb0012', datetime.datetime(2004, 1, 12, 21, 30)) >>> print(db.getfiles(datetime.datetime(2004, 1, 6), datetime.datetime(2004, 1, 7))) ('orb0002', 'orb0003', 'orb0004', 'orb0005') >>> print(db.getfiles(datetime.datetime(2004, 1, 1), datetime.datetime(2004, 1, 6))) ('orb0001', 'orb0002')
- nextof(filename)[source]¶
Return the next data of the given filename.
If the given filename is not found in the database, ValueError is returned.
If the given file is the last file, DataNotInDbError is raised.
>>> db = DB._get_sample_database() >>> print(db.nextof('rootfolder/orb0001')) rootfolder/orb0002
>>> try: ... print(db.nextof('rootfolder/orb0005')) ... print("!!!! Should not reach here") ... except ValueError as e: ... print("Exception correctly caught") Exception correctly caught
>>> try: ... print(db.nextof('rootfolder/orb0012')) ... print("!!!! Should not reach here") ... except DataNotInDbError as e: ... print("Exception correctly caught") Exception correctly caught
- previousof(filename)[source]¶
Return the previous data of the given filename
If the given filename is not found in the database, KeyError is returned.
If the given file is the last file, DataNotInDbError is raised.
>>> db = DB._get_sample_database() >>> try: ... print(db.previousof('rootfolder/orb0001')) ... print("!!!! Should not reach here") ... except DataNotInDbError as e: ... print("Exception correctly caught") Exception correctly caught
- class irfpy.util.timeseriesdb.FazzyDB(guessed_db, func_getstart)[source]¶
Bases:
object
Time series database with a fazzy start time definition.
Sample of fazzy database:
If you have a dataset file0, file1, … file4. Assume you do not know the exact start times of these data files without getting surveying all the data files, which may take time.
Now, assume you have guessed the start times as follows.:
file0 1996-01-01 file1 1997-01-01 file2 1998-01-01 file3 1999-01-01 file4 2000-01-01
Using the guessed times, first create
DB
object.>>> guessdb = DB() >>> guessdb.append('file0', datetime.datetime(1996, 1, 1)) >>> guessdb.append('file1', datetime.datetime(1997, 1, 1)) >>> guessdb.append('file2', datetime.datetime(1998, 1, 1)) >>> guessdb.append('file3', datetime.datetime(1999, 1, 1)) >>> guessdb.append('file4', datetime.datetime(2000, 1, 1))
However, suppose the real start times are indeed as follows.:
file0 1996-06-01 file1 1996-08-01 file2 1996-10-01 file3 1999-08-01 file4 1999-10-01
Note that all the real start times of the files are not known a-pri-ori. Suppose you need to call the following function.
>>> def real_start(filename): # Emulating the start time retrieval function ... st={'file0': datetime.datetime(1996, 6, 1), ... 'file1': datetime.datetime(1996, 8, 1), ... 'file2': datetime.datetime(1996,10, 1), ... 'file3': datetime.datetime(1999, 8, 1), ... 'file4': datetime.datetime(1999,10, 1),} ... # sleep(100) # Very heavy processing :) ... return st[filename]
Ok. Now preparation is ready. Make
FazzyDB
object.>>> fdb = FazzyDB(guessdb, real_start) >>> print(len(fdb)) 5 >>> fdb.get(datetime.datetime(1999, 3, 1)) 'file2' >>> fdb.get(datetime.datetime(1996, 6, 1)) 'file0' >>> fdb.get(datetime.datetime(1996, 7, 31)) 'file0' >>> fdb.get(datetime.datetime(1996, 8, 1)) 'file1' >>> fdb.get(datetime.datetime(1996, 10, 1)) 'file2' >>> fdb.get(datetime.datetime(1999, 9, 9)) 'file3' >>> fdb.get(datetime.datetime(1999, 10, 1)) 'file4' >>> fdb.get(datetime.datetime(1999, 10, 21)) 'file4' >>> fdb.get(datetime.datetime(2050, 10, 1)) 'file4' >>> fdb.get(datetime.datetime(1996, 3, 1)) Traceback (most recent call last): irfpy.util.timeseriesdb.DataNotInDbError: 'This is the first file in the DB (file0).' >>> fdb.get(datetime.datetime(1990, 3, 1)) Traceback (most recent call last): irfpy.util.timeseriesdb.DataNotInDbError: 'This is the first file in the DB (file0).'
Getting the first time >>> print(fdb.t0()) 1996-06-01 00:00:00
Getting files
>>> print(fdb.getfiles(datetime.datetime(1996, 1, 1), datetime.datetime(1997, 1, 1))) ('file0', 'file1', 'file2')
Create fazzy DB.
- Parameters:
guessed_db – Guessed database (
DB
object)func_getstart – A function that eats filename and returns the start time.
- logger = <Logger irfpy.util.timeseriesdb.FazzyDB (DEBUG)>¶
- get_exactstart(filename)[source]¶
Return the exact time range
The exact time range is returned by the real evaluation of the data, or from the cache.
timerangefunc specified in the __init__ is used.