irfpy.util.datacenter
¶
The data center: Easy-implementation and access of time series dataset from multiple files.
This document is more for developer of DataCenter. For user of already-implemented DataCenter, refer to Data center, a new coherent way to access data document.
For space plasma data analysis, most of dataset have the following characteristics.
Data is time-tagged (time series data)
Data file is multiple
Each data file contains data with time-tag
Data contents are arbitrarily (vector, matrix, or multi-dimensional values)
In this module, for an easy data handling, I prepared a base class
(BaseDataCenter
) to contain multiple files as data,
with coherent accessors. The accessors support
- Users specify a instantaneous time. Then user will get a single point data.
- Users specify a range of time. Then user will get an array of data
- Users specify a range of time. Then user will get an iterator
All the methods above will return a pair of the observed time and the data.
What is the tagged-time?
Time specification in space plasma data is always not straightforward. The time-tag in a data could have at least three meaning:
Tagged time is for the start of observation
Tagged time is for the middle (or the exact instance) of observation
Tagged time is for the end of observation
Thus, two different definitions of time specification are available: “strict” and “wide”.
Assume you have a time series data with a series of tagged time (Ti)
.
“strict”: The strict time range
(t0, t1)
will give you a subset of the original(Ti)
with -Ti
included ift0 <= Ti <= t1
-Ti
excluded ifTi < t0
-Ti
excluded ifTi > t1
“wide”: The wide time range
(t0, t1)
will give you a subset of the original(Ti)
with -Ti
included ifTi <= t0 < T(i+1)
andT(i-1) < t1 <= Ti
, in addition to the string range.
If the specified time t
is exactly at one of the time-tags, there is no difference of “strict” and “wide”.
In general, “wide” will behave the same as “strict”, if the time given is at one of the time-tags of data.
If the time given is not exactly at a time-tag, one more data will be returned for “wide” rather than that for “strict”.
See example in SampleDataCenter
for more in details.
For users
This module provides a skelton (base abstract class) for a data center. Implementation for each instruments are available by each project.
The simplest use cases are described in the sample class, SampleDataCenter
.
For developer
The BaseDataCenter
will give a quick implementation of dataset accessor.
What the developer should implement is several abstract methods.
See the description under BaseDataCenter
for details.
- exception irfpy.util.datacenter.DataNotInDC(value)[source]¶
Bases:
IrfpyException
Exception raised when the data is not found.
- exception irfpy.util.datacenter.IrfpyWarningNoFileInDataCenter(*args, **kwds)[source]¶
Bases:
UserWarning
- class irfpy.util.datacenter.BaseDataCenter(cache_size=25, name='Unnamed DataCenter', copy=True)[source]¶
Bases:
object
Base class of the data center.
The
BaseDataCenter
provides a quick implementation of time series data under multiple files in various format. The usage for users (interface) is described atSampleDataCenter
as follows.Here how to extend the
BaseDataCenter
is depicted.Needed implementation is three methods as follows.
search_files()
: Search the data files to store to the data center.approximate_starttime()
: Give an extremely quick way of giving a start time for each data file.read_file()
: Read a single data file and return the data as airfpy.util.timeseries.ObjectSeries
object.
Changed in version v4.5: The data to be returned is, by default, a copy of the original data. With a compensation of slight overhead of processing time an memory, the returned data is safe to manipulated by users. Users can overwrite this default behavior with
copy
keyword on running.Changed in version v4.4.8a2: It is not recommended to re-implement
exact_starttime()
any more. This is because theexact_startime()
should also judge if the data file is properly formatted and can be interpreted. If the data file is not proper, the file is disregarded from the database at the time of time check.Note that the loss of performance in total would be minimal, since the read data is stored to a cache, so that the datacenter does not repeat reading the file again. Cache size can be changed at the time of datacenter creation, or via :meth:
set_cachesize
method.Initializer.
- Parameters:
cache_size – Size of the ring cache.
name – The name of the
copy – Boolean if the returned data is to be deep-copied (True) or reference (False). It is good to return the data after the copy, since then the data is always original. Returning reference is possibly faster, while there are side effect that the post-processing will destroy the original data. Therefore, it is recommended to set True always. The copy value can be overwritten by each method as necessity.
- set_cachesize(cache_size)[source]¶
Set the size of the cache.
You can change the data cache size (
read_file()
). Existing cache will be disregarded.- Parameters:
cache_size – A size of the data cache
- abstract search_files()[source]¶
Search the data files, returning a list of data file.
This method searches the data files under the
base_folder
. This method should return a list / tuple of the data file name (usually a full path).This method is called only once when
__init__()
was called.- Returns:
A list / tuple of the data file. It should be full path (or relative path from the current path), and sorted from earlier data to later data.
- abstract approximate_starttime(filename)[source]¶
Start time should be guessed for each file.
A guessed start time should be returned. It is OK if it is very approximate, but the orders of the guessed-start and the exact-start should be identical. This method must be very fast, because it is called for all the files in the data base (i.e. all the files retuned by
search_files()
method).A practical suggestion for implementation is to guess the time from the filename.
- Parameters:
filename – A string, filename.
- Returns:
An approximate, guessed start time of the file
- Return type:
datetime.datetime
- exact_starttime(filename)[source]¶
From a file name, the precise start time should be obtained.
The exact start time should be returned.
Changed in version v4.4.8a2: It is not recommended to re-implement
exact_starttime()
any more. This is because theexact_startime()
should also judge if the data file is properly formatted and can be interpreted. If the data file is not proper, the file is disregarded from the database at the time of time check.For developer
Here, the exact start time is indeed the start time of the data contents. For example, let us think a data file “2013-10-18-12-00-00.dat”. And assume that the first data in the data file is for
2013-10-18T12:00:02
. In this case, the returned time should be the latter, i.e.,2013-10-18T12:00:02
. Therefore, the start time can be obtained from the contents loaded byread_file()
method, with some specific error handling (zero-size data, or corrupted data file)Implementation of this method is not recommended.
- Parameters:
filename – A string, filename.
- Returns:
The
datatime
expression of the exact start time of the file.None
is allowed if the data file cannot define the exact start time. In theNone
case, the file is dropped from the data center.
- abstract read_file(filename)[source]¶
The file is read, and return the contents as a tuple with size 2, (tlist, dlist).
This method is an abstract method, meaning that the developer of the data center should implement it. See
SampleDataCenter
for more details.The implementation of this method should follow:
Returned value is a tuple with a size of 2. - The first element is a tuple/list specifying the time (with each element as
datetime.datetime
object) - The second element is a tuple/list specifying the data, with any format. - The length of both two elements should be the same.
If the given filename is corrupted or empty, a two empty tuple would be returned (i.e.,
return (), ()
). In this case, returnNone
for theexact_starttime()
method.- Parameters:
filename – File name
- Returns:
The contents of the data file
- Return type:
tuple
- nearest(t, copy=None)[source]¶
Return the nearest data in the data center.
- Parameters:
t – Time in
datetime
objectcopy – If True, returned data is a new copy. If False, returned data is a reference. If None, the default value (see
__init__()
) is used. If unaware, keep it as None (default).
- Returns:
A list, (
t_obs
,data
)
- nearest_no_later(t, copy=None)[source]¶
Return the nearest data in the data center.
- Parameters:
t – Time in
datetime
object- Returns:
A list, (
t_obs
,data
)
- nearest_earlier_or_at(t, copy=None)¶
Return the nearest data in the data center.
- Parameters:
t – Time in
datetime
object- Returns:
A list, (
t_obs
,data
)
- nearest_no_earlier(t, copy=None)[source]¶
Return the nearest data in the data center.
- Parameters:
t – Time in
datetime
object- Returns:
A list, (
t_obs
,data
)
- nearest_later_or_at(t, copy=None)¶
Return the nearest data in the data center.
- Parameters:
t – Time in
datetime
object- Returns:
A list, (
t_obs
,data
)
- get_array_strict(t0, t1, copy=None)[source]¶
Return the array of data in the data center.
- Parameters:
t0 – Start time in
datetime
objectt1 – Stop time in
datetime
object
- Returns:
A list, (
tlist_obs
,data_list
)
The time of the observation (
tlist_obs
) should be strictly betweent0
andt1
(edge inclusive).
- get_array_wide(t0, t1, copy=None)[source]¶
Return the array of data in the data center.
- Parameters:
t0 – Start time in
datetime
objectt1 – Stop time in
datetime
object
- Returns:
A list, (
tlist_obs
,data_list
)
The time of the observation (
tlist_obs
) contains data outside of the given (single data point added in both direction) in order to account for interpretations of time tag.
- get_array_wide_start(t0, t1, copy=None)[source]¶
Return the array of data in the data center.
- Parameters:
t0 – Start time in
datetime
objectt1 – Stop time in
datetime
object
- Returns:
A list, (
tlist_obs
,data_list
)
The time of the observation (
tlist_obs
) contains data outside of the given (single data point added in both direction) in order to account for interpretations of time tag.
- get_array_wide_stop(t0, t1, copy=None)[source]¶
Return the array of data in the data center.
- Parameters:
t0 – Start time in
datetime
objectt1 – Stop time in
datetime
object
- Returns:
A list, (
tlist_obs
,data_list
)
The time of the observation (
tlist_obs
) contains data outside of the given (single data point added in both direction) in order to account for interpretations of time tag.
- iter(t0=datetime.datetime(1, 1, 1, 0, 0), t1=datetime.datetime(9999, 12, 31, 23, 59, 59, 999999), wide_start=False, wide_stop=False, copy=None)[source]¶
Return the iterator of data in the data center.
- Parameters:
t0 – Start time in
datetime
object.t1 – Stop time in
datetime
object.wide_start – Set True if the start time should be interpreted as “wide”. Default, False (“strict”)
wide_stop – Set True if the stop time should be interpreted as “wide”. Default, False (“strict”)
- Returns:
An iterator.
- iter_strict(t0=datetime.datetime(1, 1, 1, 0, 0), t1=datetime.datetime(9999, 12, 31, 23, 59, 59, 999999), copy=None)[source]¶
Return the iterator of data in the data center.
- Parameters:
t0 – Start time in
datetime
object. This is defined as “strict”t1 – Stop time in
datetime
object. This is defined as “strict”
- Returns:
An iterator.
The time of the observation should be strictly between
t0
andt1
(edge inclusive).
- iter_wide(t0=datetime.datetime(1, 1, 1, 0, 0), t1=datetime.datetime(9999, 12, 31, 23, 59, 59, 999999), copy=None)[source]¶
Return the iterator of data in the data center.
- Parameters:
t0 – Start time in
datetime
object. This is defined as “wide”.t1 – Stop time in
datetime
object. This is defined as “wide”.
- Returns:
An iterator.
The time of the observation (
tlist_obs
) contains data outside of the given (single data point added in both direction) in order to account for interpretations of time tag.
- irfpy.util.datacenter.create_sample_folder(sample_basedir=None)[source]¶
Create a sample folder structure. It is for development and testing.
It creates the following folder structure
<sample_basedir>/y2017/ <sample_basedir>/y2017/m03/ <sample_basedir>/y2017/m03/DATACENTER_SAMPLE_DATA_2017_03_30.dat <sample_basedir>/y2017/m03/DATACENTER_SAMPLE_DATA_2017_03_31.dat <sample_basedir>/y2017/m04/ <sample_basedir>/y2017/m04/DATACENTER_SAMPLE_DATA_2017_04_01.dat <sample_basedir>/y2017/m04/DATACENTER_SAMPLE_DATA_2017_04_02.dat
- Parameters:
sample_basedir – The folder under which the sample file is created. Default is None, which means that the folder is created by
tempfile.mkdtemp()
- Returns:
The name of the sample directory. User should remove the directory manually.
- class irfpy.util.datacenter.SampleDataCenter(sample_basedir, *args, **kwds)[source]¶
Bases:
BaseDataCenter
A sample of a data center implementation.
How to implement a data center
Only three methods are to be implemented.
search_files()
: Search the data files to store to the data center.approximate_starttime()
: Give an extremely quick way of giving a start time for each data file.read_file()
: Read a single data file and return the data as airfpy.util.timeseries.ObjectSeries
object.
How to read the data from a data center
Preparation. The sample folder is created by
create_sample_folder()
.
>>> sample_folder = create_sample_folder() >>> print(sample_folder) /tmp/tmpirg9wgix
Create the sample data center
>>> dc = SampleDataCenter(sample_folder)
Check if the data center correctly created.
>>> print(dc.t0()) # The time of the first data 2017-03-30 00:00:00 >>> print(dc.t1()) # The time of the last data 2017-04-02 23:40:00 >>> print(len(dc)) # The size of data center, namely, the number of files contained. 4
Get the data via
nearest
methods.
>>> import datetime >>> t0 = datetime.datetime(2017, 3, 30, 17, 15) >>> tobs, dat = dc.nearest_no_later(t0) >>> print(tobs, dat) 2017-03-30 16:55:00 ['2017' '3' '30' '16' '55']
>>> tobs, dat = dc.nearest_no_earlier(t0) >>> print(tobs, dat) 2017-03-30 17:30:00 ['2017' '3' '30' '17' '30']
>>> tobs, dat = dc.nearest(t0) >>> print(tobs, dat) 2017-03-30 17:30:00 ['2017' '3' '30' '17' '30']
Iterate the data
>>> for t, d in dc.iter_strict(): ... print(t, d) 2017-03-30 00:00:00 ['2017' '3' '30' '0' '0'] 2017-03-30 00:35:00 ['2017' '3' '30' '0' '35'] ... 2017-03-30 23:55:00 ['2017' '3' '30' '23' '55'] 2017-03-31 00:30:00 ['2017' '3' '31' '0' '30'] ... 2017-04-01 23:45:00 ['2017' '4' '1' '23' '45'] 2017-04-02 00:20:00 ['2017' '4' '2' '0' '20'] ... 2017-04-02 23:05:00 ['2017' '4' '2' '23' '5'] 2017-04-02 23:40:00 ['2017' '4' '2' '23' '40']
>>> for t, d in dc.iter_strict(datetime.datetime(2017, 3, 30, 23, 45), datetime.datetime(2017, 3, 31, 0, 45)): ... print(t, d) 2017-03-30 23:55:00 ['2017' '3' '30' '23' '55'] 2017-03-31 00:30:00 ['2017' '3' '31' '0' '30']
>>> for t, d in dc.iter_wide(datetime.datetime(2017, 3, 30, 23, 45), datetime.datetime(2017, 3, 31, 0, 45)): ... print(t, d) 2017-03-30 23:20:00 ['2017' '3' '30' '23' '20'] 2017-03-30 23:55:00 ['2017' '3' '30' '23' '55'] 2017-03-31 00:30:00 ['2017' '3' '31' '0' '30'] 2017-03-31 01:05:00 ['2017' '3' '31' '1' '5']
>>> for t, d in dc.iter_wide(datetime.datetime(2017, 3, 30, 23, 55), datetime.datetime(2017, 3, 31, 0, 30)): ... print(t, d) 2017-03-30 23:55:00 ['2017' '3' '30' '23' '55'] 2017-03-31 00:30:00 ['2017' '3' '31' '0' '30']
Getting the data as array
>>> tlist, dlist = dc.get_array_strict(datetime.datetime(2017, 3, 30, 23, 45), datetime.datetime(2017, 3, 31, 0, 45)) >>> from pprint import pprint >>> pprint(tlist) [datetime.datetime(2017, 3, 30, 23, 55), datetime.datetime(2017, 3, 31, 0, 30)]
>>> tlist, dlist = dc.get_array_wide(datetime.datetime(2017, 3, 30, 23, 45), datetime.datetime(2017, 3, 31, 0, 45)) >>> pprint(tlist) [datetime.datetime(2017, 3, 30, 23, 20), datetime.datetime(2017, 3, 30, 23, 55), datetime.datetime(2017, 3, 31, 0, 30), datetime.datetime(2017, 3, 31, 1, 5)]
Remove the sample folder.
>>> import shutil, os >>> if os.path.isdir(sample_folder): ... shutil.rmtree(sample_folder)
Initialize the data center.
- search_files()[source]¶
Search the files, returning a list of them
The method will return a list of the file names. Usually, a full path name is used.
- Returns:
A list of the file names.
- approximate_starttime(filename)[source]¶
Get the approxiate start time.
- Parameters:
filename – The name of the file
- Returns:
Approximate start time
For the samle data, the filename corresponds to the start time of the file. The name is “DATACENTER_SAMPLE_DATA_2017_03_31.dat” for example.
- read_file(filename)[source]¶
Read the file, returning a
irfpy.util.timeseries.ObjectSeries
data.- Parameters:
filename – File name
- Returns:
The contents as a format of (
tlist
,dlist
).
For the sample data, the contents is (6,) shaped array.