irfpy.util.datacenter

The data center: Easy-implementation and access of time series dataset from multiple files.

This document is more for developer of DataCenter. For user of already-implemented DataCenter, refer to Data center, a new coherent way to access data document.

For space plasma data analysis, most of dataset have the following characteristics.

  • Data is time-tagged (time series data)

  • Data file is multiple

  • Each data file contains data with time-tag

  • Data contents are arbitrarily (vector, matrix, or multi-dimensional values)

In this module, for an easy data handling, I prepared a base class (BaseDataCenter) to contain multiple files as data, with coherent accessors. The accessors support

All the methods above will return a pair of the observed time and the data.

What is the tagged-time?

Time specification in space plasma data is always not straightforward. The time-tag in a data could have at least three meaning:

  • Tagged time is for the start of observation

  • Tagged time is for the middle (or the exact instance) of observation

  • Tagged time is for the end of observation

Thus, two different definitions of time specification are available: “strict” and “wide”. Assume you have a time series data with a series of tagged time (Ti).

  • “strict”: The strict time range (t0, t1) will give you a subset of the original (Ti) with - Ti included if t0 <= Ti <= t1 - Ti excluded if Ti < t0 - Ti excluded if Ti > t1

  • “wide”: The wide time range (t0, t1) will give you a subset of the original (Ti) with - Ti included if Ti <= t0 < T(i+1) and T(i-1) < t1 <= Ti, in addition to the string range.

If the specified time t is exactly at one of the time-tags, there is no difference of “strict” and “wide”. In general, “wide” will behave the same as “strict”, if the time given is at one of the time-tags of data. If the time given is not exactly at a time-tag, one more data will be returned for “wide” rather than that for “strict”.

See example in SampleDataCenter for more in details.

For users

This module provides a skelton (base abstract class) for a data center. Implementation for each instruments are available by each project.

The simplest use cases are described in the sample class, SampleDataCenter.

For developer

The BaseDataCenter will give a quick implementation of dataset accessor. What the developer should implement is several abstract methods.

See the description under BaseDataCenter for details.

exception irfpy.util.datacenter.DataNotInDC(value)[source]

Bases: IrfpyException

Exception raised when the data is not found.

exception irfpy.util.datacenter.IrfpyWarningNoFileInDataCenter(*args, **kwds)[source]

Bases: UserWarning

class irfpy.util.datacenter.BaseDataCenter(cache_size=25, name='Unnamed DataCenter', copy=True)[source]

Bases: object

Base class of the data center.

The BaseDataCenter provides a quick implementation of time series data under multiple files in various format. The usage for users (interface) is described at SampleDataCenter as follows.

Here how to extend the BaseDataCenter is depicted.

Needed implementation is three methods as follows.

Changed in version v4.5: The data to be returned is, by default, a copy of the original data. With a compensation of slight overhead of processing time an memory, the returned data is safe to manipulated by users. Users can overwrite this default behavior with copy keyword on running.

Changed in version v4.4.8a2: It is not recommended to re-implement exact_starttime() any more. This is because the exact_startime() should also judge if the data file is properly formatted and can be interpreted. If the data file is not proper, the file is disregarded from the database at the time of time check.

Note that the loss of performance in total would be minimal, since the read data is stored to a cache, so that the datacenter does not repeat reading the file again. Cache size can be changed at the time of datacenter creation, or via :meth:set_cachesize method.

Initializer.

Parameters:
  • cache_size – Size of the ring cache.

  • name – The name of the

  • copy – Boolean if the returned data is to be deep-copied (True) or reference (False). It is good to return the data after the copy, since then the data is always original. Returning reference is possibly faster, while there are side effect that the post-processing will destroy the original data. Therefore, it is recommended to set True always. The copy value can be overwritten by each method as necessity.

set_cachesize(cache_size)[source]

Set the size of the cache.

You can change the data cache size (read_file()). Existing cache will be disregarded.

Parameters:

cache_size – A size of the data cache

print_status()[source]
abstract search_files()[source]

Search the data files, returning a list of data file.

This method searches the data files under the base_folder. This method should return a list / tuple of the data file name (usually a full path).

This method is called only once when __init__() was called.

Returns:

A list / tuple of the data file. It should be full path (or relative path from the current path), and sorted from earlier data to later data.

abstract approximate_starttime(filename)[source]

Start time should be guessed for each file.

A guessed start time should be returned. It is OK if it is very approximate, but the orders of the guessed-start and the exact-start should be identical. This method must be very fast, because it is called for all the files in the data base (i.e. all the files retuned by search_files() method).

A practical suggestion for implementation is to guess the time from the filename.

Parameters:

filename – A string, filename.

Returns:

An approximate, guessed start time of the file

Return type:

datetime.datetime

exact_starttime(filename)[source]

From a file name, the precise start time should be obtained.

The exact start time should be returned.

Changed in version v4.4.8a2: It is not recommended to re-implement exact_starttime() any more. This is because the exact_startime() should also judge if the data file is properly formatted and can be interpreted. If the data file is not proper, the file is disregarded from the database at the time of time check.

For developer

Here, the exact start time is indeed the start time of the data contents. For example, let us think a data file “2013-10-18-12-00-00.dat”. And assume that the first data in the data file is for 2013-10-18T12:00:02. In this case, the returned time should be the latter, i.e., 2013-10-18T12:00:02. Therefore, the start time can be obtained from the contents loaded by read_file() method, with some specific error handling (zero-size data, or corrupted data file)

Implementation of this method is not recommended.

Parameters:

filename – A string, filename.

Returns:

The datatime expression of the exact start time of the file. None is allowed if the data file cannot define the exact start time. In the None case, the file is dropped from the data center.

abstract read_file(filename)[source]

The file is read, and return the contents as a tuple with size 2, (tlist, dlist).

This method is an abstract method, meaning that the developer of the data center should implement it. See SampleDataCenter for more details.

The implementation of this method should follow:

  • Returned value is a tuple with a size of 2. - The first element is a tuple/list specifying the time (with each element as datetime.datetime object) - The second element is a tuple/list specifying the data, with any format. - The length of both two elements should be the same.

If the given filename is corrupted or empty, a two empty tuple would be returned (i.e., return (), ()). In this case, return None for the exact_starttime() method.

Parameters:

filename – File name

Returns:

The contents of the data file

Return type:

tuple

t0()[source]

Return the time-tag of the first data.

Returns:

The time-tag of the first data.

t1()[source]

Return the time-tag of the last data.

Returns:

The time tag of the last data

nearest(t, copy=None)[source]

Return the nearest data in the data center.

Parameters:
  • t – Time in datetime object

  • copy – If True, returned data is a new copy. If False, returned data is a reference. If None, the default value (see __init__()) is used. If unaware, keep it as None (default).

Returns:

A list, (t_obs, data)

nearest_no_later(t, copy=None)[source]

Return the nearest data in the data center.

Parameters:

t – Time in datetime object

Returns:

A list, (t_obs, data)

nearest_earlier_or_at(t, copy=None)

Return the nearest data in the data center.

Parameters:

t – Time in datetime object

Returns:

A list, (t_obs, data)

previousof(t, copy=None)[source]
nearest_no_earlier(t, copy=None)[source]

Return the nearest data in the data center.

Parameters:

t – Time in datetime object

Returns:

A list, (t_obs, data)

nearest_later_or_at(t, copy=None)

Return the nearest data in the data center.

Parameters:

t – Time in datetime object

Returns:

A list, (t_obs, data)

nextof(t, copy=None)[source]
get_array(t0, t1, wide_start=False, wide_stop=False, copy=None)[source]
get_array_strict(t0, t1, copy=None)[source]

Return the array of data in the data center.

Parameters:
  • t0 – Start time in datetime object

  • t1 – Stop time in datetime object

Returns:

A list, (tlist_obs, data_list)

The time of the observation (tlist_obs) should be strictly between t0 and t1 (edge inclusive).

get_array_wide(t0, t1, copy=None)[source]

Return the array of data in the data center.

Parameters:
  • t0 – Start time in datetime object

  • t1 – Stop time in datetime object

Returns:

A list, (tlist_obs, data_list)

The time of the observation (tlist_obs) contains data outside of the given (single data point added in both direction) in order to account for interpretations of time tag.

get_array_wide_start(t0, t1, copy=None)[source]

Return the array of data in the data center.

Parameters:
  • t0 – Start time in datetime object

  • t1 – Stop time in datetime object

Returns:

A list, (tlist_obs, data_list)

The time of the observation (tlist_obs) contains data outside of the given (single data point added in both direction) in order to account for interpretations of time tag.

get_array_wide_stop(t0, t1, copy=None)[source]

Return the array of data in the data center.

Parameters:
  • t0 – Start time in datetime object

  • t1 – Stop time in datetime object

Returns:

A list, (tlist_obs, data_list)

The time of the observation (tlist_obs) contains data outside of the given (single data point added in both direction) in order to account for interpretations of time tag.

iter(t0=datetime.datetime(1, 1, 1, 0, 0), t1=datetime.datetime(9999, 12, 31, 23, 59, 59, 999999), wide_start=False, wide_stop=False, copy=None)[source]

Return the iterator of data in the data center.

Parameters:
  • t0 – Start time in datetime object.

  • t1 – Stop time in datetime object.

  • wide_start – Set True if the start time should be interpreted as “wide”. Default, False (“strict”)

  • wide_stop – Set True if the stop time should be interpreted as “wide”. Default, False (“strict”)

Returns:

An iterator.

iter_strict(t0=datetime.datetime(1, 1, 1, 0, 0), t1=datetime.datetime(9999, 12, 31, 23, 59, 59, 999999), copy=None)[source]

Return the iterator of data in the data center.

Parameters:
  • t0 – Start time in datetime object. This is defined as “strict”

  • t1 – Stop time in datetime object. This is defined as “strict”

Returns:

An iterator.

The time of the observation should be strictly between t0 and t1 (edge inclusive).

iter_wide(t0=datetime.datetime(1, 1, 1, 0, 0), t1=datetime.datetime(9999, 12, 31, 23, 59, 59, 999999), copy=None)[source]

Return the iterator of data in the data center.

Parameters:
  • t0 – Start time in datetime object. This is defined as “wide”.

  • t1 – Stop time in datetime object. This is defined as “wide”.

Returns:

An iterator.

The time of the observation (tlist_obs) contains data outside of the given (single data point added in both direction) in order to account for interpretations of time tag.

iter_wide_start(t0=datetime.datetime(1, 1, 1, 0, 0), t1=datetime.datetime(9999, 12, 31, 23, 59, 59, 999999), copy=None)[source]

Return the iterator of data in the data center

Parameters:
  • t0 – Start time. This is defined as “wide”

  • t1 – Stop time. This is defined as “strict”

Returns:

An iterator.

iter_wide_stop(t0=datetime.datetime(1, 1, 1, 0, 0), t1=datetime.datetime(9999, 12, 31, 23, 59, 59, 999999), copy=None)[source]

Return the iterator of data in the data center

Parameters:
  • t0 – Start time. This is defined as “strict”

  • t1 – Stop time. This is defined as “wide”

Returns:

An iterator.

irfpy.util.datacenter.create_sample_folder(sample_basedir=None)[source]

Create a sample folder structure. It is for development and testing.

It creates the following folder structure

<sample_basedir>/y2017/
<sample_basedir>/y2017/m03/
<sample_basedir>/y2017/m03/DATACENTER_SAMPLE_DATA_2017_03_30.dat
<sample_basedir>/y2017/m03/DATACENTER_SAMPLE_DATA_2017_03_31.dat
<sample_basedir>/y2017/m04/
<sample_basedir>/y2017/m04/DATACENTER_SAMPLE_DATA_2017_04_01.dat
<sample_basedir>/y2017/m04/DATACENTER_SAMPLE_DATA_2017_04_02.dat
Parameters:

sample_basedir – The folder under which the sample file is created. Default is None, which means that the folder is created by tempfile.mkdtemp()

Returns:

The name of the sample directory. User should remove the directory manually.

irfpy.util.datacenter.remove_sample_folder()[source]
class irfpy.util.datacenter.SampleDataCenter(sample_basedir, *args, **kwds)[source]

Bases: BaseDataCenter

A sample of a data center implementation.

How to implement a data center

Only three methods are to be implemented.

How to read the data from a data center

  1. Preparation. The sample folder is created by create_sample_folder().

>>> sample_folder = create_sample_folder()
>>> print(sample_folder)    
/tmp/tmpirg9wgix
  1. Create the sample data center

>>> dc = SampleDataCenter(sample_folder)

Check if the data center correctly created.

>>> print(dc.t0())  # The time of the first data
2017-03-30 00:00:00
>>> print(dc.t1())  # The time of the last data
2017-04-02 23:40:00
>>> print(len(dc))  # The size of data center, namely, the number of files contained.
4
  1. Get the data via nearest methods.

>>> import datetime
>>> t0 = datetime.datetime(2017, 3, 30, 17, 15)
>>> tobs, dat = dc.nearest_no_later(t0)
>>> print(tobs, dat)
2017-03-30 16:55:00 ['2017' '3' '30' '16' '55']
>>> tobs, dat = dc.nearest_no_earlier(t0)
>>> print(tobs, dat)
2017-03-30 17:30:00 ['2017' '3' '30' '17' '30']
>>> tobs, dat = dc.nearest(t0)
>>> print(tobs, dat)
2017-03-30 17:30:00 ['2017' '3' '30' '17' '30']
  1. Iterate the data

>>> for t, d in dc.iter_strict():     
...     print(t, d)
2017-03-30 00:00:00 ['2017' '3' '30' '0' '0']
2017-03-30 00:35:00 ['2017' '3' '30' '0' '35']
...
2017-03-30 23:55:00 ['2017' '3' '30' '23' '55']
2017-03-31 00:30:00 ['2017' '3' '31' '0' '30']
...
2017-04-01 23:45:00 ['2017' '4' '1' '23' '45']
2017-04-02 00:20:00 ['2017' '4' '2' '0' '20']
...
2017-04-02 23:05:00 ['2017' '4' '2' '23' '5']
2017-04-02 23:40:00 ['2017' '4' '2' '23' '40']
>>> for t, d in dc.iter_strict(datetime.datetime(2017, 3, 30, 23, 45), datetime.datetime(2017, 3, 31, 0, 45)):
...     print(t, d)
2017-03-30 23:55:00 ['2017' '3' '30' '23' '55']
2017-03-31 00:30:00 ['2017' '3' '31' '0' '30']
>>> for t, d in dc.iter_wide(datetime.datetime(2017, 3, 30, 23, 45), datetime.datetime(2017, 3, 31, 0, 45)):
...     print(t, d)
2017-03-30 23:20:00 ['2017' '3' '30' '23' '20']
2017-03-30 23:55:00 ['2017' '3' '30' '23' '55']
2017-03-31 00:30:00 ['2017' '3' '31' '0' '30']
2017-03-31 01:05:00 ['2017' '3' '31' '1' '5']
>>> for t, d in dc.iter_wide(datetime.datetime(2017, 3, 30, 23, 55), datetime.datetime(2017, 3, 31, 0, 30)):
...     print(t, d)
2017-03-30 23:55:00 ['2017' '3' '30' '23' '55']
2017-03-31 00:30:00 ['2017' '3' '31' '0' '30']
  1. Getting the data as array

>>> tlist, dlist = dc.get_array_strict(datetime.datetime(2017, 3, 30, 23, 45), datetime.datetime(2017, 3, 31, 0, 45))
>>> from pprint import pprint
>>> pprint(tlist)
[datetime.datetime(2017, 3, 30, 23, 55), datetime.datetime(2017, 3, 31, 0, 30)]
>>> tlist, dlist = dc.get_array_wide(datetime.datetime(2017, 3, 30, 23, 45), datetime.datetime(2017, 3, 31, 0, 45))
>>> pprint(tlist)
[datetime.datetime(2017, 3, 30, 23, 20),
 datetime.datetime(2017, 3, 30, 23, 55),
 datetime.datetime(2017, 3, 31, 0, 30),
 datetime.datetime(2017, 3, 31, 1, 5)]
  1. Remove the sample folder.

>>> import shutil, os
>>> if os.path.isdir(sample_folder):
...     shutil.rmtree(sample_folder)

Initialize the data center.

search_files()[source]

Search the files, returning a list of them

The method will return a list of the file names. Usually, a full path name is used.

Returns:

A list of the file names.

approximate_starttime(filename)[source]

Get the approxiate start time.

Parameters:

filename – The name of the file

Returns:

Approximate start time

For the samle data, the filename corresponds to the start time of the file. The name is “DATACENTER_SAMPLE_DATA_2017_03_31.dat” for example.

read_file(filename)[source]

Read the file, returning a irfpy.util.timeseries.ObjectSeries data.

Parameters:

filename – File name

Returns:

The contents as a format of (tlist, dlist).

For the sample data, the contents is (6,) shaped array.