`irfpy.util.datacenter`¶

The data center: Easy-implementation and access of time series dataset from multiple files.

This document is more for developer of DataCenter. For user of already-implemented DataCenter, refer to Data center, a new coherent way to access data document.

For space plasma data analysis, most of dataset have the following characteristics.

Data is time-tagged (time series data)
Data file is multiple
Each data file contains data with time-tag
Data contents are arbitrarily (vector, matrix, or multi-dimensional values)

In this module, for an easy data handling, I prepared a base class (BaseDataCenter) to contain multiple files as data, with coherent accessors. The accessors support

Users specify a instantaneous time. Then user will get a single point data.
Users specify a range of time. Then user will get an array of data
Users specify a range of time. Then user will get an iterator

All the methods above will return a pair of the observed time and the data.

What is the tagged-time?

Time specification in space plasma data is always not straightforward. The time-tag in a data could have at least three meaning:

Tagged time is for the start of observation
Tagged time is for the middle (or the exact instance) of observation
Tagged time is for the end of observation

Thus, two different definitions of time specification are available: “strict” and “wide”. Assume you have a time series data with a series of tagged time (Ti).

“strict”: The strict time range (t0, t1) will give you a subset of the original (Ti) with - Ti included if t0 <= Ti <= t1 - Ti excluded if Ti < t0 - Ti excluded if Ti > t1
“wide”: The wide time range (t0, t1) will give you a subset of the original (Ti) with - Ti included if Ti <= t0 < T(i+1) and T(i-1) < t1 <= Ti, in addition to the string range.

If the specified time t is exactly at one of the time-tags, there is no difference of “strict” and “wide”. In general, “wide” will behave the same as “strict”, if the time given is at one of the time-tags of data. If the time given is not exactly at a time-tag, one more data will be returned for “wide” rather than that for “strict”.

See example in SampleDataCenter for more in details.

For users

This module provides a skelton (base abstract class) for a data center. Implementation for each instruments are available by each project.

The simplest use cases are described in the sample class, SampleDataCenter.

For developer

The BaseDataCenter will give a quick implementation of dataset accessor. What the developer should implement is several abstract methods.

See the description under BaseDataCenter for details.

exception irfpy.util.datacenter.DataNotInDC(value)[source]¶

Bases: IrfpyException

Exception raised when the data is not found.

exception irfpy.util.datacenter.IrfpyWarningNoFileInDataCenter(*args, **kwds)[source]¶: Bases: UserWarning

class irfpy.util.datacenter.BaseDataCenter(cache_size=25, name='Unnamed DataCenter', copy=True)[source]¶

Bases: object

Base class of the data center.

The BaseDataCenter provides a quick implementation of time series data under multiple files in various format. The usage for users (interface) is described at SampleDataCenter as follows.

Here how to extend the BaseDataCenter is depicted.

Needed implementation is three methods as follows.

search_files(): Search the data files to store to the data center.
approximate_starttime(): Give an extremely quick way of giving a start time for each data file.
read_file(): Read a single data file and return the data as a irfpy.util.timeseries.ObjectSeries object.

Changed in version v4.5: The data to be returned is, by default, a copy of the original data. With a compensation of slight overhead of processing time an memory, the returned data is safe to manipulated by users. Users can overwrite this default behavior with copy keyword on running.

Changed in version v4.4.8a2: It is not recommended to re-implement exact_starttime() any more. This is because the exact_startime() should also judge if the data file is properly formatted and can be interpreted. If the data file is not proper, the file is disregarded from the database at the time of time check.

Note that the loss of performance in total would be minimal, since the read data is stored to a cache, so that the datacenter does not repeat reading the file again. Cache size can be changed at the time of datacenter creation, or via :meth:set_cachesize method.

Initializer.

Parameters:

cache_size – Size of the ring cache.
name – The name of the
copy – Boolean if the returned data is to be deep-copied (True) or reference (False). It is good to return the data after the copy, since then the data is always original. Returning reference is possibly faster, while there are side effect that the post-processing will destroy the original data. Therefore, it is recommended to set True always. The copy value can be overwritten by each method as necessity.

set_cachesize(cache_size)[source]¶

Set the size of the cache.

You can change the data cache size (read_file()). Existing cache will be disregarded.

Parameters:: cache_size – A size of the data cache

print_status()[source]¶

abstractmethod search_files()[source]¶

Search the data files, returning a list of data file.

This method searches the data files under the base_folder. This method should return a list / tuple of the data file name (usually a full path).

This method is called only once when __init__() was called.

Returns:: A list / tuple of the data file. It should be full path (or relative path from the current path), and sorted from earlier data to later data.

abstractmethod approximate_starttime(filename)[source]¶

Start time should be guessed for each file.

A guessed start time should be returned. It is OK if it is very approximate, but the orders of the guessed-start and the exact-start should be identical. This method must be very fast, because it is called for all the files in the data base (i.e. all the files retuned by search_files() method).

A practical suggestion for implementation is to guess the time from the filename.

Parameters:: filename – A string, filename.
Returns:: An approximate, guessed start time of the file
Return type:: datetime.datetime

exact_starttime(filename)[source]¶

From a file name, the precise start time should be obtained.

The exact start time should be returned.

Changed in version v4.4.8a2: It is not recommended to re-implement exact_starttime() any more. This is because the exact_startime() should also judge if the data file is properly formatted and can be interpreted. If the data file is not proper, the file is disregarded from the database at the time of time check.

For developer

Here, the exact start time is indeed the start time of the data contents. For example, let us think a data file “2013-10-18-12-00-00.dat”. And assume that the first data in the data file is for 2013-10-18T12:00:02. In this case, the returned time should be the latter, i.e., 2013-10-18T12:00:02. Therefore, the start time can be obtained from the contents loaded by read_file() method, with some specific error handling (zero-size data, or corrupted data file)

Implementation of this method is not recommended.

Parameters:: filename – A string, filename.
Returns:: The datatime expression of the exact start time of the file. None is allowed if the data file cannot define the exact start time. In the None case, the file is dropped from the data center.

abstractmethod read_file(filename)[source]¶

The file is read, and return the contents as a tuple with size 2, (tlist, dlist).

This method is an abstract method, meaning that the developer of the data center should implement it. See SampleDataCenter for more details.

The implementation of this method should follow:

Returned value is a tuple with a size of 2. - The first element is a tuple/list specifying the time (with each element as datetime.datetime object) - The second element is a tuple/list specifying the data, with any format. - The length of both two elements should be the same.

If the given filename is corrupted or empty, a two empty tuple would be returned (i.e., return (), ()). In this case, return None for the exact_starttime() method.

Parameters:: filename – File name
Returns:: The contents of the data file
Return type:: tuple

t0()[source]¶

Return the time-tag of the first data.

Returns:: The time-tag of the first data.

t1()[source]¶

Return the time-tag of the last data.

Returns:: The time tag of the last data

nearest(t, copy=None)[source]¶