Data center, a new coherent way to access data¶

Introduction¶

Space physics data is often organized with following characteristics.

Data is time-tagged
Data file is multiple, meaning that the files are in a folder structure
Each data file contains multiple time-tagged data
Data at a specific time has define structure
The structure is usually coherent (e.g. vector, matrix, array, etc)

For example, magnetic field data measured by a spacecraft would satisfy the above condition.

Data is time-tagged. Yes, the measurement is a vector at a specific time
Data file is multiple. Yes. Maybe the data file is separated each day by day, for example, covering the whole mission life.
Each data file contains multiple time-tagged data. Yes. One file (e.g., one day data) contains 86400 data if the time resolution is 1 second.
Data at a specific time has define structure. Yes. The data at a specific time is a vector.
The structure is usually coherent (e.g. vector, matrix, array, etc). The data (vector) is applicable throughout the mission.

To access such data easily, irfpy provides a functionality of Data Center.

Added in version v4.4.7: The irfpy.util.datacenter and the abstract base class irfpy.util.datacenter.BaseDataCenter are added for this purpose.

For user¶

Ideally, each dataset provides one or data centers. For example, irfpy.aspera project provides several data centers.

VEX/IMA provides irfpy.vima.rawdata.DataCenterCount2dEmulated
VEX/NPD provides irfpy.vnpd.vnpddata.DataCenterRawMode
VEX/MAG provides irfpy.vmag.scidata1s.DataCenterMag1s
and more…

How to prepare the data¶

Preparation of data depends on the project. Usually you can find information somewhere in this website, or ask the responsible people on the setup.

Typical procedure is

Prepare the dataset (downloaded from some web-site)
Set the path of the downloaded dataset at .irfpyrc.

A sample dataset¶

Here in this tutorial, we try to use the irfpy.util.datacenter.SampleDataCenter class.

First, you can create a sample data folder.

>>> from irfpy.util import datacenter
>>> datacenter.create_sample_folder()

Then you can find a folder _datacenter_sample in the current directory.

To create a sample datacenter object,

>>> dc = datacenter.SampleDataCenter()

You can print out the status of the datacenter.

>>> dc.print_status()
# DataCenterStatus:
# - Time range:
#   - Start: 2017-03-30 00:00:00
#   - Stop:  2017-04-02 23:40:00
# - Cache:
#   - Size: 2
# - DB:
#   - Size: 4

How to read the data¶

General reading¶

All the data centers support the following three typical reading scheme.

Get the data at a specific time

To get the data at a specific time, you may use usually the irfpy.util.datacenter.BaseDataCenter.nearest() method, or similar.
```
>>> import datetime
>>> t0 = datetime.datetime(2017, 4, 1, 10, 10)
>>> tobs, dat = dc.nearest(t0)
```
The data center returns a tuple with two elements. First one is the time, and the second one is the data.

Let’s see them.
```
>>> print(tobs)
2017-04-01 10:20:00
```
The nearest data is observed at 10 minutes later.
```
>>> print(dat)
['2017' '4' '1' '10' '20']
```
The corresponding data is a five element numpy array.

Get the data in a range of time

To get the data at a time range, you may use the irfpy.util.datacenter.BaseDataCenter.get_array() method or similar.

>>> t0 = datetime.datetime(2017, 3, 31)
>>> t1 = datetime.datetime(2017, 4, 1)
>>> tobs, dat = dc.get_array(t0, t1)

In these methods, the returned value is a tuple with two elements. The first one is a list of the times sampled, and the second one is a list of data.

>>> print(tobs)
[datetime.datetime(2017, 3, 31, 0, 30),
 datetime.datetime(2017, 3, 31, 1, 5),
 ...,
 datetime.datetime(2017, 3, 31, 23, 15),
 datetime.datetime(2017, 3, 31, 23, 50)]

>>> print(dat)
[array(['2017', '3', '31', '0', '30'], dtype='<U4'),
 array(['2017', '3', '31', '1', '5'], dtype='<U4'),
 ...
 array(['2017', '3', '31', '23', '15'], dtype='<U4'),
 array(['2017', '3', '31', '23', '50'], dtype='<U4')]

For-loop

You can use for-loop (or iteration) to look through the data each by each.

>>> t0 = datetime.datetime(2017, 3, 31)
>>> t1 = datetime.datetime(2017, 4, 1)
>>> for tobs, dat in dc.iter(t0, t1):
...     print(tobs, dat)
2017-03-31 00:30:00 ['2017' '3' '31' '0' '30']
2017-03-31 01:05:00 ['2017' '3' '31' '1' '5']
...
2017-03-31 23:15:00 ['2017' '3' '31' '23' '15']
2017-03-31 23:50:00 ['2017' '3' '31' '23' '50']

Note

Which should be used, 2. array or 3. for-loop (iteration)?

The 2. array approach is better in terms of

Intuitiveness

Interactiveness

Time series analysis

The 3. for-loop approach is better in terms of

Memory efficient for long time data

Thus, the 2. array approach may be used for developing the code using a short term period, and 3. for-loop approach may be used for a script for statistics.

Specific reading¶

Each data center can provide specific ways of reading or formatting the data. See each data center description.

For developer¶

Note

Changed in version v4.4.8: It is now not recommended to override the method irfpy.util.datacenter.BaseDataCenter.exact_starttime(). The method not only returns the exact start time, but also check validity of data file. If the data file is invalid, processing to disregard the file is conducted.

A sample implementation of a data center can be found irfpy.util.datacenter.SampleDataCenter.

Data center, a new coherent way to access data¶

Introduction¶

For user¶

How to prepare the data¶

A sample dataset¶

How to read the data¶

General reading¶

Specific reading¶

For developer¶

Table of Contents

Previous topic

Next topic

This Page