Data center, a new coherent way to access data

Introduction

Space physics data is often organized with following characteristics.

  • Data is time-tagged

  • Data file is multiple, meaning that the files are in a folder structure

  • Each data file contains multiple time-tagged data

  • Data at a specific time has define structure

  • The structure is usually coherent (e.g. vector, matrix, array, etc)

For example, magnetic field data measured by a spacecraft would satisfy the above condition.

  • Data is time-tagged. Yes, the measurement is a vector at a specific time

  • Data file is multiple. Yes. Maybe the data file is separated each day by day, for example, covering the whole mission life.

  • Each data file contains multiple time-tagged data. Yes. One file (e.g., one day data) contains 86400 data if the time resolution is 1 second.

  • Data at a specific time has define structure. Yes. The data at a specific time is a vector.

  • The structure is usually coherent (e.g. vector, matrix, array, etc). The data (vector) is applicable throughout the mission.

To access such data easily, irfpy provides a functionality of Data Center.

New in version v4.4.7: The irfpy.util.datacenter and the abstract base class irfpy.util.datacenter.BaseDataCenter are added for this purpose.

For user

Ideally, each dataset provides one or data centers. For example, irfpy.aspera project provides several data centers.

  • VEX/IMA provides irfpy.vima.rawdata.DataCenterCount2dEmulated

  • VEX/NPD provides irfpy.vnpd.vnpddata.DataCenterRawMode

  • VEX/MAG provides irfpy.vmag.scidata1s.DataCenterMag1s

  • and more…

How to prepare the data

Preparation of data depends on the project. Usually you can find information somewhere in this website, or ask the responsible people on the setup.

Typical procedure is

  1. Prepare the dataset (downloaded from some web-site)

  2. Set the path of the downloaded dataset at .irfpyrc.

A sample dataset

Here in this tutorial, we try to use the irfpy.util.datacenter.SampleDataCenter class.

First, you can create a sample data folder.

>>> from irfpy.util import datacenter
>>> datacenter.create_sample_folder()

Then you can find a folder _datacenter_sample in the current directory.

To create a sample datacenter object,

>>> dc = datacenter.SampleDataCenter()

You can print out the status of the datacenter.

>>> dc.print_status()
# DataCenterStatus:
# - Time range:
#   - Start: 2017-03-30 00:00:00
#   - Stop:  2017-04-02 23:40:00
# - Cache:
#   - Size: 2
# - DB:
#   - Size: 4

How to read the data

General reading

All the data centers support the following three typical reading scheme.

  1. Get the data at a specific time

    To get the data at a specific time, you may use usually the irfpy.util.datacenter.BaseDataCenter.nearest() method, or similar.

    >>> import datetime
    >>> t0 = datetime.datetime(2017, 4, 1, 10, 10)
    >>> tobs, dat = dc.nearest(t0)
    

    The data center returns a tuple with two elements. First one is the time, and the second one is the data.

    Let’s see them.

    >>> print(tobs)
    2017-04-01 10:20:00
    

    The nearest data is observed at 10 minutes later.

    >>> print(dat)
    ['2017' '4' '1' '10' '20']
    

    The corresponding data is a five element numpy array.

  2. Get the data in a range of time

    To get the data at a time range, you may use the irfpy.util.datacenter.BaseDataCenter.get_array() method or similar.

    >>> t0 = datetime.datetime(2017, 3, 31)
    >>> t1 = datetime.datetime(2017, 4, 1)
    >>> tobs, dat = dc.get_array(t0, t1)
    

    In these methods, the returned value is a tuple with two elements. The first one is a list of the times sampled, and the second one is a list of data.

    >>> print(tobs)
    [datetime.datetime(2017, 3, 31, 0, 30),
     datetime.datetime(2017, 3, 31, 1, 5),
     ...,
     datetime.datetime(2017, 3, 31, 23, 15),
     datetime.datetime(2017, 3, 31, 23, 50)]
    
    >>> print(dat)
    [array(['2017', '3', '31', '0', '30'], dtype='<U4'),
     array(['2017', '3', '31', '1', '5'], dtype='<U4'),
     ...
     array(['2017', '3', '31', '23', '15'], dtype='<U4'),
     array(['2017', '3', '31', '23', '50'], dtype='<U4')]
    
  3. For-loop

    You can use for-loop (or iteration) to look through the data each by each.

    >>> t0 = datetime.datetime(2017, 3, 31)
    >>> t1 = datetime.datetime(2017, 4, 1)
    >>> for tobs, dat in dc.iter(t0, t1):
    ...     print(tobs, dat)
    2017-03-31 00:30:00 ['2017' '3' '31' '0' '30']
    2017-03-31 01:05:00 ['2017' '3' '31' '1' '5']
    ...
    2017-03-31 23:15:00 ['2017' '3' '31' '23' '15']
    2017-03-31 23:50:00 ['2017' '3' '31' '23' '50']
    

Note

Which should be used, 2. array or 3. for-loop (iteration)?

The 2. array approach is better in terms of

  • Intuitiveness

  • Interactiveness

  • Time series analysis

The 3. for-loop approach is better in terms of

  • Memory efficient for long time data

Thus, the 2. array approach may be used for developing the code using a short term period, and 3. for-loop approach may be used for a script for statistics.

Specific reading

Each data center can provide specific ways of reading or formatting the data. See each data center description.

For developer

Note

Changed in version v4.4.8: It is now not recommended to override the method irfpy.util.datacenter.BaseDataCenter.exact_starttime(). The method not only returns the exact start time, but also check validity of data file. If the data file is invalid, processing to disregard the file is conducted.

A sample implementation of a data center can be found irfpy.util.datacenter.SampleDataCenter.