.. _data_center:

==============================================
Data center, a new coherent way to access data
==============================================

Introduction
============

Space physics data is often organized with following characteristics.

- Data is time-tagged
- Data file is multiple, meaning that the files are in a folder structure
- Each data file contains multiple time-tagged data
- Data at a specific time has define structure
- The structure is usually coherent (e.g. vector, matrix, array, etc)

For example, magnetic field data measured by a spacecraft would satisfy the above condition.

- Data is time-tagged. Yes, the measurement is a vector at a specific time
- Data file is multiple. Yes. Maybe the data file is separated each day by day, for example, covering the whole mission life.
- Each data file contains multiple time-tagged data. Yes. One file (e.g., one day data) contains 86400 data if the time resolution is 1 second.
- Data at a specific time has define structure. Yes. The data at a specific time is a vector.
- The structure is usually coherent (e.g. vector, matrix, array, etc). The data (vector) is applicable throughout the mission.

To access such data easily, ``irfpy`` provides a functionality of ``Data Center``.

.. versionadded:: v4.4.7

        The :mod:`irfpy.util.datacenter` and the abstract base class
        :class:`irfpy.util.datacenter.BaseDataCenter` are added for this purpose.


For user
========

Ideally, each dataset provides one or data centers. For example, ``irfpy.aspera`` project provides several data centers.

- VEX/IMA provides :class:`irfpy.vima.rawdata.DataCenterCount2dEmulated`
- VEX/NPD provides :class:`irfpy.vnpd.vnpddata.DataCenterRawMode`
- VEX/MAG provides :class:`irfpy.vmag.scidata1s.DataCenterMag1s`
- and more...


How to prepare the data
-----------------------

Preparation of data depends on the project.
Usually you can find information somewhere in this website,
or ask the responsible people on the setup.

Typical procedure is

1. Prepare the dataset (downloaded from some web-site)
2. Set the path of the downloaded dataset at ``.irfpyrc``.

A sample dataset
................

Here in this tutorial, we try to use the :class:`irfpy.util.datacenter.SampleDataCenter` class.

First, you can create a sample data folder.

        >>> from irfpy.util import datacenter
        >>> datacenter.create_sample_folder()

Then you can find a folder `_datacenter_sample` in the current directory.

To create a sample datacenter object,

        >>> dc = datacenter.SampleDataCenter()

You can print out the status of the datacenter.

        >>> dc.print_status()
        # DataCenterStatus:
        # - Time range:
        #   - Start: 2017-03-30 00:00:00
        #   - Stop:  2017-04-02 23:40:00
        # - Cache:
        #   - Size: 2
        # - DB:
        #   - Size: 4


How to read the data
--------------------

General reading
...............

All the data centers support the following three typical reading scheme.

1. Get the data at a specific time

   To get the data at a specific time, you may use usually
   the :meth:`irfpy.util.datacenter.BaseDataCenter.nearest` method,
   or similar.

        >>> import datetime
        >>> t0 = datetime.datetime(2017, 4, 1, 10, 10)
        >>> tobs, dat = dc.nearest(t0)

   The data center returns a tuple with two elements.
   First one is the time, and the second one is the data.

   Let's see them.

        >>> print(tobs)
        2017-04-01 10:20:00

   The nearest data is observed at 10 minutes later.

        >>> print(dat)
        ['2017' '4' '1' '10' '20']

   The corresponding data is a five element numpy array.


2. Get the data in a range of time

   To get the data at a time range, you may use the
   :meth:`irfpy.util.datacenter.BaseDataCenter.get_array` method or similar.

        >>> t0 = datetime.datetime(2017, 3, 31)
        >>> t1 = datetime.datetime(2017, 4, 1)
        >>> tobs, dat = dc.get_array(t0, t1)

   In these methods, the returned value is a tuple with two elements.
   The first one is a list of the times sampled, and the second one is
   a list of data.

        >>> print(tobs)
        [datetime.datetime(2017, 3, 31, 0, 30),
         datetime.datetime(2017, 3, 31, 1, 5),
         ...,
         datetime.datetime(2017, 3, 31, 23, 15),
         datetime.datetime(2017, 3, 31, 23, 50)]

        >>> print(dat)
        [array(['2017', '3', '31', '0', '30'], dtype='<U4'),
         array(['2017', '3', '31', '1', '5'], dtype='<U4'),
         ...
         array(['2017', '3', '31', '23', '15'], dtype='<U4'),
         array(['2017', '3', '31', '23', '50'], dtype='<U4')]


3. For-loop

   You can use for-loop (or iteration) to look through the data each by each.

        >>> t0 = datetime.datetime(2017, 3, 31)
        >>> t1 = datetime.datetime(2017, 4, 1)
        >>> for tobs, dat in dc.iter(t0, t1):
        ...     print(tobs, dat)
        2017-03-31 00:30:00 ['2017' '3' '31' '0' '30']
        2017-03-31 01:05:00 ['2017' '3' '31' '1' '5']
        ...
        2017-03-31 23:15:00 ['2017' '3' '31' '23' '15']
        2017-03-31 23:50:00 ['2017' '3' '31' '23' '50']

.. note::

        Which should be used, 2. array or 3. for-loop (iteration)?

        The 2. array approach is better in terms of

          - Intuitiveness
          - Interactiveness
          - Time series analysis

        The 3. for-loop approach is better in terms of

          - Memory efficient for long time data

        Thus, the 2. array approach may be used for developing the code using a short term
        period, and 3. for-loop approach may be used for a script for statistics.


Specific reading
................

Each data center can provide specific ways of reading or formatting the data.
See each data center description.


For developer
=============

.. note::

    .. versionchanged:: v4.4.8

        It is now not recommended to override the method
        :meth:`irfpy.util.datacenter.BaseDataCenter.exact_starttime`.
        The method not only returns the exact start time, but also
        check validity of data file.  If the data file is invalid,
        processing to disregard the file is conducted.

A sample implementation of a data center can be found :class:`irfpy.util.datacenter.SampleDataCenter`.