Data center, a new coherent way to access data¶
Introduction¶
Space physics data is often organized with following characteristics.
Data is time-tagged
Data file is multiple, meaning that the files are in a folder structure
Each data file contains multiple time-tagged data
Data at a specific time has define structure
The structure is usually coherent (e.g. vector, matrix, array, etc)
For example, magnetic field data measured by a spacecraft would satisfy the above condition.
Data is time-tagged. Yes, the measurement is a vector at a specific time
Data file is multiple. Yes. Maybe the data file is separated each day by day, for example, covering the whole mission life.
Each data file contains multiple time-tagged data. Yes. One file (e.g., one day data) contains 86400 data if the time resolution is 1 second.
Data at a specific time has define structure. Yes. The data at a specific time is a vector.
The structure is usually coherent (e.g. vector, matrix, array, etc). The data (vector) is applicable throughout the mission.
To access such data easily, irfpy
provides a functionality of Data Center
.
New in version v4.4.7: The irfpy.util.datacenter
and the abstract base class
irfpy.util.datacenter.BaseDataCenter
are added for this purpose.
For user¶
Ideally, each dataset provides one or data centers. For example, irfpy.aspera
project provides several data centers.
VEX/IMA provides
irfpy.vima.rawdata.DataCenterCount2dEmulated
VEX/NPD provides
irfpy.vnpd.vnpddata.DataCenterRawMode
VEX/MAG provides
irfpy.vmag.scidata1s.DataCenterMag1s
and more…
How to prepare the data¶
Preparation of data depends on the project. Usually you can find information somewhere in this website, or ask the responsible people on the setup.
Typical procedure is
Prepare the dataset (downloaded from some web-site)
Set the path of the downloaded dataset at
.irfpyrc
.
A sample dataset¶
Here in this tutorial, we try to use the irfpy.util.datacenter.SampleDataCenter
class.
First, you can create a sample data folder.
>>> from irfpy.util import datacenter
>>> datacenter.create_sample_folder()
Then you can find a folder _datacenter_sample in the current directory.
To create a sample datacenter object,
>>> dc = datacenter.SampleDataCenter()
You can print out the status of the datacenter.
>>> dc.print_status()
# DataCenterStatus:
# - Time range:
# - Start: 2017-03-30 00:00:00
# - Stop: 2017-04-02 23:40:00
# - Cache:
# - Size: 2
# - DB:
# - Size: 4
How to read the data¶
General reading¶
All the data centers support the following three typical reading scheme.
Get the data at a specific time
To get the data at a specific time, you may use usually the
irfpy.util.datacenter.BaseDataCenter.nearest()
method, or similar.>>> import datetime >>> t0 = datetime.datetime(2017, 4, 1, 10, 10) >>> tobs, dat = dc.nearest(t0)
The data center returns a tuple with two elements. First one is the time, and the second one is the data.
Let’s see them.
>>> print(tobs) 2017-04-01 10:20:00
The nearest data is observed at 10 minutes later.
>>> print(dat) ['2017' '4' '1' '10' '20']
The corresponding data is a five element numpy array.
Get the data in a range of time
To get the data at a time range, you may use the
irfpy.util.datacenter.BaseDataCenter.get_array()
method or similar.>>> t0 = datetime.datetime(2017, 3, 31) >>> t1 = datetime.datetime(2017, 4, 1) >>> tobs, dat = dc.get_array(t0, t1)
In these methods, the returned value is a tuple with two elements. The first one is a list of the times sampled, and the second one is a list of data.
>>> print(tobs) [datetime.datetime(2017, 3, 31, 0, 30), datetime.datetime(2017, 3, 31, 1, 5), ..., datetime.datetime(2017, 3, 31, 23, 15), datetime.datetime(2017, 3, 31, 23, 50)]
>>> print(dat) [array(['2017', '3', '31', '0', '30'], dtype='<U4'), array(['2017', '3', '31', '1', '5'], dtype='<U4'), ... array(['2017', '3', '31', '23', '15'], dtype='<U4'), array(['2017', '3', '31', '23', '50'], dtype='<U4')]
For-loop
You can use for-loop (or iteration) to look through the data each by each.
>>> t0 = datetime.datetime(2017, 3, 31) >>> t1 = datetime.datetime(2017, 4, 1) >>> for tobs, dat in dc.iter(t0, t1): ... print(tobs, dat) 2017-03-31 00:30:00 ['2017' '3' '31' '0' '30'] 2017-03-31 01:05:00 ['2017' '3' '31' '1' '5'] ... 2017-03-31 23:15:00 ['2017' '3' '31' '23' '15'] 2017-03-31 23:50:00 ['2017' '3' '31' '23' '50']
Note
Which should be used, 2. array or 3. for-loop (iteration)?
The 2. array approach is better in terms of
Intuitiveness
Interactiveness
Time series analysis
The 3. for-loop approach is better in terms of
Memory efficient for long time data
Thus, the 2. array approach may be used for developing the code using a short term period, and 3. for-loop approach may be used for a script for statistics.
Specific reading¶
Each data center can provide specific ways of reading or formatting the data. See each data center description.
For developer¶
Note
Changed in version v4.4.8: It is now not recommended to override the method
irfpy.util.datacenter.BaseDataCenter.exact_starttime()
.
The method not only returns the exact start time, but also
check validity of data file. If the data file is invalid,
processing to disregard the file is conducted.
A sample implementation of a data center can be found irfpy.util.datacenter.SampleDataCenter
.