.. _tutorial_datafile:

=========
Data file
=========

Reading/Writing data file is a major part of data analysys and
numerical simulations.
``irfpy`` provides a simple template of data file in
:mod:`irfpy.util.datafile` module, which is one solution for
easy to use data file.

In general for ``python`` and ``numpy``/``scipy``, 
many data file format is defined and available,
in addition to ``CDF`` or ``netcdf``.

Even such a lot of data format is available,
use of them sometimes is faced diffuculty.
Therefore, many people make a data file for your own purpose.
The document describes an easy way of creating such data file.

Importance
----------

My judge on "useful" data file is as follows.

- Human readable (Other-platform readable)
- Header and trailer for meta data
- Version controlled (Easy to change the format, particularly header)
- Easy to use (Intiuitve!)

The follows are important, but less important than the above.

- File size  (You may use gzip to decrease the data file size)
- Fully self-described


Simplest way
------------

The simplest way is to open a file, dump the data, and close a file
something like as follows.

The pros are

- Human readable
- Easy to use

The cons are

- Hard to change file format
- Hard to remember the data contents

Anyway, a sample file to write data looks

.. code-block:: py

        >>> fp = open('solarwind.dat', 'w')
        >>> for t in tlist:
        ...     print >> fp, '%s %.1f %.1f %1f.' % (
        ...               tlist.strftime('%FT%T'),
        ...               density[t], velocity[t], temperature[t])
        >>> fp.close()

and the resulting file looks

::

        2009-01-31T00:00:00 5.3 275.0 42.1
        2009-01-31T01:00:00 5.4 235.0 31.8
        2009-01-31T02:00:00 3.3 278.3 45.9
        2009-01-31T03:00:00 7.2 285.7 44.3
        2009-01-31T04:00:00 6.3 285.2 50.1

Many situation will work on this kind of simplest approaches.

``irfpy``-compatible data file
------------------------------

:mod:`irfpy.util.datafile` provides one solution on my data file criteria.
:class:`irfpy.util.datafile.Datafile` is an implementation.

Creating data file
..................

It is simple. Instance the ``Datafile`` class.

.. code-block:: py

        >>> import irfpy.util.datafile
        >>> df = irfpy.util.datafile.Datafile(version="1.0")

Now, the data file has already version and creation data.
These have already been generated automatically, and
dumped as a header.


.. code-block:: py

        >>> print df.dumps()
        ### HEADER : 2
        # VERSION : 1.0
        # CREATION_DATE : 2013-03-08T09:56:22
        ### DATA : 0
        ### TRAIL : 0

Method :meth:`dumps` dumps the data into string.
Method :meth:`dump` dumps the data into a file.
This follows the way that ``pickle`` does.

Anyway, look at the dumped contents.
The three hashes ("``###``") separates header, data and trail.
After the colon, the number (2 here) describes the number of the header.
Now only the default header is set.
Header line starts with a single hash ("``#``")

.. note::

        Remember that all the entry is "line-oriented".
        Therefore, 1 line 1 data principal should be followed.
        This means that no "\\n" in the string of header/data/trail.

Then, you can add header and trail as you want. For example,

.. code-block:: py

        >>> df.add_header('SPACECRAFT', 'MEX')
        >>> df.add_header('SENSOR', 'IMA')
        >>> df.add_trail('LICENSE', 'BSD')
        >>> print df.dumps()
        ### HEADER : 4
        # VERSION : 1.0
        # CREATION_DATE : 2013-03-08T09:56:22
        # SPACECRAFT : MEX
        # SENSOR : IMA
        ### DATA : 0
        ### TRAIL : 1
        # LICENSE : BSD

Ok, now four header and a trail is shown.
Still data and trail is missing.

Add data into a data file
.........................

You may use :meth:`add_data` to add the data.

.. code-block:: py

        >>> df.add_data('FLUX', '0 2009-01-03T18:05:33  2.8 1.9 4.1 7.3')
        >>> df.add_data('FLUX', '1 2009-01-03T18:08:45  2.7 2.3 3.7 6.8')
        >>> df.add_data('FLUX', '2 2009-01-03T18:11:57  2.6 1.9 2.8 9.0')

The key "FLUX" is the first argument and the data for second.
The data should be a **single line string**.
Three data is added to the data key "FLUX".
The contents looks as follows.

.. code-block:: py

        >>> print df.dumps()
        ### HEADER : 4
        # VERSION : 1.0
        # CREATION_DATE : 2013-03-08T09:56:22
        # SPACECRAFT : MEX
        # SENSOR : IMA
        ### DATA : 1
        ## DATA : FLUX : 3
        0 2009-01-03T18:05:33  2.8 1.9 4.1 7.3
        1 2009-01-03T18:08:45  2.7 2.3 3.7 6.8
        2 2009-01-03T18:11:57  2.6 1.9 2.8 9.0
        ### TRAIL : 1
        # LICENSE : BSD

Data with the key "FLUX" has three entries.
You may add multiple dataset.

Now, the tutorial is over. You may dump the file into ``tutdf_v1.0.txt``.

.. code-block:: py

        >>> df.dump(open('tutdf_v1.0.txt', 'w'))

Reading data
............        

Reading the data is very simple. We will try to read the file
:download:`tutdf_v1.0.txt`.
Try to look at it with a prefered pager.

This file will be read by :mod:`irfpy.util.datafile`.

.. code-block:: py

        >>> import irfpy.util.datafile
        >>> df = irfpy.util.datafile.Datafile()
        >>> df.readfile(open('tutdf_v1.0.txt'))

Now the data is read and loaded into ``df``.

You can access the data via wrapping by
:class:`irfpy.util.datafile.DatafileReader`.

.. code-block:: py

        >>> dfr = irfpy.util.datafile.DatafileReader(df)
        >>> print dfr.get_header('VERSION')
        1.0
        >>> print dfr.get_header('Version')   # Not in the entry
        None
        >>> print dfr.get_data('FLUX')[2]
        2 2009-01-03T18:11:57  2.6 1.9 2.8 9.0


You can also access the header, data, and trail via the attributes
:attr:`header`, :attr:`data` and :attr:`trail` of :class:`Datafile` object
directly.  These attributes are :class:`OrderedDict` objects.
For the `data`, a list of string is stored.

.. code-block:: py

        >>> print df.header['VERSION']
        1.0
        >>> print df.data['FLUX'][2]
        2 2009-01-03T18:11:57  2.6 1.9 2.8 9.0

Use numpy.loadtxt or genfromtxt
...............................

One benefit is the data file is compatible with ``loadtxt`` or ``genfromtxt``,
if the data block is limited to single.

.. code-block:: py

        >>> import numpy as np
        >>> dat = np.genfromtxt(open('tutdf_v1.0.txt'), usecols=(0, 2, 3, 4, 5))
        >>> print dat
        [[ 0.   2.8  1.9  4.1  7.3]
         [ 1.   2.7  2.3  3.7  6.8]
         [ 2.   2.6  1.9  2.8  9. ]]

The :meth:`readheader` will provide only the header information.
This is a quick way of getting meta info on the file.


.. code-block:: py

        >>> import irfpy.util.datafile
        >>> df = irfpy.util.datafile.Datafile()
        >>> df.readheader(open('tutdf_v1.0.txt'))

.. note::

        So far :meth:`readtrail` assumes to start a correct file pointer,
        so that only works after :meth:`readdata`.


Use case: update the header
...........................

It is a general demand to update the file format or contents.
Updating header is a simple case of the demand.

Consider you want to add "START_TIME" and "STOP_TIME" entry into the header.
Also, you want to remove "SENSOR" entry from the header.
Changing the Datafile writing is very simple. Just modify the code!
Remember to update the version number, say "1.1".

.. code-block:: py

        >>> import irfpy.util.datafile
        >>> df = irfpy.util.datafile.Datafile(version="1.1")
        >>> df.add_header('SPACECRAFT', 'MEX')
        >>> df.add_header('START_TIME', '2009-01-03T18:05:00')
        >>> df.add_header('STOP_TIME', '2009-01-03T18:12:00')
        >>> df.add_data('FLUX', '0 2009-01-03T18:05:33  2.8 1.9 4.1 7.3')
        >>> df.add_data('FLUX', '1 2009-01-03T18:08:45  2.7 2.3 3.7 6.8')
        >>> df.add_data('FLUX', '2 2009-01-03T18:11:57  2.6 1.9 2.8 9.0')
        >>> df.add_trail('LICENSE', 'BSD')
        >>> df.dump(open('tutdf_v1.1.txt', 'w'))

The output will be :download:`tutdf_v1.1.txt`.

For reading part, I would use following to handle both the version.

.. code-block:: py

        >>> # Read v1.0 file.
        >>> df_v10 = irfpy.util.datafile.Datafile()
        >>> df_v10.readheader(open('tutdf_v1.0.txt'))
        >>> dfr_v10 = irfpy.util.datafile.DatafileReader(df_v10, missing_return="UNKNOWN")

        >>> # Read v1.1 file.
        >>> df_v11 = irfpy.util.datafile.Datafile()
        >>> df_v11.readheader(open('tutdf_v1.1.txt'))
        >>> dfr_v11 = irfpy.util.datafile.DatafileReader(df_v11, missing_return="UNKNOWN")

        >>> print dfr_v10.get_header('VERSION')
        1.0
        >>> print dfr_v10.get_header('START_TIME')   # Only in v1.1
        UNKNOWN

        >>> print dfr_v11.get_header('SPACECRAFT')
        MEX
        >>> print dfr_v11.get_header('SENSOR')   # Only in v1.0
        UNKNOWN