Data file

Reading/Writing data file is a major part of data analysys and numerical simulations. irfpy provides a simple template of data file in irfpy.util.datafile module, which is one solution for easy to use data file.

In general for python and numpy/scipy, many data file format is defined and available, in addition to CDF or netcdf.

Even such a lot of data format is available, use of them sometimes is faced diffuculty. Therefore, many people make a data file for your own purpose. The document describes an easy way of creating such data file.

Importance

My judge on “useful” data file is as follows.

  • Human readable (Other-platform readable)

  • Header and trailer for meta data

  • Version controlled (Easy to change the format, particularly header)

  • Easy to use (Intiuitve!)

The follows are important, but less important than the above.

  • File size (You may use gzip to decrease the data file size)

  • Fully self-described

Simplest way

The simplest way is to open a file, dump the data, and close a file something like as follows.

The pros are

  • Human readable

  • Easy to use

The cons are

  • Hard to change file format

  • Hard to remember the data contents

Anyway, a sample file to write data looks

>>> fp = open('solarwind.dat', 'w')
>>> for t in tlist:
...     print >> fp, '%s %.1f %.1f %1f.' % (
...               tlist.strftime('%FT%T'),
...               density[t], velocity[t], temperature[t])
>>> fp.close()

and the resulting file looks

2009-01-31T00:00:00 5.3 275.0 42.1
2009-01-31T01:00:00 5.4 235.0 31.8
2009-01-31T02:00:00 3.3 278.3 45.9
2009-01-31T03:00:00 7.2 285.7 44.3
2009-01-31T04:00:00 6.3 285.2 50.1

Many situation will work on this kind of simplest approaches.

irfpy-compatible data file

irfpy.util.datafile provides one solution on my data file criteria. irfpy.util.datafile.Datafile is an implementation.

Creating data file

It is simple. Instance the Datafile class.

>>> import irfpy.util.datafile
>>> df = irfpy.util.datafile.Datafile(version="1.0")

Now, the data file has already version and creation data. These have already been generated automatically, and dumped as a header.

>>> print df.dumps()
### HEADER : 2
# VERSION : 1.0
# CREATION_DATE : 2013-03-08T09:56:22
### DATA : 0
### TRAIL : 0

Method dumps() dumps the data into string. Method dump() dumps the data into a file. This follows the way that pickle does.

Anyway, look at the dumped contents. The three hashes (”###”) separates header, data and trail. After the colon, the number (2 here) describes the number of the header. Now only the default header is set. Header line starts with a single hash (”#”)

Note

Remember that all the entry is “line-oriented”. Therefore, 1 line 1 data principal should be followed. This means that no “\n” in the string of header/data/trail.

Then, you can add header and trail as you want. For example,

>>> df.add_header('SPACECRAFT', 'MEX')
>>> df.add_header('SENSOR', 'IMA')
>>> df.add_trail('LICENSE', 'BSD')
>>> print df.dumps()
### HEADER : 4
# VERSION : 1.0
# CREATION_DATE : 2013-03-08T09:56:22
# SPACECRAFT : MEX
# SENSOR : IMA
### DATA : 0
### TRAIL : 1
# LICENSE : BSD

Ok, now four header and a trail is shown. Still data and trail is missing.

Add data into a data file

You may use add_data() to add the data.

>>> df.add_data('FLUX', '0 2009-01-03T18:05:33  2.8 1.9 4.1 7.3')
>>> df.add_data('FLUX', '1 2009-01-03T18:08:45  2.7 2.3 3.7 6.8')
>>> df.add_data('FLUX', '2 2009-01-03T18:11:57  2.6 1.9 2.8 9.0')

The key “FLUX” is the first argument and the data for second. The data should be a single line string. Three data is added to the data key “FLUX”. The contents looks as follows.

>>> print df.dumps()
### HEADER : 4
# VERSION : 1.0
# CREATION_DATE : 2013-03-08T09:56:22
# SPACECRAFT : MEX
# SENSOR : IMA
### DATA : 1
## DATA : FLUX : 3
0 2009-01-03T18:05:33  2.8 1.9 4.1 7.3
1 2009-01-03T18:08:45  2.7 2.3 3.7 6.8
2 2009-01-03T18:11:57  2.6 1.9 2.8 9.0
### TRAIL : 1
# LICENSE : BSD

Data with the key “FLUX” has three entries. You may add multiple dataset.

Now, the tutorial is over. You may dump the file into tutdf_v1.0.txt.

>>> df.dump(open('tutdf_v1.0.txt', 'w'))

Reading data

Reading the data is very simple. We will try to read the file tutdf_v1.0.txt. Try to look at it with a prefered pager.

This file will be read by irfpy.util.datafile.

>>> import irfpy.util.datafile
>>> df = irfpy.util.datafile.Datafile()
>>> df.readfile(open('tutdf_v1.0.txt'))

Now the data is read and loaded into df.

You can access the data via wrapping by irfpy.util.datafile.DatafileReader.

>>> dfr = irfpy.util.datafile.DatafileReader(df)
>>> print dfr.get_header('VERSION')
1.0
>>> print dfr.get_header('Version')   # Not in the entry
None
>>> print dfr.get_data('FLUX')[2]
2 2009-01-03T18:11:57  2.6 1.9 2.8 9.0

You can also access the header, data, and trail via the attributes header, data and trail of Datafile object directly. These attributes are OrderedDict objects. For the data, a list of string is stored.

>>> print df.header['VERSION']
1.0
>>> print df.data['FLUX'][2]
2 2009-01-03T18:11:57  2.6 1.9 2.8 9.0

Use numpy.loadtxt or genfromtxt

One benefit is the data file is compatible with loadtxt or genfromtxt, if the data block is limited to single.

>>> import numpy as np
>>> dat = np.genfromtxt(open('tutdf_v1.0.txt'), usecols=(0, 2, 3, 4, 5))
>>> print dat
[[ 0.   2.8  1.9  4.1  7.3]
 [ 1.   2.7  2.3  3.7  6.8]
 [ 2.   2.6  1.9  2.8  9. ]]

The readheader() will provide only the header information. This is a quick way of getting meta info on the file.

>>> import irfpy.util.datafile
>>> df = irfpy.util.datafile.Datafile()
>>> df.readheader(open('tutdf_v1.0.txt'))

Note

So far readtrail() assumes to start a correct file pointer, so that only works after readdata().

Use case: update the header

It is a general demand to update the file format or contents. Updating header is a simple case of the demand.

Consider you want to add “START_TIME” and “STOP_TIME” entry into the header. Also, you want to remove “SENSOR” entry from the header. Changing the Datafile writing is very simple. Just modify the code! Remember to update the version number, say “1.1”.

>>> import irfpy.util.datafile
>>> df = irfpy.util.datafile.Datafile(version="1.1")
>>> df.add_header('SPACECRAFT', 'MEX')
>>> df.add_header('START_TIME', '2009-01-03T18:05:00')
>>> df.add_header('STOP_TIME', '2009-01-03T18:12:00')
>>> df.add_data('FLUX', '0 2009-01-03T18:05:33  2.8 1.9 4.1 7.3')
>>> df.add_data('FLUX', '1 2009-01-03T18:08:45  2.7 2.3 3.7 6.8')
>>> df.add_data('FLUX', '2 2009-01-03T18:11:57  2.6 1.9 2.8 9.0')
>>> df.add_trail('LICENSE', 'BSD')
>>> df.dump(open('tutdf_v1.1.txt', 'w'))

The output will be tutdf_v1.1.txt.

For reading part, I would use following to handle both the version.

>>> # Read v1.0 file.
>>> df_v10 = irfpy.util.datafile.Datafile()
>>> df_v10.readheader(open('tutdf_v1.0.txt'))
>>> dfr_v10 = irfpy.util.datafile.DatafileReader(df_v10, missing_return="UNKNOWN")

>>> # Read v1.1 file.
>>> df_v11 = irfpy.util.datafile.Datafile()
>>> df_v11.readheader(open('tutdf_v1.1.txt'))
>>> dfr_v11 = irfpy.util.datafile.DatafileReader(df_v11, missing_return="UNKNOWN")

>>> print dfr_v10.get_header('VERSION')
1.0
>>> print dfr_v10.get_header('START_TIME')   # Only in v1.1
UNKNOWN

>>> print dfr_v11.get_header('SPACECRAFT')
MEX
>>> print dfr_v11.get_header('SENSOR')   # Only in v1.0
UNKNOWN