Data file¶
Reading/Writing data file is a major part of data analysys and
numerical simulations.
irfpy
provides a simple template of data file in
irfpy.util.datafile
module, which is one solution for
easy to use data file.
In general for python
and numpy
/scipy
,
many data file format is defined and available,
in addition to CDF
or netcdf
.
Even such a lot of data format is available, use of them sometimes is faced diffuculty. Therefore, many people make a data file for your own purpose. The document describes an easy way of creating such data file.
Importance¶
My judge on “useful” data file is as follows.
Human readable (Other-platform readable)
Header and trailer for meta data
Version controlled (Easy to change the format, particularly header)
Easy to use (Intiuitve!)
The follows are important, but less important than the above.
File size (You may use gzip to decrease the data file size)
Fully self-described
Simplest way¶
The simplest way is to open a file, dump the data, and close a file something like as follows.
The pros are
Human readable
Easy to use
The cons are
Hard to change file format
Hard to remember the data contents
Anyway, a sample file to write data looks
>>> fp = open('solarwind.dat', 'w')
>>> for t in tlist:
... print >> fp, '%s %.1f %.1f %1f.' % (
... tlist.strftime('%FT%T'),
... density[t], velocity[t], temperature[t])
>>> fp.close()
and the resulting file looks
2009-01-31T00:00:00 5.3 275.0 42.1
2009-01-31T01:00:00 5.4 235.0 31.8
2009-01-31T02:00:00 3.3 278.3 45.9
2009-01-31T03:00:00 7.2 285.7 44.3
2009-01-31T04:00:00 6.3 285.2 50.1
Many situation will work on this kind of simplest approaches.
irfpy
-compatible data file¶
irfpy.util.datafile
provides one solution on my data file criteria.
irfpy.util.datafile.Datafile
is an implementation.
Creating data file¶
It is simple. Instance the Datafile
class.
>>> import irfpy.util.datafile
>>> df = irfpy.util.datafile.Datafile(version="1.0")
Now, the data file has already version and creation data. These have already been generated automatically, and dumped as a header.
>>> print df.dumps()
### HEADER : 2
# VERSION : 1.0
# CREATION_DATE : 2013-03-08T09:56:22
### DATA : 0
### TRAIL : 0
Method dumps()
dumps the data into string.
Method dump()
dumps the data into a file.
This follows the way that pickle
does.
Anyway, look at the dumped contents.
The three hashes (”###
”) separates header, data and trail.
After the colon, the number (2 here) describes the number of the header.
Now only the default header is set.
Header line starts with a single hash (”#
”)
Note
Remember that all the entry is “line-oriented”. Therefore, 1 line 1 data principal should be followed. This means that no “\n” in the string of header/data/trail.
Then, you can add header and trail as you want. For example,
>>> df.add_header('SPACECRAFT', 'MEX')
>>> df.add_header('SENSOR', 'IMA')
>>> df.add_trail('LICENSE', 'BSD')
>>> print df.dumps()
### HEADER : 4
# VERSION : 1.0
# CREATION_DATE : 2013-03-08T09:56:22
# SPACECRAFT : MEX
# SENSOR : IMA
### DATA : 0
### TRAIL : 1
# LICENSE : BSD
Ok, now four header and a trail is shown. Still data and trail is missing.
Add data into a data file¶
You may use add_data()
to add the data.
>>> df.add_data('FLUX', '0 2009-01-03T18:05:33 2.8 1.9 4.1 7.3')
>>> df.add_data('FLUX', '1 2009-01-03T18:08:45 2.7 2.3 3.7 6.8')
>>> df.add_data('FLUX', '2 2009-01-03T18:11:57 2.6 1.9 2.8 9.0')
The key “FLUX” is the first argument and the data for second. The data should be a single line string. Three data is added to the data key “FLUX”. The contents looks as follows.
>>> print df.dumps()
### HEADER : 4
# VERSION : 1.0
# CREATION_DATE : 2013-03-08T09:56:22
# SPACECRAFT : MEX
# SENSOR : IMA
### DATA : 1
## DATA : FLUX : 3
0 2009-01-03T18:05:33 2.8 1.9 4.1 7.3
1 2009-01-03T18:08:45 2.7 2.3 3.7 6.8
2 2009-01-03T18:11:57 2.6 1.9 2.8 9.0
### TRAIL : 1
# LICENSE : BSD
Data with the key “FLUX” has three entries. You may add multiple dataset.
Now, the tutorial is over. You may dump the file into tutdf_v1.0.txt
.
>>> df.dump(open('tutdf_v1.0.txt', 'w'))
Reading data¶
Reading the data is very simple. We will try to read the file
tutdf_v1.0.txt
.
Try to look at it with a prefered pager.
This file will be read by irfpy.util.datafile
.
>>> import irfpy.util.datafile
>>> df = irfpy.util.datafile.Datafile()
>>> df.readfile(open('tutdf_v1.0.txt'))
Now the data is read and loaded into df
.
You can access the data via wrapping by
irfpy.util.datafile.DatafileReader
.
>>> dfr = irfpy.util.datafile.DatafileReader(df)
>>> print dfr.get_header('VERSION')
1.0
>>> print dfr.get_header('Version') # Not in the entry
None
>>> print dfr.get_data('FLUX')[2]
2 2009-01-03T18:11:57 2.6 1.9 2.8 9.0
You can also access the header, data, and trail via the attributes
header
, data
and trail
of Datafile
object
directly. These attributes are OrderedDict
objects.
For the data, a list of string is stored.
>>> print df.header['VERSION']
1.0
>>> print df.data['FLUX'][2]
2 2009-01-03T18:11:57 2.6 1.9 2.8 9.0
Use numpy.loadtxt or genfromtxt¶
One benefit is the data file is compatible with loadtxt
or genfromtxt
,
if the data block is limited to single.
>>> import numpy as np
>>> dat = np.genfromtxt(open('tutdf_v1.0.txt'), usecols=(0, 2, 3, 4, 5))
>>> print dat
[[ 0. 2.8 1.9 4.1 7.3]
[ 1. 2.7 2.3 3.7 6.8]
[ 2. 2.6 1.9 2.8 9. ]]
The readheader()
will provide only the header information.
This is a quick way of getting meta info on the file.
>>> import irfpy.util.datafile
>>> df = irfpy.util.datafile.Datafile()
>>> df.readheader(open('tutdf_v1.0.txt'))
Note
So far readtrail()
assumes to start a correct file pointer,
so that only works after readdata()
.
Use case: update the header¶
It is a general demand to update the file format or contents. Updating header is a simple case of the demand.
Consider you want to add “START_TIME” and “STOP_TIME” entry into the header. Also, you want to remove “SENSOR” entry from the header. Changing the Datafile writing is very simple. Just modify the code! Remember to update the version number, say “1.1”.
>>> import irfpy.util.datafile
>>> df = irfpy.util.datafile.Datafile(version="1.1")
>>> df.add_header('SPACECRAFT', 'MEX')
>>> df.add_header('START_TIME', '2009-01-03T18:05:00')
>>> df.add_header('STOP_TIME', '2009-01-03T18:12:00')
>>> df.add_data('FLUX', '0 2009-01-03T18:05:33 2.8 1.9 4.1 7.3')
>>> df.add_data('FLUX', '1 2009-01-03T18:08:45 2.7 2.3 3.7 6.8')
>>> df.add_data('FLUX', '2 2009-01-03T18:11:57 2.6 1.9 2.8 9.0')
>>> df.add_trail('LICENSE', 'BSD')
>>> df.dump(open('tutdf_v1.1.txt', 'w'))
The output will be tutdf_v1.1.txt
.
For reading part, I would use following to handle both the version.
>>> # Read v1.0 file.
>>> df_v10 = irfpy.util.datafile.Datafile()
>>> df_v10.readheader(open('tutdf_v1.0.txt'))
>>> dfr_v10 = irfpy.util.datafile.DatafileReader(df_v10, missing_return="UNKNOWN")
>>> # Read v1.1 file.
>>> df_v11 = irfpy.util.datafile.Datafile()
>>> df_v11.readheader(open('tutdf_v1.1.txt'))
>>> dfr_v11 = irfpy.util.datafile.DatafileReader(df_v11, missing_return="UNKNOWN")
>>> print dfr_v10.get_header('VERSION')
1.0
>>> print dfr_v10.get_header('START_TIME') # Only in v1.1
UNKNOWN
>>> print dfr_v11.get_header('SPACECRAFT')
MEX
>>> print dfr_v11.get_header('SENSOR') # Only in v1.0
UNKNOWN