Monday, February 27, 2012

Read an ascii file (cont')


In this message we'll see other ways to do the same job than in the previous post: read an ascii file.

In the previous post we read a file of the form:

0.000000 0.000000
0.095200 0.095056
0.190400 0.189251
0.285599 0.281733
0.380799 0.371662
0.475999 0.458227
0.571199 0.540641
0.666398 0.618159
0.761598 0.690079
0.856798 0.755750
...
using 2 lists, one for each column. It can be easier to manage only one 2D variable containing the whole table. Easier to apply filter on it, for example.
We first read the whole file into a single variable:
f2 = open('datas.dat', 'r')
lines = f2.readlines()
f2.close()

One can alternatively use the with formulation, which close the file automatically once used:
with open('datas.dat', 'r') as f2:
    lines = f2.readlines()

lines is not usable as it is, as it contains the lines of the file in string form, without separating the 2 values...

We now define data as a list, which will contain the lines of the file in the right format:
data = []

Now, we loop on the lines to copy the values, but after splitting the line:
for line in lines:
    p = line.split()
    data.append(p)

One can do the 4 previous commands in a single one, using list comprehension:
data = [line.split() for line in lines]

Let's have a look at the first element of the list data:
print(data[0])
['0.000000', '0.000000']

Ok, the value of x and y are not anymore in a single variable, but to really use them, we have to:
  1. transform the list into a numpy array, to allow operations on the values.
  2. transform this into floating point.
The 2 operations can be performed one after the other:

data2 = np.array(data)
data2 = data2.astype(float)

or in a single command (notice the f in asfarray, to transform into float):
data2 = np.asfarray(data)

The numpy arrays have a lot of properties and method. One tells us the shape of the array:
print(data2.shape)
(100, 2)

To access the distinct columns, use:
x = data2[:,0]
y = data2[:,1]

As the array is a numpy object, one can perform operations on it:
z = (data2[:,0]**2 + data2[:,1]**2)**0.5

Numpy includes functions to read ascii files and return float arrays directly. The following is then very compact and efficient:
data3 = np.loadtxt('datas.dat')


An even more complete tool to read ascii file is the following:
data4 = np.genfromtxt('datas.dat')
It allows to read formated files, deal with undefined values, etc...
One can even ask for this function to automatically put the result in 2 different variables:
x, y = np.loadtxt('datas.dat', unpack = True)
x, y = np.genfromtxt('datas.dat', unpack = True)

Once you have the data2, data3, or data4, you can plot it (we'll see more on plots latter):
plt.plot(data2[:,0], data2[:,1])

A small comment on the order of the elements in arrays in Python:
There is two ways arrays can be stored: row- or column major. It has a direct impact on the way one has to loop on the arrays. IDL is like Fortran (column major) and Python is like C (row major). It means that in Python, as you move linearly through the memory of an array, the second dimension (rightmost) changes the fastest, while in IDL the first (leftmost) dimension changes the fastest.
Consequence on the loop order in Python:

for i in range(100):
   for j in range(2):
         ... data2[i,j] ...

Another comment on boundaries in arrays:
print(data2[0:4, 0])
[ 0.        0.0952    0.1904    0.285599]
Some would expect the answer to be a 5 elements sub-array. The last index of 0:4 is not included. The rule is that a[n:m] return the elements between n and m-1.

Something quite powerful: accessing the latest elements of an array:
print(data2[-1, :])
[ 9.424778  0.      ]
print(data2[-5:-1, :])
[[ 9.043979  0.371662]
 [ 9.139179  0.281733]
 [ 9.234378  0.189251]
 [ 9.329578  0.095056]]

The np.genfromtxt method allows to read formatted column tables, as for example the following file:
#  Line      Iobs    lambda  relat_error Obs_code
H  1  4861A 1.00000    4861. 0.08000  Anabel                               
H  1  6563A 2.8667     6563. 0.19467  Anabel                               
H  1  4340A 0.4933     4340. 0.03307  Anabel                               
H  1  4102A 0.2907     4102. 0.02229  Anabel                               
H  1  3970A 0.1800     3970. 0.01253  Anabel                               
N  2  6584A 2.1681     6584. 0.08686  Anabel                               
N  2 121.7m 0.00446 1217000. 0.20000  Liu                                  
O  1  6300A 0.0147     6300. 0.00325  Anabel                               
TOTL  2326A 0.07900    2326. 0.20000  Adams                                
C  2 157.6m 0.00856 1576000. 0.20000  Liu                                  
O  1 63.17m 0.13647  631700. 0.10000  Liu                                  
O  1 145.5m 0.00446 1455000. 0.200    Liu                                  
TOTL  3727A 0.77609    3727. 0.200    Torres-Peimbert                      
S II  4070A 0.06174    4070. 0.200    Torres-Peimbert                      
S II  4078A 0.06174    4078. 0.200    Torres-Peimbert                      

read using this command:
obs  = np.genfromtxt('observations.dat', skip_header=1,
                     dtype=["a11","float","float","float","a2","a2"],
                     delimiter=[11,8,9,8,2,2],
                     names = ['label', 'i_obs', 'lambda', 'e_obs', 'na', 'observer'],
                     usecols = (0,1,2,3,5)
                     )
The skip_header tells not to read the first line. The dtype describes the type of the 6 columns. delimiter is the list of the sizes (in characters) of each column. The names are used to identify the columns. The usecols is used to specify which columns to output (here I don't need the 5th column, made of 2 spaces before the observer name)
The data are accessible using for example:
obs['i_obs']


To learn more in arrays indexing:
To learn more about list comprehension:
More about np.loadtxt, np.genfromtxt:

No comments:

Post a Comment