Python-Astro: Read an ascii file (cont')

In this message we'll see other ways to do the same job than in the previous post: read an ascii file.

In the previous post we read a file of the form:

0.000000 0.000000
0.095200 0.095056
0.190400 0.189251
0.285599 0.281733
0.380799 0.371662
0.475999 0.458227
0.571199 0.540641
0.666398 0.618159
0.761598 0.690079
0.856798 0.755750
...
using 2 lists, one for each column. It can be easier to manage only one 2D variable containing the whole table. Easier to apply filter on it, for example.
We first read the whole file into a single variable:

f2 = open('datas.dat', 'r')
lines = f2.readlines()
f2.close()

One can alternatively use the with formulation, which close the file automatically once used:

with open('datas.dat', 'r') as f2:

lines = f2.readlines()

lines is not usable as it is, as it contains the lines of the file in string form, without separating the 2 values...

We now define data as a list, which will contain the lines of the file in the right format:
data = []

Now, we loop on the lines to copy the values, but after splitting the line:
for line in lines:
p = line.split()
data.append(p)

One can do the 4 previous commands in a single one, using list comprehension:
data = [line.split() for line in lines]

Let's have a look at the first element of the list data:

print(data[0])

['0.000000', '0.000000']

Ok, the value of x and y are not anymore in a single variable, but to really use them, we have to:

transform the list into a numpy array, to allow operations on the values.
transform this into floating point.

The 2 operations can be performed one after the other:

data2 = np.array(data)
data2 = data2.astype(float)

or in a single command (notice the f in asfarray, to transform into float):
data2 = np.asfarray(data)

The numpy arrays have a lot of properties and method. One tells us the shape of the array:
print(data2.shape)
(100, 2)

To access the distinct columns, use:

x = data2[:,0]

y = data2[:,1]

As the array is a numpy object, one can perform operations on it:

z = (data2[:,0]**2 + data2[:,1]**2)**0.5

Numpy includes functions to read ascii files and return float arrays directly. The following is then very compact and efficient:
data3 = np.loadtxt('datas.dat')

An even more complete tool to read ascii file is the following:
data4 = np.genfromtxt('datas.dat')
It allows to read formated files, deal with undefined values, etc...
One can even ask for this function to automatically put the result in 2 different variables:

x, y = np.loadtxt('datas.dat', unpack = True)

x, y = np.genfromtxt('datas.dat', unpack = True)

Once you have the data2, data3, or data4, you can plot it (we'll see more on plots latter):
plt.plot(data2[:,0], data2[:,1])

A small comment on the order of the elements in arrays in Python:
There is two ways arrays can be stored: row- or column major. It has a direct impact on the way one has to loop on the arrays. IDL is like Fortran (column major) and Python is like C (row major). It means that in Python, as you move linearly through the memory of an array, the second dimension (rightmost) changes the fastest, while in IDL the first (leftmost) dimension changes the fastest.
Consequence on the loop order in Python:

for i in range(100):
   for j in range(2):

         ... data2[i,j] ...

Another comment on boundaries in arrays:

print(data2[0:4, 0])

[ 0. 0.0952 0.1904 0.285599]
Some would expect the answer to be a 5 elements sub-array. The last index of 0:4 is not included. The rule is that a[n:m] return the elements between n and m-1.

Something quite powerful: accessing the latest elements of an array:

print(data2[-1, :])

[ 9.424778 0. ]

print(data2[-5:-1, :])

[[ 9.043979 0.371662]
[ 9.139179 0.281733]
[ 9.234378 0.189251]
[ 9.329578 0.095056]]

The np.genfromtxt method allows to read formatted column tables, as for example the following file:
# Line      Iobs    lambda relat_error Obs_code
H 1 4861A 1.00000    4861. 0.08000 Anabel
H 1 6563A 2.8667     6563. 0.19467 Anabel
H 1 4340A 0.4933     4340. 0.03307 Anabel
H 1 4102A 0.2907     4102. 0.02229 Anabel
H 1 3970A 0.1800     3970. 0.01253 Anabel
N 2 6584A 2.1681     6584. 0.08686 Anabel
N 2 121.7m 0.00446 1217000. 0.20000 Liu
O 1 6300A 0.0147     6300. 0.00325 Anabel
TOTL 2326A 0.07900    2326. 0.20000 Adams
C 2 157.6m 0.00856 1576000. 0.20000 Liu
O 1 63.17m 0.13647 631700. 0.10000 Liu
O 1 145.5m 0.00446 1455000. 0.200    Liu
TOTL 3727A 0.77609    3727. 0.200    Torres-Peimbert
S II 4070A 0.06174    4070. 0.200    Torres-Peimbert
S II 4078A 0.06174    4078. 0.200    Torres-Peimbert

read using this command:
obs = np.genfromtxt('observations.dat', skip_header=1,
                     dtype=["a11","float","float","float","a2","a2"],
                     delimiter=[11,8,9,8,2,2],
                     names = ['label', 'i_obs', 'lambda', 'e_obs', 'na', 'observer'],
                     usecols = (0,1,2,3,5)
                     )
The skip_header tells not to read the first line. The dtype describes the type of the 6 columns. delimiter is the list of the sizes (in characters) of each column. The names are used to identify the columns. The usecols is used to specify which columns to output (here I don't need the 5th column, made of 2 spaces before the observer name)
The data are accessible using for example:
obs['i_obs']

To learn more in arrays indexing:

http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html

To learn more about list comprehension:

More about np.loadtxt, np.genfromtxt:

Python-Astro

Monday, February 27, 2012

Read an ascii file (cont')

No comments:

Post a Comment