Saturday, May 5, 2012

Playing with arrays: slicing, sorting, filtering, where function, etc.

You can also read the more recent post on Numpy here: http://python-astro.blogspot.mx/2014/08/introduction-to-numpy.html


Most of the data we have to manage are in form of arrays. That's why it's quite always necessary to import the numpy library to easily deal with tables, arrays, images, cubes, etc...
In this post I will discuss some of the methods and functions one may have to use in this context.

In Python, one can use lists, tuples and dictionaries to put different elements together. They even can contain elements of different types. But nothing better than numpy arrays to really "play" with the data, extract subsets, combine them using arithmetic operations.

We already discuss a little some array commands in a previous post: read-ascii-file-cont. Here we are more complete, but not exhaustive, as numpy is a whole world... Have a look at the reference guide: http://docs.scipy.org/doc/numpy/reference/ and other references at the end of this post.

Create arrays

a = np.array([1, 2, 3, 7, 5])
a
array([1, 2, 3, 7, 5])

numpy's arrays cannot contain values of different types, the values are transformed if necessary, from integer into real, and from real into string:
b = np.array([1, 2, 3, 4.])
b
array([ 1.,  2.,  3.,  4.])
c = np.array([1, 2, 3., '4'])
c
array(['1', '2', '3', '4'],
      dtype='|S1')

shapes and sizes of the arrays are obtained using method and functions:
print c.shape
(4,)
print len(c)
4

Different ways to create arrays:
a = np.arange(1, 10, 0.5)
print a
[ 1.   1.5  2.   2.5  3.   3.5  4.   4.5  5.   5.5  6.   6.5  7.   7.5  8.
  8.5  9.   9.5]

TAKE CARE, THE LATEST ELEMENT IS NOT WHAT YOU MAY EXPECT...
It's because the first element in Python is indexed by 0. The simple use of np.arange is :
a = np.arange(10)
print a
[0 1 2 3 4 5 6 7 8 9]
a starts at 0, and has 10 elements, that's ok.

Easy to create linearly spaced arrays:
c = np.linspace(0, 1, 4)
print c
[ 0.          0.33333333  0.66666667  1.        ]
or log spaced:

c = np.logspace(0, 1, 4)
c
array([  1.        ,   2.15443469,   4.64158883,  10.        ])
which is actually the same as :
c = 10**np.logspace(0, 1, 4)

To create arrays of 0. or 1.:
a = np.zeros(10)
b = np.ones((5, 4))
Here b is a 2D array.
b.shape
(5, 4)

You can use a list (or a tuple, or an array) and replicate it:
a = np.array([1, 2, 3])
b = np.tile(a, 5)
print b
[1 2 3 1 2 3 1 2 3 1 2 3 1 2 3]
c = np.tile(a, (5, 1))
print c
[[1 2 3]
 [1 2 3]
 [1 2 3]
 [1 2 3]
 [1 2 3]]

You can also create 2D arrays from 1D vectors:
x = np.linspace(-1, 1, 100)
y = x
X, Y = np.meshgrid(x, y) #this create 2D arrays containing x and y for each pixel
print X.shape
(100, 100)
Outer products are obtain using... np.outer:
a = np.array([1, 2, 3])
b = np.array([1, 10, 100])
np.outer(a, b)
array([[  1,  10, 100],
       [  2,  20, 200],
       [  3,  30, 300]])
np.outer(b, a)
array([[  1,   2,   3],
       [ 10,  20,  30],
       [100, 200, 300]])

Broadcasting

Python numpy is able to add dimensions to arrays to perform operations:
a = np.array([1, 2, 3, 4])
b = np.ones((6, 4))
a * b
array([[ 1.,  2.,  3.,  4.],
       [ 1.,  2.,  3.,  4.],
       [ 1.,  2.,  3.,  4.],
       [ 1.,  2.,  3.,  4.],
       [ 1.,  2.,  3.,  4.],
       [ 1.,  2.,  3.,  4.]])
Nice page on this:
http://www.scipy.org/EricsBroadcastingDoc

Indexing and Slicing

The access to elements of arrays is done using []:
a = np.arange(10)
a[5]
0
a[-1]  # last elements
9

In case of N-dims arrays, one can extract slides using :
a = np.array([1, 2, 3, 4, 5])
b = np.array([1, 10, 100, 1000])
c = np.outer(a, b)
c.shape
(5, 4)
print c 
[[   1   10  100 1000]
 [   2   20  200 2000]
 [   3   30  300 3000]
 [   4   40  400 4000]
 [   5   50  500 5000]]
print c[:,2]
[100 200 300 400 500]
print c[-1,:]
[   5   50  500 5000]
print c[-1, -1]
5000

There is a lot of methods in the array object, the best page to learn more is there: http://scipy-lectures.github.com/intro/numpy/numpy.html#the-numpy-array-object

Filtering

Sometimes we need to extract elements from arrays following some criteria. There is basically two ways to do this: defining a set of indices where the condition is completed (a la WHERE in IDL), or defining a boolean mask.

Let's first define a 2D array made of 10 times 1000 random values:
a = np.random.random((10, 1000))

We want to extract the values where the 2nd and the 4th 1000-elements vectors are greater than 0.5.

Using the numpy.where function:
w1 = np.where((a[1,:] > 0.5) & (a[3,:] > 0.5))
as the result is a tuple of indices, and in this case the dimension of the result is 1, better extract on the fly the array of indices from the tuple:
w2 = np.where((a[1,:] > 0.5) & (a[3,:] > 0.5))[0]
len(w2)
248  # your result may differ form this value, but must be close to 250.

We can now reduce the initial array to the desired values:
b = a[:, w2]
b.shape
(10, 248)

One can also build an array of boolean:
mask = (a[1,:] > 0.5) & (a[3,:] > 0.5)
which is actually similar to
mask2 =  np.where((a[1,:] > 0.5) & (a[3,:] > 0.5), True, False)

It can be used the same way as before:
b = a[:, mask]
b.shape
(10, 248)

These latter cases result in a 1000-element array, filled with True and False. To know the number of True values, just sum the array:
mask.sum()
248  # your result may differ form this value, but must be close to 250.

One big advantage of the mask technique is that you can combine different masks:
mask1 = a[1,:] > 0.5
mask2 = a[3,:] > 0.5
mask_total = mask1 & mask2
b = a[:, mask_total]

Sorting

a = np.array([1,2,45,2,1,46,7,-1])
b1 = np.sort(a)
ib = np.argsort(a)
b2 = a[ib]
print b1
[-1  1  1  2  2  7 45 46]
print b2 
[-1  1  1  2  2  7 45 46]

More on numpy arrays

http://scipy.org/Numpy_Example_List_With_Doc
http://scipy-lectures.github.com/intro/numpy/numpy.html#the-numpy-array-object

Sunday, April 8, 2012

Plotting


Don't miss the update: "2014 Python lecture on Matplotlib" on this blog: http://python-astro.blogspot.mx/2014/09/2014-python-lecture.html

Plotting is certainly one of the most common task one would do in Python (at least for astronomers).
There is various libraries to draw plots in Python, but the mostly used and powerful is perhaps matplotlib.
The plotting part is pyplot, so before any use, one must import it, with the (widely used) alias plt:
import matplotlib.pyplot as plt

The best to see how it works is to dive into the website where there is a lot of examples of plots with the code used to generate them: http://matplotlib.sourceforge.net/, especially the gallery: http://matplotlib.sourceforge.net/gallery.html
In the following we will present a few examples to help you to start.

X-Y plot

Let's say you have 2 vectors x and y and wanna plot y vs. x. Here are some examples of how to do this.
First generate x and y (don't forget to import numpy as np):
x = np.linspace(0., 4 * np.pi, 100)
y = np.sin(x)
plt.plot(x, y)
plt.show()

This latest command is not always necessary, if you run ipython with the --pylab option.
As you can see, python automatically define the axis ranges and draw the plot using a default blue color.
If you want to overplot another function, just call plt.plot again:
plt.plot(x, y**2)
The figure can be cleaned before another plot using
plt.clf()
You may need to draw various figure in different windows at the same time. Every call to
plt.figure()
will start a new figure where the next plot(s) will be drawn. The figures (each one in a separate window) are identified with numbers, the first one being 1.
When calling plt.figure(N), a new figure is created if figure N doesn't exist, and focus will be on figure N otherwise.

The color of the line is defined using c or color keyword:
plt.plot(x, y**3, color = 'red')
plt.plot(x, y**4, c='b')
Abbreviation Color
b blue
g green
r red
c cyan
m magenta
y yellow
k black
w white


Symbols and line style can also be easily defined:
plt.plot(x, y**3, color='r', marker='o', linestyle=':')
Symbol Description
- solid line
-- dashed line
-. dash-dot line
: dotted line
. points
, pixels
o circle symbols
^ triangle up symbols
v triangle down symbols
< triangle left symbols
> triangle right symbols
s square symbols
+ plus symbols
x cross symbols
D diamond symbols
d thin diamond symbols
1 tripod down symbols
2 tripod up symbols
3 tripod left symbols
4 tripod right symbols
h hexagon symbols
H rotated hexagon symbols
p pentagon symbols
| vertical line symbols
_ horizontal line symbols
steps use gnuplot style ‘steps’ # kwarg only

Other useful line properties:

Property Value
alpha alpha transparency on 0-1 scale
antialiased True or False - use antialised rendering
color matplotlib color arg
data_clipping whether to use numeric to clip data
label string optionally used for legend
linestyle one of - : -. -
linewidth float, the line width in points
marker one of + , o . s v x > <, etc
markeredgewidth line width around the marker symbol
markeredgecolor edge color if a marker is used
markerfacecolor face color if a marker is used
markersize size of the marker in points

If you need more control on the markers, better use scatter. For example, if one need ot change the size of the symbol according to a function:
plt.scatter(x, y**3, c=abs(y), marker='s', s=10+abs(y)*100, edgecolors='none')

The labels are set after the plot is done:
plt.xlabel('X')
plt.ylabel('Some trig function')
LaTex fans are welcome:
plt.title(r'Example of LaTex $\alpha_{\beta}$') #Notice the r before the string


One can change the axis ranges afterward:
plt.xlim((0, 10))
plt.ylim((-2, 2))

Multiple plots

To draw multiple plots of the same figure:
plt.subplot(N_y, N_x, N)

for i in np.arange(9):
    plt.subplot(3,3,i+1)
    plt.plot(x, y**i)
    plt.ylim((-2, 2))

log plots

using plt.semilogx, plt.semilogy and plt.loglog
plt.loglog(x)
plt.grid(True, which='minor')

Contours

Contour plots are done with plt.contour and plt.contourf.
x = np.linspace(-1, 1, 100)
y = x
X, Y = np.meshgrid(x, y) #this create 2D arrays containing x and y for each pixel
dist =  (X**2 + Y**2)**0.5
plt.contourf(X, Y, dist)
plt.colorbar()
CS = plt.contour(X, Y, dist, colors ='black', linewidths = 4)
plt.clabel(CS) #to print label on each contour



Saving the plot

The result of the plot can be save in PDF, EPS, JPG, BMP format, using:
plt.savefig('fig1.pdf')

More on plotting:

http://scipy-lectures.github.com/intro/matplotlib/matplotlib.html


Sunday, March 25, 2012

Play with FITS files

The same lecture is now in the Notebook format, see there: http://python-astro.blogspot.mx/2014/09/interacting-with-files-reading-writing.html


What is the FITS format?

The FITS format is the most popular way to save and interchange astronomical data. The files are organized in units each of which contains a human readable header and a data. This structure is refereed as HDUs (Header/DATA Unit).

A FITS file can contain one or more HDUs, the first of which is called "primary" and the rest are called "extensions". The primary HDU usually contains 1D spectrum, 2D image or 3D data cube, although any dimension from 0 to 999 are possible. The data are 1, 2 or 4 bytes integers or 4 or 8 bytes real numbers.

The extensions can contain or arrays as in the primary HDU or ascii tables or binary tables.
If a FITS file contains only tables, it primary HDU does not contain data, but only header.

Both headers and data in a FITS file are organized in blocs of 2880 bytes. The header contain 80 bytes lines each of which consists of a keyword of 8 bytes followed in most of the cases by '= ' in the position 9 and 10 and then the value of the keyword. The rest of the line is a comment string beginning with '/'. Each header begins with the following lines

SIMPLE  =                    T / file conforms to FITS standard
BITPIX  =                   16 / number of bits per data pixel
NAXIS   =                    2 / number of data axes
NAXIS1  =                  440 / length of data axis 1
NAXIS2  =                  300 / length of data axis 2

which defines the format of the file as standard FITS, the data format and the dimensions of the stored data.

One block of 2880 bytes contains 36 lines of 80 characters per line. The header can have several blocks of 36 lines. The last block is identified by the presence of the keyword 'END' The next 2880 bytes block contains the first part of the data. The empty lines after 'END' keyword are filled with blanks and the unused bytes from the end of the data to the end of the 2880 bytes block are filled with NULLs.

Full description of the FITS format can be found at http://fits.gsfc.nasa.gov/fits_primer.html

PyFITS

As one can see, reading FITS files is not as simple as reading columns of data. Fortunately, there is a python package called PyFITS which makes the task much easier.

The package can be downloaded from

http://www.stsci.edu/institute/software_hardware/pyfits

The documentation can be found on the same web page.

Although it is possible to download and install the package it is much easier to install it with

easy_install pyfits
or
pip pyfits

Information on easy_install: HERE and on pip: HERE.
I did that on both Mac and Linux machines and it working nicely.

How to read FITS file

First, as usual, one has to import the pyfits package

import pyfits

then the file is read with

hdulist = pyfits.open('n10017o.fits')

where I used one of my FITS files from San Pedro Martir echelle spectrograph. The file can be downloaded from HERE.

The result hdulist is a list of HDU objects. In the case of a simple file, there is only one primary HDU so the list contains only one element

len(hdulist)
1

The information on what the file contains can be obtained by calling the info() method:

hdulist.info()
Filename: n10017o.fits
No.    Name         Type      Cards   Dimensions   Format
0    PRIMARY     PrimaryHDU      62   (2154, 2048)   int16


The table said that there is only a primary HDU which contains 2154 X 2048 image with data stored in 2 bytes (16 bits) integers.

As described above, the HDU (header/data unit) contains header and data. The header is a dictionary. To see what keywords were used in the header one can do:

hdulist[0].header.keys()
['SIMPLE',
 'BITPIX',
 'NAXIS',
 'NAXIS1',
 'NAXIS2',
 'EXTEND',
 'COMMENT',
 'BZERO',
 'BSCALE',
 'EXPTIME',
 'DETECTOR',
 'ORIGIN',
 'OBSERVAT',
 'TELESCOP',
 'LATITUDE',
 'LONGITUD',
 'ALTITUD',
 'SECONDAR',
 'TIMEZONE',
 'OBSERVER',
 'OBJECT',
 'INSTRUME',
 'GAINMODE',
 'FILTER',
 'IMGTYPE',
 'EQUINOX',
 'ST',
 'UT',
 'JD',
 'DATE-OBS',
 'CCDSUM',
 'RA',
 'DEC',
 'AH',
 'AIRMASS',
 'TMMIRROR',
 'TSMIRROR',
 'TAIR',
 'XTEMP',
 'HUMIDITY',
 'ATMOSBAR',
 'WIND',
 'WDATE',
 'DATE',
 'NAMPS',
 'CCDNAMPS',
 'AMPNAME',
 'CREATOR',
 'VERSION',
 'HISTORY']

and to get the value of a given keyword

hdulist[0].header['OBJECT'] 
'BD +59 363'

The header can be printed as it apears in the file by

print hdulist[0].header.ascardlist()
SIMPLE  =                    T / conforms to FITS standard                     
BITPIX  =                   16 / array data type                               
NAXIS   =                    2 / number of array dimensions                    
NAXIS1  =                 2154 / length of data axis 1                         
NAXIS2  =                 2048 / length of data axis 2                         
EXTEND  =                    T                                                 
COMMENT FITS (Flexible Image Transport System) format is defined in 'Astronomy 
COMMENT and Astrophysics', volume 376, page 359; bibcode: 2001A&A...376..359H  
BZERO   =                32768 / BZERO                                         
BSCALE  =                    1 / BSCALE                                        
EXPTIME =                  0.0 / Integration Time, sec.                        
DETECTOR= 'e2vm2 E2V-4240'     / CCD Type                                      
ORIGIN  = 'UNAM    '           / OAN SPM, IA-UNAM                              
OBSERVAT= 'SPM     '           / Observatory                                   
TELESCOP= '2.12m   '           / Telescope                                     
LATITUDE= '+31:02:39'          / Latitude                                      
LONGITUD= '115:27:49'          / Longitud                                      
ALTITUD =                 2800 / altitud                                       
SECONDAR=                   -1 / F/ Secondary type                             
TIMEZONE=                    8 / Time Zone                                     
OBSERVER= 'Leonid  '           / Observer's Name                               
OBJECT  = 'BD +59 363'         / Object                                        
INSTRUME= 'Echelle '           / Instrument                                    
GAINMODE=                    1 / Gain factor in the CCD                        
FILTER  = 'None    '           / Filter                                        
IMGTYPE = 'zero    '           / Image Type                                    
EQUINOX =               2011.7 / Equinox                                       
ST      = '02:19:51.2'         / Sideral Time                                  
UT      = '11:28:28'           / Universal Time                                
JD      =            2455803.5 / Julian Date                                   
DATE-OBS= '2011-08-30'         / Observation Date UTM                          
CCDSUM  = '1 1     '           / Binning [ Cols:Rows ]                         
RA      = ' 02:13:22.2'        / Right Ascension                               
DEC     = ' 52''14''44.0'      / Declination                                   
AH      = '  00:06:28.0'       / Hour Angle                                    
AIRMASS =                1.073 / Airmass                                       
TMMIRROR=                    0 / Primary Mirror Temperature (celsius degree)   
TSMIRROR=                    0 / Secundary Mirror Temperature (celsius degree) 
TAIR    =                    0 / Internal Telescope Air Temperature (celsius deg
XTEMP   =                 14.6 / Exterior Temperature (celsius degree)         
HUMIDITY=                 44.0 / % external Humidity                           
ATMOSBAR=                731.7 / Atmosferic Presure in mb                      
WIND    = 'S at 24.1 km/h'     / Wind Direction                                
WDATE   = '11:28:10, 08/30/11' / Weather Acquisition Date (Local time)         
DATE    = '2011-08-30T11:28:29' / file creation date (YYYY-MM-DDThh:mm:ss UT)  
NAMPS   =                    1 / Number of Amplifiers                          
CCDNAMPS=                    1 / Number of amplifiers used                     
AMPNAME = '1 Channel'          / Amplifier name                                
CREATOR = 'Python Oan ccds'    / Name of the software task that created the file
VERSION = '4.12D   '           / Application Software Version                  
COMMENT Visit our weather site http://www.astrossp.unam.mx/weather15           
COMMENT for complete meteorological data of your observation night             
HISTORY bin2fits V1.0                                                          
HISTORY Programmer: Enrique Colorado [ colorado@astrosen.unam.mx ]             
HISTORY Observatorio Astronomico Nacional -UNAM                                
HISTORY V1.00 By Arturo Nunez and Colorado >Ported to Python using pyfits      
HISTORY V0.50 By E. Colorado >Added interior mirrors temperatures              
HISTORY V0.49 By E. Colorado >Added BIASSEC parameter                          
HISTORY V0.48 By E. Colorado >Aditional info for autofocus calculations        
HISTORY V0.4 By E. Colorado >Now we include timezone, and remove lat. sign     
HISTORY V0.3 By E. Colorado >Now we include weather data                       
HISTORY V0.2 By E. Colorado >General OAN Working Release                       

The data in the file are accessible with

data = hdulist[0].data

and can be seen with [don't forget to import matplotlib.pyplot as plt before running this]:

plt.imshow(data)

A column from the data can be plotted with

plt.plot(data[:,1000])

where I am plotting the column number 1000. In the same way a line from the data is plotted with:

plt.plot(data[1000,:])

The data are numpy object so all manipulations are available.

Some more on displaying images

The main matplotlib function to display images is imshow. It has several parameters of which te most important are the one which controls the color scheme and the ones which control the dynamic range.

plt.imshow(data, cmap=cm.gray, vmin=1000, vmax=10000)

cmap controls the pseudocolor map (same as color= in IDL). The predefined color maps can be seen on http://matplotlib.sourceforge.net/examples/pylab_examples/show_colormaps.html

vmin and vmax controls the range of values which are mapped in [0,1] range where 0 is black (the darkest color) and 1 is white (the lightest color).

origin = 'upper' | 'lower', place the pixel with coordinates 0,0 at the upper or lower corner of the plot

extent = (xmin,xmax,ymin,ymax) - This parameter defines the numbers written on the axes. No changes on the image.

Using FITS tables

For this example I'll use a spectrum obtain with the high dispersion camera on board of IUE.
The file is opened as usual:
hdulist = pyfits.open('swp04345.mxhi')

Download the file from THERE.

but now hdulist has 2 elements (2 header/data units):

len(hdulist)
2

We can see that the primary header has dimension (), son does not contain any data. The data are in the extension.

hdulist.info()
Filename: swp04345.mxhi
No.    Name         Type      Cards   Dimensions   Format
0    PRIMARY     PrimaryHDU     421   ()           uint8
1    MEHI        BinTableHDU     61   60R x 17C    [1B, 1I, 1D, 1I, 1D, 1E, 1E, 768E, 768E, 768E, 768I, 768E, 768E, 1I, 1I, 1E, 7E]

The first header contains the minimal infirmation:

print hdulist[0].header.ascardlist()[:5]
SIMPLE  =                    T / Standard FITS Format                          
BITPIX  =                    8 / Binary data                                   
NAXIS   =                    0 / Two-dimensional image                         
EXTEND  =                    T / Extensions are present                        
TELESCOP= 'IUE     '           / International Ultraviolet Explorer            

The number of axis is 0 which means there is no data block in the primary HDU. The header of the second HDU begins with the keyword XTENSION and with the specification of the data

print hdulist[1].header.ascardlist()[:5]
XTENSION= 'BINTABLE'           / Binary table extension                        
BITPIX  =                    8 / Binary data                                   
NAXIS   =                    2 / Two-dimensional table array                   
NAXIS1  =                16961 / Width of row in bytes                         
NAXIS2  =                   60 / Number of orders                              

To progress further we need to know what is in the table. As usual, the columns have names and type of the stored data. These information can be obtained using the column attribute of hdulist:

cols = hdulist[1].columns

cols.info
<bound method ColDefs.info of ColDefs(
    name = 'ORDER'; format = '1B'; unit = ' '
    name = 'NPOINTS'; format = '1I'; unit = ' '
    name = 'WAVELENGTH'; format = '1D'; unit = 'ANGSTROM'
    name = 'STARTPIX'; format = '1I'; unit = 'PIXEL'
    name = 'DELTAW'; format = '1D'; unit = 'ANGSTROM'
    name = 'SLIT HEIGHT'; format = '1E'; unit = 'PIXEL'
    name = 'LINE_FOUND'; format = '1E'; unit = 'PIXEL'
    name = 'NET'; format = '768E'; unit = 'FN'
    name = 'BACKGROUND'; format = '768E'; unit = 'FN'
    name = 'NOISE'; format = '768E'; unit = 'FN'
    name = 'QUALITY'; format = '768I'; unit = ' '
    name = 'RIPPLE'; format = '768E'; unit = 'FN'
    name = 'ABS_CAL'; format = '768E'; unit = 'ERG/CM2/S/A'
    name = 'START-BKG'; format = '1I'; unit = 'PIXEL'
    name = 'END-BKG'; format = '1I'; unit = 'PIXEL'
    name = 'SCALE_BKG'; format = '1E'; unit = ' '
    name = 'COEFF'; format = '7E'; unit = ' '
)>


the cols.info returns the names of the columns and the information of their format and units.

The data are available using (this example is NOT the right way of plotting the data, it's just an example) [and don't forget to import numpy as np to have np.arange working]:

data1 = hdulist[1].data
DTs =  data1.ABS_CAL
WLs = data1.WAVELENGTH
DWs = data1.DELTAW
for WL, DW, DT in zip(WLs, DWs, DTs):
    plot(WL + np.arange(len(DT)) * DW, DT)

Writing FITS files.

The creation of a FITS file pass through 4 steps.

1) Creation of numpy array with the data.

x = np.arange(100)
  
2) Creation of the HDU from the data.

hdu = pyfits.PrimaryHDU(x)
   
thus created, the hdu has its basic header and the data.

print hdu.header.ascardlist()
SIMPLE  =                    T / conforms to FITS standard                     
BITPIX  =                   64 / array data type                               
NAXIS   =                    1 / number of array dimensions                    
NAXIS1  =                  100                                                 
EXTEND  =                    T                                                 

3) Adding additional keywords to the header. The automatically created header contains only the required minimum of keywords. If additional keywords are needed they are added with

hdu.header.update('testkey',0.001,'some test value')

print hdu.header.ascardlist()
SIMPLE  =                    T / conforms to FITS standard                     
BITPIX  =                   64 / array data type                               
NAXIS   =                    1 / number of array dimensions                    
NAXIS1  =                  100                                                 
EXTEND  =                    T                                                 
TESTKEY =                0.001 / some test value                               

4) Once all the keywords are ready, the final HDU list have to be created and written to the file:

hdulist = pyfits.HDUList([hdu])
hdulist.writeto('new.fits')
hdulist.close()

Alternative way

Another way to deal with FITS tables is to use the ATpy library, look there:
http://atpy.github.com/index.html

References


To learn more on PyFITS: http://packages.python.org/pyfits/

Monday, February 27, 2012

Read an ascii file (cont')


In this message we'll see other ways to do the same job than in the previous post: read an ascii file.

In the previous post we read a file of the form:

0.000000 0.000000
0.095200 0.095056
0.190400 0.189251
0.285599 0.281733
0.380799 0.371662
0.475999 0.458227
0.571199 0.540641
0.666398 0.618159
0.761598 0.690079
0.856798 0.755750
...
using 2 lists, one for each column. It can be easier to manage only one 2D variable containing the whole table. Easier to apply filter on it, for example.
We first read the whole file into a single variable:
f2 = open('datas.dat', 'r')
lines = f2.readlines()
f2.close()

One can alternatively use the with formulation, which close the file automatically once used:
with open('datas.dat', 'r') as f2:
    lines = f2.readlines()

lines is not usable as it is, as it contains the lines of the file in string form, without separating the 2 values...

We now define data as a list, which will contain the lines of the file in the right format:
data = []

Now, we loop on the lines to copy the values, but after splitting the line:
for line in lines:
    p = line.split()
    data.append(p)

One can do the 4 previous commands in a single one, using list comprehension:
data = [line.split() for line in lines]

Let's have a look at the first element of the list data:
print(data[0])
['0.000000', '0.000000']

Ok, the value of x and y are not anymore in a single variable, but to really use them, we have to:
  1. transform the list into a numpy array, to allow operations on the values.
  2. transform this into floating point.
The 2 operations can be performed one after the other:

data2 = np.array(data)
data2 = data2.astype(float)

or in a single command (notice the f in asfarray, to transform into float):
data2 = np.asfarray(data)

The numpy arrays have a lot of properties and method. One tells us the shape of the array:
print(data2.shape)
(100, 2)

To access the distinct columns, use:
x = data2[:,0]
y = data2[:,1]

As the array is a numpy object, one can perform operations on it:
z = (data2[:,0]**2 + data2[:,1]**2)**0.5

Numpy includes functions to read ascii files and return float arrays directly. The following is then very compact and efficient:
data3 = np.loadtxt('datas.dat')


An even more complete tool to read ascii file is the following:
data4 = np.genfromtxt('datas.dat')
It allows to read formated files, deal with undefined values, etc...
One can even ask for this function to automatically put the result in 2 different variables:
x, y = np.loadtxt('datas.dat', unpack = True)
x, y = np.genfromtxt('datas.dat', unpack = True)

Once you have the data2, data3, or data4, you can plot it (we'll see more on plots latter):
plt.plot(data2[:,0], data2[:,1])

A small comment on the order of the elements in arrays in Python:
There is two ways arrays can be stored: row- or column major. It has a direct impact on the way one has to loop on the arrays. IDL is like Fortran (column major) and Python is like C (row major). It means that in Python, as you move linearly through the memory of an array, the second dimension (rightmost) changes the fastest, while in IDL the first (leftmost) dimension changes the fastest.
Consequence on the loop order in Python:

for i in range(100):
   for j in range(2):
         ... data2[i,j] ...

Another comment on boundaries in arrays:
print(data2[0:4, 0])
[ 0.        0.0952    0.1904    0.285599]
Some would expect the answer to be a 5 elements sub-array. The last index of 0:4 is not included. The rule is that a[n:m] return the elements between n and m-1.

Something quite powerful: accessing the latest elements of an array:
print(data2[-1, :])
[ 9.424778  0.      ]
print(data2[-5:-1, :])
[[ 9.043979  0.371662]
 [ 9.139179  0.281733]
 [ 9.234378  0.189251]
 [ 9.329578  0.095056]]

The np.genfromtxt method allows to read formatted column tables, as for example the following file:
#  Line      Iobs    lambda  relat_error Obs_code
H  1  4861A 1.00000    4861. 0.08000  Anabel                               
H  1  6563A 2.8667     6563. 0.19467  Anabel                               
H  1  4340A 0.4933     4340. 0.03307  Anabel                               
H  1  4102A 0.2907     4102. 0.02229  Anabel                               
H  1  3970A 0.1800     3970. 0.01253  Anabel                               
N  2  6584A 2.1681     6584. 0.08686  Anabel                               
N  2 121.7m 0.00446 1217000. 0.20000  Liu                                  
O  1  6300A 0.0147     6300. 0.00325  Anabel                               
TOTL  2326A 0.07900    2326. 0.20000  Adams                                
C  2 157.6m 0.00856 1576000. 0.20000  Liu                                  
O  1 63.17m 0.13647  631700. 0.10000  Liu                                  
O  1 145.5m 0.00446 1455000. 0.200    Liu                                  
TOTL  3727A 0.77609    3727. 0.200    Torres-Peimbert                      
S II  4070A 0.06174    4070. 0.200    Torres-Peimbert                      
S II  4078A 0.06174    4078. 0.200    Torres-Peimbert                      

read using this command:
obs  = np.genfromtxt('observations.dat', skip_header=1,
                     dtype=["a11","float","float","float","a2","a2"],
                     delimiter=[11,8,9,8,2,2],
                     names = ['label', 'i_obs', 'lambda', 'e_obs', 'na', 'observer'],
                     usecols = (0,1,2,3,5)
                     )
The skip_header tells not to read the first line. The dtype describes the type of the 6 columns. delimiter is the list of the sizes (in characters) of each column. The names are used to identify the columns. The usecols is used to specify which columns to output (here I don't need the 5th column, made of 2 spaces before the observer name)
The data are accessible using for example:
obs['i_obs']


To learn more in arrays indexing:
To learn more about list comprehension:
More about np.loadtxt, np.genfromtxt:

Saturday, February 25, 2012

Read a 2 columns file and plot the result

"Reading and writing files" is available in the 2014 version of the lecture here: http://python-astro.blogspot.mx/2014/09/interacting-with-files-reading-writing.html


In this second post, we will read the file we wrote in the previous one (datas.dat), and use the data to draw a plot.

We will now have to import the plotting package, which is part of matplotlib (we will also need numpy):

import matplotlib.pyplot as plt
import numpy as np
There is a lot of different ways to read a file, depending if it is ASCII or fits or Binary, if we want to extract only some columns, if we know the format of the data, etc... Even for a given data file, for example the one we did last post, I'll present various ways, to illustrate some functions of python.

We now open the file with the 'r' parameter, to specify that we will read it:
f2 = open('datas.dat', 'r')
'r' is optional.

We read the whole file in a single variable, we will deal with it latter. If the file is really big, and/or we only want a small part of it (we are looking for some line, or only one column is of interest) we may have to proceed differently:
lines = f2.readlines()

Now that all the datas are in the lines variable, we can close the file and forget it:
f2.close()

lines  contains the 200 values from the 2 vectors x and y, but we will have to extract this information, as it is now in an unusable form.
First we can ask for the type of the variable:
type(lines)
list
A list is... a list of objects. It can handle objects of different types, even other lists.

We can print the length of the list, it is the number of elements it contains:
len(lines)
100

We can print the first element:
print(lines[0])
0.000000 0.000000 
It seems it is the first line of our file, nice!

But it is a single string...:
print(type(lines[0]))
<type 'str'>

We will have to use a method of the object string to split the line into the 2 values:
print(lines[0].split())
['0.000000', '0.000000']
The result is another list, with 2 elements, each one being a string with the value of x and y corresponding to the line.

We can now access to the value of x, and transform it into a floating point value:
print(float(lines[0].split()[0]))
0.0

We will now loop on all the lines of the file (they are stored in the lines variable) and do the same.
We want to store the result in 2 lists. We first then need to define these lists, that will later be filled by the values of x and y.
x1 = []
y1 = []

The two empty lists are now ready. We can loop on files without to know the number of elements, python take care of this:
for line in lines:
    p = line.split()
    x1.append(float(p[0]))
    y1.append(float(p[1]))

We use p as an intermediate variable, to store the list of 2 strings (like ['0.000000', '0.000000'] for the first line).
append is a method applied to the lists x1 and y1, to insert a new element to the end of the list.

We now have the x and y vectors in the 2 variables x1 and y1.
We can transform the two lists into numpy vectors, to have more flexibility on them:
xv = np.array(x1)
yv = np.array(y1)
We can plot them:
plt.plot(xv, yv)

To see the plot, one last command must be sent:
plt.show()
A window should pop-up with a nice sin(x) plot!

All the previous commands can be stored in a file, called for example Ex2.py and execute by python.
From the shell:
LINUX> python Ex2.py
or from with python
import Ex2
or from ipython
%run Ex2

Here is the script of Ex2.py:

# Need to import the plotting package:
import matplotlib.pyplot as plt

# Read the file.
f2 = open('datas.dat', 'r')
# read the whole file into a single variable, which is a list of every row of the file.
lines = f2.readlines()
f2.close()

# initialize some variable to be lists:
x1 = []
y1 = []

# scan the rows of the file stored in lines, and put the values into some variables:
for line in lines:
    p = line.split()
    x1.append(float(p[0]))
    y1.append(float(p[1]))
   
xv = np.array(x1)
yv = np.array(y1)


# now, plot the data:
plt.plot(xv, yv)

plt.show()



To learn more on lists:
To learn more on string manipulations:
To learn more on plot (but we'll coma back to this latter):

Write vectors as an ASCII file

In this first post we will create 2 vectors and write them into a file.

Python can be used interactively, with classical call to python from the command line or by running iPython. In the following, I will mainly use iPython, which provides a lot of tools to help the user interacting with its variables and programs.

Use copy-paste to execute the commands found in this color in the following, in an interactive python or ipython shell. The outputs from python are in this color.

We can also put all the commands in a file and execute it from python (we'll come back to this latter).

As we want to use vectors and perform operations on them, we need an additional library called numpy.
To import this library, we will write at the very beginning of the session (or the file):
import numpy

Everything depending on numpy will be called using numpy. before its name (Notice the point after numpy). For example, if we need to create an array with 10 elements named a, we do:

a = numpy.arange(10)

Given that it's a little heavy to have to write numpy. a lot of times, we can alias this when importing the package, so that we only need to use np.:

import numpy as np
a = np.arange(10)

You can print out the value of a by just typing "a" and ENTER:
a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) 

The output can differ if you use iPython:
In [2]: a
Out[2]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

One can also print a:
print a
[0 1 2 3 4 5 6 7 8 9]

We want to create 2 vectors: x and y = sin(x). We will first define some variables holding the parameters of the vectors:
x_min = 0.
x_max = 3 * np.pi
n_steps = 100

To generate an array of n_steps values ranging from x_min to x_max, use linspace:
x = np.linspace(x_min, x_max, n_steps)

x is a numpy array, which support arithmetic operations like :
y = x**2
It can be used as argument for functions who will apply on each element of the array and return an array of the result. The trigonometrical functions are also available from numpy, so we use np. to access it:
y = np.sin(x)

We now want to store the 2 vectors as 2 columns of an ASCII file.
We first open the file, named 'datas'. We tell python that we will write to it, using 'w':
f = open('datas.dat', 'w')
f is an object (everything in Python is an object ;-) ), and one can play with some function that apply to f or some variable related to f directly by calling them using f. (notice the point after f):
print f.name, f.mode
datas w
We have to loop on the different elements of x and y to write them. The indexes in Python start at 0. The function np.arange(n) returns a list of integer from 0 to n-1, exactly what we need to scan the values of the two arrays x and y.
The loop is classically defined by a for instruction. In Python, the start of a block (for, while, if, etc.. blobk) is delimited by :, and the block is defined by indentation:
for i in np.arange(n_steps): 
    f.write('{0:f} {1:f} \n'.format(x[i], y[i]))
f.close()

We see that the f.write is indented: this is the block (here a single line, but can be very bigger) on which the loop is run. The instruction to close the file f.close() is written at the beginning of the line, no more indentation, so the loop is terminated.
The format to write x[i] and y[i] is '{0:f} {1:f} \n'.
{0:f} is to tell python to write the first variable in format(x[i], y[i]), namely x[i], as floating points.
\n is to go to the next line. We will have a post dedicated to formats latter, you can already have a look at some links below.

One can copy-paste all the previous commands in a python session, or write all in a file, to be run directly.

The python program is there:

#! /usr/bin/env python
#Need to import numpy:

import numpy as np

# define some variables:
x_min = 0.
x_max = 3 * np.pi
n_steps = 100
# create a vector from x_min to x_max, with n_steps elements
x = np.linspace(x_min, x_max, n_steps)
# create a 2nd vector, as a function of x
y = np.sin(x)

# Write x and y into a file
f = open('datas.dat', 'w')
for i in np.arange(n_steps):
# could have been: for xx, yy in zip(x,y):
    f.write('{0:f} {1:f} \n'.format(x[i], y[i]))
f.close()

Save it somewhere with the name Ex1.py.
You can run it from the shell by:
LINUX> python Ex1.py

With the help of the first line #! /usr/bin/env python, once the Ex1.py file is set to executable (chmod a+x Ex1.py), it's possible to run it directly:
LINUX> ./Ex1.py

or from python:
import Ex1
Notice that import actually also executes any instruction in the imported file.

or from ipython
In [1]: %run Ex1

The file "datas.dat" should contains the x and y values, in 2 columns, like this:
0.000000 0.000000
0.095200 0.095056
0.190400 0.189251
0.285599 0.281733
0.380799 0.371662
0.475999 0.458227
0.571199 0.540641
0.666398 0.618159
0.761598 0.690079
0.856798 0.755750
...

To lear more on files:
To learn more on string format: