Monday, August 27, 2012

Decent scientific plots with matplotlib

I think matplotlib is poorly documented and too object-oriented to be immediately usable. Trying to be too many things at once. Besides, the default values and default behaviour is kooky sometimes.
For instance, histogram bar widths: they are different for different datasets, and you cannot compare two distributions if that's the case. You have to resort to hacks like this:

  hist, bins = np.histogram(data, bins = 10)
  width=1*(bins[1]-bins[0])
And I think it's a bloody hack, calling histogram method from some other module in order to be able to call matplotlib's histogram plotting routine. Nevertheless. It's flexible and, since I already use Python to pull my data from databases, I decided to give it a try when I had to prepare some plots for a review poster.
So, what makes an ugly plot look decent? First of all, ticks:

      minor_locator = plt.MultipleLocator(plotTitles.yMinorTicks)
      Ymajor_locator = plt.MultipleLocator(plotTitles.yMajorTicks)  
      major_locator = plt.MultipleLocator(plotTitles.xMajorTicks)      
      Xminor_locator = plt.MultipleLocator(plotTitles.xMinorTicks)   
      ax.xaxis.set_major_locator(major_locator)
      ax.xaxis.set_minor_locator(Xminor_locator)     
      ax.yaxis.set_major_locator(Ymajor_locator)
      ax.yaxis.set_minor_locator(minor_locator)
They have to be set separately for each axis, I pass them as parameters from a wrapper class. Then, hatched bars.
Some parameters, redefining matplotlib's defaults (these plots are for printouts, so font sizes are big):

params = {'backend': 'ps',
          'axes.labelsize': 10,
          'text.fontsize': 10,
          'legend.fontsize': 10,
          'xtick.labelsize': 8,
          'ytick.labelsize': 8,
          'text.usetex': True, 
          'font': 'serif',
          'font.size': 16,
          'ylabel.fontsize': 20}
If you'd like to have different symbols (markers) in a scatterplot, this is useful:

      markers = itertools.cycle('.^*')
      p1 = ax.plot(gd.data[0], gd.data[1], linestyle='none', marker=markers.next(), markersize=8, color=gd.colour, mec=gd.colour, alpha = 1) 
But the double symbols in the legend, why is that? No sense.
I would not claim the plots shown are publication-quality (just look at slightly differing histogram widths). But they look way better than default plots one would get with matplotlib.

RGB colours in matplotlib plots

LaTeX accepts un-normalised (like 0-65-122) rgb values as colour arguments in, say, .cls files:
\definecolor{AIPblue} {RGB}{0, 65, 122}
matplotlib insists on getting something like that:

#colour settings
ASColour = '#666666'
or normalised (0 - 1) rgb values tuple. It's easy to normalise, but it's also easy to convert rgb values to hex.

Forcing capital letters in Bibtex bibliography lists

Some bibliography styles eat the capital letters in article titles. To avoid this, one should include the letter in question in curly brackets, like {T}hat.

Saturday, August 25, 2012

Generating matplotlib plots via ssh on a remote server

If you're running a plotting script via ssh, you're likely to get an error message of this sort:

_tkinter.TclError: no display name and no $DISPLAY environment variable
One possible solution is to use the Agg backend.

Friday, August 24, 2012

Failed optimisation: numpy.roll(), masked arrays

I've spent yesterday trying to optimise the inpainting code -- it's quite slow, taking up to 10 minutes for 1 galaxy, and with 900 galaxies, I can't really afford that. Or I should get access to more computers, then prepare the working environment, copy files, install Python libraries, screw something up along the way. That's boring and teaches nothing new.
Besides, there's really some room for improvement: those loops and exceptions are ugly and probably avoidable, and looping over array many times could be sped up using some numpy kung-fu.
I wrote a prototype that juggles with the masks, uses some boolean manipulations and numpy.roll(). Near the edges the algorithm reverts to traditional loops over indices.
However, it's still a slow. I ran a profiler: the loops shifting the array take 0.5s each, and so I win nothing in the long run. Array operations are expensive, and when an array is indexed repeatedly within a loop, the code gets slow inevitably. But there's no way I can escape recursion, because I have to pick up a new set of pixels every time.
I still have some ideas, but I want to have some time to analyse the results.

Thursday, August 23, 2012

Morning coffee read -- ps

ps reference. this command was pretty mysterious to me, with several types of option syntax.

Wednesday, August 22, 2012

The new inpainting script

It's ridiculous in places and still a bit slow, but I'm quite happy with the idea underneath it.
It maps the number of all finite (not NaNs, not infs, not having indices beyond the array edges) neighbours of each pixel in the given array. Then it loops through the masked pixels which have the _largest_ number of adjacent neighbours available, picks the closest neighbours in a 5*5 matrix around each pixel, fills the masked pixel with distance-weighted Gaussian average of said neighbours, then updates the neighbour count, loops through the pixels which have the largest number (out of 4) of neighbours again, (...).
I realised I should do inpainting this way while riding my bicycle from cbase. My previous implementation simply looped over indices along one axis, disregarding the number of pixels used for filling the given one. Hence ugly flux shape deformations when a masked region is directly above or below the galaxy in question.
But now it seems to be fine! So finally I get to do some science, instead of coding 24 if's and exceptions in one function.

randomize contents of a list

The new inpainting script left some ringing artefacts, I think they might come from np.where() results ordering (and the resulting imposed order of inpainting). I've tried numpy.random.shuffle(). It's still running, so I have yet to see whether this worked.
Updated -- no, it didn't. The weird lines and gradients are the edge effects of rectangular mask shapes, which I didn't like from the beginning. On the other hand, given that the inpainted values vary by ~4 counts (the contrast is stretched in this image), I'm not going to worry about this.

Monday, August 20, 2012

SciPy: spatial binary tree

I just realised that scipy.spatial.KDTree works in _data_ space, not in _indices_ space, which makes total sense now that I think of it. Should go get me some breakfast, as I'm starting to overcomplicate things.

Numpy: getting indices of n-dimensional array

just found np.ndenumerate. I should absolutely read some numpy documentation, there's lots of good hidden stuff there.

Inpainting, optimisation, nearest neighbours in Numpy

After looking at my ~900 inpainted images and rejecting ~130 of them outright, I'd like to rewrite my inpainting code. It creates weird artifacts or distorted galaxy views under some circumstances, because it iterates _along_ the axes.
My current idea is to do this recursively: first fill (using the same Gaussian kernel) the pixels that have the nearest 3 4 neighbouring pixels (out of 4 possible) available (not masked), then repeat it again, then go to those that have 3 neighbours, rinse and repeat.
A similar, but possibly a better approach would be to include the diagonal neighbours in the calculation as well (those who have a common vertex, but do not lie side by side). Start with those that have 8 neighbours, (...).
Here's some food for thought: http://stackoverflow.com/questions/9132114/indexing-numpy-array-neighbours-efficiently, http://davescience.wordpress.com/2011/12/12/counting-neighbours-in-a-python-numpy-matrix/, http://stackoverflow.com/questions/5551286/filling-gaps-in-a-numpy-array.

Sunday, August 19, 2012

batch command line image resizing on remote server

I had the images I've generated on a remote server and wanted to copy them to my laptop to work on Sunday. However, the zipfile was 1.4 GB large. I used parts of this shell script, luckily, ImageMagick was installed on the server.

Batch conversion of fits files to .png images, dynamic range and histogram equalisation

My sample consists of almost a thousand galaxies. It's that inconvenient sample size where you would like to inspect each result visually, but it already sucks. First of all, I wanted to see if the inpainting code makes sense and leaves no strange artifacts. To avoid opening each .fits file manually with ds9, I scripted some quick loop to save all the fits images as .png files. It uses scipy.imsave():

scipy.misc.imsave('img/output/filled/'+ID+'_imsave.jpg', outputImage)
However, I wanted to stretch the pixel intensity values and increase the dynamic range of the output images. This code was useful. It tries to make all intensities equally common and hence increases the contrast.

Friday, August 17, 2012

matplotlib: missing x, y axis labels

There must be some reason why it was so, but I had

plt.ylabel = plotTitles.ylabel 

instead of

plt.ylabel(plotTitles.ylabel)

in my pyplot wrapper.

Getting a fits header keyword value from a list of files

I wanted to get the sky value estimates from SDSS fits headers (fpC). There were 939 of them. It was amusing to see how many SDSS images of large galaxies didn't have the sky estimated -- should probably look at why that was the case. I remember writing something similar before, but it was easier to quickly script it.

#a rough excerpt from ellipse_photometry module. 
#loops through filenames, extracts fits header keyword values.

  sky = np.empty((939, 2))
  for i in range(0, 939):    
    try:
      fileName = GalaxyParameters.getSDSSUrl(listFile, dataDir, i)
      head = pyfits.open(fileName)[0].header
      sky[i:, 0] = i+1
      sky[i:, 1] = head['SKY']
      print head['SKY']
    except IOError as e:
      sky[i:, 0] = i+1
      sky[i:, 1] = 0
      pass
    except KeyError as err:
      sky[i:, 0] = i+1
      sky[i:, 1] = -1
      pass
  np.savetxt('SDSS_skies.txt', sky, fmt = ['%i', '%10.5f'])

Drawing a diagonal (or any arbitrarily inclined and offset) line in Matplotlib

I wanted to overplot a simple 45 deg line on my data. As far as I know, Matplotlib doesn't accept equations, only sets of points, so here's a clumsy solution to it, based on this:

    import pylab as pl
    import numpy as np
    x1,x2,n,m,b = min(data),max(data).,1000,1.,0. 
# 1000 -- a rough guess of the sufficient length of this line.
#you can actually calculate it using the Pythagorean theorem.
#1, here -- the slope, 0 -- the offset.

    x = np.r_[x1:x2:n*1j] #http://docs.scipy.org/doc/numpy/reference/generated/numpy.r_.html
    pl.plot(x,m*x + b); pl.grid(); pl.show()


Thursday, August 16, 2012

Image intensity scaling

I wanted to make some images, showing the output of the photometric analysis, i.e. the radii where the growth curves flattened, etc. scipy.misc.imsave() bytescales the image, so here's some documentation on image intensity scaling I'm currently digesting: how ds9's 'scale --> 99%' works: it's outlier clipping.

Bulges and ellipticals

I read an interesting article on the train, Galaxy Bulges and Elliptical Galaxies, lecture notes by D. Gadotti. Dimitri had helped us a lot with the MICE decomposition.
He sums up a great deal of recent research regarding the structure of galaxies. A few main points:
  • the components that we call bulges are not necessarily old 'scaled down' ellipticals as we often hear: there are classical bulges, disk-like bulges, 'box-peanuts', which may coexist or overlap.
  • 'box-peanuts' are thought to be the inner parts of bars, not separate components
  • classical bulges are kinematically hot, i.e. supported by their velocity dispersion. disk-like bulges are kinematically cold, i.e. supported by their rotation velocity.
  • there's an interesting section on photometric bulge recognition, and why the arbitrary cut on n=2 is what it is, arbitrary. I tend to believe that any classification has to be data-driven, so I'll keep the alternative suggestions in mind.
  • another interesting part is about the scaling relations, i.e. the fundamental plane and F-J relation. I didn't know that the FP can be derived directly from the virial theorem!
  • F-J and L-size relations seem to be nonlinear, according to the data.
The only thing I'm not sure about is the use of 2D K-S test, an enlightening read otherwise.

Wednesday, August 15, 2012

Git -- non-fast-forward updates were rejected

I got the following error message:

To prevent you from losing history, non-fast-forward updates were rejected
Merge the remote changes before pushing again.  See the 'Note about
fast-forwards' section of 'git push --help' for details.
Stackoverflow helped, as usual. As I am the sole developer of my can of worms, I just added the '-f' flag. It's nice to do the last commit of the photometry code which I've been working on for the last 3 months.

Git -- fixing a detached head

'fixing' as in 're-attaching'. There's nothing intrinsically wrong with having a detached head.
I returned from the state of detached head this way.

'In Git commits form a directed acyclic graph where each commit points to one or more parent commits'.

Geek language, you're a thing of beauty.

Monday, August 13, 2012

Python: convert a numpy array to list

One more little trick I didn't know:


  inputImage = np.zeros(inputImage.shape)
  for i in range(0, inputImage.shape[0]):
    for j in range(0, inputImage.shape[1]):  
      inputIndices[i, j] = (i, j)
  inputIndices =  inputIndices.tolist() 

Python: removing duplicate elements from a list

python has a lovely set type, which makes removing duplicate elements (in my case, tuples) extremely easy, e.g.:

out = [(12, 3), (11, 4), (12, 3), (44, -5)]
out = list(set(out)))


Thursday, August 9, 2012

unix: show sizes of all files and directories, without subdirectories

simple but handy:
du -ah --max-depth=1

numpy array slices: inclusivity vs.exclusivity

I was banging my head at the desk for the last fifteen minutes: moving averaging code didn't work where it was supposed to. Look at this snippet:
fluxData[distance-5:distance, 3]
The idea was to average an array over a running 5 element width annulus, counting back from the distance'th element.
I normally know this, but Numpy's array slicing does not include the last element, i.e. distance'th row was excluded from the count.
For instance, fluxData[distance-4:distance+1, 3] looks stupid, but somehow worked.

Inpainting again

I'm running the inpainting script once more: it took 75 iterations to fix one file. That is, all night. I don't know why I haven't done this before, as this unfinished business was frustrating me throughout the holidays.