Computer specs for fast matplotlib and basemap processing

Hi all,

I am processing a lot of grib data from noaa with the use of matplotlib and
basemap. On my actual laptop (p4 3ghz, 512mb ram) the whole process takes
close to 3 hours... so it's time for a new machine but still on a very tight
budget :slight_smile:

My main question is what should i emphasize more, a quad core processor
running on 64 bit vista/xp, or more memory and a fast hard drive, even a
raid drive? Will python, mpl and basemap take full advantage of multiple
cores or will they use only one? Also, would they work on a 64 bit
environment or would I be better off just sticking to XP32? Now memory wise,
it seems that on my actual machine the app uses all the available ram, how
much should i buy to make sure that all it's needs would be meet?

Processor wise, i see that both Intel and AMD have a plethora of options to
choose from... What would you recommend?

And the last question is about hard drives. From your experience, what
drives should I look at? Is a SCSI raid still that much faster than a 10.000
rpm hdd? I've also seen that there are some 15.000 rpm drives that have a
controller, would they worth the money or should I just get a 10.000 rpm hdd
and be done?

Thanks for any help as lately I haven't kept up with the technology and I
feel like a noob :frowning:

Anton

···

--
View this message in context: http://www.nabble.com/Computer-specs-for-fast-matplotlib-and-basemap-processing-tp22956400p22956400.html
Sent from the matplotlib - users mailing list archive at Nabble.com.

antonv wrote:

Hi all,

I am processing a lot of grib data from noaa with the use of matplotlib and
basemap. On my actual laptop (p4 3ghz, 512mb ram) the whole process takes
close to 3 hours... so it's time for a new machine but still on a very tight
budget :slight_smile:

My main question is what should i emphasize more, a quad core processor
running on 64 bit vista/xp, or more memory and a fast hard drive, even a
raid drive? Will python, mpl and basemap take full advantage of multiple
cores or will they use only one? Also, would they work on a 64 bit
environment or would I be better off just sticking to XP32? Now memory wise,
it seems that on my actual machine the app uses all the available ram, how
much should i buy to make sure that all it's needs would be meet?

Just a few comments; I am sure others are more knowledgeable about most of this.

First, I think you need to try to figure out what the bottlenecks are. Can you monitor disk use, memory use, and cpu use? Is the disk maxed out and the cpu idle? If the disk is heavily used, is it swapping? From what you have said, it is impossible to tell whether the disk speed would make a difference, for example. My guess is that it is going to be low priority.

Second, as part of the above, you might review your code and see whether there are some very inefficient parts. How much time is spent in loops that could be vectorized? Are lists being used where arrays would be more efficient? In basemap, are you re-using instances where possible, or are you unnecessarily re-extracting coastlines, for example? Is it possible that you are running out of memory and then swapping because you are using pylab/pyplot and failing to close figures when you have finished with them?

If your budget is tight, I would be very surprised if SCSI would be cost-effective. Generally, SATA is the way to go these days.

I suspect there won't be much speed difference between 32-bit and 64-bit OS versions.

RAM: I expect 4GB will be both cheap and adequate.

To use multiple processors efficiently with matplotlib, you will need multiple processes; mpl and numpy do not automatically dispatch parts of a single job out to multiple processors. (I'm not sure what happens if you use threads--I think it will still be one job per processor--but the general advice is, don't use threads unless you really know what you are doing, really need them, and are willing to put in some heavy debugging time.) My guess is that your 3-hour-job could easily be split up into independent jobs working on independent chunks of data, in which case such a split would give you a big speed-up with more processor cores, assuming the work is CPU-intensive; if it is disk IO-bound, then the split won't help. Anyway, dual-core is pretty standard now, and you will want at least that. Quad might or might not help, as indicated above.

Eric

···

Processor wise, i see that both Intel and AMD have a plethora of options to
choose from... What would you recommend?

And the last question is about hard drives. From your experience, what
drives should I look at? Is a SCSI raid still that much faster than a 10.000
rpm hdd? I've also seen that there are some 15.000 rpm drives that have a
controller, would they worth the money or should I just get a 10.000 rpm hdd
and be done?

Thanks for any help as lately I haven't kept up with the technology and I
feel like a noob :frowning:

Anton

antonv wrote:

Hi all,

I am processing a lot of grib data from noaa with the use of matplotlib and
basemap. On my actual laptop (p4 3ghz, 512mb ram) the whole process takes
close to 3 hours... so it's time for a new machine but still on a very tight
budget :slight_smile:

You should profile your application to see why it's taking so long. Maybe you just coded something in a slow way. Python is a great language, but if you don't know it well you might have programmed some parts in a way that takes orders of magnitude more time than other solutions. Even if your code reasonably optimized, you should know first why it's slow: Has the computer run out of memory and is swapping? Is the CPU at 100%? I'd recommend you ask a local python expert for some help.

JLS

I have a bit of experience programming and I am pretty sure I get my parts of
the code pretty well optimized. I made sure that in the loop I have only the
stuff needed and I'm loading all the stuff before.

The biggest bottleneck is happening because I'm unpacking grib files to csv
files using Degrib in command line. That operation is usually around half an
hour using no more than 50% of the processor but it maxes out the memory
usage and it definitely is hard drive intensive as it ends up writing over 4
GB of data. I have noticed also that on a lower spec AMD desktop this runs
faster than on my P4 Intel Laptop, my guess being that the laptop hdd is
5400 rpm and the desktop is 7200 rpm.

Next step is to take all those csv files and make images from them. For this
one I haven't dug too deep to see what is happening but it seems to be the
other way, using the cpu a lot more while keeping the memory usage high too.

Thanks,
Anton

···

--
View this message in context: http://www.nabble.com/Computer-specs-for-fast-matplotlib-and-basemap-processing-tp22956400p22959409.html
Sent from the matplotlib - users mailing list archive at Nabble.com.

I do the same sort of processing, and use GDAL to read the GRIB (I think
grib2, whatever ECMWF provides) files directly into numpy arrays. It's as
easy as

from osgeo import gdal
g = gdal.Open("my_grib_file.grib")
data = g.GetRasterBand( my_band ).ReadAsArray()
pylab.imshow
blah blah blah

It doesn't take long at all, unless your files are huge and are stored over a
slow and busy network. But then, there's little you can do about that!

J

···

On Wednesday 08 April 2009 21:57:21 antonv wrote:

The biggest bottleneck is happening because I'm unpacking grib files to csv
files using Degrib in command line. That operation is usually around half
an hour using no more than 50% of the processor but it maxes out the memory
usage and it definitely is hard drive intensive as it ends up writing over
4 GB of data. I have noticed also that on a lower spec AMD desktop this
runs faster than on my P4 Intel Laptop, my guess being that the laptop hdd

--
RSU ■ Dept. of Geography ■ University College ■ Gower St, London WC1E 6BT UK
EMM ■ Dept. of Geography ■ King's College ■ Strand, London WC2R 2LS UK

antonv wrote:

I have a bit of experience programming and I am pretty sure I get my parts of
the code pretty well optimized. I made sure that in the loop I have only the
stuff needed and I'm loading all the stuff before.

The biggest bottleneck is happening because I'm unpacking grib files to csv
files using Degrib in command line. That operation is usually around half an

Instead of going to csv files--which are *very* inefficient to write, store, and then read in again--why not convert directly to netcdf, and then read your data in from netcdf as needed for plotting? I suspect this will speed things up quite a bit. Numpy support for netcdf is very good. Of course, direct numpy-enabled access to the grib files might be even better, eliminating the translation phase entirely. Have you looked into The Nio module for reading and writing supported data formats?

Eric

···

hour using no more than 50% of the processor but it maxes out the memory
usage and it definitely is hard drive intensive as it ends up writing over 4
GB of data. I have noticed also that on a lower spec AMD desktop this runs
faster than on my P4 Intel Laptop, my guess being that the laptop hdd is
5400 rpm and the desktop is 7200 rpm.

Next step is to take all those csv files and make images from them. For this
one I haven't dug too deep to see what is happening but it seems to be the
other way, using the cpu a lot more while keeping the memory usage high too.

Thanks,
Anton

I know that using the csv files is very slow but I have no knowledge of
working with the netcdf format and I was in a bit of a rush when I wrote
this. I will take a look again at it. How would you translate a grib in
netcdf? Are there any secific applications or straight through numpy?

As for pyngl, if i remember correctly I looked at it but it was not working
on windows.

Thanks,
Anton

efiring wrote:

···

antonv wrote:

I have a bit of experience programming and I am pretty sure I get my
parts of
the code pretty well optimized. I made sure that in the loop I have only
the
stuff needed and I'm loading all the stuff before.

The biggest bottleneck is happening because I'm unpacking grib files to
csv
files using Degrib in command line. That operation is usually around half
an

Instead of going to csv files--which are *very* inefficient to write,
store, and then read in again--why not convert directly to netcdf, and
then read your data in from netcdf as needed for plotting? I suspect
this will speed things up quite a bit. Numpy support for netcdf is very
good. Of course, direct numpy-enabled access to the grib files might be
even better, eliminating the translation phase entirely. Have you
looked into The Nio module for reading and writing supported data formats?

Eric

hour using no more than 50% of the processor but it maxes out the memory
usage and it definitely is hard drive intensive as it ends up writing
over 4
GB of data. I have noticed also that on a lower spec AMD desktop this
runs
faster than on my P4 Intel Laptop, my guess being that the laptop hdd is
5400 rpm and the desktop is 7200 rpm.

Next step is to take all those csv files and make images from them. For
this
one I haven't dug too deep to see what is happening but it seems to be
the
other way, using the cpu a lot more while keeping the memory usage high
too.

Thanks,
Anton

------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
Matplotlib-users mailing list
Matplotlib-users@lists.sourceforge.net
matplotlib-users List Signup and Options

--
View this message in context: http://www.nabble.com/Computer-specs-for-fast-matplotlib-and-basemap-processing-tp22956400p22961419.html
Sent from the matplotlib - users mailing list archive at Nabble.com.

antonv wrote:

I know that using the csv files is very slow but I have no knowledge of
working with the netcdf format and I was in a bit of a rush when I wrote
this. I will take a look again at it. How would you translate a grib in
netcdf? Are there any secific applications or straight through numpy?

As for pyngl, if i remember correctly I looked at it but it was not working
on windows.

Thanks,
Anton
  
Anton: If these are grib version 2 files, another option is Google Code Archive - Long-term storage for Google Code Project Hosting.. I have made a windows installer.

-Jeff

···

efiring wrote:
  

antonv wrote:
    

I have a bit of experience programming and I am pretty sure I get my
parts of
the code pretty well optimized. I made sure that in the loop I have only
the
stuff needed and I'm loading all the stuff before.

The biggest bottleneck is happening because I'm unpacking grib files to
csv
files using Degrib in command line. That operation is usually around half
an
      

Instead of going to csv files--which are *very* inefficient to write, store, and then read in again--why not convert directly to netcdf, and then read your data in from netcdf as needed for plotting? I suspect this will speed things up quite a bit. Numpy support for netcdf is very good. Of course, direct numpy-enabled access to the grib files might be even better, eliminating the translation phase entirely. Have you looked into The Nio module for reading and writing supported data formats?

Eric

hour using no more than 50% of the processor but it maxes out the memory
usage and it definitely is hard drive intensive as it ends up writing
over 4
GB of data. I have noticed also that on a lower spec AMD desktop this
runs
faster than on my P4 Intel Laptop, my guess being that the laptop hdd is
5400 rpm and the desktop is 7200 rpm.

Next step is to take all those csv files and make images from them. For
this
one I haven't dug too deep to see what is happening but it seems to be
the
other way, using the cpu a lot more while keeping the memory usage high
too.

Thanks,
Anton
      

------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
Matplotlib-users mailing list
Matplotlib-users@lists.sourceforge.net
matplotlib-users List Signup and Options

--
Jeffrey S. Whitaker Phone : (303)497-6313
Meteorologist FAX : (303)497-6449
NOAA/OAR/PSD R/PSD1 Email : Jeffrey.S.Whitaker@...259...
325 Broadway Office : Skaggs Research Cntr 1D-113
Boulder, CO, USA 80303-3328 Web : Jeffrey S. Whitaker: NOAA Physical Sciences Laboratory

Wow Jeff! You save me again! I remember looking at it last year and thinking it would be awesome if there would be a windows installer for it!
I will install and play with it tonight! Thanks a lot!

Anton

···

From: Jeff Whitaker <jswhit@…146…>
To: antonv <vasilescu_anton@…9…>
Cc: matplotlib-users@lists.sourceforge.net
Sent: Wednesday, April 8, 2009 4:02:22
PM
Subject: Re: [Matplotlib-users] Computer specs for fast matplotlib and basemap processing

antonv wrote:

I know that using the csv files is very slow but I have no knowledge of
working with the netcdf format and I was in a bit of a rush when I wrote
this. I will take a look again at it. How would you translate a grib in
netcdf? Are there any secific applications or straight through numpy?

As for pyngl, if i remember correctly I looked at it but it was not working
on windows.

Thanks,
Anton

Anton: If these are grib version 2 files, another option is http://code.google.com/p/pygrib2. I have made a windows installer.

-Jeff

efiring wrote:

antonv wrote:

I have a bit of experience programming and I am pretty sure I get my
parts
of
the code pretty well optimized. I made sure that in the loop I have only
the
stuff needed and I’m loading all the stuff before.

The biggest bottleneck is happening because I’m unpacking grib files to
csv
files using Degrib in command line. That operation is usually around half
an

Instead of going to csv files–which are very inefficient to write, store, and then read in again–why not convert directly to netcdf, and then read your data in from netcdf as needed for plotting? I suspect this will speed things up quite a bit. Numpy support for netcdf is very good. Of course, direct numpy-enabled access to the grib files might be even better, eliminating the translation phase entirely. Have you looked into http://www.pyngl.ucar.edu/Nio.shtml?

Eric

hour using no more than 50% of the processor but it maxes out the memory
usage and it definitely is hard drive intensive as it ends up writing
over 4
GB of data. I have noticed also that on a lower spec AMD desktop this
runs
faster than on my P4 Intel Laptop, my guess being that the laptop hdd is
5400 rpm and the desktop is 7200 rpm.

Next step is to take all those csv files and make images from them. For
this
one I haven’t dug too deep to see what is happening but it seems to be
the
other way, using the cpu a lot more while keeping the memory usage high

too.

Thanks,
Anton


This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com


Matplotlib-users mailing list
Matplotlib-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/matplotlib-users

– Jeffrey S. Whitaker Phone : (303)497-6313
Meteorologist FAX : (303)497-6449
NOAA/OAR/PSD R/PSD1 Email : Jeffrey.S.Whitaker@…259…
325 Broadway Office : Skaggs Research Cntr 1D-113
Boulder, CO, USA 80303-3328 Web : http://tinyurl.com/5telg

antonv wrote:

I know that using the csv files is very slow but I have no knowledge of
working with the netcdf format and I was in a bit of a rush when I wrote
this. I will take a look again at it. How would you translate a grib in
netcdf? Are there any secific applications or straight through numpy?

The program you are already using is said to convert grib2 to netcdf:
http://www.nws.noaa.gov/mdl/NDFD_GRIB2Decoder/
and there are several modules providing a netcdf interface for numpy. I like this one:
http://code.google.com/p/netcdf4-python/
and it is included in Enthought Python Distribution.

For GRIB to numpy, googling turned up Google Code Archive - Long-term storage for Google Code Project Hosting.
as well as PyNIO. My guess is that this (pygrib2) will be exactly what you need. It is by Jeffrey Whitaker, the author of the above-mentioned netcdf4 interface as well as of basemap.

As for pyngl, if i remember correctly I looked at it but it was not working
on windows.

Well, I recommend switching to linux anyway, but that is another story.

Eric

Eric Firing wrote:

The biggest bottleneck is happening because I'm unpacking grib files to csv
files using Degrib in command line. That operation is usually around half an

disk speed -- you might want to try SATA RAID 0 (striping) -- I"d get a good hardware vendor's advise in maximizing your disk IO.

You can also multi-task that process easily, but if you're disk-bound, that won't help anyway.

Instead of going to csv files--which are *very* inefficient to write, store, and then read in again--why not convert directly to netcdf,

Or HDF, via PyTables. Or even direct binary numpy arrays, with either fromfile / to file, or, more robustly, with numpy.save and numpy.load.

direct numpy-enabled access to the grib files might be even better, eliminating the translation phase entirely. Have you looked into The Nio module for reading and writing supported data formats?

Also, I think GDAL support GRIB, and can directly give you numpy arrays.

I have noticed also that on a lower spec AMD desktop this runs
faster than on my P4 Intel Laptop, my guess being that the laptop hdd is
5400 rpm and the desktop is 7200 rpm.

yup, those laptop hard drives are SLOW -- you culd look into a Solic State drive, if you have some money to spend.

Next step is to take all those csv files and make images from them. For this
one I haven't dug too deep to see what is happening but it seems to be the
other way, using the cpu a lot more while keeping the memory usage high too.

mulit-cores aren't going to help here, unless yuo run a few separate processes -- also, how much memory? All 64 bits will buy you is more memory, which you may or may not need.

Also, as for Windows 64 bits -- is numpy supported there yet? I'd make sure, there are issues, as there is no MingGW for 64 bit Windows.

antonv wrote:

I know that using the csv files is very slow but I have no knowledge of
working with the netcdf format and I was in a bit of a rush when I wrote
this. I will take a look again at it. How would you translate a grib in
netcdf?

See if degrib supports any binary formats (I now, I'm form NOAA, I should know...). Otherewise yuo could use the hGDAL command-line tools to translate into something else binary that may be easier to deal with. Though it looks like Jeff may have solved this problem for you (One NOAA, Jeff!)

-Chris

···

--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@...259...