I am processing a lot of grib data from noaa with the use of matplotlib and
basemap. On my actual laptop (p4 3ghz, 512mb ram) the whole process takes
close to 3 hours... so it's time for a new machine but still on a very tight
My main question is what should i emphasize more, a quad core processor
running on 64 bit vista/xp, or more memory and a fast hard drive, even a
raid drive? Will python, mpl and basemap take full advantage of multiple
cores or will they use only one? Also, would they work on a 64 bit
environment or would I be better off just sticking to XP32? Now memory wise,
it seems that on my actual machine the app uses all the available ram, how
much should i buy to make sure that all it's needs would be meet?
Just a few comments; I am sure others are more knowledgeable about most of this.
First, I think you need to try to figure out what the bottlenecks are. Can you monitor disk use, memory use, and cpu use? Is the disk maxed out and the cpu idle? If the disk is heavily used, is it swapping? From what you have said, it is impossible to tell whether the disk speed would make a difference, for example. My guess is that it is going to be low priority.
Second, as part of the above, you might review your code and see whether there are some very inefficient parts. How much time is spent in loops that could be vectorized? Are lists being used where arrays would be more efficient? In basemap, are you re-using instances where possible, or are you unnecessarily re-extracting coastlines, for example? Is it possible that you are running out of memory and then swapping because you are using pylab/pyplot and failing to close figures when you have finished with them?
If your budget is tight, I would be very surprised if SCSI would be cost-effective. Generally, SATA is the way to go these days.
I suspect there won't be much speed difference between 32-bit and 64-bit OS versions.
RAM: I expect 4GB will be both cheap and adequate.
To use multiple processors efficiently with matplotlib, you will need multiple processes; mpl and numpy do not automatically dispatch parts of a single job out to multiple processors. (I'm not sure what happens if you use threads--I think it will still be one job per processor--but the general advice is, don't use threads unless you really know what you are doing, really need them, and are willing to put in some heavy debugging time.) My guess is that your 3-hour-job could easily be split up into independent jobs working on independent chunks of data, in which case such a split would give you a big speed-up with more processor cores, assuming the work is CPU-intensive; if it is disk IO-bound, then the split won't help. Anyway, dual-core is pretty standard now, and you will want at least that. Quad might or might not help, as indicated above.
Processor wise, i see that both Intel and AMD have a plethora of options to
choose from... What would you recommend?
And the last question is about hard drives. From your experience, what
drives should I look at? Is a SCSI raid still that much faster than a 10.000
rpm hdd? I've also seen that there are some 15.000 rpm drives that have a
controller, would they worth the money or should I just get a 10.000 rpm hdd
and be done?
Thanks for any help as lately I haven't kept up with the technology and I
feel like a noob