Experience with using h5py to do analytical work on big data in Python?

We use Python in conjunction with h5py, numpy/scipy and boost::python to do data analysis. Our typical datasets have sizes of up to a few hundred GBs. HDF5 advantages: data can be inspected conveniently using the h5view application, h5py/ipython and the h5* commandline tools APIs are available for different platforms and languages structure data using groups … Read more

Large, persistent DataFrame in pandas

Wes is of course right! I’m just chiming in to provide a little more complete example code. I had the same issue with a 129 Mb file, which was solved by: import pandas as pd tp = pd.read_csv(‘large_dataset.csv’, iterator=True, chunksize=1000) # gives TextFileReader, which is iterable with chunks of 1000 rows. df = pd.concat(tp, ignore_index=True) … Read more

Anyone seen a meaningful SAS vs SATA comparison/benchmark?

The “SATA = 7.2K RPM, SAS = 10/15K RPM” mind-set is strong, and (in my opinion anyway) where most of the “SAS is faster than SATA” thinking comes from. There are some slight differences between SAS and SATA drives, notably in their on-board caching algorithms (NCQ vs. TCQ). However, the performance difference of equivalently specced … Read more

SAS vs. Nearline/MDL SAS – What is the difference?

Marketing. 7.2K drives are slower and easier to produce, and with higher error thresholds which improves yields (and capacity). However, in terms of I/O operations each discrete disk can support, the 7.2K drives are markedly less performant than their faster brethren. Therefore they get the ‘Nearline’ moniker, as they’ll hit I/O saturation much faster than … Read more

SAS vs Near-line SAS vs SATA

This has been covered here… See the related links on the right pane of this question. Right now, the market conditions are such that you should try to use SAS disks everywhere you can. Enterprise SAS disks are your fastest and most resilient rotating media available at 10,000 and 15,000 RPM. Performance-optimized Nearline or Midline … Read more