python - Pandas optimization -
python - Pandas optimization -
i wrote function process info pandas. profiling log using %prun
of function posted @ bottom (only top few lines). want optimize code because need phone call function wrote more 4,000 times. , took 37.7 s run function once.
it seems time consuming part nonzero
of numpy.ndarray
. since of operations based on pandas
, wonder function in pandas
rely on method heavily?
my operations consisted dataframe slicing based on datetimeindex
using df.ix[]
, dataframe merges using pandas.merge()
.
i know it's hard tell without posting actual script, script long meaningful , operations advertisement hoc, can't rewrite little script post here.
16439731 function calls (16108083 primitive calls) in 37.766 seconds ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 7461 3.712 0.000 3.712 0.000 {method 'nonzero' of 'numpy.ndarray' objects} 244 1.731 0.007 5.434 0.022 index.py:1126(_partial_date_slice) 122 1.655 0.014 1.655 0.014 {pandas.algos.inner_join_indexer_int64} 610 1.578 0.003 1.578 0.003 {method 'factorize' of 'pandas.hashtable.int64factorizer' objects} 118817 0.764 0.000 0.764 0.000 {method 'reduce' of 'numpy.ufunc' objects} 22474 0.753 0.000 0.917 0.000 index.py:409(is_unique) 353210 0.669 0.000 1.228 0.000 {numpy.core.multiarray.array} 1577935 0.596 0.000 0.925 0.000 {isinstance} 1221 0.511 0.000 0.516 0.000 index.py:402(is_monotonic) 183 0.427 0.002 0.427 0.002 {pandas.algos.left_outer_join} 34529 0.376 0.000 1.286 0.000 index.py:98(__new__) 12356 0.358 0.000 0.358 0.000 {method 'take' of 'numpy.ndarray' objects} 3812 0.352 0.000 0.352 0.000 {pandas.algos.take_2d_axis0_int64_int64} 610 0.344 0.001 0.349 0.001 index.py:35(wrapper) 981 0.334 0.000 0.335 0.000 {method 'copy' of 'numpy.ndarray' objects}
the df.ix[] little unpredictable in label-based has integer-position fallback. should seek using .loc[] instead. if pass single label homecoming series of row @ index label. can piece passing range. instead of:
df.ix[begin_date:end_date]
try:
df.loc[begin_date:end_date]
even faster utilize integer-based slicing method .iloc[]. since you're looping on index anyway add together enumerate() loop , utilize enumerate() values, i.e.:
df.iloc[4:9]
on machine .iloc tends twice fast .loc.
python numpy pandas
Comments
Post a Comment