python - Pandas optimization -



python - Pandas optimization -

i wrote function process info pandas. profiling log using %prun of function posted @ bottom (only top few lines). want optimize code because need phone call function wrote more 4,000 times. , took 37.7 s run function once.

it seems time consuming part nonzero of numpy.ndarray. since of operations based on pandas, wonder function in pandas rely on method heavily?

my operations consisted dataframe slicing based on datetimeindex using df.ix[] , dataframe merges using pandas.merge().

i know it's hard tell without posting actual script, script long meaningful , operations advertisement hoc, can't rewrite little script post here.

16439731 function calls (16108083 primitive calls) in 37.766 seconds ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 7461 3.712 0.000 3.712 0.000 {method 'nonzero' of 'numpy.ndarray' objects} 244 1.731 0.007 5.434 0.022 index.py:1126(_partial_date_slice) 122 1.655 0.014 1.655 0.014 {pandas.algos.inner_join_indexer_int64} 610 1.578 0.003 1.578 0.003 {method 'factorize' of 'pandas.hashtable.int64factorizer' objects} 118817 0.764 0.000 0.764 0.000 {method 'reduce' of 'numpy.ufunc' objects} 22474 0.753 0.000 0.917 0.000 index.py:409(is_unique) 353210 0.669 0.000 1.228 0.000 {numpy.core.multiarray.array} 1577935 0.596 0.000 0.925 0.000 {isinstance} 1221 0.511 0.000 0.516 0.000 index.py:402(is_monotonic) 183 0.427 0.002 0.427 0.002 {pandas.algos.left_outer_join} 34529 0.376 0.000 1.286 0.000 index.py:98(__new__) 12356 0.358 0.000 0.358 0.000 {method 'take' of 'numpy.ndarray' objects} 3812 0.352 0.000 0.352 0.000 {pandas.algos.take_2d_axis0_int64_int64} 610 0.344 0.001 0.349 0.001 index.py:35(wrapper) 981 0.334 0.000 0.335 0.000 {method 'copy' of 'numpy.ndarray' objects}

the df.ix[] little unpredictable in label-based has integer-position fallback. should seek using .loc[] instead. if pass single label homecoming series of row @ index label. can piece passing range. instead of:

df.ix[begin_date:end_date]

try:

df.loc[begin_date:end_date]

even faster utilize integer-based slicing method .iloc[]. since you're looping on index anyway add together enumerate() loop , utilize enumerate() values, i.e.:

df.iloc[4:9]

on machine .iloc tends twice fast .loc.

python numpy pandas

Comments

Popular posts from this blog

php - Android app custom user registration and login with cookie using facebook sdk -

django - Access session in user model .save() -

php - .htaccess Multiple Rewrite Rules / Prioritizing -