python - Pandas optimization -

i wrote function process info pandas. profiling log using %prun of function posted @ bottom (only top few lines). want optimize code because need phone call function wrote more 4,000 times. , took 37.7 s run function once.

it seems time consuming part nonzero of numpy.ndarray. since of operations based on pandas, wonder function in pandas rely on method heavily?

my operations consisted dataframe slicing based on datetimeindex using df.ix[] , dataframe merges using pandas.merge().

i know it's hard tell without posting actual script, script long meaningful , operations advertisement hoc, can't rewrite little script post here.

         16439731 function calls (16108083 primitive calls) in 37.766 seconds     ordered by: internal time     ncalls  tottime  percall  cumtime  percall filename:lineno(function)      7461    3.712    0.000    3.712    0.000 {method 'nonzero' of 'numpy.ndarray' objects}       244    1.731    0.007    5.434    0.022 index.py:1126(_partial_date_slice)       122    1.655    0.014    1.655    0.014 {pandas.algos.inner_join_indexer_int64}       610    1.578    0.003    1.578    0.003 {method 'factorize' of 'pandas.hashtable.int64factorizer' objects}    118817    0.764    0.000    0.764    0.000 {method 'reduce' of 'numpy.ufunc' objects}     22474    0.753    0.000    0.917    0.000 index.py:409(is_unique)    353210    0.669    0.000    1.228    0.000 {numpy.core.multiarray.array}   1577935    0.596    0.000    0.925    0.000 {isinstance}      1221    0.511    0.000    0.516    0.000 index.py:402(is_monotonic)       183    0.427    0.002    0.427    0.002 {pandas.algos.left_outer_join}     34529    0.376    0.000    1.286    0.000 index.py:98(__new__)     12356    0.358    0.000    0.358    0.000 {method 'take' of 'numpy.ndarray' objects}      3812    0.352    0.000    0.352    0.000 {pandas.algos.take_2d_axis0_int64_int64}       610    0.344    0.001    0.349    0.001 index.py:35(wrapper)       981    0.334    0.000    0.335    0.000 {method 'copy' of 'numpy.ndarray' objects}

the df.ix[] little unpredictable in label-based has integer-position fallback. should seek using .loc[] instead. if pass single label homecoming series of row @ index label. can piece passing range. instead of:

df.ix[begin_date:end_date]

try:

df.loc[begin_date:end_date]

even faster utilize integer-based slicing method .iloc[]. since you're looping on index anyway add together enumerate() loop , utilize enumerate() values, i.e.:

df.iloc[4:9]

on machine .iloc tends twice fast .loc.

python numpy pandas

Search This Blog

Three

python - Pandas optimization -

Comments

Post a Comment

Popular posts from this blog

php - Android app custom user registration and login with cookie using facebook sdk -

ruby on rails - Devise Logout Error in RoR -

c# - Create a Notification Object (Email or Page) At Run Time -- Dependency Injection or Factory -