python - Aggregating with groupby-apply on dataframe index (DatetimeIndex) -

i'm trying cut down meterological info using pandas 0.13.1. have big dataframe of floats. this answer have grouped info half-hour intervals efficiently. using groupby+apply instead of resample because of need examine multiple columns.

>>> winddata                             sonic_ux  sonic_uy  sonic_uz timestamp                                                2014-04-30 14:13:12.300000  0.322444  2.530129  0.347921 2014-04-30 14:13:12.400000  0.357793  2.571811  0.360840 2014-04-30 14:13:12.500000  0.469529  2.400510  0.193011 2014-04-30 14:13:12.600000  0.298787  2.212599  0.404752 2014-04-30 14:13:12.700000  0.259310  2.054919  0.066324 2014-04-30 14:13:12.800000  0.342952  1.962965  0.070500 2014-04-30 14:13:12.900000  0.434589  2.210533 -0.010147                                  ...       ...       ...  [4361447 rows x 3 columns] >>> winddata.dtypes sonic_ux    float64 sonic_uy    float64 sonic_uz    float64 dtype: object >>> hhdata = winddata.groupby(timegrouper('30t')); hhdata <pandas.core.groupby.dataframegroupby object @ 0xb440790c>

i want utilize math.atan2 on 'ux/uy' columns , having problem applying function. tracebacks attribute ndim:

>>> hhdata.apply(lambda g: atan2(g['sonic_ux'].mean(), g['sonic_uy'].mean())) traceback (most recent  phone call last):       <<snip>>   file "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-i686.egg/pandas/tools/merge.py", line 989, in __init__     if not 0 <= axis <= sample.ndim: attributeerror: 'float' object has no attribute 'ndim' >>>  >>> hhdata.apply(lambda g: 42) traceback (most recent  phone call last):       <<snip>>   file "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-i686.egg/pandas/tools/merge.py", line 989, in __init__     if not 0 <= axis <= sample.ndim: attributeerror: 'int' object has no attribute 'ndim'

i can loop through groupby object fine. can wrap result in series or dataframe wrapping values requires adding index tuple-ed original index. next advice of this answer remove duplicate index didn't work expected. since can reproduce problem , solution question, wonder if believe it's behaving differently because grouping on a datetimeindex index.

>>> name, g in hhdata: ...     print name, atan2(g['sonic_ux'].mean(), g['sonic_uy'].mean()), '   wd' ...  2014-04-30 14:00:00 0.13861912975    wd 2014-04-30 14:30:00 0.511709085506    wd 2014-04-30 15:00:00 -1.5088990774    wd 2014-04-30 15:30:00 0.13200013186    wd     <<snip>> >>> def winddir(g): ...      homecoming pd.series(atan2( np.mean(g['sonic_ux']), np.mean(g['sonic_uy']) ), name='wd') ...  >>> hhdata.apply(winddir) 2014-04-30 14:00:00  0    0.138619 2014-04-30 14:30:00  0    0.511709 2014-04-30 15:00:00  0   -1.508899 2014-04-30 15:30:00  0    0.132000 ... 2014-05-05 14:00:00  0   -2.551593 2014-05-05 14:30:00  0   -2.523250 2014-05-05 15:00:00  0   -2.698828 name: wd, length: 243, dtype: float64 >>> hhdata.apply(winddir).index[0] (timestamp('2014-04-30 14:00:00', tz=none), 0) >>> def winddir(g): ...      homecoming pd.dataframe({'wd':atan2(g['sonic_ux'].mean(), g['sonic_uy'].mean())}, index=[g.name]) ...  >>> hhdata.apply(winddir)                                                wd 2014-04-30 14:00:00 2014-04-30 14:00:00  0.138619 2014-04-30 14:30:00 2014-04-30 14:30:00  0.511709 2014-04-30 15:00:00 2014-04-30 15:00:00 -1.508899 2014-04-30 15:30:00 2014-04-30 15:30:00  0.132000                                               ...  [243 rows x 1 columns] >>> hhdata.apply(winddir).index[0] (timestamp('2014-04-30 14:00:00', tz=none), timestamp('2014-04-30 14:00:00', tz=none)) >>>  >>> tsfast.groupby(timegrouper('30t')).apply(lambda g: ...     series({'wd': atan2(g.sonic_ux.mean(), g.sonic_uy.mean()),  ...             'ws': np.sqrt(g.sonic_ux.mean()**2 + g.sonic_uy.mean()**2)})) 2014-04-30 14:00:00  wd    0.138619                      ws    1.304311 2014-04-30 14:30:00  wd    0.511709                      ws    0.143762 2014-04-30 15:00:00  wd   -1.508899                      ws    0.856643 ... 2014-05-05 14:30:00  wd   -2.523250                      ws    3.317810 2014-05-05 15:00:00  wd   -2.698828                      ws    3.279520 length: 486, dtype: float64

edited: notice column when series or dataframe returned? , next formula of linked reply results in hierarchical index?

my original question was: kind of value should returned applyed function groupby-apply operation results in 1-column dataframe or series length equal number of groups , grouping names (e.g. timestamps) used index values?

after feedback & farther investigation, asking why grouping on index behave differently grouping on column? observe changing datetimeindex column string values accomplish equivalent grouping timegrouper('30t') results in behavior expecting:

>>> winddata.index.name = 'wasindex' >>> data2 = winddata.reset_index() >>> def to_hh(x): # <-- big hammer ...     ts = x.isoformat() ...      homecoming ts[:14] + ('30:00' if int(ts[14:16]) >= 30 else '00:00') ...  >>> data2['ts'] = data2['wasindex'].apply(lambda x: to_hh(x)) >>> wd = data2.groupby('ts').apply(lambda df: series({'wd': np.arctan2(df.x.mean(), df.y.mean())})) >>> type(wd) pandas.core.frame.dataframe >>> wd.columns index([u'wd'], dtype=object) >>> wd.index index([u'2014-04-30t14:00:00', u'2014-04-30t14:30:00', <<snip>> dtype=object)

in [31]: pd.set_option('max_rows',10) in [32]: winddata = dataframe({ 'x' : np.random.randn(n), 'y' : np.random.randn(n)+2, 'z' : np.random.randn(n) },pd.date_range('20140430 14:13:12',periods=n,freq='100ms')) in [33]: winddata out[33]: x y z 2014-04-30 14:13:12 -0.065350 0.567525 2.212534 2014-04-30 14:13:12.100000 -0.436498 2.591799 2.424359 2014-04-30 14:13:12.200000 -1.059038 3.120631 -0.645579 2014-04-30 14:13:12.300000 1.973474 0.630424 0.966405 2014-04-30 14:13:12.400000 0.575082 1.941845 -0.674695 ... ... ... ... 2014-05-05 15:22:16.200000 0.601962 0.027834 -0.101967 2014-05-05 15:22:16.300000 0.741777 1.764745 0.991516 2014-05-05 15:22:16.400000 -0.494253 1.765930 2.493000 2014-05-05 15:22:16.500000 -2.643749 0.671604 0.275096 2014-05-05 15:22:16.600000 0.676698 0.958903 0.946942 [4361447 rows x 3 columns] in [34]: winddata.info() <class 'pandas.core.frame.dataframe'> datetimeindex: 4361447 entries, 2014-04-30 14:13:12 2014-05-05 15:22:16.600000 freq: 100l info columns (total 3 columns): x float64 y float64 z float64 dtypes: float64(3)

in < 0.14.0, utilize pd.timegrouper

in [35]: g = winddata.groupby(pd.grouper(freq='30t'))  in [36]: results = dataframe({'x' : g['x'].mean(), 'y' : g['y'].mean() })  in [37]: results['wd'] = np.arctan2(results['x'],results['y'])  in [38]: results['ws'] = np.sqrt(results['x']**2+results['y']**2)  in [39]: results out[39]:                              x         y        wd        ws 2014-04-30 14:00:00  0.005060  1.986778  0.002547  1.986784 2014-04-30 14:30:00  0.004922  2.015551  0.002442  2.015557 2014-04-30 15:00:00 -0.004209  1.988889 -0.002116  1.988893 2014-04-30 15:30:00  0.008410  2.003453  0.004198  2.003470 2014-04-30 16:00:00  0.004027  1.997369  0.002016  1.997373 ...                       ...       ...       ...       ... 2014-05-05 13:00:00  0.006901  1.991252  0.003466  1.991264 2014-05-05 13:30:00  0.005458  2.008731  0.002717  2.008739 2014-05-05 14:00:00 -0.000805  2.000045 -0.000402  2.000045 2014-05-05 14:30:00 -0.004556  1.997437 -0.002281  1.997443 2014-05-05 15:00:00  0.003444  2.000182  0.001722  2.000185  [243 rows x 4 columns]

python pandas

Search This Blog

Three

python - Aggregating with groupby-apply on dataframe index (DatetimeIndex) -

Comments

Post a Comment

Popular posts from this blog

model view controller - MVC Rails Planning -

html - Submenu setup with jquery and effect 'fold' -

ruby on rails - Devise Logout Error in RoR -