python - Aggregating with groupby-apply on dataframe index (DatetimeIndex) -



python - Aggregating with groupby-apply on dataframe index (DatetimeIndex) -

i'm trying cut down meterological info using pandas 0.13.1. have big dataframe of floats. this answer have grouped info half-hour intervals efficiently. using groupby+apply instead of resample because of need examine multiple columns.

>>> winddata sonic_ux sonic_uy sonic_uz timestamp 2014-04-30 14:13:12.300000 0.322444 2.530129 0.347921 2014-04-30 14:13:12.400000 0.357793 2.571811 0.360840 2014-04-30 14:13:12.500000 0.469529 2.400510 0.193011 2014-04-30 14:13:12.600000 0.298787 2.212599 0.404752 2014-04-30 14:13:12.700000 0.259310 2.054919 0.066324 2014-04-30 14:13:12.800000 0.342952 1.962965 0.070500 2014-04-30 14:13:12.900000 0.434589 2.210533 -0.010147 ... ... ... [4361447 rows x 3 columns] >>> winddata.dtypes sonic_ux float64 sonic_uy float64 sonic_uz float64 dtype: object >>> hhdata = winddata.groupby(timegrouper('30t')); hhdata <pandas.core.groupby.dataframegroupby object @ 0xb440790c>

i want utilize math.atan2 on 'ux/uy' columns , having problem applying function. tracebacks attribute ndim:

>>> hhdata.apply(lambda g: atan2(g['sonic_ux'].mean(), g['sonic_uy'].mean())) traceback (most recent phone call last): <<snip>> file "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-i686.egg/pandas/tools/merge.py", line 989, in __init__ if not 0 <= axis <= sample.ndim: attributeerror: 'float' object has no attribute 'ndim' >>> >>> hhdata.apply(lambda g: 42) traceback (most recent phone call last): <<snip>> file "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-i686.egg/pandas/tools/merge.py", line 989, in __init__ if not 0 <= axis <= sample.ndim: attributeerror: 'int' object has no attribute 'ndim'

i can loop through groupby object fine. can wrap result in series or dataframe wrapping values requires adding index tuple-ed original index. next advice of this answer remove duplicate index didn't work expected. since can reproduce problem , solution question, wonder if believe it's behaving differently because grouping on a datetimeindex index.

>>> name, g in hhdata: ... print name, atan2(g['sonic_ux'].mean(), g['sonic_uy'].mean()), ' wd' ... 2014-04-30 14:00:00 0.13861912975 wd 2014-04-30 14:30:00 0.511709085506 wd 2014-04-30 15:00:00 -1.5088990774 wd 2014-04-30 15:30:00 0.13200013186 wd <<snip>> >>> def winddir(g): ... homecoming pd.series(atan2( np.mean(g['sonic_ux']), np.mean(g['sonic_uy']) ), name='wd') ... >>> hhdata.apply(winddir) 2014-04-30 14:00:00 0 0.138619 2014-04-30 14:30:00 0 0.511709 2014-04-30 15:00:00 0 -1.508899 2014-04-30 15:30:00 0 0.132000 ... 2014-05-05 14:00:00 0 -2.551593 2014-05-05 14:30:00 0 -2.523250 2014-05-05 15:00:00 0 -2.698828 name: wd, length: 243, dtype: float64 >>> hhdata.apply(winddir).index[0] (timestamp('2014-04-30 14:00:00', tz=none), 0) >>> def winddir(g): ... homecoming pd.dataframe({'wd':atan2(g['sonic_ux'].mean(), g['sonic_uy'].mean())}, index=[g.name]) ... >>> hhdata.apply(winddir) wd 2014-04-30 14:00:00 2014-04-30 14:00:00 0.138619 2014-04-30 14:30:00 2014-04-30 14:30:00 0.511709 2014-04-30 15:00:00 2014-04-30 15:00:00 -1.508899 2014-04-30 15:30:00 2014-04-30 15:30:00 0.132000 ... [243 rows x 1 columns] >>> hhdata.apply(winddir).index[0] (timestamp('2014-04-30 14:00:00', tz=none), timestamp('2014-04-30 14:00:00', tz=none)) >>> >>> tsfast.groupby(timegrouper('30t')).apply(lambda g: ... series({'wd': atan2(g.sonic_ux.mean(), g.sonic_uy.mean()), ... 'ws': np.sqrt(g.sonic_ux.mean()**2 + g.sonic_uy.mean()**2)})) 2014-04-30 14:00:00 wd 0.138619 ws 1.304311 2014-04-30 14:30:00 wd 0.511709 ws 0.143762 2014-04-30 15:00:00 wd -1.508899 ws 0.856643 ... 2014-05-05 14:30:00 wd -2.523250 ws 3.317810 2014-05-05 15:00:00 wd -2.698828 ws 3.279520 length: 486, dtype: float64

edited: notice column when series or dataframe returned? , next formula of linked reply results in hierarchical index?

my original question was: kind of value should returned applyed function groupby-apply operation results in 1-column dataframe or series length equal number of groups , grouping names (e.g. timestamps) used index values?

after feedback & farther investigation, asking why grouping on index behave differently grouping on column? observe changing datetimeindex column string values accomplish equivalent grouping timegrouper('30t') results in behavior expecting:

>>> winddata.index.name = 'wasindex' >>> data2 = winddata.reset_index() >>> def to_hh(x): # <-- big hammer ... ts = x.isoformat() ... homecoming ts[:14] + ('30:00' if int(ts[14:16]) >= 30 else '00:00') ... >>> data2['ts'] = data2['wasindex'].apply(lambda x: to_hh(x)) >>> wd = data2.groupby('ts').apply(lambda df: series({'wd': np.arctan2(df.x.mean(), df.y.mean())})) >>> type(wd) pandas.core.frame.dataframe >>> wd.columns index([u'wd'], dtype=object) >>> wd.index index([u'2014-04-30t14:00:00', u'2014-04-30t14:30:00', <<snip>> dtype=object)

in [31]: pd.set_option('max_rows',10) in [32]: winddata = dataframe({ 'x' : np.random.randn(n), 'y' : np.random.randn(n)+2, 'z' : np.random.randn(n) },pd.date_range('20140430 14:13:12',periods=n,freq='100ms')) in [33]: winddata out[33]: x y z 2014-04-30 14:13:12 -0.065350 0.567525 2.212534 2014-04-30 14:13:12.100000 -0.436498 2.591799 2.424359 2014-04-30 14:13:12.200000 -1.059038 3.120631 -0.645579 2014-04-30 14:13:12.300000 1.973474 0.630424 0.966405 2014-04-30 14:13:12.400000 0.575082 1.941845 -0.674695 ... ... ... ... 2014-05-05 15:22:16.200000 0.601962 0.027834 -0.101967 2014-05-05 15:22:16.300000 0.741777 1.764745 0.991516 2014-05-05 15:22:16.400000 -0.494253 1.765930 2.493000 2014-05-05 15:22:16.500000 -2.643749 0.671604 0.275096 2014-05-05 15:22:16.600000 0.676698 0.958903 0.946942 [4361447 rows x 3 columns] in [34]: winddata.info() <class 'pandas.core.frame.dataframe'> datetimeindex: 4361447 entries, 2014-04-30 14:13:12 2014-05-05 15:22:16.600000 freq: 100l info columns (total 3 columns): x float64 y float64 z float64 dtypes: float64(3)

in < 0.14.0, utilize pd.timegrouper

in [35]: g = winddata.groupby(pd.grouper(freq='30t')) in [36]: results = dataframe({'x' : g['x'].mean(), 'y' : g['y'].mean() }) in [37]: results['wd'] = np.arctan2(results['x'],results['y']) in [38]: results['ws'] = np.sqrt(results['x']**2+results['y']**2) in [39]: results out[39]: x y wd ws 2014-04-30 14:00:00 0.005060 1.986778 0.002547 1.986784 2014-04-30 14:30:00 0.004922 2.015551 0.002442 2.015557 2014-04-30 15:00:00 -0.004209 1.988889 -0.002116 1.988893 2014-04-30 15:30:00 0.008410 2.003453 0.004198 2.003470 2014-04-30 16:00:00 0.004027 1.997369 0.002016 1.997373 ... ... ... ... ... 2014-05-05 13:00:00 0.006901 1.991252 0.003466 1.991264 2014-05-05 13:30:00 0.005458 2.008731 0.002717 2.008739 2014-05-05 14:00:00 -0.000805 2.000045 -0.000402 2.000045 2014-05-05 14:30:00 -0.004556 1.997437 -0.002281 1.997443 2014-05-05 15:00:00 0.003444 2.000182 0.001722 2.000185 [243 rows x 4 columns]

python pandas

Comments

Popular posts from this blog

model view controller - MVC Rails Planning -

ruby on rails - Devise Logout Error in RoR -

html - Submenu setup with jquery and effect 'fold' -