hadoop - Optimizing pig script -
hadoop - Optimizing pig script -
i trying generate aggregated output. issue info going single reducer(filter , count creating problem). how can optimize next script?
expected output: group, 10,2,12,34...
data = load '/input/useragents' using pigstorage('\t') (col1:chararray,col2:chararray,col3:chararray,col4:chararray,col5:chararray); grp1 = grouping info ua parallel 50; fr1 = foreach grp1 { fltrcol1 = filter info col1 == 'other'; fltrcol2 = filter info col2 == 'other'; fltrcol3 = filter info col3 == 'other'; fltrcol4 = filter info col4 == 'other'; fltrcol5 = filter info col5 == 'other'; cnt_fltrcol1 = count(fltrcol1); cnt_fltrcol2 = count(fltrcol2); cnt_fltrcol3 = count(fltrcol3); cnt_fltrcol4 = count(fltrcol4); cnt_fltrcol5 = count(fltrcol5); generate group,cnt_fltrcol1,cnt_fltrcol2,cnt_fltrcol3,cnt_fltrcol4,cnt_fltrcol5; }
you set filter logic before grouping adding fltrcol{1,2,3,4,5} columns integers, sum them up. top of head here script :
info = load '/input/useragents' using pigstorage('\t') (col1:chararray,col2:chararray,col3:chararray,col4:chararray,col5:chararray); filter = foreach info generate ua, ((col1 == 'other') ? 1 : 0) fltrcol1, ((col2 == 'other') ? 1 : 0) fltrcol2, ((col3 == 'other') ? 1 : 0) fltrcol3, ((col4 == 'other') ? 1 : 0) fltrcol4, ((col5 == 'other') ? 1 : 0) fltrcol5; grp1 = grouping info ua parallel 50; fr1 = foreach grp1 { cnt_fltrcol1 = sum(fltrcol1); cnt_fltrcol2 = sum(fltrcol2); cnt_fltrcol3 = sum(fltrcol3); cnt_fltrcol4 = sum(fltrcol4); cnt_fltrcol5 = sum(fltrcol5); generate group,cnt_fltrcol1,cnt_fltrcol2,cnt_fltrcol3,cnt_fltrcol4,cnt_fltrcol5; }
hadoop apache-pig
Comments
Post a Comment