hadoop - Optimizing pig script -
hadoop - Optimizing pig script -
i trying generate aggregated output. issue info going single reducer(filter , count creating problem). how can optimize next script?
expected output: group, 10,2,12,34...
data = load '/input/useragents' using pigstorage('\t') (col1:chararray,col2:chararray,col3:chararray,col4:chararray,col5:chararray);  grp1 =  grouping   info  ua parallel 50; fr1 = foreach grp1 {         fltrcol1 = filter   info col1 == 'other';         fltrcol2 = filter   info col2 == 'other';         fltrcol3 = filter   info col3 == 'other';         fltrcol4 = filter   info col4 == 'other';         fltrcol5 = filter   info col5 == 'other';         cnt_fltrcol1 = count(fltrcol1);         cnt_fltrcol2 = count(fltrcol2);         cnt_fltrcol3 = count(fltrcol3);         cnt_fltrcol4 = count(fltrcol4);         cnt_fltrcol5 = count(fltrcol5);         generate group,cnt_fltrcol1,cnt_fltrcol2,cnt_fltrcol3,cnt_fltrcol4,cnt_fltrcol5; }        
you set filter logic before grouping adding fltrcol{1,2,3,4,5} columns integers, sum them up. top of head here script :
      info = load '/input/useragents' using pigstorage('\t') (col1:chararray,col2:chararray,col3:chararray,col4:chararray,col5:chararray);      filter = foreach   info generate ua,          ((col1 == 'other') ? 1 : 0) fltrcol1,          ((col2 == 'other') ? 1 : 0) fltrcol2,          ((col3 == 'other') ? 1 : 0) fltrcol3,          ((col4 == 'other') ? 1 : 0) fltrcol4,          ((col5 == 'other') ? 1 : 0) fltrcol5;       grp1 =  grouping   info ua parallel 50;      fr1 = foreach grp1 {             cnt_fltrcol1 = sum(fltrcol1);             cnt_fltrcol2 = sum(fltrcol2);             cnt_fltrcol3 = sum(fltrcol3);             cnt_fltrcol4 = sum(fltrcol4);             cnt_fltrcol5 = sum(fltrcol5);             generate group,cnt_fltrcol1,cnt_fltrcol2,cnt_fltrcol3,cnt_fltrcol4,cnt_fltrcol5;     }         hadoop apache-pig 
 
Comments
Post a Comment