scala - Inner map function in Spark -
scala - Inner map function in Spark -
i have 2 rdds :
rdd1[string, double]
sample info :
("a" , 1.0) ("b" , 2.0) ("c" , 3.0) ("d" , 4.0)
this corresponds key value pair.
rdd2[string , (string , string)
sample info :
("a" , ("b" , "c")) ("b" , ("a" , "b")) ("c" , ("a" , "d")) ("d" , ("a" , "b"))
rdd1 contains values required rdd2
so want able access values rdd2 in rdd1 such :
("a" , ("b" , "c")) map ("a" , (2.0 , 3.0))
2.0 & 3.0 corresponding values in rdd1
how can accomplish scala spark ? possible solutions convert rdd1 hashmap , "get" values within map operation of rdd2 :
rdd2.map(m => rdd1hashmap.get(m._2._1))
is there alternative method accomplish ?
if rdd1
little should have in hash map utilize broadcast variable (wild guess in low 10's of millions should fine). if not have 2 options.
use pairrddfunction lookup, may extremely inefficient/illegal(although worked fine locally).
rdd1.cache() rdd2.map(m => rdd1.lookup(m._2._1))
the sec alternative more complex, have 2 joins (spark still doesn't have back upwards joining more 2 datasets @ time)
val joineddataset = rdd2.map((k,v)=> (v._1,(k,v._2))). join(rdd1).map((k,v)=>(v._1._2,(v._2,v._1._1))). join(rdd2).map((k,v)=>(v._1._2(v._1._1,v._2)))
that should dataset wanted, realize rdd extremely messy, may want utilize case classes , or 2 joins seperately bring together rdd's create clearer(if less efficient). noticed reason scala can't perform type inference on lambdas, think seek 1 of other 2 options before resorting this.
scala apache-spark
Comments
Post a Comment