scala - Inner map function in Spark -



scala - Inner map function in Spark -

i have 2 rdds :

rdd1[string, double]

sample info :

("a" , 1.0) ("b" , 2.0) ("c" , 3.0) ("d" , 4.0)

this corresponds key value pair.

rdd2[string , (string , string)

sample info :

("a" , ("b" , "c")) ("b" , ("a" , "b")) ("c" , ("a" , "d")) ("d" , ("a" , "b"))

rdd1 contains values required rdd2

so want able access values rdd2 in rdd1 such :

("a" , ("b" , "c")) map ("a" , (2.0 , 3.0))

2.0 & 3.0 corresponding values in rdd1

how can accomplish scala spark ? possible solutions convert rdd1 hashmap , "get" values within map operation of rdd2 :

rdd2.map(m => rdd1hashmap.get(m._2._1))

is there alternative method accomplish ?

if rdd1 little should have in hash map utilize broadcast variable (wild guess in low 10's of millions should fine). if not have 2 options.

use pairrddfunction lookup, may extremely inefficient/illegal(although worked fine locally).

rdd1.cache() rdd2.map(m => rdd1.lookup(m._2._1))

the sec alternative more complex, have 2 joins (spark still doesn't have back upwards joining more 2 datasets @ time)

val joineddataset = rdd2.map((k,v)=> (v._1,(k,v._2))). join(rdd1).map((k,v)=>(v._1._2,(v._2,v._1._1))). join(rdd2).map((k,v)=>(v._1._2(v._1._1,v._2)))

that should dataset wanted, realize rdd extremely messy, may want utilize case classes , or 2 joins seperately bring together rdd's create clearer(if less efficient). noticed reason scala can't perform type inference on lambdas, think seek 1 of other 2 options before resorting this.

scala apache-spark

Comments

Popular posts from this blog

php - Android app custom user registration and login with cookie using facebook sdk -

django - Access session in user model .save() -

php - .htaccess Multiple Rewrite Rules / Prioritizing -