regex - Detect rows in a data frame that are highly similar but not necessarily exact duplicates -



regex - Detect rows in a data frame that are highly similar but not necessarily exact duplicates -

i identify rows in info frame highly similar each other not exact duplicates. have considered merging info each row 1 string cell @ end , using partial matching function. nice able set/adjust level of similarity required qualify match (for example, homecoming rows match 75% of characters in row).

here simple working example.

df<-data.frame(name = c("andrew", "andrem", "adam", "pamdrew"), id = c(12334, 12344, 34345, 98974), score = c(90, 90, 83, 95))

in scenario, want row 2 show duplicate of row 1, not row 4 (it dissimilar). suggestions.

you can utilize agrep first need concatenate columns fuzzy search in columns , not first one.

xx <- do.call(paste0,df) df[agrep(xx[1],xx,max=0.6*nchar(xx[1])),] name id score 1 andrew 12334 90 2 andrem 12344 90 4 pamdrew 98974 95

note 0.7 rows.

once rows matched should extract them data.frame , repeat same process other rows(row 3 here rest of data)...

regex r duplicates agrep

Comments

Popular posts from this blog

php - Android app custom user registration and login with cookie using facebook sdk -

django - Access session in user model .save() -

php - .htaccess Multiple Rewrite Rules / Prioritizing -