regex - Detect rows in a data frame that are highly similar but not necessarily exact duplicates -

i identify rows in info frame highly similar each other not exact duplicates. have considered merging info each row 1 string cell @ end , using partial matching function. nice able set/adjust level of similarity required qualify match (for example, homecoming rows match 75% of characters in row).

here simple working example.

df<-data.frame(name = c("andrew", "andrem", "adam", "pamdrew"), id = c(12334, 12344, 34345, 98974), score = c(90, 90, 83, 95))

in scenario, want row 2 show duplicate of row 1, not row 4 (it dissimilar). suggestions.

you can utilize agrep first need concatenate columns fuzzy search in columns , not first one.

xx <- do.call(paste0,df) df[agrep(xx[1],xx,max=0.6*nchar(xx[1])),]      name    id score 1  andrew 12334    90 2  andrem 12344    90 4 pamdrew 98974    95

note 0.7 rows.

once rows matched should extract them data.frame , repeat same process other rows(row 3 here rest of data)...

regex r duplicates agrep

Search This Blog

Three

regex - Detect rows in a data frame that are highly similar but not necessarily exact duplicates -

Comments

Post a Comment