regex - Detect rows in a data frame that are highly similar but not necessarily exact duplicates -
regex - Detect rows in a data frame that are highly similar but not necessarily exact duplicates -
i identify rows in info frame highly similar each other not exact duplicates. have considered merging info each row 1 string cell @ end , using partial matching function. nice able set/adjust level of similarity required qualify match (for example, homecoming rows match 75% of characters in row).
here simple working example.
df<-data.frame(name = c("andrew", "andrem", "adam", "pamdrew"), id = c(12334, 12344, 34345, 98974), score = c(90, 90, 83, 95))
in scenario, want row 2 show duplicate of row 1, not row 4 (it dissimilar). suggestions.
you can utilize agrep
first need concatenate columns fuzzy search in columns , not first one.
xx <- do.call(paste0,df) df[agrep(xx[1],xx,max=0.6*nchar(xx[1])),] name id score 1 andrew 12334 90 2 andrem 12344 90 4 pamdrew 98974 95
note 0.7 rows.
once rows matched should extract them data.frame , repeat same process other rows(row 3 here rest of data)...
regex r duplicates agrep
Comments
Post a Comment