r - Extract string elements that possibly appear multiple times, or not at all -
r - Extract string elements that possibly appear multiple times, or not at all -
start character vector of urls. goal end name of company, meaning column "test", "example" , "sample" in illustration below.
urls <- c("http://grand.test.com/", "https://example.com/", "http://.big.time.sample.com/") remove ".com" , whatever might follow , maintain first part:
urls <- sapply(strsplit(urls, split="(?<=.)(?=\\.com)", perl=t), "[", 1) urls # [1] "http://grand.test" "https://example" "http://.big.time.sample" my next step remove http:// , https:// portions chained gsub() call:
urls <- gsub("^http://", "", gsub("^https://", "", urls)) urls # [1] "grand.test" "example" ".big.time.sample" but here need help. how handle multiple periods (dots) before company name in first , 3rd strings of urls? example, phone call below returns na sec string, since "example" string has no period remaining. or if retain first part, lose company name.
urls <- sapply(strsplit(urls, split = "\\."), "[", 2) urls # [1] "test" na "big" urls <- sapply(strsplit(urls, split = "\\."), "[", 1) urls # [1] "grand" "example" "" perhaps ifelse() phone call counts number of periods remaining , uses strsplit if there more 1 period? note possible there 2 or more periods before company name. don't know how lookarounds, might solve problem. didn't
strsplit(urls, split="(?=\\.)", perl=t) thank suggestions.
here's approach may easier understand , generalize of others:
pat = "(.*?)(\\w+)(\\.com.*)" gsub(pat, "\\2", urls) it works breaking each string 3 capture groups match entire string, , substituting in capture grouping (2), 1 want.
pat = "(.*?)(\\w+)(\\.com.*)" # ^ ^ ^ # | | | # (1) (2) (3) edit (adding explanation of ? modifier):
do note capture grouping (1) needs include "ungreedy" or "minimal" quantifier ? (also called "lazy" or "reluctant"). tells regex engine match many characters can ... without using otherwise become part of next capture grouping (2).
without trailing ?, repetition quantifiers default greedy; in case, greedy capture group, (.*), since matches number of type of characters, "eat up" characters in string, leaving none @ other 2 capture groups -- not behavior want!
r substring regex-lookarounds strsplit
Comments
Post a Comment