r - Extract string elements that possibly appear multiple times, or not at all -

r - Extract string elements that possibly appear multiple times, or not at all -

start character vector of urls. goal end name of company, meaning column "test", "example" , "sample" in illustration below.

urls <- c("http://grand.test.com/", "https://example.com/", "http://.big.time.sample.com/")

remove ".com" , whatever might follow , maintain first part:

urls <- sapply(strsplit(urls, split="(?<=.)(?=\\.com)", perl=t), "[", 1) urls # [1] "http://grand.test" "https://example" "http://.big.time.sample"

my next step remove http:// , https:// portions chained gsub() call:

urls <- gsub("^http://", "", gsub("^https://", "", urls)) urls # [1] "grand.test" "example" ".big.time.sample"

but here need help. how handle multiple periods (dots) before company name in first , 3rd strings of urls? example, phone call below returns na sec string, since "example" string has no period remaining. or if retain first part, lose company name.

urls <- sapply(strsplit(urls, split = "\\."), "[", 2) urls # [1] "test" na "big" urls <- sapply(strsplit(urls, split = "\\."), "[", 1) urls # [1] "grand" "example" ""

perhaps ifelse() phone call counts number of periods remaining , uses strsplit if there more 1 period? note possible there 2 or more periods before company name. don't know how lookarounds, might solve problem. didn't

strsplit(urls, split="(?=\\.)", perl=t)

thank suggestions.

here's approach may easier understand , generalize of others:

pat = "(.*?)(\\w+)(\\.com.*)" gsub(pat, "\\2", urls)

it works breaking each string 3 capture groups match entire string, , substituting in capture grouping (2), 1 want.

pat = "(.*?)(\\w+)(\\.com.*)" # ^ ^ ^ # | | | # (1) (2) (3)

edit (adding explanation of ? modifier):

do note capture grouping (1) needs include "ungreedy" or "minimal" quantifier ? (also called "lazy" or "reluctant"). tells regex engine match many characters can ... without using otherwise become part of next capture grouping (2).

without trailing ?, repetition quantifiers default greedy; in case, greedy capture group, (.*), since matches number of type of characters, "eat up" characters in string, leaving none @ other 2 capture groups -- not behavior want!

r substring regex-lookarounds strsplit
