r - Extract string elements that possibly appear multiple times, or not at all -
r - Extract string elements that possibly appear multiple times, or not at all -
start character vector of urls. goal end name of company, meaning column "test"
, "example"
, "sample"
in illustration below.
urls <- c("http://grand.test.com/", "https://example.com/", "http://.big.time.sample.com/")
remove ".com"
, whatever might follow , maintain first part:
urls <- sapply(strsplit(urls, split="(?<=.)(?=\\.com)", perl=t), "[", 1) urls # [1] "http://grand.test" "https://example" "http://.big.time.sample"
my next step remove http://
, https://
portions chained gsub()
call:
urls <- gsub("^http://", "", gsub("^https://", "", urls)) urls # [1] "grand.test" "example" ".big.time.sample"
but here need help. how handle multiple periods (dots) before company name in first , 3rd strings of urls? example, phone call below returns na sec string, since "example"
string has no period remaining. or if retain first part, lose company name.
urls <- sapply(strsplit(urls, split = "\\."), "[", 2) urls # [1] "test" na "big" urls <- sapply(strsplit(urls, split = "\\."), "[", 1) urls # [1] "grand" "example" ""
perhaps ifelse()
phone call counts number of periods remaining , uses strsplit if there more 1 period? note possible there 2 or more periods before company name. don't know how lookarounds, might solve problem. didn't
strsplit(urls, split="(?=\\.)", perl=t)
thank suggestions.
here's approach may easier understand , generalize of others:
pat = "(.*?)(\\w+)(\\.com.*)" gsub(pat, "\\2", urls)
it works breaking each string 3 capture groups match entire string, , substituting in capture grouping (2)
, 1 want.
pat = "(.*?)(\\w+)(\\.com.*)" # ^ ^ ^ # | | | # (1) (2) (3)
edit (adding explanation of ?
modifier):
do note capture grouping (1)
needs include "ungreedy" or "minimal" quantifier ?
(also called "lazy" or "reluctant"). tells regex engine match many characters can ... without using otherwise become part of next capture grouping (2)
.
without trailing ?
, repetition quantifiers default greedy; in case, greedy capture group, (.*)
, since matches number of type of characters, "eat up" characters in string, leaving none @ other 2 capture groups -- not behavior want!
r substring regex-lookarounds strsplit
Comments
Post a Comment