php - Why is mb_strpos() so considerably slower than strpos()? -
php - Why is mb_strpos() so considerably slower than strpos()? -
i had criticized an answer suggested preg_match
on ===
when finding substring offsets in order avoid type mismatch.
however, later on answer's author has discovered preg_match
significantly faster multi-byte operating mb_strpos
. normal strpos
faster both functions of course, cannot deal multibyte strings.
i understand mb_strpos
needs something more strpos
. however, if regex can almost fast strpos
, mb_strpos
takes much time?
i have strong suspicion it's optimization error. could, example, php extensions slower native functions?
mb_strpos($str, "颜色", 0 ,"gbk"): 15.988190889 (89%) preg_match("/颜色/", $str): 1.022506952 (6%) strpos($str, "dh"): 0.934401989 (5%)
functions run 106 times. absolute time(s) accounts sum of time of 106 runs of function, rather average one.
the test string $str = "代码dhgd颜色代码";
. test can seen here (scroll downwards skip testing class).
note: according 1 of commentators (and mutual sense), preg_match
not utilize multi-byte when comparing, beingness subject same risk of errors strpos
.
to understand why functions have different runtime need understand do. because summing them ‘they search needle in haystack’ isn’t enough.
strpos
if @ implementation of strpos
, uses zend_memstr
internally, implements pretty naive algorithm searching needle in haystack: basically, uses memchr
find first byte of needle in haystack , uses memcmp
check whether whole needle begins @ position. if not, repeats search first byte of needle position of previous match of first byte.
knowing this, can strpos
search byte sequence in byte sequence using naive search algorithm.
mb_strpos
this function multi-byte counterpart strpos
. makes searching little more complex can’t @ bytes without knowing character belong to.
mb_strpos
uses mbfl_strpos
, lot more in comparing simple algorithm of zend_memstr
, it’s 200 lines of complex code (mbfl_strpos
) compared 30 lines of slick code (zend_memstr
).
we can skip part both needle , haystack converted utf-8 if necessary, , come major chunk of code.
first have 2 setup loops , there loop proceeds pointer according given offset can see aware of actual characters , how skip whole encoded utf-8 characters: since utf-8 variable-width character encoding first byte of each encoded character denotes whole length of encoded character. info stored in u8_tbl
array.
finally, loop actual search happens. , here have interesting, because test needle @ position in haystack tried in reverse. , if 1 byte did not match, jump table jtbl
used find next possible position needle in haystack. implementation of boyer–moore string search algorithm.
so know mb_strpos
…
preg_match
as preg_match
, uses pcre library. its standard matching algorithm uses nondeterministic finite automaton (nfa) find match conducting depth-first search of pattern tree. naive search approach.
php regex performance strpos
Comments
Post a Comment