php - Why is mb_strpos() so considerably slower than strpos()? -

i had criticized an answer suggested preg_match on === when finding substring offsets in order avoid type mismatch.

however, later on answer's author has discovered preg_match significantly faster multi-byte operating mb_strpos. normal strpos faster both functions of course, cannot deal multibyte strings.

i understand mb_strpos needs something more strpos. however, if regex can almost fast strpos, mb_strpos takes much time?

i have strong suspicion it's optimization error. could, example, php extensions slower native functions?

mb_strpos($str, "颜色", 0 ,"gbk"): 15.988190889 (89%) preg_match("/颜色/", $str): 1.022506952 (6%) strpos($str, "dh"): 0.934401989 (5%)

functions run 106 times. absolute time(s) accounts sum of time of 106 runs of function, rather average one.

the test string $str = "代码dhgd颜色代码";. test can seen here (scroll downwards skip testing class).

note: according 1 of commentators (and mutual sense), preg_match not utilize multi-byte when comparing, beingness subject same risk of errors strpos.

to understand why functions have different runtime need understand do. because summing them ‘they search needle in haystack’ isn’t enough.

strpos

if @ implementation of strpos, uses zend_memstr internally, implements pretty naive algorithm searching needle in haystack: basically, uses memchr find first byte of needle in haystack , uses memcmp check whether whole needle begins @ position. if not, repeats search first byte of needle position of previous match of first byte.

knowing this, can strpos search byte sequence in byte sequence using naive search algorithm.

mb_strpos

this function multi-byte counterpart strpos. makes searching little more complex can’t @ bytes without knowing character belong to.

mb_strpos uses mbfl_strpos, lot more in comparing simple algorithm of zend_memstr, it’s 200 lines of complex code (mbfl_strpos) compared 30 lines of slick code (zend_memstr).

we can skip part both needle , haystack converted utf-8 if necessary, , come major chunk of code.

first have 2 setup loops , there loop proceeds pointer according given offset can see aware of actual characters , how skip whole encoded utf-8 characters: since utf-8 variable-width character encoding first byte of each encoded character denotes whole length of encoded character. info stored in u8_tbl array.

finally, loop actual search happens. , here have interesting, because test needle @ position in haystack tried in reverse. , if 1 byte did not match, jump table jtbl used find next possible position needle in haystack. implementation of boyer–moore string search algorithm.

so know mb_strpos …

converts strings utf-8, if necessary is aware of actual characters uses boyer–moore search algorithm preg_match

as preg_match, uses pcre library. its standard matching algorithm uses nondeterministic finite automaton (nfa) find match conducting depth-first search of pattern tree. naive search approach.

php regex performance strpos

Search This Blog

Three

php - Why is mb_strpos() so considerably slower than strpos()? -

Comments

Post a Comment