Avoid quadratic complexity in log parser (#26568) - spack - A flexible package manager that supports multiple versions, configurations, platforms, and compilers.

diff options

author	Harmen Stoppels <harmenstoppels@gmail.com>	2021-10-12 09:05:11 +0200
committer	GitHub <noreply@github.com>	2021-10-12 00:05:11 -0700
commit	0c0831861c57b747d915731ef3ad214346a9dcbe (patch)
tree	9fc03780ccfdcc1d743b7fb38b4c650aee0fee47 /COPYRIGHT
parent	580a20c2436110f6b10ae3a62242c7b01f24b9a9 (diff)
download	spack-0c0831861c57b747d915731ef3ad214346a9dcbe.tar.gz spack-0c0831861c57b747d915731ef3ad214346a9dcbe.tar.bz2 spack-0c0831861c57b747d915731ef3ad214346a9dcbe.tar.xz spack-0c0831861c57b747d915731ef3ad214346a9dcbe.zip

Avoid quadratic complexity in log parser (#26568)

TL;DR: there are matching groups trying to match 1 or more occurrences of something. We don't use the matching group. Therefore it's sufficient to test for 1 occurrence. This reduce quadratic complexity to linear time. --- When parsing logs of an mpich build, I'm getting a 4 minute (!!) wait with 16 threads for regexes to run: ``` In [1]: %time p.parse("mpich.log") Wall time: 4min 14s ``` That's really unacceptably slow... After some digging, it seems a few regexes tend to have `O(n^2)` scaling where `n` is the string / log line length. I don't think they *necessarily* should scale like that, but it seems that way. The common pattern is this ``` ([^:]+): error ``` which matches `: error` literally, and then one or more non-colons before that. So for a log line like this: ``` abcdefghijklmnopqrstuvwxyz: error etc etc ``` Any of these are potential group matches when using `search` in Python: ``` abcdefghijklmnopqrstuvwxyz bcdefghijklmnopqrstuvwxyz cdefghijklmnopqrstuvwxyz ⋮ yz z ``` but clearly the capture group should return the longest match. My hypothesis is that Python has a very bad implementation of `search` that somehow considers all of these, even though it can be implemented in linear time by scanning for `: error` first, and then greedily expanding the longest possible `[^:]+` match to the left. If Python indeed considers all possible matches, then with `n` matches of length `1 .. n` you see the `O(n^2)` slowness (i verified this by replacing + with {1,k} and doubling `k`, it doubles the execution time indeed). This PR fixes this by removing the `+`, so effectively changing the O(n^2) into a O(n) worst case. The reason we are fine with dropping `+` is that we don't use the capture group anywhere, so, we just ensure `:: error` is not a match but `x: error` is. After going from O(n^2) to O(n), the 15MB mpich build log is parsed in `1.288s`, so about 200x faster. Just to be sure I've also updated `^CMake Error.*:` to `^CMake Error`, so that it does not match with all the possible `:`'s in the line. Another option is to use `.*?` there to make it quit scanning as soon as possible, but what line that starts with `CMake Error` that does not have a colon is really a false positive...

Diffstat (limited to 'COPYRIGHT')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: