diff options
author | Harmen Stoppels <harmenstoppels@gmail.com> | 2021-10-12 09:05:11 +0200 |
---|---|---|
committer | GitHub <noreply@github.com> | 2021-10-12 00:05:11 -0700 |
commit | 0c0831861c57b747d915731ef3ad214346a9dcbe (patch) | |
tree | 9fc03780ccfdcc1d743b7fb38b4c650aee0fee47 /COPYRIGHT | |
parent | 580a20c2436110f6b10ae3a62242c7b01f24b9a9 (diff) | |
download | spack-0c0831861c57b747d915731ef3ad214346a9dcbe.tar.gz spack-0c0831861c57b747d915731ef3ad214346a9dcbe.tar.bz2 spack-0c0831861c57b747d915731ef3ad214346a9dcbe.tar.xz spack-0c0831861c57b747d915731ef3ad214346a9dcbe.zip |
Avoid quadratic complexity in log parser (#26568)
TL;DR: there are matching groups trying to match 1 or more occurrences of
something. We don't use the matching group. Therefore it's sufficient to test
for 1 occurrence. This reduce quadratic complexity to linear time.
---
When parsing logs of an mpich build, I'm getting a 4 minute (!!) wait
with 16 threads for regexes to run:
```
In [1]: %time p.parse("mpich.log")
Wall time: 4min 14s
```
That's really unacceptably slow...
After some digging, it seems a few regexes tend to have `O(n^2)` scaling
where `n` is the string / log line length. I don't think they *necessarily*
should scale like that, but it seems that way. The common pattern is this
```
([^:]+): error
```
which matches `: error` literally, and then one or more non-colons before that. So
for a log line like this:
```
abcdefghijklmnopqrstuvwxyz: error etc etc
```
Any of these are potential group matches when using `search` in Python:
```
abcdefghijklmnopqrstuvwxyz
bcdefghijklmnopqrstuvwxyz
cdefghijklmnopqrstuvwxyz
⋮
yz
z
```
but clearly the capture group should return the longest match.
My hypothesis is that Python has a very bad implementation of `search`
that somehow considers all of these, even though it can be implemented
in linear time by scanning for `: error` first, and then greedily expanding
the longest possible `[^:]+` match to the left. If Python indeed considers
all possible matches, then with `n` matches of length `1 .. n` you
see the `O(n^2)` slowness (i verified this by replacing + with {1,k}
and doubling `k`, it doubles the execution time indeed).
This PR fixes this by removing the `+`, so effectively changing the
O(n^2) into a O(n) worst case.
The reason we are fine with dropping `+` is that we don't use the
capture group anywhere, so, we just ensure `:: error` is not a match
but `x: error` is.
After going from O(n^2) to O(n), the 15MB mpich build log is parsed
in `1.288s`, so about 200x faster.
Just to be sure I've also updated `^CMake Error.*:` to `^CMake Error`,
so that it does not match with all the possible `:`'s in the line.
Another option is to use `.*?` there to make it quit scanning as soon as
possible, but what line that starts with `CMake Error` that does not have
a colon is really a false positive...
Diffstat (limited to 'COPYRIGHT')
0 files changed, 0 insertions, 0 deletions