Aho-Corasick with wildcards

A lot of Messenger plugins, addons, and scripts nowadays use some form of memory patching. Some are quick patches like disabling the nudge limit, others like the chat-only name of StuffPlug 3 are much more complicated. But no matter how simple or complicated, they all have the same downsides: the location of the patch changes with almost every new release (even minor ones) of Messenger. Most developers simply keep a small table of locations in different versions, but this forces you to update your software with each and every messenger update. With StuffPlug 3 I opted to write a quick search routine that would search the messenger executable for specific patterns, and would find patch locations using that. This pretty much garantueed StuffPlug to keep working with minor messenger
updates, and sometimes even major updates.

One of the downsides of my implementation in StuffPlug 3 was however that the search routine was pretty simple and as such slow. With each new pattern that needed to be looked for, the time it took to search increased, and at some point this reached a full 3 seconds on my machine (which is pretty high-end compared to my average user). I solved this partially with caching, but for some patterns I needed all hits and caching was not possible. In the end the searching generally took 1 second, or 3 if you just updated messenger. Now even though 1 second (or possibly 5 on slower PCs) doesn’t sound like a lot, it bothered me, and it hasn’t even stopped bothering me. I’ve always wanted to look into other algorithms to speed this up, but generally I had other priorities or didn’t have time at all.

Somehow that changed a week or so ago, when we revisited string matching algorithms in my current course. I just had to try, and I did. As I’m looking for multiple patterns at once, Aho-Corasick should be the ideal approach. One downside: I need wildcards (characters that match anything) in my patterns, and Aho-Corasick wasn’t really made for it. I did see a suggestion somewhere to simply split your search string on the wildcards and search for the constant sub-patterns, but this sounded a bit silly. Especially as there was no motivation about why this was the best approach, I felt I could do better. And today, I thought I had. I had an Aho-Corasick implementation working, with wildcards and everything.

Or so I thought. Yes, the algorithm works. Yes, it can take a wildcard or two. But once you add a few more wildcards, the number of states in the state tree just explodes. With just two memory patterns (38 characters in total, of which 12 wildcards) needed a whopping 7007 nodes, and generating the tree alone took a full 12 seconds. I also tried with all 20 patterns in StuffPlug 3, but that just didn’t seem to finish, ever.

In conclusion: For Aho-Corasick with wildcards, just use the suggested technique of splitting the patterns up. It is possible to adjust the algorithm itself for wildcards, but the number of states needed gets so big that it’s no longer feasible, and will most likely not run on any current computer when you use more than 5 wildcards in total.

I’m not giving up though 🙂 I’m currently working on the Aho-Corasick bit with splitting up, and I have another tree/state-machine based approach in my head that I want to try. At least if I finish this I can truthfully say that I can’t possibly make that part of StuffPlug any faster…