Regexes don't work #8

ComFreek · 2014-04-30T17:00:02Z

I currently face some strange problems.
I use the official 1.7 version from the Chrome Web Store (updated on April 27, 2014).

Go to http://stackoverflow.com
Search for this regex: modified \d+
It does not find any search results. It should find several occurrences of "modified 4 mins ago" or the like.

Also try:

Go to http://stackoverflow.com
There is a modified time within each entry of the question list. Example:
Search for asked 39 in this case.
The extension doesn't find any search results although it should!

Am I missing something?

The text was updated successfully, but these errors were encountered:

gsingh93 · 2014-04-30T17:46:03Z

Does the modified \d+ work for you, or are you saying it should find those matches and doesn't? Because both queries you gave don't work for me.

I just checked out version 1.6 and it looks like the same issue is there for me. Is this the first time you've run these queries or was this working for you before?

This is a major issue, but I'm hoping that at least it's not a regression.

ComFreek · 2014-04-30T18:40:26Z

It was the first time I tried a query which doesn't work.
I debug now.

PS: Both queries don't work for me although they should. I updated my comment above to clarify this.

ComFreek · 2014-05-01T12:30:25Z

@gsingh93 I've found the "error". StackOverflow splits the string modified x mins ago into two elements:

<a class="started-link" href="...">modified <span title="..." class="relativetime">1m ago</span></a>`

I don't think there is a method for solving this problem, which is correctly guessing such combinations all the time, let alone taking CSS positioning into account.

gsingh93 · 2014-05-01T15:57:30Z

Thanks for looking into this. Spans and anchor tags especially are used to separate a lot of text, so we're going to run into this for a lot of regexes. For example, on this page search for "with GitHub". You won't find anything because "GitHub" is wrapped in an anchor tag. Maybe we should group all adjacent inline elements together when searching for text, and if not all inline elements, at least spans and anchor tags.

ComFreek · 2014-05-01T17:13:17Z

Exactly, I've thought about the same thing. The problem, however, is to do the highlighting afterwards. You have to know which character (in the grouped text) belongs to which HTML element on the site. Otherwise characters which are actually not included in the search result get highlighted.

Example:

<div>
  Blub
  <span>Com</span>
  <span>Freek</span>
  Blub
</div>

Grouped version:

<div>BlubComFreekBlub</div>

Now how can the extension determine the part (or rather the element in the source HTML) to highlight? It can't, we therefore have to choose the parent <div> and a result of this action, Blub gets also highlighted two times.

gsingh93 · 2014-05-01T17:26:42Z

I think while it wouldn't be easy, it's still possible.

When we merge the elements to get the combined text, we keep a list of the original elements that were merged and a mapping of ranges (starting and ending positions in the combined text) to these elements. If we find a match in the combined text, we simply need to find which element the match starts in, and since we know where the match starts in the combined text, we can use our mapping of ranges to elements to know which element the match starts in. Now since we know the matched text, we know the match is continuous, and we know all of the merged elements in order, we can just iterate through the elements highlighting as much of the match as the element contains until we've highlighted the entire match.

For example, if we have <div>ab<span>c</span>de</div> and we search for bcd, we first combine the elements and find the text abcde. When combining, we keep a list of all the elements in order, so something like [text, span, text]. We also keep a mapping of [(0,1) -> 1, (2,2) -> 2, (3,4) -> 3] which maps ranges to elements. Now since we can easily find the first letter of the match starts at position 1 (zero-indexed), we can use the mapping to know the match starts at the first element. Thus, we can start highlighting from the beginning of the match in the first element to either the end of the element or the end of the match. We hit the end of the element first so we repeat this for this for the second element. For the third element, we hit the end of the match first, so we stop highlighting there.

It's been a while since I wrote the highlighting code, so while this seems doable to me, I might be missing something. Let me know if this makes sense.

ComFreek · 2014-05-02T17:38:21Z

That sounds good!

I've made a little proof of concept script:

HTML:

<div id="myDiv"><span>Com</span><span>pu</span><span>ters</span><span>999 are good!</span></div>

TypeScript:

function combine(root: HTMLElement, combinedData?: string) {
    var walker: TreeWalker = document.createTreeWalker(root, NodeFilter.SHOW_TEXT);

    var combinedData = combinedData || "";
    var ranges = [];

    while (walker.nextNode()) {
            var value = walker.currentNode.nodeValue;
            var range = {
                start: combinedData.length,
                end: combinedData.length + value.length,
                element: walker.currentNode
            };
            combinedData += value;
            ranges.push(range);
    }
    return {combined: combinedData, ranges: ranges};
}

var elem = document.querySelector("div#myDiv");
console.log(combine(elem));

(I found an interesting online editor which also supports TypeScript: http://fiddlesalad.com/typescript/ just copy & paste my code!)

Question:

Do you want to apply this range-based concatenation algorithm only to <span> elements and their "neighbors"? My function above can handle the whole DOM tree if you pass document.body as the root.

TODOs:

Create a better data structure (see here for examples)
Write a function which retrieves the element for a specific position
Refactor the content script code to embed the new search engine (what about generalizing the idea of a search and highlight engine, so we only use interfaces => higher decoupling)

gsingh93 · 2014-05-03T05:34:58Z

This looks good.

My only concern about combining the whole DOM tree is that two completely unrelated pieces of text (possibly on opposite sides of the screen) could then be matched. This technically could happen with spans too, but if we code towards the general case (which is that people generally use spans/inline elements correctly), then I think we should be fine. We could also just test both versions on different sites and see whether we get more false positives or false negatives with each version.

ComFreek · 2014-05-05T17:12:29Z

👍

One could also take the elements' positions into account ... but it only gets more complicated over time!

I think prototyping the span code (for now) and getting a first functional search engine would be the best

jayeye · 2014-07-27T18:44:35Z

In Chrome 35, it does not even find single-letter regexps (or if does, it does not highlight them). What's the point?

gsingh93 · 2014-07-27T18:46:36Z

@jayeye If you want to report a bug, post both the page and the regex query you're using. Otherwise, there's no way we can help you.

gsingh93 mentioned this issue Jun 3, 2015

Only works on some pages #22

Closed

ComFreek mentioned this issue Mar 22, 2016

Doesn't match when text format changed #30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regexes don't work #8

Regexes don't work #8

ComFreek commented Apr 30, 2014

gsingh93 commented Apr 30, 2014

ComFreek commented Apr 30, 2014

ComFreek commented May 1, 2014

gsingh93 commented May 1, 2014

ComFreek commented May 1, 2014

gsingh93 commented May 1, 2014

ComFreek commented May 2, 2014

gsingh93 commented May 3, 2014

ComFreek commented May 5, 2014

jayeye commented Jul 27, 2014

gsingh93 commented Jul 27, 2014

Regexes don't work #8

Regexes don't work #8

Comments

ComFreek commented Apr 30, 2014

gsingh93 commented Apr 30, 2014

ComFreek commented Apr 30, 2014

ComFreek commented May 1, 2014

gsingh93 commented May 1, 2014

ComFreek commented May 1, 2014

gsingh93 commented May 1, 2014

ComFreek commented May 2, 2014

gsingh93 commented May 3, 2014

ComFreek commented May 5, 2014

jayeye commented Jul 27, 2014

gsingh93 commented Jul 27, 2014