Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regexes don't work #8

Open
ComFreek opened this issue Apr 30, 2014 · 11 comments
Open

Regexes don't work #8

ComFreek opened this issue Apr 30, 2014 · 11 comments

Comments

@ComFreek
Copy link
Contributor

I currently face some strange problems.
I use the official 1.7 version from the Chrome Web Store (updated on April 27, 2014).

  1. Go to http://stackoverflow.com
  2. Search for this regex: modified \d+
  3. It does not find any search results. It should find several occurrences of "modified 4 mins ago" or the like.

Also try:

  1. Go to http://stackoverflow.com
  2. There is a modified time within each entry of the question list. Example:
    regex
  3. Search for asked 39 in this case.
  4. The extension doesn't find any search results although it should!

Am I missing something?

@gsingh93
Copy link
Owner

Does the modified \d+ work for you, or are you saying it should find those matches and doesn't? Because both queries you gave don't work for me.

I just checked out version 1.6 and it looks like the same issue is there for me. Is this the first time you've run these queries or was this working for you before?

This is a major issue, but I'm hoping that at least it's not a regression.

@ComFreek
Copy link
Contributor Author

It was the first time I tried a query which doesn't work.
I debug now.

PS: Both queries don't work for me although they should. I updated my comment above to clarify this.

@ComFreek
Copy link
Contributor Author

ComFreek commented May 1, 2014

@gsingh93 I've found the "error". StackOverflow splits the string modified x mins ago into two elements:

<a class="started-link" href="...">modified <span title="..." class="relativetime">1m ago</span></a>`

I don't think there is a method for solving this problem, which is correctly guessing such combinations all the time, let alone taking CSS positioning into account.

@gsingh93
Copy link
Owner

gsingh93 commented May 1, 2014

Thanks for looking into this. Spans and anchor tags especially are used to separate a lot of text, so we're going to run into this for a lot of regexes. For example, on this page search for "with GitHub". You won't find anything because "GitHub" is wrapped in an anchor tag. Maybe we should group all adjacent inline elements together when searching for text, and if not all inline elements, at least spans and anchor tags.

@ComFreek
Copy link
Contributor Author

ComFreek commented May 1, 2014

Exactly, I've thought about the same thing. The problem, however, is to do the highlighting afterwards. You have to know which character (in the grouped text) belongs to which HTML element on the site. Otherwise characters which are actually not included in the search result get highlighted.

Example:

<div>
  Blub
  <span>Com</span>
  <span>Freek</span>
  Blub
</div>

Grouped version:

<div>BlubComFreekBlub</div>

Now how can the extension determine the part (or rather the element in the source HTML) to highlight? It can't, we therefore have to choose the parent <div> and a result of this action, Blub gets also highlighted two times.

@gsingh93
Copy link
Owner

gsingh93 commented May 1, 2014

I think while it wouldn't be easy, it's still possible.

When we merge the elements to get the combined text, we keep a list of the original elements that were merged and a mapping of ranges (starting and ending positions in the combined text) to these elements. If we find a match in the combined text, we simply need to find which element the match starts in, and since we know where the match starts in the combined text, we can use our mapping of ranges to elements to know which element the match starts in. Now since we know the matched text, we know the match is continuous, and we know all of the merged elements in order, we can just iterate through the elements highlighting as much of the match as the element contains until we've highlighted the entire match.

For example, if we have <div>ab<span>c</span>de</div> and we search for bcd, we first combine the elements and find the text abcde. When combining, we keep a list of all the elements in order, so something like [text, span, text]. We also keep a mapping of [(0,1) -> 1, (2,2) -> 2, (3,4) -> 3] which maps ranges to elements. Now since we can easily find the first letter of the match starts at position 1 (zero-indexed), we can use the mapping to know the match starts at the first element. Thus, we can start highlighting from the beginning of the match in the first element to either the end of the element or the end of the match. We hit the end of the element first so we repeat this for this for the second element. For the third element, we hit the end of the match first, so we stop highlighting there.

It's been a while since I wrote the highlighting code, so while this seems doable to me, I might be missing something. Let me know if this makes sense.

@ComFreek
Copy link
Contributor Author

ComFreek commented May 2, 2014

That sounds good!

I've made a little proof of concept script:

HTML:

<div id="myDiv"><span>Com</span><span>pu</span><span>ters</span><span>999 are good!</span></div>

TypeScript:

function combine(root: HTMLElement, combinedData?: string) {
    var walker: TreeWalker = document.createTreeWalker(root, NodeFilter.SHOW_TEXT);

    var combinedData = combinedData || "";
    var ranges = [];

    while (walker.nextNode()) {
            var value = walker.currentNode.nodeValue;
            var range = {
                start: combinedData.length,
                end: combinedData.length + value.length,
                element: walker.currentNode
            };
            combinedData += value;
            ranges.push(range);
    }
    return {combined: combinedData, ranges: ranges};
}

var elem = document.querySelector("div#myDiv");
console.log(combine(elem));

(I found an interesting online editor which also supports TypeScript: http://fiddlesalad.com/typescript/ just copy & paste my code!)

Question:

  • Do you want to apply this range-based concatenation algorithm only to <span> elements and their "neighbors"? My function above can handle the whole DOM tree if you pass document.body as the root.

TODOs:

  • Create a better data structure (see here for examples)
  • Write a function which retrieves the element for a specific position
  • Refactor the content script code to embed the new search engine (what about generalizing the idea of a search and highlight engine, so we only use interfaces => higher decoupling)

@gsingh93
Copy link
Owner

gsingh93 commented May 3, 2014

This looks good.

My only concern about combining the whole DOM tree is that two completely unrelated pieces of text (possibly on opposite sides of the screen) could then be matched. This technically could happen with spans too, but if we code towards the general case (which is that people generally use spans/inline elements correctly), then I think we should be fine. We could also just test both versions on different sites and see whether we get more false positives or false negatives with each version.

@ComFreek
Copy link
Contributor Author

ComFreek commented May 5, 2014

👍

One could also take the elements' positions into account ... but it only gets more complicated over time!

I think prototyping the span code (for now) and getting a first functional search engine would be the best

@jayeye
Copy link

jayeye commented Jul 27, 2014

In Chrome 35, it does not even find single-letter regexps (or if does, it does not highlight them). What's the point?

@gsingh93
Copy link
Owner

@jayeye If you want to report a bug, post both the page and the regex query you're using. Otherwise, there's no way we can help you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants