Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] CPD is always case sensitive #4396

Open
adangel opened this issue Feb 16, 2023 · 1 comment · May be fixed by #4943
Open

[core] CPD is always case sensitive #4396

adangel opened this issue Feb 16, 2023 · 1 comment · May be fixed by #4943
Labels
a:bug PMD crashes or fails to analyse a file.
Milestone

Comments

@adangel
Copy link
Member

adangel commented Feb 16, 2023

Affects PMD Version: 6.x

Description:

Some languages like PL/SQL or the new T-SQL (#4390) are case-insensitive. When tokenizing, this is working correctly, e.g. the lexers are agnostic to casing. JavaCC has a grammar option and ANTLR since 4.10 as well.

However, when we convert the original tokens into CPD TokenEntries, we don't seem to use the token kind and use the original token text, which contains the original casing. It's therefore very easy to work around duplicated for these languages by just changing the casing:

echo 'select a, b, c, d, e, f from table where x = 1 and y = 2;' > file1.plsql
cp file1.plsql file2.plsql
echo 'sEleCt a, b, c, d, e, f frOm table where x = 1 and y = 2;' > file3.plsql

run.sh cpd --minimum-tokens 20 --language plsql --dir file1.plsql file2.plsql

results correctly in:

Found a 1 line (23 tokens) duplication in the following files: 
Starting at line 1 of /home/andreas/temp/plsql/file1.plsql
Starting at line 1 of /home/andreas/temp/plsql/file2.plsql

select a, b, c, d, e, f from table where x = 1 and y = 2;

since file1.plsql and file2.plsql are identical.

However, comparing file1.plsql and file3.plsql which differ only in casing, shows no duplications:

run.sh cpd --minimum-tokens 20 --language plsql --dir file1.plsql file3.plsql

I think, this problem affects both JavaCC and ANTLR based languages.

@adangel adangel added the a:bug PMD crashes or fails to analyse a file. label Feb 16, 2023
@oowekyala
Copy link
Member

The apex tokenizer has an option to be case insensitive:

tokenText = tokenText.toLowerCase(Locale.ROOT);

#4397 changes this option into a language property: https://github.com/pmd/pmd/pull/4397/files#diff-9320afd0816587cbe6b47f1b793f39e1987484e35efedf59c3e63453877e12fdR46
That could be used in more languages than apex.

@jsotuyod jsotuyod added needs:pmd7-revalidation The issue hasn't yet been retested vs PMD 7 and may be stale and removed needs:pmd7-revalidation The issue hasn't yet been retested vs PMD 7 and may be stale labels Apr 2, 2024
oowekyala added a commit to oowekyala/pmd that referenced this issue Apr 8, 2024
@oowekyala oowekyala linked a pull request Apr 8, 2024 that will close this issue
4 tasks
@adangel adangel added this to the 7.1.0 milestone Apr 18, 2024
@adangel adangel modified the milestones: 7.1.0, 7.2.0 Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:bug PMD crashes or fails to analyse a file.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants