[core] CPD is always case sensitive #4396

adangel · 2023-02-16T17:44:27Z

Affects PMD Version: 6.x

Description:

Some languages like PL/SQL or the new T-SQL (#4390) are case-insensitive. When tokenizing, this is working correctly, e.g. the lexers are agnostic to casing. JavaCC has a grammar option and ANTLR since 4.10 as well.

However, when we convert the original tokens into CPD TokenEntries, we don't seem to use the token kind and use the original token text, which contains the original casing. It's therefore very easy to work around duplicated for these languages by just changing the casing:

echo 'select a, b, c, d, e, f from table where x = 1 and y = 2;' > file1.plsql
cp file1.plsql file2.plsql
echo 'sEleCt a, b, c, d, e, f frOm table where x = 1 and y = 2;' > file3.plsql

run.sh cpd --minimum-tokens 20 --language plsql --dir file1.plsql file2.plsql

results correctly in:

Found a 1 line (23 tokens) duplication in the following files: 
Starting at line 1 of /home/andreas/temp/plsql/file1.plsql
Starting at line 1 of /home/andreas/temp/plsql/file2.plsql

select a, b, c, d, e, f from table where x = 1 and y = 2;

since file1.plsql and file2.plsql are identical.

However, comparing file1.plsql and file3.plsql which differ only in casing, shows no duplications:

run.sh cpd --minimum-tokens 20 --language plsql --dir file1.plsql file3.plsql

I think, this problem affects both JavaCC and ANTLR based languages.

The text was updated successfully, but these errors were encountered:

oowekyala · 2023-02-16T21:52:17Z

The apex tokenizer has an option to be case insensitive:

pmd/pmd-apex/src/main/java/net/sourceforge/pmd/cpd/ApexTokenizer.java

Line 56 in af6d502

tokenText = tokenText.toLowerCase(Locale.ROOT);

#4397 changes this option into a language property: https://github.com/pmd/pmd/pull/4397/files#diff-9320afd0816587cbe6b47f1b793f39e1987484e35efedf59c3e63453877e12fdR46
That could be used in more languages than apex.

adangel added the a:bug PMD crashes or fails to analyse a file. label Feb 16, 2023

adangel mentioned this issue Feb 16, 2023

Add support for T-SQL using Antlr4 lexer #4390

Merged

4 tasks

jsotuyod added needs:pmd7-revalidation The issue hasn't yet been retested vs PMD 7 and may be stale and removed needs:pmd7-revalidation The issue hasn't yet been retested vs PMD 7 and may be stale labels Apr 2, 2024

oowekyala added a commit to oowekyala/pmd that referenced this issue Apr 8, 2024

Fix pmd#4396 - Fix PLSQL CPD being case-sensitive

44f29c3

oowekyala linked a pull request Apr 8, 2024 that will close this issue

[plsql,tsql] Fix CPD being case sensitive in PLSQL and TSQL #4943

Open

4 tasks

adangel added this to the 7.1.0 milestone Apr 18, 2024

oowekyala mentioned this issue Apr 22, 2024

[core] Update Antlr to 4.10+ #4972

Open

adangel modified the milestones: 7.1.0, 7.2.0 Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] CPD is always case sensitive #4396

[core] CPD is always case sensitive #4396

adangel commented Feb 16, 2023

oowekyala commented Feb 16, 2023

[core] CPD is always case sensitive #4396

[core] CPD is always case sensitive #4396

Comments

adangel commented Feb 16, 2023

oowekyala commented Feb 16, 2023