Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6070 Update Apache Commons Lang3 #7801

Merged
merged 10 commits into from
May 21, 2021
Merged

Conversation

poikilotherm
Copy link
Contributor

@poikilotherm poikilotherm commented Apr 14, 2021

What this PR does / why we need it:
I want to use modern Apache Commons Lang3 functionality (FailableStream) for #5989. (see usage in code)
To keep the scope of the PR for that issue smallish, I created this PR, as you folks prefer small steps.

The dependency update has been a long standing thing. See the issue.

Which issue(s) this PR closes:

Closes #6070

Special notes for your reviewer:

🐛 THERE IS A HIDDEN BUG.
Some tests for the CSV ingest fail.

Switching http://github.com/IQSS/dataverse/blob/eca8b8fd403002327439f81f54c6ed9edde391c7/src/main/java/edu/harvard/iq/dataverse/ingest/tabulardata/impl/plugins/csv/CSVFileReader.java#L211-L211 like this makes all tests pass:

-&& StringUtils.isNumeric(varString.substring(1))));
+&& org.apache.commons.lang.StringUtils.isNumeric(varString.substring(1))));

Switching that function from Lang3 back to Lang2 should NOT make the tests pass. Playing around with the CSV files and the expected tests results makes the tests flaky with v2 and pass with v3. It looks like there is a hidden bug in the CSV parser not revealed by the tests. (The only change in isNumeric() from v2 -> v3 is that empty strings are also not counted as numeric. The test cases have no example covered by this change.)

As I am not very familiar with the ingest code, I would appreciate some help.

Suggestions on how to test this:
Usual unit tests and integration tests. UI uses some escaping routines, but they have just been moved in upstream and there is no functional changes expected.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Nope.

Is there a release notes update needed for this change?:
I don't think so.

Additional documentation:
None.

…6070

Also switch from bulk imports of commons.lang3.* to single class style.
Fixing for other parts like java.util, too.
IQSS#6070

Apache Commons Lang3 3.12.0 moved `StringEscapeUtils` to Apache Commons
Text. Also, escapeHtml() and escapeXml() from Apache Commons Lang (v2)
have been renamed to escapeHtml4() and escapeXml10() with v3.0
@djbrooke
Copy link
Contributor

Thanks @poikilotherm for flagging the failing test. We've been dealing with some production issues on dataverse.harvard.edu this week, but we'll take a look at this as soon as we can.

@qqmyers
Copy link
Member

qqmyers commented Apr 16, 2021

FWIW: In general, I think these changes should be fine - just updating to a later version. (We actually had a few lang3 calls already (it must be a dependency of something already in the pom)).

For the failing code - it just looks like it is trying to check that the field is either null, empty, or a number potentially starting with a + or -. The code does expect that the string is at least two chars long and should probably be fixed by splitting the cases up a bit further - a one char string has to be numeric and if it is two or more chars, the existing check would work (and not give an empty value to isNumeric()).

Others may know more, but it looks straight-forward enough that I think you could just make the change and let people review the result.

@poikilotherm
Copy link
Contributor Author

poikilotherm commented Apr 19, 2021

FWIW: In general, I think these changes should be fine - just updating to a later version. (We actually had a few lang3 calls already (it must be a dependency of something already in the pom)).

Aye, the transitive dependency has been used. See also my extensive review from last year.

For the failing code - it just looks like it is trying to check that the field is either null, empty, or a number potentially starting with a + or -. The code does expect that the string is at least two chars long and should probably be fixed by splitting the cases up a bit further - a one char string has to be numeric and if it is two or more chars, the existing check would work (and not give an empty value to isNumeric()).

Jim I know that the code is expected to do that. Yet it seems to be more complicated, which is why I consider there is a bug hidding either in the test code or within the app code.

grafik

Test testRead is expecting "null" to become "0", yet it becomes "0.0" for this line of CSV

The other tests also are failing because of detecting as continuous instead of discrete.

Removing all but L1 (header) plus L7 from IngestCSV.csv is producing even weirder test results for testRead():
grafik

Using this simple reproducer with just a single CSV line to be parsed, but now using org.apache.commons.lang.StringUtils.isNumeric() (Apache Commons Lang v2) instead, is producing the same failing testRead() with the same error as above.

Which is leading me to: who want's to go for a 🐛 hunt?

Thanks @djbrooke for sending in more troops 🚁 👍

@landreev landreev self-assigned this May 11, 2021
@landreev
Copy link
Contributor

I believe @qqmyers has been right all along - it was simply the single digit numbers (like the "2" and "0", in the lines 3 and 4 of the CSV file, respectively) that were breaking the integer test. The code was checking the first character - which is also the last character of the string, in this case - against [+-0-9]; and then trying to do StringUtils.isNumerc() on the remainder of the string - which as of lang3 would return false on an empty string... So then, in the second/final pass, that character string "null" in line 7 would be formatted as "0.0", on the assumption that it was a floating point zero, not an integer kind... (or "continuous" vs. "discrete", for the purposes of the other tests).
So yes, what Jim suggested above could be a possible fix - a separate case for a single-char string ([0-9]), the current test for all other cases...; but it may be easier to just replace these lines:

   || (firstNumCharSet.contains(varString.charAt(0))
       && StringUtils.isNumeric(varString.substring(1))

with

   || varString.matches("^[+-]?[0-9]+$")

So I'm going to check that in. (this change makes the test pass; I'll give it another thought, whether there are advantages to sticking with StringUtils.isNumeric()...)

@landreev
Copy link
Contributor

I don't really get why Jenkins failed to build the branch again, after I checked in the fix for the CSV ingest test. (The branch is building fine for me locally).
Anyone?

@donsizemore
Copy link
Contributor

@landreev Travis failed as well:

[ERROR] COMPILATION ERROR : 

[INFO] -------------------------------------------------------------

[ERROR] /home/travis/build/IQSS/dataverse/src/main/java/edu/harvard/iq/dataverse/DOIDataCiteRegisterService.java:[178,51] cannot find symbol

  symbol:   method escapeXml(java.lang.String)

  location: class org.apache.commons.text.StringEscapeUtils

@landreev
Copy link
Contributor

@poikilotherm Could you please sync the branch up with develop?
The change I have committed fixes the CSV ingest tests that were failing. The bad news is, there appears to be another conflict between your branch, and something that has been added to develop since the PR was made. Hence the most recent Jenkins failure (Jenkins automatically syncs up its checkout with develop before attempting to build).
So let's update the branch, and then investigate further.

@poikilotherm
Copy link
Contributor Author

poikilotherm commented May 17, 2021

Hi. Have been on vacation, thus offline.
I updated the branch as requested and fixed the failing escapeXML() call. All local unit tests now running smoothly.
Thanks for fixing the issue @landreev @qqmyers

@poikilotherm
Copy link
Contributor Author

poikilotherm commented May 17, 2021

One remaining thing pops up in my mind: as formerly there had been v2 and v3 usage in parallel, should we enforce a checkstyle rule check for "illegal" v2 usages, so we don't introduce transitive dependency use?

We could let @reviewdog bark in PR reviews based on these rules. https://github.com/marketplace/actions/run-java-checkstyle

@coveralls
Copy link

coveralls commented May 17, 2021

Coverage Status

Coverage decreased (-0.001%) to 19.324% when pulling 4a13f15 on poikilotherm:6070-commons-lang3 into b95ba34 on IQSS:develop.

@landreev
Copy link
Contributor

Great, thank you. I may check in a better permanent fix for the CSV ingest (trying to decide on that now), and will move the PR after that.

@landreev
Copy link
Contributor

Also, yes, it would be a good idea to add a check for transitive dependencies like this, as part of the overall automated review framework.

@poikilotherm
Copy link
Contributor Author

poikilotherm commented May 18, 2021

I created a test branch and a test PR to show the usage of a barking reviewdog in action...
poikilotherm#441

The check action is reporting the failed state. It does not leave comments as it's running in a fork, only showing the check report inline.

grafik
grafik

It's fairly simple to use and only barks on files of the PR's changeset. Shall I include this in this PR? (Or is this a part of #7876 / #3950 ?)

@kcondon
Copy link
Contributor

kcondon commented May 21, 2021

Hi @poikilotherm I'm testing this today but due to recent dataverse version change to 5.5, cannot build/deploy. Would you mind updating this branch from develop? Thanks.

@poikilotherm
Copy link
Contributor Author

Done @kcondon

@kcondon kcondon assigned kcondon and unassigned landreev May 21, 2021
@kcondon kcondon merged commit a875db7 into IQSS:develop May 21, 2021
@djbrooke djbrooke added this to the 5.6 milestone Jun 21, 2021
@poikilotherm poikilotherm deleted the 6070-commons-lang3 branch December 15, 2021 11:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dependency update: Apache Common Lang v2 -> v3
7 participants