Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support Dirac GOS #72

Merged
merged 27 commits into from
Jul 27, 2024
Merged

support Dirac GOS #72

merged 27 commits into from
Jul 27, 2024

Conversation

zezhong-zhang
Copy link
Contributor

@zezhong-zhang zezhong-zhang commented Jul 16, 2024

Description of the change

  • Support Dirac GOS that includes the relativistic effects as described in the arxiv. The dataset is open-source on Zenodo under CC-BY-4.0 license. The format follows the gosh for the ease of the user, as discussed in Generalised oscillator strengths in GOSH format hyperspy#3082 (comment).
  • Currently, I simply duplicate the exspy/misc/eels/gosh_gos.py to exspy/misc/eels/dirac_gos.py and replace the links. I think it might be better to reduce the redundancy and combine into one. But I hope to hear your preference.
  • Some minor changes on the docstring and documentation, to be improved latter upon the feedback of this PR.
  • The changelog is not yet done

Progress of the PR

  • Change implemented (can be split into several points),
  • docstring updated (if appropriate),
  • update user guide (if appropriate),
  • added tests,
  • add a changelog entry in the upcoming_changes folder (see upcoming_changes/README.rst),
  • Check formatting of the changelog entry (and eventual user guide changes) in the docs/readthedocs.org:exspy build of this PR (link in github checks)
  • ready for review.

Minimal example of the bug fix or the new feature

import hyperspy.api as hs

s = hs.load("coreloss_spectrum.msa", signal_type="EELS")
ll = hs.load("lowloss_spectrum.msa", signal_type="EELS")

s.add_elements(("Mn", "O"))
s.set_microscope_parameters(
    beam_energy=300, convergence_angle=24.6, collection_angle=13.6
)
m = s.create_model(low_loss=ll, GOS="dirac") #specify the GOS when create the model 
m.enable_fine_structure()
m.multifit(kind="smart")
m.plot()

Copy link

codecov bot commented Jul 16, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.55%. Comparing base (e06bf68) to head (c46b5d3).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #72      +/-   ##
==========================================
+ Coverage   88.47%   88.55%   +0.08%     
==========================================
  Files          67       67              
  Lines        7287     7341      +54     
  Branches     1179     1187       +8     
==========================================
+ Hits         6447     6501      +54     
  Misses        571      571              
  Partials      269      269              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ericpre
Copy link
Member

ericpre commented Jul 16, 2024

Indeed, as you said, the duplication should be removed.

Two approaches can be used:

  1. Create two subclass of GoshGOS:
  2. The other approach is to define a registry handling the urls and hashes.

@gguzzina, @francisco-dlp, any preferences?

On a side note, saving a model GoshGOS with a custom path may not be working currently... because the url and hash are not saved. If going for the registry approach, this may need to be fixed.

@gguzzina
Copy link
Contributor

The registry approach seems more sensible to me. If implemented in a sensible way it allows two things:

  1. to ship a list of known GOS (also allowed by the subclass method)
  2. to use other GOS data without need to edit the sources

The new GOS data could include even revisions of the two datasets above, which might be at different URLs or have different hashes.

@zezhong-zhang
Copy link
Contributor Author

The first approach can be easily done, but have the disadvantage of update the url and hash in the code everytime when update the databases. For the registry approach, do we have an mini example of how it works?

@ericpre
Copy link
Member

ericpre commented Jul 18, 2024

It is already possible to specify a file path for a gosh file: https://hyperspy.org/exspy/user_guide/eels.html#generalised-oscillator-strengths.

In the interest of simplicity and because I don't think that there is a use case for user changing the registry dynamically (the cross section are expected to be stable at some point and change rarely and user can use other gosh file easily), I would suggest to use a registry that is simply a python dictionary hardcoded in exspy, that would need to updated one in a while.

@zezhong-zhang
Copy link
Contributor Author

zezhong-zhang commented Jul 19, 2024

To check if I understand it correctly, so we use the GoshGOS and add a dictionary of registry, instead of adding subclasses. The GoshGOS will then need to have a new init parameter (call it "source"?) to specify if use DFT or Dirac, default to DFT.

The use of the GOS will be

m = s.create_model(low_loss=ll, GOS="dirac")
m = s.create_model(low_loss=ll, GOS="dft") # currently called gosh
m = s.create_model(low_loss=ll) # this will use dft-gosh by default

Currently, the pooch download the database to a cache directory by default, might be better to put in the database directory within the package.

@ericpre
Copy link
Member

ericpre commented Jul 19, 2024

To check if I understand it correctly, so we use the GoshGOS and add a dictionary of registry, instead of adding subclasses. The GoshGOS will then need to have a new init parameter (call it "source"?) to specify if use DFT or Dirac, default to DFT.

The use of the GOS will be

m = s.create_model(low_loss=ll, GOS="Dirac")
m = s.create_model(low_loss=ll, GOS="DFT") # currently called gosh
m = s.create_model(low_loss=ll) # this will use dft-gosh by default

This looks good to me. I agree that "gosh" to "DFT", considering that "gosh" refer to the file format and this argument is about specifying the type of GOS. I will defer to you and @gguzzina to define the most suitable name for these two! 😃

Currently, the pooch download the database to a cache directory by default, might be better to put in the database directory within the package.

It is good practise to keep the distribution package as small as reasonable and therefore it is better to download and cache the database. This is only be download only once.
Off the top of my head, pypi.org only allows packages ~50 MB and looking at https://zenodo.org/records/12752410, the file is 500 MB. By the way, is it expected that the file is so large? In comparison, the gosh file from https://zenodo.org/records/7645765 is ~42 MB.
If you are thinking of be able to use these GOS easily on air gap system, then it would be better to discuss this in a separate issue.

@zezhong-zhang
Copy link
Contributor Author

What I mean is that the package indeed does not distribute those files and still do the download, but to define a directory that holds the database for downloading, instead of putting in the cache directory by default which user hard to find and likely to purge.

The Dirac GOS database has all the elements and all the edges (over 2000 entries). In each entry, we have relatively dense sampling (256 in momentum and 128 in energy). Due to spin-orbit coupling, the L2 and L3 for instance are two different entries, instead of single array in the Schrodinger approach. We also used double precision which doubles the size. If preferred, we can give the single precision version but will still be ~200 MB.

Currently, the pooch download the database to a cache directory by default, might be better to put in the database directory within the package.

It is good practise to keep the distribution package as small as reasonable and therefore it is better to download and cache the database. This is only be download only once. Off the top of my head, pypi.org only allows packages ~50 MB and looking at https://zenodo.org/records/12752410, the file is 500 MB. By the way, is it expected that the file is so large? In comparison, the gosh file from https://zenodo.org/records/7645765 is ~42 MB. If you are thinking of be able to use these GOS easily on air gap system, then it would be better to discuss this in a separate issue.

@ericpre
Copy link
Member

ericpre commented Jul 22, 2024

What I mean is that the package indeed does not distribute those files and still do the download, but to define a directory that holds the database for downloading, instead of putting in the cache directory by default which user hard to find and likely to purge.

Writing to site-packages is bad practise for various reasons (user doesn't necessarily have the necessarily privilege, files doesn't get removed when uninstalling, suboptimal for python environment, etc.).
It is possible to use a different cache folder that would differ from the default cache location, but would is the use case? Users are not expected to read the files themselves?

The Dirac GOS database has all the elements and all the edges (over 2000 entries). In each entry, we have relatively dense sampling (256 in momentum and 128 in energy). Due to spin-orbit coupling, the L2 and L3 for instance are two different entries, instead of single array in the Schrodinger approach. We also used double precision which doubles the size. If preferred, we can give the single precision version but will still be ~200 MB.

Could the file size be reduced by using compression?

@zezhong-zhang
Copy link
Contributor Author

  1. Alright, I will not insist on the database path for downloading.
  2. I uploaded a compact database with single precision, down-sampling and compression to get the dataset down to ~66 MB, should be relatively easy for user to download. The link in the code is updated accordingly.
  3. The methods can be called in following way:
from exspy.misc.eels.gosh_gos import GoshGOS
GoshGOS('Ti_L2', source='dirac')

from exspy.models.eelsmodel import EELSModel
EELSModel(signal_1d, GOS='dirac')

from exspy.signals.eels import EELSSpectrum

s = EELSSpectrum(data)
s.create_model(GOS='dirac')
  1. There are breaking changes in the naming: the default is no longer GOS=gosh but GOS=dft in the above classes. As 'dft' and 'dirac' will use both the GoshGOS, the _name will be dft_gosh and dirac_gosh. @gguzzina, could you have a look of the changes to check if the names are appropriate?

Copy link
Member

@ericpre ericpre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks in very good shape.

CHANGES.rst Outdated Show resolved Hide resolved
upcoming_changes/72.new.rst Outdated Show resolved Hide resolved
@@ -146,7 +147,7 @@ class EELSCLEdge(Component):
_fine_structure_coeff_free = True
_fine_structure_spline_active = True

def __init__(self, element_subshell, GOS="gosh", gos_file_path=None):
def __init__(self, element_subshell, GOS="dft", gos_file_path=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A side note as this is outside of the scope of this PR: I find it confusing to have both GOS and gos_file_path. I am wondering if these should be merged together to make GOS accept one of the supported string ("dft", "dirac", etc.) or a path (which could a str or pathlib.Path`) in which case, it will try to read the file assuming this is gosh file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, this makes sense to me.

exspy/components/eels_cl_edge.py Show resolved Hide resolved
zezhong-zhang and others added 2 commits July 24, 2024 10:55
Co-authored-by: Eric Prestat <[email protected]>
zezhong-zhang and others added 3 commits July 24, 2024 11:02
redirect "gosh" to "dft" to avoid the name breaking

Co-authored-by: Eric Prestat <[email protected]>
Copy link
Member

@ericpre ericpre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good to me - 2 minor comments.

@gguzzina, does it sound good to you to rename "gosh" to "dft" to make the distinction because the two GOS as mentioned above?

exspy/tests/models/test_eelsmodel.py Outdated Show resolved Hide resolved
exspy/misc/eels/gosh_gos.py Outdated Show resolved Hide resolved
@gguzzina
Copy link
Contributor

@ericpre that's fine by me.

In my personal use I've started to use author names.
The traditional Hartree-Slater would be Rez, our DFT would be Segger, and the Dirac would be Zhang.

But until there's an invasion of competing DFT and Dirac datasets I don't think we have a problem.

@zezhong-zhang
Copy link
Contributor Author

@gguzzina Indeed the author names would be more specific, but the users probably would not have a instinct guess (can add in the doc string which corresponds the author name with the method). Let's keep this option open when the next new GOS added to the list.

@ericpre ericpre merged commit 3b7425e into hyperspy:main Jul 27, 2024
17 checks passed
@ericpre ericpre added this to the v0.3 milestone Aug 14, 2024
@ericpre ericpre mentioned this pull request Aug 15, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants