Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster way of performing AUC Evaluations on larger datasets. #126

Open
FaizalJnu opened this issue Aug 25, 2024 · 0 comments
Open

Faster way of performing AUC Evaluations on larger datasets. #126

FaizalJnu opened this issue Aug 25, 2024 · 0 comments

Comments

@FaizalJnu
Copy link

FaizalJnu commented Aug 25, 2024

Description:

While working with Beeline dataset as a part of GSoC. I encountered difficulty running the evaluation pipeline to generate AUC scores. The file in question was computeDGAUC.py in the BLEval folder. Therefore I've implemented an optimized version of the computeScores function that significantly improves performance and efficiency, especially for large genetic networks. Here's a comparison of the old and new implementations:

Previous Implementation:

  • Used nested loops and DataFrame operations for edge lookups
  • Initialized dictionaries with all possible edges before filling them
  • Relied on DataFrame filtering for each edge check
  • Separate logic for directed and undirected cases
for key in TrueEdgeDict.keys():
    if len(trueEdgesDF.loc[(trueEdgesDF['Gene1'] == key.split('|')[0]) &
           (trueEdgesDF['Gene2'] == key.split('|')[1])])>0:
            TrueEdgeDict[key] = 1

for key in TrueEdgeDict.keys():
    if len(trueEdgesDF.loc[((trueEdgesDF['Gene1'] == key.split('|')[0]) &
                   (trueEdgesDF['Gene2'] == key.split('|')[1])) |
                      ((trueEdgesDF['Gene2'] == key.split('|')[0]) &
                   (trueEdgesDF['Gene1'] == key.split('|')[1]))]) > 0:
        TrueEdgeDict[key] = 1

New Implementation:

  • Converts DataFrames to sets and dictionaries for faster lookups
  • Creates dictionaries on-the-fly while iterating through possible edges
  • Uses set membership and dictionary lookups instead of DataFrame filtering
  • Unifies logic for directed and undirected cases
true_edges = set(map(tuple, trueEdgesDF[['Gene1', 'Gene2']].values))
for edge in edge_generator:
    key = '|'.join(edge)
    TrueEdgeDict[key] = int(edge in true_edges or (not directed and edge[::-1] in true_edges))

Key Improvements:

  • Performance: The new version is significantly faster, especially for large datasets, due to the use of more efficient data structures and operations.
  • Scalability: Performance gains become more pronounced as the size of the input data increases, making it better suited for large-scale genetic network analyses.
  • Code Readability: The new version is more concise with less repeated code, improving maintainability.
  • Memory Usage: While it might use slightly more memory upfront, this trade-off results in substantial runtime performance benefits.

Why It's Better:

  • Faster execution times, especially crucial for large genetic networks
  • More efficient handling of edge lookups and checks
  • Better scalability for growing datasets
  • Improved code structure for easier maintenance and future enhancements

These optimizations maintain the same functionality while providing substantial performance enhancements, making our genetic network analysis more efficient and capable of handling larger datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant