From a06de35111d19c6623680625651d844e2a49e616 Mon Sep 17 00:00:00 2001 From: Jeremy Nelson Date: Mon, 8 Apr 2024 11:38:12 -0700 Subject: [PATCH] 2023 and 1Q 2024 meetings --- _toc.yml | 11 +++++ meetings/2023-02-14.md | 35 +++++++++++++++ meetings/2023-03-13.md | 27 ++++++++++++ meetings/2023-04-11.md | 28 ++++++++++++ meetings/2023-05-09.md | 24 +++++++++++ meetings/2023-07-11.md | 23 ++++++++++ meetings/2023-09-12.md | 31 +++++++++++++ meetings/2023-10-10.md | 43 ++++++++++++++++++ meetings/2023-11-13.md | 26 +++++++++++ meetings/2024-01-09.md | 34 +++++++++++++++ meetings/2024-02-13.md | 31 +++++++++++++ meetings/2024-03-12.md | 98 ++++++++++++++++++++++++++++++++++++++++++ 12 files changed, 411 insertions(+) create mode 100644 meetings/2023-02-14.md create mode 100644 meetings/2023-03-13.md create mode 100644 meetings/2023-04-11.md create mode 100644 meetings/2023-05-09.md create mode 100644 meetings/2023-07-11.md create mode 100644 meetings/2023-09-12.md create mode 100644 meetings/2023-10-10.md create mode 100644 meetings/2023-11-13.md create mode 100644 meetings/2024-01-09.md create mode 100644 meetings/2024-02-13.md create mode 100644 meetings/2024-03-12.md diff --git a/_toc.yml b/_toc.yml index 48b3c33..28d9ea7 100644 --- a/_toc.yml +++ b/_toc.yml @@ -7,6 +7,17 @@ parts: - caption: Meetings chapters: - file: meetings/about + - file: meetings/2024-03-12 + - file: meetings/2024-02-13 + - file: meetings/2024-01-09 + - file: meetings/2023-11-13 + - file: meetings/2023-10-10 + - file: meetings/2023-09-12 + - file: meetings/2023-07-11 + - file: meetings/2023-05-09 + - file: meetings/2023-04-11 + - file: meetings/2023-03-13 + - file: meetings/2023-02-14 - file: meetings/2023-01-10 - file: meetings/2022-12-13 - file: meetings/2022-10-11 diff --git a/meetings/2023-02-14.md b/meetings/2023-02-14.md new file mode 100644 index 0000000..3408aec --- /dev/null +++ b/meetings/2023-02-14.md @@ -0,0 +1,35 @@ +title: ai4lam Metadata/Discovery WG Monthly Meeting + +# February 14, 2023 + +8 AM California | 11 PM Washington DC | 4 PM UK | 5 PM Oslo & Paris + +**Attending** +* Name (institution) +* David Lowe (Texas A&M) +* Erik Love (Colorado) +* Jeremy Nelson (Stanford) +* William (Texas A&M) +* James Creel (Texa A&M) + +## Helpful Links + +* [Metadata WG Zotero Group Library](https://www.zotero.org/groups/2709151/ai4lam_metadata_wg/library) + + +## Project Documents and Data +* [WG charter](https://drive.google.com/file/d/1ypcx2F30siqr-KYOKFZtVv8h9PIS9a77/view?usp=sharing) +* [WG Google Drive folder](https://drive.google.com/drive/folders/1cpZtbjKadgD30794fD97XY-EChUSy2r9?usp=sharing) + + +## Agenda Topics +1. Updates, announcements, intros +2. Continue review of ML Use Cases using FOLIO +3. Catalog a FOLIO JSON Instance for the book Pride and Prejudice by Jane Austin using ChatGPT + 1. Is the resulting record a valid FOLIO JSON record? + 2. Source looks like it from Project Gutenberg + 3. Unique Identifiers + 4. Need to train on existing valid Instance/Inventory records + 5. Preceding/Succeeding titles automatically generated + 6. +4. Assistance of using AI/ML to curation, using these tools for cataloging records, a serious cataloger would like to know the sources that it uses to create the JSON records. If it can’t find a source, especially with new materials not cataloged elsewhere, the LLM could parse the source text, accelerate the cataloger workflows, still need professional cataloger's judgment and diff --git a/meetings/2023-03-13.md b/meetings/2023-03-13.md new file mode 100644 index 0000000..6a7dd18 --- /dev/null +++ b/meetings/2023-03-13.md @@ -0,0 +1,27 @@ +title: ai4lam Metadata/Discovery WG Monthly Meeting + +# March 13, 2023 + +8 AM California | 11 PM Washington DC | 4 PM UK | 5 PM Oslo & Paris + +**Attending** +* Name (institution) +* Jeremy Nelson (Stanford) +* Erik Radio (Colorado) + +**Regrets** +* David Lowe (Texas A&M) + +## Helpful Links + +* [Metadata WG Zotero Group Library](https://www.zotero.org/groups/2709151/ai4lam_metadata_wg/library) + + +## Project Documents and Data + +* [WG charter](https://drive.google.com/file/d/1ypcx2F30siqr-KYOKFZtVv8h9PIS9a77/view?usp=sharing) +* [WG Google Drive folder](https://drive.google.com/drive/folders/1cpZtbjKadgD30794fD97XY-EChUSy2r9?usp=sharing) + + +## Agenda Topics +1. Updates, announcements, intros diff --git a/meetings/2023-04-11.md b/meetings/2023-04-11.md new file mode 100644 index 0000000..fb8db18 --- /dev/null +++ b/meetings/2023-04-11.md @@ -0,0 +1,28 @@ +title: ai4lam Metadata/Discovery WG Monthly Meeting + +# April 11, 2023 + +8 AM California | 11 PM Washington DC | 4 PM UK | 5 PM Oslo & Paris +**Attending** +* Name (institution) +* Jeremy Nelson (Stanford) +* Sonja Thiel (Baden State Museum, Karlsruhe) +* David Lowe (Texas A&M) +* Mary Aycock (Texas State University) +* Tim Thompson (Yale University) + +## Helpful Links +* [Metadata WG Zotero Group Library](https://www.zotero.org/groups/2709151/ai4lam_metadata_wg/library) + + +## Project Documents and Data +* [WG charter](https://drive.google.com/file/d/1ypcx2F30siqr-KYOKFZtVv8h9PIS9a77/view?usp=sharing) +* [WG Google Drive folder](https://drive.google.com/drive/folders/1cpZtbjKadgD30794fD97XY-EChUSy2r9?usp=sharing) + + +## Agenda Topics +1. Updates, announcements, intros +2. ChatGPT interactive session based on [https://thecaglereport.com/2023/03/16/nine-chatgpt-tricks-for-knowledge-graph-workers/](https://thecaglereport.com/2023/03/16/nine-chatgpt-tricks-for-knowledge-graph-workers/) + 1. Investigated using ChatGPT to generate a BIBFRAME Work Resource based on a simple prompt for a book + 2. Used ChatGPT to +3. https://arxiv.org/abs/2303.10130 diff --git a/meetings/2023-05-09.md b/meetings/2023-05-09.md new file mode 100644 index 0000000..273192b --- /dev/null +++ b/meetings/2023-05-09.md @@ -0,0 +1,24 @@ +title: ai4lam Metadata/Discovery WG Monthly Meeting + +# May 9, 2023 + +8 AM California | 11 PM Washington DC | 4 PM UK | 5 PM Oslo & Paris + +**Attending** +* Name (institution) +* Jeremy Nelson (Stanford) +* David Lowe (Texas A&M) + +## Helpful Links +* [Metadata WG Zotero Group Library](https://www.zotero.org/groups/2709151/ai4lam_metadata_wg/library) + + +## Project Documents and Data +* [WG charter](https://drive.google.com/file/d/1ypcx2F30siqr-KYOKFZtVv8h9PIS9a77/view?usp=sharing) +* [WG Google Drive folder](https://drive.google.com/drive/folders/1cpZtbjKadgD30794fD97XY-EChUSy2r9?usp=sharing) + + +## Agenda +* Announcements +* Demo of [https://github.com/sul-dlss-labs/folio-llm](https://github.com/sul-dlss-labs/folio-llm), static site available at [https://sul-dlss-labs.github.io/folio-llm/](https://sul-dlss-labs.github.io/folio-llm/) based on the [https://react-lm.github.io/](https://react-lm.github.io/) pattern and the following blog post [https://til.simonwillison.net/llms/python-react-pattern](https://til.simonwillison.net/llms/python-react-pattern). +* Further investigation of similar technologies like Lang Chain ([https://python.langchain.com](https://python.langchain.com)) and AutoGPT ([https://github.com/Significant-Gravitas/Auto-GPT](https://github.com/Significant-Gravitas/Auto-GPT)) diff --git a/meetings/2023-07-11.md b/meetings/2023-07-11.md new file mode 100644 index 0000000..e4ff964 --- /dev/null +++ b/meetings/2023-07-11.md @@ -0,0 +1,23 @@ +title: ai4lam Metadata/Discovery WG Monthly Meeting + +# July 11, 2023 + +8 AM California | 11 PM Washington DC | 4 PM UK | 5 PM Oslo & Paris + +**Attending** +* Name (institution) +* Jeremy Nelson (Stanford) +* Erik Radio (University of Colorado) + +## Helpful Links +* [Metadata WG Zotero Group Library](https://www.zotero.org/groups/2709151/ai4lam_metadata_wg/library) + + +## Project Documents and Data +* [WG charter](https://drive.google.com/file/d/1ypcx2F30siqr-KYOKFZtVv8h9PIS9a77/view?usp=sharing) +* [WG Google Drive folder](https://drive.google.com/drive/folders/1cpZtbjKadgD30794fD97XY-EChUSy2r9?usp=sharing) + + +## Agenda +* Announcements +* Interest in submitting a proposal to the 2023 [Fantastic Futures Conference](https://ff2023.archive.org/pages/call_for_proposals/.)? Topics? diff --git a/meetings/2023-09-12.md b/meetings/2023-09-12.md new file mode 100644 index 0000000..fba4020 --- /dev/null +++ b/meetings/2023-09-12.md @@ -0,0 +1,31 @@ +title: ai4lam Metadata/Discovery WG Monthly Meeting + +# September 12, 2023 + +8 AM California | 11 PM Washington DC | 4 PM UK | 5 PM Oslo & Paris + +**Attending** +* Name (institution) +* Jeremy Nelson (Stanford) +* Tim Thompson (Yale) +* Erik Radio (Colorado) + +## Helpful Links +* [Metadata WG Zotero Group Library](https://www.zotero.org/groups/2709151/ai4lam_metadata_wg/library) + + +## Project Documents and Data +* [WG charter](https://drive.google.com/file/d/1ypcx2F30siqr-KYOKFZtVv8h9PIS9a77/view?usp=sharing) +* [WG Google Drive folder](https://drive.google.com/drive/folders/1cpZtbjKadgD30794fD97XY-EChUSy2r9?usp=sharing) + + +## Agenda +* Announcements +* Vector Databases for FOLIO + * Terminusdb - RDF underneath, JSON public interface, GraphQL and custom query language + * Sidecar Vector database + * Blog post describing [https://terminusdb.com/blog/vector-database-and-vector-embeddings/](https://terminusdb.com/blog/vector-database-and-vector-embeddings/) + * Using [pgventor ](https://github.com/pgvector/pgvector)and [vecs ](https://github.com/supabase/vecs)to ingest FOLIO records +* Increase participation in Metadata WG? + * FOLIO Slack channels + * Environmental Scan on existing work in AI in Libraries and Museums diff --git a/meetings/2023-10-10.md b/meetings/2023-10-10.md new file mode 100644 index 0000000..9d40a85 --- /dev/null +++ b/meetings/2023-10-10.md @@ -0,0 +1,43 @@ +title: ai4lam Metadata/Discovery WG Monthly Meeting + +# October 10, 2023 + +8 AM California | 11 PM Washington DC | 4 PM UK | 5 PM Oslo & Paris + +**Attending** + + + +* Name (institution) +* Erik Radio (University of Colorado) +* Jeremy Nelson (Stanford) +* Tim Thompson (Yale) + +## Helpful Links +* [Metadata WG Zotero Group Library](https://www.zotero.org/groups/2709151/ai4lam_metadata_wg/library) + + +## Project Documents and Data +* [WG charter](https://drive.google.com/file/d/1ypcx2F30siqr-KYOKFZtVv8h9PIS9a77/view?usp=sharing) +* [WG Google Drive folder](https://drive.google.com/drive/folders/1cpZtbjKadgD30794fD97XY-EChUSy2r9?usp=sharing) + + +## Agenda +* Announcements +* Using client-side server-side with LLM + * [https://til.simonwillison.net/llms/python-react-pattern](https://til.simonwillison.net/llms/python-react-pattern) +* Fantastic Futures 2023 Talk Proposal + * Title: Chatting with your Catalog: Exploring the use of LLMs with FOLIO LSP \ +Abstract: AbstractIn the ai4lam Metadata Working Group, we have been exploring the integration of ChatGPT and other Large Language Models (LLMs) with the FOLIO Library Services Platform over the past year. This presentation aims to showcase three practical applications of LLMs within FOLIO, catering to different audiences and tasks. + + Firstly, we demonstrate how ChatGPT can assist in creating AI and machine learning use cases that are specifically tailored to FOLIO. + + + Secondly, we explore the transformation of MARC21 and BIBFRAME RDF formats into FOLIO-specific bibliographic JSON Inventory records using ChatGPT. Attendees will witness how this process is accessible through both the web interface and the OpenAI API, enabling efficient and accurate record conversions. + + + Lastly, we present a client-side web application that enhances user interaction with the FOLIO platform. By leveraging the ReAct Pattern with customized prompts, users can engage with ChatGPT in a chat-forward manner, actively modeling common workflows within FOLIO. + + + Through these demonstrations, attendees will gain insights into the potential of integrating LLMs with FOLIO. Learn about the adaptability and utility of ChatGPT and LLMs as innovative tools for improving FOLIO's functionalities and user experience. + diff --git a/meetings/2023-11-13.md b/meetings/2023-11-13.md new file mode 100644 index 0000000..ce40ab1 --- /dev/null +++ b/meetings/2023-11-13.md @@ -0,0 +1,26 @@ +title: ai4lam Metadata/Discovery WG Monthly Meeting + +# November 13, 2023 + +8 AM California | 11 AM Washington DC | 4 PM UK | 5 PM Oslo & Paris + +**Attending** +* Name (institution) +* Jeremy Nelson (Stanford) +* Wayne Schneider (IndexData) +* Tim Thompson (Yale) +* Jenn Colt (Cornell) + +## Helpful Links + +* [Metadata WG Zotero Group Library](https://www.zotero.org/groups/2709151/ai4lam_metadata_wg/library) + + +## Project Documents and Data +* [WG charter](https://drive.google.com/file/d/1ypcx2F30siqr-KYOKFZtVv8h9PIS9a77/view?usp=sharing) +* [WG Google Drive folder](https://drive.google.com/drive/folders/1cpZtbjKadgD30794fD97XY-EChUSy2r9?usp=sharing) + + +## Agenda +* Announcements +* Preview of Fantastic Futures Presentation diff --git a/meetings/2024-01-09.md b/meetings/2024-01-09.md new file mode 100644 index 0000000..599da05 --- /dev/null +++ b/meetings/2024-01-09.md @@ -0,0 +1,34 @@ +title: ai4lam Metadata/Discovery WG Monthly Meeting + +# January 9, 2024 + +8 AM California | 11 AM Washington DC | 4 PM UK | 5 PM Oslo & Paris + +**Attending** +* Aaron Krebeck - WRLC +* Jeremy Nelson - Stanford +* Erik Radio - Colorado +* Joy Pangabutra-Roberts - Tennessee + +## Helpful Links +* [Metadata WG Zotero Group Library](https://www.zotero.org/groups/2709151/ai4lam_metadata_wg/library) + + +## Project Documents and Data +* [WG charter](https://drive.google.com/file/d/1ypcx2F30siqr-KYOKFZtVv8h9PIS9a77/view?usp=sharing) +* [WG Google Drive folder](https://drive.google.com/drive/folders/1cpZtbjKadgD30794fD97XY-EChUSy2r9?usp=sharing) + + +## Agenda +* Announcements +* Planning for 2024 + * Abigail Porter from LOC labs presentation in February + * Aaron Krebeck in March + * Joy OCLC Metadata Managers update in April +* Catalog Chat Plans https://github.com/AI4LAM/catalog-chat + * Extend to use for metadata reporting +* Alma and AI + * ChatGPT generating XSLT, MODS to METS, to Dublin Core + * Format agnostic +* Later in the year, use geographic identification for potential resources +* Data clean-up using Python scripts, research identifies diff --git a/meetings/2024-02-13.md b/meetings/2024-02-13.md new file mode 100644 index 0000000..ccbaa5e --- /dev/null +++ b/meetings/2024-02-13.md @@ -0,0 +1,31 @@ +title: ai4lam Metadata/Discovery WG Monthly Meeting + +# February 13, 2024 + +8 AM California | 11 AM Washington DC | 4 PM UK | 5 PM Oslo & Paris + + +**Attending** +* Jeremy Nelson (Stanford) +* Barbara Cormack (CDL) +* Dana Jemison (CDL) dana.jemison@ucop.edu +* Ian Bogus (Recap) ibogus@princeton.edu +* Tim Thompson (Yale) +* Erik Radio (Colorado) +* Andrew Elliot (CRL) +* Aaron Krebeck (WRLC) krebeck@wrlc.org + +## Helpful Links +* [Metadata WG Zotero Group Library](https://www.zotero.org/groups/2709151/ai4lam_metadata_wg/library) + + +## Project Documents and Data +* [WG charter](https://drive.google.com/file/d/1ypcx2F30siqr-KYOKFZtVv8h9PIS9a77/view?usp=sharing) +* [WG Google Drive folder](https://drive.google.com/drive/folders/1cpZtbjKadgD30794fD97XY-EChUSy2r9?usp=sharing) + + +## Agenda +* Announcements +* Change May meeting to May 7th for Abigail Potter presentation on AI at LOC +* Outlook Meeting requests? +* Bibliographic Experiment with Google Gemini diff --git a/meetings/2024-03-12.md b/meetings/2024-03-12.md new file mode 100644 index 0000000..e456602 --- /dev/null +++ b/meetings/2024-03-12.md @@ -0,0 +1,98 @@ +title: ai4lam Metadata/Discovery WG Monthly Meeting + +# March 12, 2024 + +8 AM California | 11 AM Washington DC | 4 PM UK | 5 PM Oslo & Paris + +**Attending** +* Jeremy Nelson, Stanford +* Aaron Krebeck, Washington Research Library Consortium +* Erik Radio, Colorado +* Tim Thompson, Yale +* Sara Amato + +## Helpful Links + +* [Metadata WG Zotero Group Library](https://www.zotero.org/groups/2709151/ai4lam_metadata_wg/library) + + +## Project Documents and Data + +* [WG charter](https://drive.google.com/file/d/1ypcx2F30siqr-KYOKFZtVv8h9PIS9a77/view?usp=sharing) +* [WG Google Drive folder](https://drive.google.com/drive/folders/1cpZtbjKadgD30794fD97XY-EChUSy2r9?usp=sharing) + + +## Agenda +* Announcements +* Aaron Krebeck Presentation + * [https://docs.google.com/presentation/d/1e4jemOguxXVYLvPESyWnIw8M7BzgPv2SlqFcXDgtSd8/edit?usp=sharing](https://docs.google.com/presentation/d/1e4jemOguxXVYLvPESyWnIw8M7BzgPv2SlqFcXDgtSd8/edit?usp=sharing) + * Practical application using ChatGPT + * Asked ChatGPT to write a script for Bob Burgers, impressed with output and formatted. What other applications to give ChatGPT a voice? Had a few projects in the library that might benefit. + * Comes from shared print and shared print serials? Serials holding statements have large variability, parts and dates not consistent. Have a Metadata committee came up with some metadata standards for writing serials statements. Guidelines how serial statements for new and ongoing serials statements in the consortium. + * Not a retroactive project, 900k serials, not enough staff time to go back manually. + * Metadata committee came up with good documentation for serials statement + * This documentation good starting point for AI + * Proof of Concept + * Slowly refine a query with ChatGPT + * Item description templates very helpful + * Great opportunity to learn to use AI + * ChatGPT free tier allows for 100 lines of input is plenty for testing query development + * Enough to test 25 gnarly descriptions + * Prompt had guides for volumes, numbers, + * Got to the point of parsing correctly 19/20, 20/20 + * How to use API calls and application + * OpenAI released custom GPT like data analytics, + * Upload a CSV, tell it what to do, and download results + * Only need to buy a pro license + * Didn’t need to use APIs or develop custom app + * Results with csv that had barcode, description, results included new column “New Description” + * Colleague says that training GPT is like teaching + * Bookmarked with iterative training, in 900k rows CSV, when 100k, 200k create bookmark + * 9 different universities, 50 different tech services managers over the years. Each one was wrong in different ways + * Q: Talk more about the process? + * Started with Alma analytics, first pulled 900k, ChatGPT doesn’t struggle with Pro ChatGPT 4.0 engine, have include 1.5 millions. Keep it pretty simple, identifier like barcode and field to enforce standardized. + * Once the CSV file, upload through ChatGPT web interface + * Prompts started off doing all 900k, better success by breaking down by University 200-300k at any one time. + * Allows for taylor prompt for “House” style of serials statements + * Use the English word for volume, for example “jahr” in German, always translated into English. Need a special rule to process. + * Q: Last time ran this process? + * Already did 900k volumes in shared storage facility + * Partners asked to do volumes on Campus, will be over a 1 million + * Ran into a problem, network configuration in Alma, material in shared storage facility is also in the home zone, duplicated in consortium and institutions. Item description must be exact. Need to do it twice for consortium and individual + * Isn’t perfect, out of the specification and the metadata format. Definitely gets confused when you have 4 digit numbers that aren’t dates. In the rare instance with like page number, ChatGPT treats like dates + * Also not great, (garbage in, garbage out) data with cataloging errors doesn’t fix, special characters. Generally speaking, average is above 80%. + * What to do with the 15% worse or doesn’t fix? + * Ask ChatGPT to ask to check over it’s work + * Looks at the original description and all of the number elements, ensures that the all of number elements appear in the new description + * If missing information don’t use from the checking process and flag things. + * Next steps? + * From normalized item descriptions, to 866s to coded fields + * National/international policies and templates for standardized 866s + * Or not! + * What other periodical problems can this solve. + * East Consortium, list of publishers to exclude, not uniform standard for names, could ChatGPT be used in similar + * Use other models Gemini or Claude? + * Gemini good for small datasets in Google Sheets. Haven’t done any large data-sets, no upload CSV option. + * Once we had a CSV with revised CSV, have a Python script to upload to Alma, + * Q: Publicly shared? Link to public thread or documentation. + * If migrating to Alma [https://github.com/WRLC/alma_notes_import](https://github.com/WRLC/alma_notes_import) Github repository, Alma doesn’t have more batch + * Q: Curious about parsing into structured data instead of formatted strings? + * Some success, taking these item descriptions, break apart, and add as enumerate and chronology fields in MARC, done in Alma environment + * On a task force, 583 field subfield a committee, push on Ex Libris if you have retention information, put into the 583 a. + * Standards themselves do not provide much information about holdings information + * Q: What is the ultimate purpose? Common Dataset? + * With success, how can these tools to be applied to other problems, what are the things that are safe for LLM to do or human resources to ourselves. + * Q: How long does it take to process 900k? + * A minute or two + * Tim: Working on places in subject headings in hierarchy, using API to do one-off requests is “place a part of place b”, will go back and try again with CSV a familiar juggling process. + * Asking ChatGPT in shared print, keep two copies of every title. Ideally, both copies are in a shared facility. Didn’t want any partners have unequal representation, prioritize shared storage and equal representation. 1.8 million monographs, how many are in shared storage but not “official” retained copy + * Rows find but extra cells are problematic. + * Only give it what needs to make a decision + * Q: Publish work in Code4Lib? + * Even a Github repo with the prompts would be very helpful + * Q: Similar problem parsing semi-structured data, is there a way to turn off code interpreters and just pure statistical? + * Haven’t done it on this project + * Q: Use of these models applied to other ares in the library + * Harvard high density storage model, storage by size, accessions is challenging, remove a small paperback, replace with large volume. End up with “swiss cheese”, some policies. + * More efficient rules using ChatGPT for heat maps for high density storage. Row with most duplication, could be targeted. + * Storage Optimization.