Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize record iterator #3

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 33 additions & 10 deletions Notes.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Task in Refactoring

* datamodel/typesystem.ts should be types.ts and have all exported types
* _DONE_ datamodel/typesystem.ts should be types.ts and have all exported types
and only have types, currently it has some functions and other types are
defined elsewhere
defined elsewhere.

* External modules that are imported - need a way to handle these
* _DONE_ External modules that are imported - need a way to handle these
* getDataDB from 'sync' - returns the dataDB for a project_id
* shouldDisplayRecord from 'users' - used to filter records, might
be best to move this inside this library since it will be useful
Expand All @@ -19,26 +19,49 @@
(unless we also want users in conductor).

* in typesystem.ts we have isAttachment which checks to see if it's argument
is an instance of Blob. Blob isn't there in node by default so we will need
is an instance of Blob. It's only used in the equality comparitor in that file.
Blob isn't there in node by default so we will need
a better test that will work on both browser and node. Initially just having
that fn return false to get the tests to run

* the exported interface 'option' could do with a better name!
that fn return false to get the tests to run

* data_storage/validation.ts is about form validation and relies on getting the
uiSpec for a notebook via modules not included here. Remove it from here for
now but consider pulling all of the metadata manipulation code into here at some
later time

* to get a list of records we have getRecordsWithRegex (data_storage/index.ts)
* to get a list of records we have getRecordsWithRegex (data_storage/index.ts)
and getAllRecordsWithRegex (data_storage/queries.ts). getRecords calls getAllRecords
and then calls filterRecordMetadata to optionally remove deleted records and those
that shouldn't be visible to the user (shouldDisplayRecord). Both are
that shouldn't be visible to the user (shouldDisplayRecord). Both are
used in FAIMS but surely only getRecordsWithRegex should be used because we
always want to remove hidden records...actually the use of getAll is in my
code to navigate via QR code so is not correct. getAll should not be
exported here and that QR code (in gui/components/notebook/add_record_by_type.tsx)
should be modified.


* getMetadataForAllRecords is called in FAIMS3 to get the list of records to
display in the main data grid, it is slow with large number of records even
if only one page will be displayed. Let's see what it does...

getMetadataForAllRecords: Promise<RecordMetadata[]>
* listRecordMetadata - to get a full list of record metadata
* if no list of record ids passed, call getAllRecords to get one
getAllRecords is a simple query to the DB for record_Format_version: 1
* get the revisions for the list of record ids (getRevisions)
* get the HRID for each record (getHRID)
* filterRecordMetadata - to remove deleted and inaccessible records
* deleted is a property on the frev record but we first retrieve
all records and then filter on this rather than doing it in a quer
* shouldDisplayRecord is called for each record, in FAIMS3 it returns
true unless the user does not have permissions on the cluster
* does a db query for the user info **for every record**
* parses the JWT to get permissions **for every record** (actually more than once)

getRecordsWithRegex: Promise<RecordMetadata[]>
* getAllRecordsWithRegex
* query the db for all records with avp_format_version: 1 (all avp records?)
* which is 'find an avp that matches this regexp somewhere'
* filtered by the regex
* generate a list of record ids from this query, deduplicate it
* pass to listRecordMetadata to get the actual records
* filterRecordMetadata
20 changes: 10 additions & 10 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,6 @@
"pouchdb-adapter-memory": "^8.0.1",
"pouchdb-find": "^8.0.1",
"ts-jest": "^29.0.5",
"typescript": "^4.9.5"
"typescript": "^5.3.2"
}
}
1 change: 1 addition & 0 deletions src/data_storage/attachments.ts
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ export function attachment_to_file(
): File {
const content_type = attachment.content_type;
const data = (attachment as PouchDB.Core.FullAttachment).data;
// console.log('a2f', name, content_type, data);
return new File([data], name, {type: content_type});
}

Expand Down
10 changes: 9 additions & 1 deletion src/data_storage/databases.ts
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,14 @@ export const addDesignDocsForNotebook = async (
}
}`,
},
avpAttachment: {
map: `function (doc) {
if (doc.avp_format_version === 1 && doc.faims_attachments)
doc.faims_attachments.forEach((att) => {
emit(doc._id, {_id: att.attachment_id});
});
}`,
},
revision: {
map: 'function (doc) {\n if (doc.revision_format_version === 1) \n emit(doc._id, 1);\n}',
},
Expand All @@ -126,7 +134,7 @@ export const addDesignDocsForNotebook = async (
documents[i]['_rev'] = existing['_rev'];
// are they the same?
if (isEqualObjects(existing, documents[i])) {
return;
continue;
}
}
await dataDB.put(documents[i]);
Expand Down
96 changes: 81 additions & 15 deletions src/data_storage/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@

import {v4 as uuidv4} from 'uuid';

import {getDataDB} from '../index';
import {file_attachments_to_data, getDataDB} from '../index';
import {DEFAULT_RELATION_LINK_VOCABULARY} from '../datamodel/core';
import {
FAIMSTypeName,
Expand All @@ -53,7 +53,7 @@ import {
getHRID,
listRecordMetadata,
} from './internals';
import {getAllRecordsOfType, getAllRecordsWithRegex} from './queries';
import {getAllRecordsOfType, getAllRecordsWithRegex, queryAsMap} from './queries';
import {logError} from '../logging';

export function generateFAIMSDataID(): RecordID {
Expand Down Expand Up @@ -507,6 +507,7 @@ const hydrateRecord = async (
project_id,
record.revision
);

const result = {
project_id: project_id,
record_id: record.record_id,
Expand Down Expand Up @@ -543,20 +544,84 @@ export async function getSomeRecords(
options.startkey = bookmark;
}
try {
const res = await dataDB.query('index/recordRevisions', options);
let record_list = res.rows.map((doc: any) => {
return {
record_id: doc.id,
revision_id: doc.value._id,
created: doc.value.created,
conflict: doc.value.conflict,
type: doc.value.type,
revision: doc.doc,
// get a list of revisions
const revisions = await dataDB.query('index/recordRevisions', options);

// collect the avp keys in these revisions
let avpKeys: string[] = [];
revisions.rows.map((row: any) => {
avpKeys = avpKeys.concat(Object.values(row.doc.avps));
});

// now grab the attachments for these avps
const attOptions = {
include_docs: true,
attachments: true,
keys: avpKeys,
};

const attachments = await queryAsMap(dataDB, 'index/avpAttachment', attOptions);
const avps = await queryAsMap(dataDB, 'index/avp', {include_docs: true, keys: avpKeys});


let record_list = revisions.rows.map((row: any) => {
const doc = row.doc;
const data: {[key: string]: any} = {};
const annotations: {[key: string]: any} = {};
const types: {[key: string]: string} = {};

Object.keys(doc.avps).map((key: string) => {
const avp = avps.get(doc.avps[key]);
if (avp.data != undefined) {
data[key] = avp.data;
} else {
// we might have an attachment
if (avp.faims_attachments) {
const attachmentList: any[] = [];
avp.faims_attachments.forEach((a: any) => {
const attachment = attachments.get(a.attachment_id);
if (attachment) {
attachmentList.push(attachment);
}
});
data[key] = file_attachments_to_data(avp, attachmentList).data;
} else {
data[key] = '';
}
}
// get annotations
annotations[key] = avp.annotations;
// and types
types[key] = avp.type;
});

// get the HRID by finding a property starting with HRID_STRING
const hridKey = Object.keys(data).find((key: string) => {
return key.startsWith('HRID_STRING');
});

const result = {
project_id: project_id,
record_id: doc.record_id,
revision_id: doc._id,
created_by: doc.created_by,
updated: new Date(doc.created),
updated_by: doc.created_by,
deleted: doc.deleted ? true : false,
hrid: hridKey ? data[hridKey] : doc.record_id,
relationship: doc.relationship,
data: data,
annotations: annotations,
types: types,
created: new Date(row.value.created),
conflicts: row.value.conflict,
type: doc.type,
};
return result;
});
if (filter_deleted) {
record_list = record_list.filter((record: any) => {
return !record.revision.deleted;
return !record.deleted;
});
}
// don't return the first record if we have a bookmark
Expand All @@ -580,12 +645,14 @@ export const notebookRecordIterator = async (
) => {
const batchSize = 100;
const getNextBatch = async (bookmark: string | null) => {
const mark = performance.now();
const records = await getSomeRecords(
project_id,
batchSize,
bookmark,
filter_deleted
);
console.log('getNextBatch', performance.now() - mark);
// select just those in this view
return records.filter((record: any) => {
return record.type === viewID;
Expand Down Expand Up @@ -613,9 +680,8 @@ export const notebookRecordIterator = async (
index = 1;
}
}
if (record) {
const data = await hydrateRecord(project_id, record);
return {record: data, done: false};
if (record) {
return {record: record, done: false};
} else {
return {record: null, done: true};
}
Expand Down
2 changes: 2 additions & 0 deletions src/data_storage/internals.ts
Original file line number Diff line number Diff line change
Expand Up @@ -522,6 +522,7 @@ async function loadAttributeValuePair(
project_id: ProjectID,
avp: AttributeValuePair
): Promise<AttributeValuePair> {
const mark = performance.now();
const attach_refs = avp.faims_attachments;
if (attach_refs === null || attach_refs === undefined) {
// No attachments
Expand All @@ -547,6 +548,7 @@ async function loadAttributeValuePair(
attach_docs.push(doc);
}
});
console.log('loadAVP', performance.now() - mark);
return loader(avp, attach_docs);
}

Expand Down
28 changes: 26 additions & 2 deletions src/data_storage/queries.ts
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,30 @@ export async function getAllRecordsWithRegex(
return avp.record_id;
});
// Remove duplicates, no order is implied
const deduped_record_ids = Array.from(new Set<RecordID>(record_ids));
return await listRecordMetadata(project_id, deduped_record_ids);
const unique_record_ids = Array.from(new Set<RecordID>(record_ids));
return await listRecordMetadata(project_id, unique_record_ids);
}

/**
* Turn a query result (array of documents) into a Map keyed by document id
* so that we can access results by id rather than iterating over the array
* to find documents.
*
* @param db a PouchDB database
* @param index an index on the database
* @param options options to pass to db.query
* @returns a Map with record id as keys and documents as values
*/
export const queryAsMap = async (
db: PouchDB.Database,
index: string,
options: PouchDB.Query.Options<{}, {}>
) => {
const queryResult = await db.query(index, options);

const aMap = new Map<string, any>();
queryResult.rows.forEach((row: any) => {
aMap.set(row.doc._id, row.doc);
});
return aMap;
};
Loading
Loading