-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME
1126 lines (925 loc) · 49.7 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
NCBI SOFTWARE DEVELOPMENT TOOLKIT
National Center for Biotechnology Information
Bldg 38A, NIH
8600 Rockville Pike
Bethesda, MD 20894
The NCBI Software Development Toolkit was developed for the production and
distribution of GenBank, Entrez, BLAST, and related services by NCBI. We make
it freely available to the public without restriction to facilitate the
use of NCBI by the scientific community. However, please understand that
while we feel we have done a high quality job, this is not commercial software.
The documentation lags considerably behind the software and we must make any
changes required by our data production needs. Nontheless, many people have
found it a useful and stable basis for a number of tools and applications.
The toolkit is available by anonymous ftp from ftp.ncbi.nih.gov
cd toolbox
cd ncbi_tools
bin
get ncbi.tar.Z (compressed UNIX tar file)
quit
In this same directory are also ncbiz.exe (DOS self extracting archive) and
ncbi.hqx (Mac self extracting archive). All three files contain the same
source code and will make the toolkit for all platforms.
Please feel free to email questions/suggestions to:
If you would like hardcopy of the current documentation, send your mailing
address with your request to the email address above.
If you are considering a serious development project using this toolkit, please
contact us. We are happy to discuss compatible strategies and inform you of
our longer term plans. There is no limitation of the use of this code or in
contacting us about its use for commercial, academic, or government groups.
===========================================================================
Version 6.1
the date of release may be obtained from the file ncbi/VERSION
===========================================================================
Summary
The procedure of building the toolkit on Unix was slightly changed.
Now there is no need to download any binary NCBI product for your
platform to obtain the platform-specific ncbi.mk file.
To build the NCBI toolkit you need to look for platform-dependent instructions:
For UNIX (including Linux and Mac OS X):
look at the file make/readme.unx
For alternative Mac instructions (using CodeWarrior):
look at the file make/readme.mac
For Microsoft Windows95/98/NT:
look at the file make/readme.dos
There is some information which may be useful for NCBI tookit building
in the file doc/FAQ.txt
This release includes source code for the new (2.0.9) version of BLAST.
Look at the file doc/README.bls for more detailed documentation on
stand-alone BLAST.
The file doc/README.pbl has the information about PowerBLAST.
And the description on Integrating Matrix Profiles And Local Alignments
(IMPALA) is located in the file doc/README.imp
The file doc/sequin.htm describes the SEQUIN and its configuration.
If you have problems configuring Entrez with a firewall, look at the
file doc/firewall.txt
This file has a section called CONFIGURATION OR SETTINGS FILES,
which explains in detail how our configuration system works. The ncbi
config file (.ncbirc on UNIX, ncbi.ini on PC/Windows, and ncbi.cnf on
Macintosh) is needed in order to find data files, such as
gc.val (the genetic code table), provided in the toolkit or with programs
like Sequin. (The asnload files containing dynamic versions of the ASN.1
parse tables are no longer needed, since all platforms can now have large
static data.)
It has recently become possible to eliminate the need for the ncbi config
file by calling UseLocalAsnloadDataAndErrMsg () at the beginning of your
program. This looks for the data directory in the same directory as the
running program. If it doesn't find it, it looks up one level, in case you
are compiling programs in the build directory of the toolkit. If it finds
the data directory in either of these places, it transiently sets the
location, so code that loads these files is given the correct path.
An even more recent change is that copies of several of our data files (gc,
seqcode, and featdef) are now built into the source code, so if the data
directory is not found, programs that require only these can still run.
One final improvement is that access to our network services is now much
simpler than before, so if you are not behind a firewall and have domain
name server (DNS) available you can connect to our network without needing
any configuration information in the ncbi config file. Operation behind a
firewall, or with a proxy, requires very little in the ncbi config file, and
this is easily created by asking Sequin to configure for network access.
=============================================================================
Notes from Previous Releases
=============================================================================
=============================================================================
Version 6.0
the date of release may be obtained from the file ncbi/VERSION
=============================================================================
This release includes source code for the new (2.0) version of BLAST.
Also included are a small number of incremental changes in the ASN.1
specification.
BLAST 2.0 - BLAST 2.0 can produce gapped alignments and is capable of
position-specific-iterated BLASTp (PSI-BLAST). Compared to the 1.4 release of
BLAST, there are also signficant performance enhancements as well as extensive
changes to the text report and the format of the databases. BLAST 2.0
uses threads for multi-processing, using the NCBI threads library.
Three BLAST programs may be compiled in the demo directory. They are:
formatdb: formats FASTA files as BLAST databases for BLAST 2.0.
blastall: perform all five flavors of blast comparison.
blastn and blastp offer fully gapped alignments.
blastx and tblastn have 'in-frame' gapped alignments and use sum
statistics to link alignments from different frames.
tblastx provides only ungapped alignments.
blastpgp: performs gapped blastp searches and can be used to perform
iterative searches in psi-blast mode.
Additional information may be obtained from the README in the BLAST
directory of the FTP site and from the NCBI BLAST pages.
ASN.1 Spec Changes for 1997
biblio.asn
Cit-pat - some fields made optional to allow patent applications to be legal
Cit-pat.number OPTIONAL
Cit-pat.date-issue OPTIONAL
-- Patent number and date-issue were made optional in 1997 to
-- support patent applications being issued from the USPTO
-- Semantically a Cit-pat must have either a patent number or
-- an application number (or both) to be valid
medline.asn
added ML-field to support other MEDLINE line types
Medline-entry ::= SEQUENCE {
uid INTEGER OPTIONAL , -- MEDLINE UID, sometimes not yet available if from PubMed
em Date , -- Entry Month
... (not shown)
pmid PubMedId OPTIONAL , -- MEDLINE records may include the PubMedId
pub-type SET OF VisibleString OPTIONAL, -- may show publication types (review, etc)
mlfield SET OF Medline-field OPTIONAL } -- additional Medline field types
Medline-field ::= SEQUENCE {
type INTEGER { -- Keyed type
other (0) , -- look in line code
comment (1) , -- comment line
erratum (2) } , -- retracted, corrected, etc
str VisibleString , -- the text
ids SEQUENCE OF DocRef OPTIONAL } -- pointers relevant to this text
DocRef ::= SEQUENCE { -- reference to a document
type INTEGER {
medline (1) ,
pubmed (2) ,
ncbigi (3) } ,
uid INTEGER }
seq.asn
MolInfo.tech - added names for HTG classes already implemented
Annotdesc.region - added seqloc. If present, all annots in this SeqAnnot
are within this region. Optimization on big seqs.
seqfeat.asn
added OrgMod.specimen-voucher - new organism qualifier
added OrgMod.old-name - used internally at NCBI
added BioSource.is-focus - for distinguishing biological focus of
multiple source features.
added Seq-feat.pseudo so any feature can be flagged explicitly as
belonging to a pseudogene
added Seq-feat.except-text for an explanation of the exception when
Seq-feat.except is TRUE. Currently this text is in Seq-feat.comment
in backbone records and GBQuals in some other genbank records.
=============================================================================
Notes from Previous Releases
=============================================================================
Version 5.0
Summary
This release includes a small number of incremental changes in the ASN.1
specification. Most significant is the addition of the PubMedID, a
bibliographic citation identifier similar to a MEDLINE UID. PubMed is a new
citation database being developed at NCBI which is a superset of MEDLINE. It
will be an avenue by which publishers can deposit electronic versions of their
citations and abstracts to allow them timely linking to network entrez from
the publishers on-line services. PubMed will route these citations to MEDLINE
and they will appear in MEDLINE (and Entrez) after the usual MEDLINE indexing.
However, for some period of time, such articles will have only a PubMedID.
We would like to switch Entrez over to supporting PubMedIDs as early as
possible. WE STRONGLY ENCOURAGE DEVELOPERS TO RECOMPILE AND RELINK WITH THIS
VERSION OF THE TOOLKIT AS SOON AS POSSIBLE. The changes in this specification
should not cause problems with existing software, so a simple compile and
link should be enough to make you compatible. Details of ASN.1 specification
changes are listed below.
There has been considerable development of the toolkit in other aspects as
well, many of which are embodied in sequin, the new NCBI direct submission
tool, which is included in the toolkit as well. In the interest of getting the
PubMed changes into the specification and developers hands promptly, we have
not included much on that aspect of this toolkit at this time.
Changes in the 1996 NCBI ASN.1 (version 5.0) specification
Once again, there are very few changes to the NCBI ASN.1 specification this
year. The biggest change is the addition of the PubMed ID to support the new
NCBI PubMed database. There are also small additions to the medline and
organism specifications, detailed below. As usual, these changes are also
backward compatible with old data. However, you should recompile and relink
your applications as soon as possible, since the old applications will not be
compatible with the new datatypes.
1) PubMed - NCBI is building a new citation database that is a superset of
MEDLINE and which will be linked to online journals from publishers. The
bibliographic components of the specification have had support for PubMed IDs
added. These include biblio.asn (objbibli.[ch]), pub.asn (objpub.[ch]),
medline.asn (objmedli.[ch]).
2) pub-type - MEDLINE includes strings indicating the type of a publication.
The medline definition has had the attribute pub-type added to support these
strings.
From the 1996 MeSH, here's the list.
Abstract
Bibliography
Classical Article
Clinical Conference
Clinical Trial
Clinical Trial, Phase I
Clinical Trial, Phase II
Clinical Trial, Phase III
Clinical Trial, Phase IV
Comment
Consensus Development Conference
Consensus Development Conference, NIH
Controlled Clinical Trial
Corrected and Republished Article
Current Biog-Obit
Dictionary
Directory
Duplicate Publication
Editorial
Festschrift
Guideline
Historical Article
Historical Biography
Interview
Journal Article
Legal Brief
Letter
Meeting Report
Meta-Analysis
Monograph
Multicenter Study
News
Newspaper Article
Overall
Periodical Index
Practice Guideline
Published Erratum
Randomized Controlled Trial
Retracted Publication
Retraction of Publication
Review
Review Literature
Review of Reported Cases
Review, Academic
Review, Multicase
Review, Tutorial
Scientific Integrity Review
Technical Report
Twin Study
3) virion - the attribute virion has been added to BioSource.genome. It just
complements proviral which was already there. This will map to a /virion
qualifier in the new GenBank feature table definition.
4) division - OrgName.div now (optionally) can contain the GenBank division code
(eg. PRI).
5) signal-peptide, transit-peptide - were added to Prot-ref, to support
annotation of protein features on the protein sequence in a way that could be
mapped to a GenBank feature table.
That's all. Relevant sections of the asn.1 specification are shown below.
================================================================================
biblio.asn
PubMedId ::= INTEGER -- Id from the PubMed database at NCBI
and..
Cit-gen ::= SEQUENCE { -- NOT from ANSI, this is a catchall
cit VisibleString OPTIONAL , -- anything, not parsable
authors Auth-list OPTIONAL ,
muid INTEGER OPTIONAL , -- medline uid
journal Title OPTIONAL ,
volume VisibleString OPTIONAL ,
issue VisibleString OPTIONAL ,
pages VisibleString OPTIONAL ,
date Date OPTIONAL ,
serial-number INTEGER OPTIONAL , -- for GenBank style references
title VisibleString OPTIONAL , -- eg. cit="unpublished",title="title"
pmid PubMedId OPTIONAL } -- PubMed Id
pub.asn
Pub ::= CHOICE {
gen Cit-gen , -- general or generic unparsed
sub Cit-sub , -- submission
medline Medline-entry ,
muid INTEGER , -- medline uid
article Cit-art ,
journal Cit-jour ,
book Cit-book ,
proc Cit-proc , -- proceedings of a meeting
patent Cit-pat ,
pat-id Id-pat , -- identify a patent
man Cit-let , -- manuscript, thesis, or letter
equiv Pub-equiv, -- to cite a variety of ways
pmid PubMedId } -- PubMedId
medline.asn
-- a MEDLINE or PubMed entry
Medline-entry ::= SEQUENCE {
uid INTEGER OPTIONAL , -- MEDLINE UID, sometimes not yet available if
from PubMed
em Date , -- Entry Month
cit Cit-art , -- article citation
abstract VisibleString OPTIONAL ,
mesh SET OF Medline-mesh OPTIONAL ,
substance SET OF Medline-rn OPTIONAL ,
xref SET OF Medline-si OPTIONAL ,
idnum SET OF VisibleString OPTIONAL , -- ID Number (grants, contracts)
gene SET OF VisibleString OPTIONAL ,
pmid PubMedId OPTIONAL , -- MEDLINE records may include
the PubMedId
pub-type SET OF VisibleString OPTIONAL } -- may show publication types
(review, etc)
seqfeat.asn
OrgName ::= SEQUENCE {
name CHOICE {
binomial BinomialOrgName , -- genus/species type name
virus VisibleString , -- virus names are different
hybrid MultiOrgName , -- hybrid between organisms
namedhybrid BinomialOrgName , -- some hybrids have genus x species
name
partial PartialOrgName } OPTIONAL , -- when genus not known
attrib VisibleString OPTIONAL , -- attribution of name
mod SEQUENCE OF OrgMod OPTIONAL ,
lineage VisibleString OPTIONAL , -- lineage with semicolon separators
gcode INTEGER OPTIONAL , -- genetic code (see CdRegion)
mgcode INTEGER OPTIONAL , -- mitochondrial genetic code
div VisibleString OPTIONAL } -- GenBank division code
BioSource ::= SEQUENCE {
genome INTEGER { -- biological context
unknown (0) ,
genomic (1) ,
chloroplast (2) ,
chromoplast (3) ,
kinetoplast (4) ,
mitochondrion (5) ,
plastid (6) ,
macronuclear (7) ,
extrachrom (8) ,
plasmid (9) ,
transposon (10) ,
insertion-seq (11) ,
cyanelle (12) ,
proviral (13) ,
virion (14) } DEFAULT unknown ,
origin INTEGER {
unknown (0) ,
natural (1) , -- normal biological entity
natmut (2) , -- naturally occurring mutant
mut (3) , -- artificially mutagenized
artificial (4) , -- artificially engineered
synthetic (5) , -- purely synthetic
other (255) } DEFAULT unknown ,
org Org-ref ,
subtype SEQUENCE OF SubSource OPTIONAL }
Prot-ref ::= SEQUENCE {
name SET OF VisibleString OPTIONAL , -- protein name
desc VisibleString OPTIONAL , -- description (instead of name)
ec SET OF VisibleString OPTIONAL , -- E.C. number(s)
activity SET OF VisibleString OPTIONAL , -- activities
db SET OF Dbtag OPTIONAL , -- ids in other dbases
processed ENUMERATED { -- processing status
not-set (0) ,
preprotein (1) ,
mature (2) ,
signal-peptide (3) ,
transit-peptide (4) } DEFAULT not-set }
=============================================================================
Notes from Previous Releases
=============================================================================
New Functions in Version 4.0
There are a host of new functions in this release, but as usual we have not
managed to make time to document them all. Large parts of Sequin are present
which will be announced and described more fully in the fall. However,
specific tools of immediate interest are:
blast2 - this is the long awaited BLAST client/server which permits structured
interaction with BLAST over the internet. We have provided a basic client
that produces the traditional blast output. In addition, the function call
interface can be used in more elaborate clients. For more information
contact Tom Madden, [email protected]
WARNING!!! blast2 is the client we plan to support on the longer term.
The blast1 client we included for those of you who wanted a head start
will NOT be supported in future. Please shift any blast1 clients to the
(very similar) blast2 interface as soon as possible.
sim, sim2 - protein and DNA sequence alignments in linear space. This is
the function call interface to these valuable tools. Applications have
been written which are available by ftp as are published papers. For more
information contact Jinghui Zhang, [email protected]
Changes in ASN.1 spec 4.0 from 3.0
Affil - biblio.asn
added the field "postal-code" for Zip code finally.
Contact-info - submit.asn
added the field "contact" which is type "Author". The contact info has
evolved into a fully structured form, so I just took Author which has
structured names and structured address (Affil). We will eventually
phase out all the less structured ones in Contact-info.
OrgName - sefeat.asn
added "lineage", "gcode", "mgcode" for the lineage, genetic code, and
mitochondrial genetic code. This is part of Org-ref, and consolidates
all the organism info (except original SOURCE line) out of the
GenBank block... and enables us to deliver it nicely from Taxon.
Seq-descr - seq.asn
removed the Seq-descr "neighbors" and replaced it with "dbxref", since
neighbors has never been used. This is used to add cross-references to
the whole entry.
Pubdesc - seq.asn
has an added slot, "reftype" which is an integer and is used to
indicate the GenBank usage of a reference.
0 - seq - applies to the sequence. This is default and they way it is
used now.
1 - sites - applies to (unspecified) features. Equivalent to a GenBank
SITES feature. We could switch to this from using the
Imp-feat we do now.
2 - feats - applies to specific features. The idea here is provide a
place for the full citation, so features nead only reference
it. If now features reference it should be removed. This
would work for checking content when only a part of a sequence
is copied or pasted. A "sites" ref could not have this check
since we do not know which features it goes to.
Seq-feat - seqfeat.asn
added a slot called "dbxref" to Seq-feat. This is a SET OF Dbtag. It will
be for adding the new db_xref qualifiers to features. We already have some
of these in the xref slots of Gene-ref, Prot-ref, Org-ref. It means we ahve
to check two places in these cases. I do not want to retire the slots
since these were meant to be used in other contexts besides features.. and
Org-ref already is.
added a slot called "anticodon" to the tRNA extension of the RNA feature.
This is a Seq-loc that points to the location of the anticodon in a tRNA.
We have been populating this data in a User-object, and will have to do
a retro to convert it.
EXPORTED Genetic-code
Seq-align - seqalign.asn
added "bounds" to Seq-align so you can record the regions over which
an alignment was computed.. not always included in the resulting alignment
itself.
added two new types:
A) Packed-seg -- a denser representation from Colombe and Jinghui
B) disc - discontinuous alignments as a SEQUENCE OF Seq-align
Seq-annot - seq.asn
added a field to Seq-annot, Align-def, to discriminate types of
alignment sets. This has the advantage of minimal changes as well as
separating sets of alignments from conceptually single alignments. I am
not sure it is necessary to distinguish "alt" from "blocks" though. Also
it means you can attach more info, with other Seq-annot fields and/or by
expanding the Align-def. I put in "ids" in Align-def specifically to put
the one Seq-id that is the "master" for type "ref". I made it a SET OF
so we could use it for other collections where we might want to list
more than one.
added "ids" and "locs" as allowed types within Seq-annot. This would
enable us to pass lists like this around between tools with all the
addtional descriptive information in Annotdesc. I know this will be
useful.
added "general" to Annot-id for tracking 3rd party annotations.
Introduction
This distribution is release 5.0 of the NCBI core library for building
portable software, and AsnLib, a collection of routines for handling ASN.1
data and developing ASN.1 software applications. AsnLib and the asntool
application are built using the CoreLib routines. In the \doc directory is an
MS Word file which details the information given below. It is also available
as hardcopy. See the README in \doc.
The lowest layer of code is the CoreLib. These are multi-
platform functions for memory allocation (including byte stores), string
manipulation, file input and output, error and general messages, and
time and date notification. These functions have been written only
where we found that the existing ANSI functions were not sufficiently
multi-platform or well- behaved among all of the platforms that we
support. For each platform (a combination of processor, operating
system, compiler, and windowing system), we supply a specific ncbilcl.h
file, which contains typedefs and defines for multi-platform symbols,
and includes a number of standard header files. (For example,
ncbilcl.msw is used for the Microsoft C compiler under Microsoft Windows
on the PC.) Use of these symbols, and of the functions in the CoreLib,
allow us to write multi-platform source code for a variety of disparate
platforms.
The next layer of code is the AsnLib stream reader. This is
used in conjunction with a header file and a parse table loader file,
both of which are produced by processing the formal ASN.1 specification
with the AsnTool application. The symbolic defines in the
header file are pointers into the parse table, in which the ASN.1
specification is represented. To read at the stream reader level, a
program alternates between calls to AsnReadId and AsnReadVal. AsnReadId
returns a pointer into the parse table, which can be compared against
the defines in the AsnTool-generated header. For example, in the
specification for MEDLINE records, the Medline-entry section has an item
called "uid", for the unique ID of the record. This is symbolized in
the header file as MEDLINE_ENTRY_uid. When AsnReadId returns this
symbol, the program calls AsnReadVal to obtain the uid for that record.
AsnKillValue is also needed to free any memory allocated by AsnReadVal,
which occurs when the value is a string and not an integer. The entire
set of records on the Entrez CD-ROM can be read as a single stream with
the AsnLib functions.
The ASN.1 records may be accessed at a higher level through the object
loaders, which utilize the stream processing functions to
load C memory structures with the contents of the ASN.1 objects. For
each ASN.1 object we specify, we also define an equivalent C memory
structure. The object loader level of code contains functions to read
and write each ASN.1 object. These are hierarchical, as are the ASN.1
specifications. Calling the top level loader, SeqEntryAsnRead, will
load an entire SeqEntry from an open AsnIo channel, and will return a
pointer to the loaded memory structure. The read function for an AsnIo
channel can be swapped to refer to a normal disk file, a network socket,
or to compressed data, which it automatically decompresses. The object
loader code can interconvert between the highly-branched memory object
and a linear ASN.1 message with complete fidelity. The object loaders
have additional functions, including the ability to explore the
structure and notify the program when particular data elements are
encountered. The entire contents of the Entrez CD-ROM can also be
streamed through the object loaders. However, most calls to the object
loaders for simply reading a particular record are done via the data
access functions (see below).
The data access functions allow a program to call the object loaders on
a sequence or MEDLINE record given the uid of the record.
This will get the data into memory regardless of whether the data are
compressed on the Entrez CD-ROM or are obtained through a service over
the Internet. This means that a detailed understanding of the files and
formats on the Entrez disc is not needed by application programmers. The
function to load a sequence record, SeqEntryGet, needs the uid to
retrieve and a complexity code parameter. A sequence record is in the
form of a NucProt set. This contains a nucleotide (which may itself be
composed of segments) and all of the proteins it is known to encode.
The set of segments is called a SegSet, and the individual sequences are
called BioSeqs. We have taken the liberty of producing this integrated
view, but the complexity code parameter allows the record to be easily
loaded in a simpler, more traditional form, if desired. The accession
number term list is built to supply the proper uids to support this
facility. This access library is compatible with Entrez release 1.0 or
later only.
The sequence utilities and application programmer interface layer
allows exploration of the loaded memory structures and
generation of standard literature or sequence reports from those
objects. For example, a BioSeq can be converted to FASTA or GenBank
flat file formats and saved to a file, and a MEDLINE record can be saved
in MEDLARS format, which is suitable for entry into personal
bibliographic database programs. A sequence port can be opened that
gives a simple, linear view of a segmented sequence, converting
alphabets, merging exon segments, and dealing with information on both
strands of the DNA. This layer also includes some functions to explore
the NucProt set. The explore functions visit each individual BioSeq in
the set, calling a callback function for each sequence node so that a
program can examine feature tables and other information that are
associated with the NucProt or SegSets or with the individual sequences.
Vibrant is a multi-platform user interface development library that runs
on the Macintosh, Microsoft Windows on the PC, or X11 and OSF/Motif on
UNIX and VAX computers [separate documentation]. It is used to build
the graphical interface for the Entrez application (whose source code is
in the browser directory). The philosophy behind Vibrant is that
everything in the published user interface guidelines (the generic
behavior of windows, menus, buttons, etc.), as well as positioning and
sizing of graphical control objects, is taken care of automatically.
The program provides callback functions that are notified when the user
has manipulated an object. Vibrant and Entrez code are not supported,
but are provided on an as-is basis.
The advantage of using AsnLib and the object loaders, as they are
implemented, is that application program developers merely need to
recompile their programs with the new (AsnTool-generated) header files
and load the new parse tables (included with the Entrez software) in
order to be able to read the new data. This process is straightforward,
and will not break existing program code. The application is free to
ignore new fields if it does not choose to take advantage of the new
kinds of information.
When developing new ASN.1 specifications, as of June 1994 it is possible to
automatically generate the object loaders and header files for those
specifications, using the AsnCode utility. For some complex ASN.1
specifications, however, AsnCode may fail to generate the correct source code.
The documentation is currently being brought up to date. The programs
in the demo directory are designed to teach the proper use of many of
the functions discussed above. Many of these programs are not yet
documented. The simplest is testcore.c, which tests various functions
in the CoreLib. The most complex is getfeat.c, which takes an accession
number of locus name, determines the unique seq ID, retrieves the entry
from the Entrez CD-ROM using the data access library, locates all coding
region features using the explore functions, and prints the DNA
sequences of all exons using sequence port functions. If you cannot
extract and print the doc.tar.Z file, please send an email message with
your land mailing address and phone number to [email protected],
and we will mail a copy to you.
The contents of the ncbi directory (the highest level, containing the
NCBI Software Development Kit source code in several subdirectories) is
shown below. The readme file contains instructions on copying the
appropriate make files to be built in the build directory. The makeall
file copies headers to the include directory builds four libraries
(ncbi, ncbiobj, ncbicdr and vibrant), copying them to the lib directory.
The makedemo file builds the demo programs and the Entrez application:
api Application Programmer Interface, Sequence Utilities
asn ASN.1 specifications for publications and sequences
asnlib Source code for AsnLib and asntool
asnload AsnLib headers and dynamic parse tables (Mac and PC)
asnstat AsnLib headers that use static memory (UNIX and VMS)
bin Asntool executable copied here
biostruc Source code for Molecular Modelling DataBase functions
browser Source code for Entrez application
build Empty directory for building tools and libraries
cdromlib Access routines for data on the Entrez CD-ROM
cn3d Source code for Vibrant-based 3D structure viewer
config Configuration files for NCBI software:
mac
unix
vms
win
corelib Source code for NCBI Core Software Library
data Data files used for sequence conversion
demo AsnLib and sequence utility demonstration programs
desktop Source code for Vibrant-based viewers and editors
doc Documentation in Microsoft Word file
include Include files required by applications are copied here
lib Libraries copied here
link Contains several subdirectories with build accessory files:
macmet Macintosh Metrowerks/CodeWarrior
macmpw Macintosh MPW C
mswin Microsoft C and Borland C for Windows
make Make files for various systems
network Network version of data access
apple
blast2
encrypt
entrez
netmanag
nsclilib
object Functions for reading and writing complex objects
sequin Source code for Sequin application
tools Source code for alignment and other contributed utilities
readme File that contains important building instructions
vibrant Source code for Vibrant portable interface package
The platforms that are supported (as indicated by the suffix on the
relevant ncbilcl.h file) are shown below. Those marked with an asterisk
(*) are available as-is:
370* IBM 370
acc SUN acc compiler
alf DEC Alpha under OSF/1
aov DEC Alpha under AXP/OpenVMS
aux* Macintosh A/UX
bor Borland for DOS
bwn Borland for Microsoft Windows
ccr CenterLine CodeCenter
cpp SUN C++
cra* Cray
cvx* Convex
gcc Gnu gcc (under SunOS, not Solaris)
hp * Hewlett Packard
lna* Linux on DEC Alpha
lnx Linux (RedHat Linux release 5.2 with kernel 2.0.36)
met Macintosh Metrowerks compiler
mpw Macintosh Programmer's Workshop
msc Microsoft C for DOS
msw Microsoft for Windows
nxt* NeXT
r6k* IBM RS 6000
scr CodeCenter under Sun Solaris
sgi Silicon Graphics
sin Sun Solaris on Intel processors
sol Sun Solaris (for cc and gcc)
thc THINK C on Macintosh
ult DEC ULTRIX
vms DEC VAX/VMS
Questions or comments can be directed to [email protected].
ANSI C:
This software requires an ANSI C compiler. This will be no problem at
all except to people on Sun machines, where the bundled C compiler, cc, is
non-ansi. However, you can use the Sun unbundled compiler, acc, or the Gnu
compiler, gcc (which is free) and that works just fine. If you have written
applications on the Sun with non-ANSI functions, the ANSI compilers will
complain. See the notes below if this is a problem.
Installation
To build the NCBI toolkit you need to look for platform-dependent instructions:
For UNIX:
look at the file make/readme.unx
For Mac:
look at the file make/readme.mac
For Microsoft Windows95/98/NT:
look at the file make/readme.dos
There is some information which may be useful for NCBI tookit building
in the file doc/FAQ.txt
ALL -
change to the directory above ncbi subdirectory
Unix
tested on Sun Sparc (Solaris 2.6, Sunos 4.1.3),
Silicon Graphics IRIX 5.* and 6.*, DEC Alpha with OSF/1 V5.1,
Linux (Red Hat Linux release 6.2 with kernel 2.2.16) on Intel,
Sun Solaris for Intel (Solaris 2.7).
Run the script ncbi/make/makedis.csh keeping it's output in the
separate file:
for sh or bash:
ncbi/make/makedis.csh 2>&1 | tee out.makedis.csh
for csh or tcsh:
ncbi/make/makedis.csh |& tee out.makedis.csh
If that script gives you an error like this:
Your platform is not supported.
To port ncbi toolkit to your platform consult
the files platform/*.ncbi.mk
then you should check the script ncbi/make/makedis.csh and
add proper platform-dependent ncbi.mk file in ncbi/platform
directory.
Other UNIX: AIX, ULTRIX, NeXt, Sun acc,
Follows models above. Read header in makeall.unx and makedemo.unx
for details.
for all UNIX, edit .ncbirc as described in section "CONFIGURATION OR
SETTINGS FILES".
optional edit .login to "setenv NCBI=[path to .ncbirc file]"
MS-DOS
look at the file make/readme.dos
Mac
tested on CodeWarrior IDE 2.1, MacOS 8.0
All - copy config:mac:ncbi.cnf to your System Folder, or to the
System Folder:Preferences subfolder
edit the "ASNLOAD" line in "ncbi.cnf" to point to the
ncbi:asnload directory in this release
edit the "DATA" line to point to the ncbi/data directory
CodeWarrior - raise Preferred Size of Script Editor from 700 to 3000,
and raise Preferred Size of CodeWarrior IDE 2.1 by
2000 (e.g., from 8206 to 10206), using Get Info from
the Finder.
to compile for MC680x0 platform (default is PowerPC),
change property MASTER from "PPC" to "68K".
run copyhdrs.met
run makeall.met
run makenet.met
run makedemo.met
Think C - no longer supported
MPW C - no longer supported
Changes to VMS make file naming conventions:
The old .dcl prefix (last character is a lower case L) was changed
to .dc1 (last character is the numeral 1) to allow for different make files
for DecWindows 1.1 and DecWindows 1.2. Several new .dc2 files were
contributed by David Mathog of CalTech. A synopsis of his additional
instructions:
VAX C DecWindows 1.1 Use .dcl1 files.
DEC C DecWindows 1.1 Use .dcl1 files,
but change cc to cc/standard=vaxc
VAX C DecWindows 1.2 This combination has not been tested.
DEC C DecWindows 1.2 Use .dcl2 files.
VMS (without Vibrant) on VAX
$set def [ncbi.build]
$copy [-.make]*.dc1 *.com
$@makeall
check ncbi.cfg as described in section "CONFIGURATION OR SETTINGS FILES".
edit LOGIN.COM to "define NCBI [path to ncbi.cfg file]"
To make demos:
$@makedemo
VMS (with Vibrant) on VAX
$set def [ncbi.build]
$copy [-.make]*.dc1 *.com
$@viball
check ncbi.cfg as described in section "CONFIGURATION OR SETTINGS FILES".
edit LOGIN.COM to "define NCBI [path to ncbi.cfg file]"
To make demos:
$@vibdemo
Testing
VMS only: look in rundemo.dc1 in [make] to see how to give command
line arguments. Not all demo programs are shown. Run at least testcore.
All else:
In build should be a program called testcore. Type "testcore -" and
it should show you some default arguments. Type "testcore" and it will
run through a variety of functions in CoreLib, prompting you for responses
along the way. It should run without a crash or error report. If you made
Vibrant versions all demos will have startup dialog boxes. If not, they
take command line arguments.
If testcore runs, read the documentation for CoreLib and for AsnLib.
In the AsnLib documentation are instructions for running asntool itself.
for running a few of the demo programs. There are a large number of demo
programs now (including Entrez itself, if you made the Vibrant versions).
CONFIGURATION OR SETTINGS FILES:
One of the fundamental problems in writing portable software concerns
configuration issues. Each individual user's computer will have its own
particular hardware and software environment, and each machine will have
its disk file hierarchy set up in a unique manner. A program that needs
accessory information, such as help files, parse tables, or format
converters, must be given a means of finding the data regardless of where
the user has placed the files. The difficulty is compounded by the different
conventions for naming files and specifying paths on each class of machine.
For example, the name of a CD-ROM on the Macintosh is fixed, determined by
information on the CD itself, whereas on the PC it is addressed by a drive
letter, which can be assigned by the user, but which cannot be reconciled
with the name the Macintosh sees.
An associated problem is that many programs will want to allow the user
to make persistent changes to parameters. These parameters typically involve
numbers or font specifications, but may also include paths to data files.
Some platforms supply such configuration information in preferences files,
others in environment variables. Manipulating these settings is platform
dependent, as is the format in which the preference is specified.
The NCBI Software Toolkit core library addresses these problems by
providing configuration or settings files. These are modeled after the .INI
files used by Microsoft Windows. Settings files are plain ASCII text files
that may be edited by the user or modified by the program. They are divided
into sections, each of which is headed by the section name enclosed in square
brackets. Below each section heading is a series of key=value strings, somewhat
analogous to the environment variables that are used on many platforms.
The ncbi configuration file supplies general purpose configuration
information on paths for commonly used data files. The typical file set up for
the Entrez application running on the PC under Microsoft Windows is shown below:
[NCBI]
ROOT=D:
ASNLOAD=C:\ENTREZ\ASNLOAD\
DATA=C:\ENTREZ\DATA
The only section is entitled NCBI. The ROOT entry refers to the path to
the Entrez CD-ROM. In this example, the user has configured the machine to
use drive letter D. (On the Macintosh, the name of the disc is SEQDATA, which
cannot be changed by the user.) The ASNLOAD specifies the path to the ASN.1
parse tables. These files are required by the AsnLib functions, and all
higher-level procedures that call them, including the Object Loader, Sequence
Utility, and Data Access functions. Files pointed to by the DATA entry contain
information necessary to convert biomolecule sequence data into different
alphabets (e.g., unpacking the 2-bit nucleotide code stored on the Entrez CD
into standard IUPAC letters).
Although the contents of a configuration file is similar regardless of
platform, the name of the file and its location is platform dependent. If the
base name of the configuration file is xxx, then the actual file name is shown
below for each platform:
Macintosh xxx.cnf
Microsoft Windows xxx.INI
MS-DOS (without Windows) xxx.CFG
UNIX .xxxrc
VMS xxx.cfg
Samples of such files are in subdirectories of \config. The UNIX version
does not have the leading '.' in filename so you can see it.
The location in which these files must reside is also platform dependent,
and the functions that manipulate the contents may look in several places to
find these files.
On the Macintosh, the function first looks in the System Folder, then in the
Preferences folder within the System Folder. (See the Mac OS X addendum in the
next paragraph). Under Microsoft Windows, the file must be in the Windows
directory, along with all of the other .INI files. Under DOS without Windows,
the function first looks in the current working directory, then in the directory
whose path is specified in the NCBI environment variable. Under UNIX and VMS,
the current working directory is first checked, then the user's home directory,
and finally the directory specified by the NCBI environment variable. (Under
UNIX, when it uses the environment variable, it will check for configuration
files first without and then with the initial dot.) On the multi- user
platforms (UNIX and VMS), the use of the NCBI environment variable allows a
common settings file to be used as the default by multiple users. If such a
settings file is changed under program control, it is copied over into the
user's home directory, and the new copy is modified. The order of searching
for settings files ensures that this new copy is used in all subsequent
operations.
On Mac OS X, it first looks for xxx.cnf in username/Library/Preferences,
then in package/Contents/Resources, where username is the user's home directory
and package is the application package. If it does not find the configuration
file, it then switches to UNIX style, looking for .xxxrc in the home directory
and then in the current directory. This way Mac OS X applications retain the
traditional Mac behavior but can also UNIX style configuration files.
contents of ASNLOAD are in ncbi/asnload
contents of DATA are in ncbi/data