Skip to content

Something like apropos wrt go packages. It searchs through symbols or comments or package names, as the case may be.

Notifications You must be signed in to change notification settings

hanishkvc/go-apropos

Repository files navigation

#############
Go Apropos
#############
HanishKVC, 2022


Overview
#########

General
========
go doc requires one to already know the standard package to look for,
to get info about the same. However if one doesnt know which package
may contain what they want, they will have to grep the src directory
or do a web search.

This provides something like the apropos command wrt man pages, but
here it searchs for matching symbols/comments/packagenames from among
the packages in the go source directory.

NOTE: Symbol refers to const or var or type or func name.

As I wanted to look at Go lang a bit, so found this need and thus this.
However I havent really looked at Go or read through Go documentation,
so this code could be as far away from the conventions and concepts in
Go land. This is based on some quick random scanning of docs and go src
followed by compilation errors and potentially flawed logical guess work.
At same time, it should do the job and be useful when exploring Go lang.

AutoCache Mode 
===============

To speed up normal use of the program, by default it works in auto cache
mode, where it uses a previously created cache containing meta data wrt
symbols / pkg names / comments, when searching for these.

If the cache file is missing and or appears out of sync with the go source
directory, then the cache file will be freshly created.

However one can force the program into a non cache mode, if required, by
passing the following args

	goapropos --autocache=false searchtoken

	goapropos --autocache=false --usecache=false searchtoken

in which case it will parse through the system / specified go source dir.

NOTE: Currently this logic uses the modtime wrt the go source directory
as the version check. So if a new go source package install / update or
go lang suite install / update occurs, 99% this modtime should change and
inturn autocache logic should trigger a cache update. However if someone
changes contents of some subdirectories only within the go src dir, then
the src dir's modtime may not change. In order to udpate the cache in
such a situation, one will have to manually either touch the go src dir's
modtime or run

	goapropos --autocache=false --createcache

NOTE: In autocache mode createcache and usecache flags will be manipulated
by the autocache logic, overriding any user commandline setup wrt same.

License
=========

GPL, BSD

Thank you all for all the fish :)


Usage
######

Normal Use
============

NOTE: Remember to pass the named arguments/flags before any unnamed args to
the program.

One can specify the token/substring to match / search for wrt symbols in
packages by using either the cmdline arg --find or by just specifing it
in the cmdline ie.

	goapropos --find search_token

	OR

	goapropos search_token

By default it tries to find matching exported symbols only from available
packages. However If one wants the logic to use both internal and exported
symbols of the packages when trying to find a match, one needs to specify
the cmdline argument --allsymbols. This needs to be used either when creating
a cache of the symbols and or when doing a noncache based search ie

	goapropos --autocache=false --createcache --allsymbols

	goapropos --autocache=false --allsymbols search_token

By default it tries to search through all the packages in the identified
go source directory. However if required one can filter the packages that
will be searched by using --findpkg argument. The  token given through
findpkg argument will be used to filter the package names (including ~import
path prefix~ potentially) for a match.

	goapropos --findpkg packagename_token search_token

If one wants to get a list of package names, which match a given token, one
can run goapropos with only the findpkg argument and no find argument.

	goapropos --findpkg packagename_search_token

NOTE: Using --findpkg also prints the filenames corresponding to the matching
packages.

To get all the exported symbols of all the packages, use

	goapropos --find ""

To get all the exported symbols of a specific package, use

	goapropos --findpkg "packagename$" --find ""

If one wants to find symbols based on their comments if any, then they can
use --findcmt to specify a match token wrt comment. Any symbols which contain
comments that match the specified token will be shown to the user.

NOTE: If the comment being searched for is found at a generic file level,
rather than wrt a specific symbol within it, then the package name of the
file is shown.

	goapropos --findcmt cmt_search_token
	ex: goapropos --findcmt "device"

The tokens specified are used to match package name or the symbols or their
comments as the case maybe by using a case insensitive search, by default.
If one wants to use case sensitive matching, pass --casesensitive.

By default the search token provided (be it wrt package names or symbols or
comments) is treated as a regular expression. However if required one can
change from regular expression to a simple if-string-contains-substring logic
by specifying --matchmode contains

	goapropos search_token_regexp
	ex: goapropos fmt

	goapropos --matchmode contains search_token_substring
	ex: goapropos --matchmode contains fmt

	goapropos --matchmode regexp search_token_regexp
	ex: goapropos --matchmode regexp "fm.*t"
	ex: goapropos --matchmode regexp "fm+t"

NOTE: by default the search tokens as well as the pkg names/symbols/comments
will be converted to upper case if casesensitive search is disabled (which
is the default). Any implications of this wrt regexp if any needs to be
kept in mind. Inturn enabling casesensitive searching will disable this
automatic upper case conversion and inturn will leave the search token,
as well as the pkg names and symbols and comments as it is.


Go Source directory
======================

When looking for go source directory, by default it uses src directory under
GOROOT (or else /usr/share or /usr/local/share or /usr/lib which matchs the
pattern "go-*" or "golang*") as the go source directory wrt packages to search
for and or as the source for meta data saved into cache. If required one can
explicitly set the go source directory by using the cmdline arg --basepath

	goapropos --basepath <base_path> search_token

NOTE: Logic doesnt follow symbolic links under the go source directory

NOTE: If the go source directory auto identified by goapropos is wrong, and
one is required to set a new basepath, then basepath argument needs to be
used always, so that autocaching remains in sync.

Skipping files
================

One can skip files matching certain predefined substrings in their name or
path by using --skipfiles. One can specify multiple matching tokens to filter
out source files from different-paths/... by using skipfiles multiple times.

	--skipfiles "substring" (skip files containing substring in their path)

	goapropos --skipfiles "/src/cmd/" --skipfiles "/src/internal/" findme

If autocache true
--------------------

As the program may autoupdate the cache any time, if one needs to skip certain
go source files always, then one is required to pass the skipfiles argument
always.

However If one wants the cache to contain these files, but at same time, if one
wants to temporarily skip certain paths / files and search, then one will have
to request the program to avoid using the cache and then search ie

	goapropos --autocache=false --skipfiles "path/to/skip" search_token

NOTE: disabling autocache above is critical, because otherwise for any reason
if autocache logic decides to udpate the cache, when one has passed skipfiles
argument, then the cache will no longer contain data wrt these skipped files.

NOTE: In some cases explicitly disabling autocache mode and inturn creating
a new cache explicitly with the unwanted files skipped may allow one to avoid
the need to pass the skipfiles each time. However this is not permanent and
will get overridden, once the program autoupdates the cache file.

	goapropos --autocache=false --skipfiles "path/to/skip" --createcache


Cache management and use
===========================

To avoid having to parse the go source files each time the program is called,
it supports the creation and inturn use of a cache, with required data.

In autocache mode, which is the default, this cache is managed automatically.
However for some reason if one wants to manually control caching, Then one
will have to instruct the program to stop autocache management and inturn
pass additional cache related arguments.

Use the flag --createcache to create / update the cache of data wrt package
symbols, paths and comments.

	goapropos --autocache=false --createcache

Use the flag --usecache to use a previously created cache rather than freshly
parsing through the go source files.

	goapropos --autocache=false --usecache search_token

NOTE: One needs to first create the cache before trying to use it.

When ever the go language package is updated, one will have to recreate the
cache file, to match the same.


TODO
######

DONE: A optional simple regular expression based token matching option has
been added.

DONE: Allow searching through package / identifier comments, if possible. Have
added logic to extract comments at a basic level. TODO: Comments at the block
level wrt const or var containing multiple definitions needs to be accounted.

Maybe simplify by using parseDir on dirs and no need to look at the src files
individually seperately. However this may not skip test go source files in
them, parsing of which can be avoided by walking through files and calling
parseFile on them, like current flow. THink of this later.

DONE:Maybe add support to cache the package identifiers/paths/comments map.
This will inturn require a cmdline argument to force rebuilding of this cache,
when required.

DONE: Compile regexp match tokens, later.

Maybe later add support for searching multiple different comment tokens and
or symbol tokens. Currently one can search for either a single symbol or
single comment or one symbolORcomment match tokens together.

DONE: Build seperate maps wrt each go routine, and then merge them at the end.
This should allow the go routines to run without blocking when trying to
update map / db, unlike today, when they block as they need to synchronise
when trying to update a single map / db.. [DONE-NOTE: didnt gain much, if any,
performance, bcas rather availability of multiple go routines for working
with multiple files parallely, seems to be bypassing from this contention
becoming the hot path.]

DONE: Maintain Packages with path info wrt base dir for the package.


Note
######

AST and Parsing
=================

From a initial quick glance at golang source found go/ast and its Inspect
function. Inturn to feed Inspect found parser.parseFile to parse go source
files.

However on using them found that no package ast node or comment related nodes
(comment/commentgroup) was getting found at any level, by looking at the call
back function of Inspect. Then there was also that mode argument to parseFile
which I had not yet looked at.

From another quick glance at source files in go/ast, go/doc and go/parser,
as also looking at go doc parser I can see a parser.parseDir, which seems to
return package nodes (as a given source directory could have multiple pkgs).
Also found bits about the Mode type and inturn ParseComments.

By using the go/ast and inturn the nodes that it extracts during inspecting
of the go source files, realised that ast.Ident node is triggered for both
own as well as others' (bcas the go source file refers to symbols from the
packages it uses) symbols which is found in a go source file. WHile GenDecl/
ValueSpec/TypeSpec/FuncDecl nodes are triggered only for own symbols, which
the go source file being inspected, defines.

String Matchs
===============

For simple string matching based goapropos/find searchs, the strings.Contains
version was found to be twice as fast as the uncompiled-regexp version. However
using compiled-re and inturn reuse of the re which is possible in this program's
flow, makes the re version as fast as the strings.Contains version.

The matcher interface added to support both strings.Contains and re.MatchString
in a seemless manner, adds a tiny pico bit of overhead compared to direct use
of strings.Contains.

Duplicate Symbol Names
========================

Potentially the same symbol name could represent different things within a package.
For example one could have a type and a method associated with the same or different
type, to have the same symbol name.

For now the type and comments wrt such duplicated symbol name is collated togehter
into the same entry in the database. Inturn type tag within the type field is not
duplicated.

So also when printing the symbols, it may contain more than one type tag associated
with it and the comment wrt it will be from across all the duplicates.

May handle this situation differently in future.


Changelog
############

Rather major changes

20220624
=========

fullcomments flag and logic to print comments wrt the symbols in a indented
manner.

Add trimmed comments wrt symbols into db and inturn cache. So also update
the cache version, so that any existing old caches will be overwritten with
newer cache with these trimmed comments.

20220622+
==========

Add basepath wrt package names, to help differentiate btw different packages
having the same package name. Try keep the logic os agnostic, be it wrt path
seps, as well as how the pkg's basepath is stored and shown (to always inc
'/').

Print part of comments wrt symbols.

Handle situation of duplicate symbol names in a package in a crude way for now.

Add Program tag name and a cache file format version to the cache version file.
The version number ensures that if there is a change to logic, which changes
the cache file format, then the program can automatically recreate the cache
file.

Print pkgname on same line as pkg file paths and pkg symbols, along with spacing
around to hopefully make it cleaner and easier to read.

Make sorted and non sorted result prints match and be the cleaner version.


20220620+
==========

Cleanup compare a bit, built around structs embedding MatcherConfig and
MatcherType specific members.

Less dependency on globals and more on passed around args wrt Matcher
and db.

The hunt for go src files starts with GOROOT

Switch to regexp as the default matchmode


20220617+
==========

Cleanup db related types and print and find

Add simple test and benchmark helper functions.

Update printing of find related results to be simple, clean and tagged
wrt what it represents.

Add sample Test and Benchmark functions.

Add a Matcher interface with implementations for strings-contains and
re-compiled.

Compile the re in regexp match mode, this makes the re version match
the simple contains version speed.

Make the re matcher work with a ptr value to the underlying regexp.Regexp,
rather than with a normal value to the Regexp, when using matcher methods.
This squeezes out few msecs. More importantly However helped realise that
the type to its supported interfaces matching during assignment doesnt do
the equivalent of auto-adjust wrt method calling and a type or its pointer.

20220616
==========

Auto manage cache by default.

Recover from insane (minimal check currently) / unavailable db.

20220615+
===========

Sorted find related results at end or as they are found.

Add type info (ie Const/Var or Type or Func) wrt each Symbol.

Differentiate within ValueSpec (ie Const or Var)

20220614+
==========

Use go routines to see how things go. Here the walking of dir is made
parallel to the handling of the file. Logically this shouldnt and doesnt
change performance much, rightly so. Rather the overhead with go routines,
if any, makes the overall logic bit more slower compared to the NoGoRoutines
version.

Wrt each symbol store just the comments related string, thus simplifying
the flow. This should simplify the json loading and inturn seems to speed up
usecache based search by around 20-25%.

Add support for Multiple Go Routines wrt file handling, so that even when
there is a contention like trying to update the shared global db or a io
delay like reading a file, there is some other go routine to make use of
the available cpu/processing resources.

    This version is about 25% faster than the No Go routines version.

Use independent maps wrt each go routine, so that there is no need for any
contention to a shared global database when they are running. Then at the
end build a merged global database.

    The multiple go routines seems to be hiding any contention wrt shared
    global db, as had been hoped. This version didnt change performance much.
    This also indicates that there is enough io bandwidth to spare on the
    test machine and potentially in general on other machines also, to allow
    the parallel go routines to munch on additional files.

Make both raw source file parsing and cache based paths handle find++ queries
in equivalently similar ways.


20220612
==========

Cache and use the Maps/Database created wrt Pkg symbols, paths and comments.
User needs to pass named arguments to enable the creation as well as use of
cache.


20220610
==========

Avoid populating the apropos's pkg symbols database with symbols from other
pkgs used by a given go source file. IE avoid using ast.Ident node.

A bit more flexible find go source directory logic.

Link comment at the block level wrt consts or vars or types to the members
of the block, so that a search for any part of such a comment will list all
the members of that block.

If a comment level search matches any of the generic level comments in any
of the files belonging to that package, instead of to the comment specific
to a symbol, then the package name will be shown.


20220608
==========

A almost basic level of go apropos logic has been implemented.

About

Something like apropos wrt go packages. It searchs through symbols or comments or package names, as the case may be.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published