-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate Python Data Structures for TX EBCDIC Files defined by Cobol data grammar #36
base: master
Are you sure you want to change the base?
Conversation
can be used to create a python type for parsing EBCDIC files according to the structure specified.
…s concerned with data division for producing python struct formats). Keep generated Cobol85Parser, Lexer and Listener as is. Do not edit them. Add Symbol and SymbolTable __str__ methods for printing. Generate sizes by interpreting Pic and Cardinality and Usage Generate formats from sizes Eliminate fields that have been redefined Process Occurs so that formats can be repeated Add bintools.py to store some handly binary tools/patterns.
…or each symbol that has one
clean out dataclass, not needed add occurs adjusted size and remove struct format for fields that are part of occurs to avoid double counting make sure occurs adjusted size is added to the size of the level 01 record reorder the occurs processing and final size count reformat the ola cobol program so that cobc (GNUCobol compiler) can compile it
set up proper setting and return of symbol table add click options to main
add unit tests before that work is complete to test for regression and to highlight the problem with not removing all enclosed redefineds add helper functions to Symbol.py to get length, etc.
process enclosed fields for redefineds so they are skipped in symbol table add to unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mlbelobraydi hello, hoping you can review this and give your thoughts on it.
i like the work you have here in the repo.
I just thought it would be nice to be able to generate parsing code straight from the RRC copy books.
@@ -0,0 +1,5655 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initial code downloaded from repo.
consider removing this and just use docker to download the supported version of the g4 file, generate the cobol 85 parser.
@@ -0,0 +1,302 @@ | |||
''' | |||
|
|||
The purpose of this program is to build a data structure that we can use to generate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the main part of this PR.
it generates one or more symbol tables
SymbolTable.table (list)
Each element of that list contains a list of symbols that is used to generate the python struct format to be able to parse a related EBCDIC file.
src/Locations.py
Outdated
class Locations(object): | ||
''' | ||
Find locations of the datasets from the download area because the download links are hashes and cannot be predicted | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs more work.
need to tie in the beautiful soup that parses the top level URIs b/c the hashes in the directories and filenames are undefined.
|
||
import pandas as pd | ||
|
||
computationals = set(['COMP-1', 'COMP-2', 'COMP-3', 'COMP-4', 'COMP-5', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main code to support Symbol and SymbolTable
@@ -0,0 +1,44 @@ | |||
# Impetus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
documentation for users and contributors to the repo
@@ -0,0 +1,445 @@ | |||
IDENTIFICATION DIVISION. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oil ledge file (TX RRC) copybook copy/pasted into a complete Cobol program.
@@ -0,0 +1,38 @@ | |||
import simplejson |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rewrite this to use the cobol data parser to derive the python struct, then read the ebc file, parse it and output to an oil ledger master.
@@ -0,0 +1,26 @@ | |||
def in_situ_ebcdic_to_ascii(s): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tools for processing EBCDIC files and inspecting binary contents.
@@ -0,0 +1,3 @@ | |||
[flake8] | |||
ignore = W504, E501, E303, E301, E231, E101, W291, W391, E202, E252, E225, E201, E222, E261, E275, E302, F401, W191, W293 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the ANTLR4 .py files have lots of PEP8 issues.
ignoring them for now until i get the docker setup to generate the .py on the fly for Lexer, Parser, Listener
@@ -0,0 +1,10 @@ | |||
from OilMaster import OilMaster | |||
|
|||
def test_OilMaster(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has to be rewritten
src/Locations.py
Outdated
dataset_start_html = dataset_download_html.decode("utf-8") | ||
self.url = starting_url | ||
page = requests.get(self.url) | ||
soup = BeautifulSoup(page.content, "html.parser") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is going to be replaced by the file: src/rrc_scrape.py
98ef966#diff-c363c527d148d304224201f4fca2365f786209b56d18348a29db7cc9babec445
from selenium import webdriver | ||
from selenium.webdriver.common.by import By | ||
|
||
# this program gets the URLs for the dataset section |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turn this into a function that returns a dataframe.
the dataframe will have the filetype, e.g. Ascii, EBCDOC, ArcFile
the uri will be the URL to the list of files
the dataset_description will be the description from the main RRC dataset page.
Removal of Locator.py and test Symbol table redefine and occurs handling. Add selenium code w/headless Chrome browser to get RRC dataset files Add symbol/symbol table unit tests.
fix bug in listener whereby PIC 9(08)V99 was not picking up the '9's following the V add to the main program the parsing of the EBCDIC file
the parser cannot handle redefines of redefines yet, so change the inputs to be able to deal with that - need to make progress in other areas update unit tests to make corrections for the "5" record type
@mlbelobraydi
impetus
I was excited to see your repo b/c i want to parse Oil Ledgers historically and run them through a model trainer that would catch errors. My first step in my project is to get as much raw data as possible so I went to the TX RRC. I saw the EBCDIC files and the copybook PDFs, so I thought I would have to start from scratch. Then I saw this repo!
improvement
I thought of this issue #37
which I believe would make it easier for us to start using the .ebc files - my idea was to not have to encode the manual python parsing as was being done earlier. That is up for debate.
approach
generate a struct format for parsing binary buffered (RawIO) stream from reading a .ebc file
choice 1 - build my own parser
parse the copybooks myself with some sort of regular expression parser in python, but there are so many lexical and parsing rules I don't know about Cobol.
choice 2 - ANTLR4 lexer/parser/listener
ANTLR4 has a Cobol85.g4 (grammar) file and I was able to generate the Lexer, Parser and a Listener with one command. I have overwritten a subclass of the Listener to just visit all the nodes under the Data Division so as to build a symbol table of symbols so I can count sizes and do adjustments for redefines/occurs. So for a FIELD-MAST record, the struct is like this:
I updated the README.md to point to a new readme about the cobol parsing under the src directory.
I would love to get some feedback from you.
Thanks,
Mark
903-360-8815
(I am in Lindale, Texas near Tyler)