Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate Python Data Structures for TX EBCDIC Files defined by Cobol data grammar #36

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

markfyoung0711
Copy link

@markfyoung0711 markfyoung0711 commented May 22, 2024

@mlbelobraydi

impetus

I was excited to see your repo b/c i want to parse Oil Ledgers historically and run them through a model trainer that would catch errors. My first step in my project is to get as much raw data as possible so I went to the TX RRC. I saw the EBCDIC files and the copybook PDFs, so I thought I would have to start from scratch. Then I saw this repo!

improvement

I thought of this issue #37
which I believe would make it easier for us to start using the .ebc files - my idea was to not have to encode the manual python parsing as was being done earlier. That is up for debate.

approach

generate a struct format for parsing binary buffered (RawIO) stream from reading a .ebc file

choice 1 - build my own parser

parse the copybooks myself with some sort of regular expression parser in python, but there are so many lexical and parsing rules I don't know about Cobol.

choice 2 - ANTLR4 lexer/parser/listener

ANTLR4 has a Cobol85.g4 (grammar) file and I was able to generate the Lexer, Parser and a Listener with one command. I have overwritten a subclass of the Listener to just visit all the nodes under the Data Division so as to build a symbol table of symbols so I can count sizes and do adjustments for redefines/occurs. So for a FIELD-MAST record, the struct is like this:

Symbol table size: 1200
Symbol table struct format: ['@1s3s8s6s5s2s1s32s18s2s2s2s2s5s3s1s1s1s1s1s1s12s2s2s2s2s2s2s6s1s5s1s4s1s5s3s1s3s4s4s4s7s21s1s1s66s66s5s1s66s59s8s7s15s13s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s']


So this program then can parse a field mast record:


[nav] In [10]: from struct import unpack

[nav] In [11]: f = open('olf001l.ebc', "rb")

[nav] In [12]: rec = f.read(1200)

[nav] In [13]: format_field_mast = '@1s3s8s6s5s2s1s32s18s2s2s2s2s5s3s1s1s1s1s1s1s12s2s2s2s2s2s2s6s1s5s1s4s1s5s3s1s3s4s4s4s7s21s1s1s66s66s5s1s66s59s8s7s15s13s2s2s2s2s4s4s1s1s3s3s
          ...: 2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s
          ...: 4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8
          ...: s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s'

[nav] In [14]: record = unpack(format_field_mast, rec)


I updated the README.md to point to a new readme about the cobol parsing under the src directory.

I would love to get some feedback from you.

Thanks,
Mark
903-360-8815

(I am in Lindale, Texas near Tyler)

can be used to create a python type for parsing EBCDIC files according
to the structure specified.
…s concerned with data division for

producing python struct formats).

Keep generated Cobol85Parser, Lexer and Listener as is.  Do not edit them.

Add Symbol and SymbolTable __str__ methods for printing.
Generate sizes by interpreting Pic and Cardinality and Usage
Generate formats from sizes
Eliminate fields that have been redefined
Process Occurs so that formats can be repeated

Add bintools.py to store some handly binary tools/patterns.
clean out dataclass, not needed
add occurs adjusted size and remove struct format for fields that are part of occurs
   to avoid double counting
make sure occurs adjusted size is added to the size of the level 01 record
reorder the occurs processing and final size count
reformat the ola cobol program so that cobc (GNUCobol compiler) can compile it
set up proper setting and return of symbol table
add click options to main
add unit tests before that work is complete to test for regression and to highlight the problem with not removing all enclosed redefineds
add helper functions to Symbol.py to get length, etc.
process enclosed fields for redefineds so they are skipped in symbol table
add to unit tests
Copy link
Author

@markfyoung0711 markfyoung0711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mlbelobraydi hello, hoping you can review this and give your thoughts on it.

i like the work you have here in the repo.
I just thought it would be nice to be able to generate parsing code straight from the RRC copy books.

@@ -0,0 +1,5655 @@
/*
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial code downloaded from repo.
consider removing this and just use docker to download the supported version of the g4 file, generate the cobol 85 parser.

@@ -0,0 +1,302 @@
'''

The purpose of this program is to build a data structure that we can use to generate
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the main part of this PR.
it generates one or more symbol tables
SymbolTable.table (list)
Each element of that list contains a list of symbols that is used to generate the python struct format to be able to parse a related EBCDIC file.

src/Locations.py Outdated
class Locations(object):
'''
Find locations of the datasets from the download area because the download links are hashes and cannot be predicted

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs more work.
need to tie in the beautiful soup that parses the top level URIs b/c the hashes in the directories and filenames are undefined.


import pandas as pd

computationals = set(['COMP-1', 'COMP-2', 'COMP-3', 'COMP-4', 'COMP-5',
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main code to support Symbol and SymbolTable

@@ -0,0 +1,44 @@
# Impetus
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

documentation for users and contributors to the repo

@@ -0,0 +1,445 @@
IDENTIFICATION DIVISION.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oil ledge file (TX RRC) copybook copy/pasted into a complete Cobol program.

@@ -0,0 +1,38 @@
import simplejson
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rewrite this to use the cobol data parser to derive the python struct, then read the ebc file, parse it and output to an oil ledger master.

@@ -0,0 +1,26 @@
def in_situ_ebcdic_to_ascii(s):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tools for processing EBCDIC files and inspecting binary contents.

@@ -0,0 +1,3 @@
[flake8]
ignore = W504, E501, E303, E301, E231, E101, W291, W391, E202, E252, E225, E201, E222, E261, E275, E302, F401, W191, W293
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ANTLR4 .py files have lots of PEP8 issues.
ignoring them for now until i get the docker setup to generate the .py on the fly for Lexer, Parser, Listener

@@ -0,0 +1,10 @@
from OilMaster import OilMaster

def test_OilMaster():
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has to be rewritten

@markfyoung0711 markfyoung0711 changed the title init_and_redefines_fixes Generate Python Data Structures for TX EBCDIC Files defined by Cobol data grammar May 22, 2024
src/Locations.py Outdated
dataset_start_html = dataset_download_html.decode("utf-8")
self.url = starting_url
page = requests.get(self.url)
soup = BeautifulSoup(page.content, "html.parser")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is going to be replaced by the file: src/rrc_scrape.py

98ef966#diff-c363c527d148d304224201f4fca2365f786209b56d18348a29db7cc9babec445

from selenium import webdriver
from selenium.webdriver.common.by import By

# this program gets the URLs for the dataset section
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turn this into a function that returns a dataframe.
the dataframe will have the filetype, e.g. Ascii, EBCDOC, ArcFile
the uri will be the URL to the list of files
the dataset_description will be the description from the main RRC dataset page.

Removal of Locator.py and test
Symbol table redefine and occurs handling.
Add selenium code w/headless Chrome browser to get RRC dataset files
Add symbol/symbol table unit tests.
fix bug in listener whereby PIC 9(08)V99 was not picking up the '9's following the V
add to the main program the parsing of the EBCDIC file
the parser cannot handle redefines of redefines yet, so change the inputs to be able to deal with that
- need to make progress in other areas
update unit tests to make corrections for the "5" record type
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant