Generate Python Data Structures for TX EBCDIC Files defined by Cobol data grammar #36

markfyoung0711 · 2024-05-22T18:32:31Z

impetus

I was excited to see your repo b/c i want to parse Oil Ledgers historically and run them through a model trainer that would catch errors. My first step in my project is to get as much raw data as possible so I went to the TX RRC. I saw the EBCDIC files and the copybook PDFs, so I thought I would have to start from scratch. Then I saw this repo!

improvement

I thought of this issue #37
which I believe would make it easier for us to start using the .ebc files - my idea was to not have to encode the manual python parsing as was being done earlier. That is up for debate.

approach

generate a struct format for parsing binary buffered (RawIO) stream from reading a .ebc file

choice 1 - build my own parser

parse the copybooks myself with some sort of regular expression parser in python, but there are so many lexical and parsing rules I don't know about Cobol.

choice 2 - ANTLR4 lexer/parser/listener

ANTLR4 has a Cobol85.g4 (grammar) file and I was able to generate the Lexer, Parser and a Listener with one command. I have overwritten a subclass of the Listener to just visit all the nodes under the Data Division so as to build a symbol table of symbols so I can count sizes and do adjustments for redefines/occurs. So for a FIELD-MAST record, the struct is like this:

Symbol table size: 1200
Symbol table struct format: ['@1s3s8s6s5s2s1s32s18s2s2s2s2s5s3s1s1s1s1s1s1s12s2s2s2s2s2s2s6s1s5s1s4s1s5s3s1s3s4s4s4s7s21s1s1s66s66s5s1s66s59s8s7s15s13s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s']


So this program then can parse a field mast record:


[nav] In [10]: from struct import unpack

[nav] In [11]: f = open('olf001l.ebc', "rb")

[nav] In [12]: rec = f.read(1200)

[nav] In [13]: format_field_mast = '@1s3s8s6s5s2s1s32s18s2s2s2s2s5s3s1s1s1s1s1s1s12s2s2s2s2s2s2s6s1s5s1s4s1s5s3s1s3s4s4s4s7s21s1s1s66s66s5s1s66s59s8s7s15s13s2s2s2s2s4s4s1s1s3s3s
          ...: 2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s
          ...: 4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8
          ...: s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s2s2s2s2s4s4s1s1s3s3s2s1s8s15s'

[nav] In [14]: record = unpack(format_field_mast, rec)

I updated the README.md to point to a new readme about the cobol parsing under the src directory.

I would love to get some feedback from you.

Thanks,
Mark
903-360-8815

(I am in Lindale, Texas near Tyler)

can be used to create a python type for parsing EBCDIC files according to the structure specified.

…s concerned with data division for producing python struct formats). Keep generated Cobol85Parser, Lexer and Listener as is. Do not edit them. Add Symbol and SymbolTable __str__ methods for printing. Generate sizes by interpreting Pic and Cardinality and Usage Generate formats from sizes Eliminate fields that have been redefined Process Occurs so that formats can be repeated Add bintools.py to store some handly binary tools/patterns.

…or each symbol that has one

clean out dataclass, not needed add occurs adjusted size and remove struct format for fields that are part of occurs to avoid double counting make sure occurs adjusted size is added to the size of the level 01 record reorder the occurs processing and final size count reformat the ola cobol program so that cobc (GNUCobol compiler) can compile it

set up proper setting and return of symbol table add click options to main

add unit tests before that work is complete to test for regression and to highlight the problem with not removing all enclosed redefineds add helper functions to Symbol.py to get length, etc.

process enclosed fields for redefineds so they are skipped in symbol table add to unit tests

markfyoung0711

@mlbelobraydi hello, hoping you can review this and give your thoughts on it.

i like the work you have here in the repo.
I just thought it would be nice to be able to generate parsing code straight from the RRC copy books.

markfyoung0711 · 2024-05-22T18:33:34Z

src/Cobol85.g4

@@ -0,0 +1,5655 @@
+/*


initial code downloaded from repo.
consider removing this and just use docker to download the supported version of the g4 file, generate the cobol 85 parser.

markfyoung0711 · 2024-05-22T18:34:46Z

src/DataDivisionCobol85Listener.py

@@ -0,0 +1,302 @@
+'''
+
+The purpose of this program is to build a data structure that we can use to generate


this is the main part of this PR.
it generates one or more symbol tables
SymbolTable.table (list)
Each element of that list contains a list of symbols that is used to generate the python struct format to be able to parse a related EBCDIC file.

markfyoung0711 · 2024-05-22T18:35:36Z

src/Locations.py

+class Locations(object):
+    '''
+    Find locations of the datasets from the download area because the download links are hashes and cannot be predicted
+


needs more work.
need to tie in the beautiful soup that parses the top level URIs b/c the hashes in the directories and filenames are undefined.

markfyoung0711 · 2024-05-22T18:36:17Z

src/Symbol.py

+
+import pandas as pd
+
+computationals = set(['COMP-1', 'COMP-2', 'COMP-3', 'COMP-4', 'COMP-5',


main code to support Symbol and SymbolTable

markfyoung0711 · 2024-05-22T18:36:47Z

src/cobol_parser.md

@@ -0,0 +1,44 @@
+# Impetus


documentation for users and contributors to the repo

markfyoung0711 · 2024-05-22T18:38:39Z

src/ola_copybook.cobol

@@ -0,0 +1,445 @@
+IDENTIFICATION DIVISION.


oil ledge file (TX RRC) copybook copy/pasted into a complete Cobol program.

markfyoung0711 · 2024-05-22T18:39:37Z

src/OilMaster.py

@@ -0,0 +1,38 @@
+import simplejson


rewrite this to use the cobol data parser to derive the python struct, then read the ebc file, parse it and output to an oil ledger master.

markfyoung0711 · 2024-05-22T18:40:22Z

src/bintools.py

@@ -0,0 +1,26 @@
+def in_situ_ebcdic_to_ascii(s):


tools for processing EBCDIC files and inspecting binary contents.

markfyoung0711 · 2024-05-22T18:41:40Z

src/setup.cfg

@@ -0,0 +1,3 @@
+[flake8]
+ignore = W504, E501, E303, E301, E231, E101, W291, W391, E202, E252, E225, E201, E222, E261, E275, E302, F401, W191, W293


the ANTLR4 .py files have lots of PEP8 issues.
ignoring them for now until i get the docker setup to generate the .py on the fly for Lexer, Parser, Listener

markfyoung0711 · 2024-05-22T18:42:04Z

src/test_OilMaster.py

@@ -0,0 +1,10 @@
+from OilMaster import OilMaster
+
+def test_OilMaster():


has to be rewritten

markfyoung0711 · 2024-05-23T21:43:06Z

src/Locations.py

-        dataset_start_html = dataset_download_html.decode("utf-8")
+        self.url = starting_url
+        page = requests.get(self.url)
+        soup = BeautifulSoup(page.content, "html.parser")


this is going to be replaced by the file: src/rrc_scrape.py

98ef966#diff-c363c527d148d304224201f4fca2365f786209b56d18348a29db7cc9babec445

markfyoung0711 · 2024-05-23T21:44:39Z

src/rrc_scrape.py

+from selenium import webdriver
+from selenium.webdriver.common.by import By
+
+# this program gets the URLs for the dataset section


turn this into a function that returns a dataframe.
the dataframe will have the filetype, e.g. Ascii, EBCDOC, ArcFile
the uri will be the URL to the list of files
the dataset_description will be the description from the main RRC dataset page.

Removal of Locator.py and test Symbol table redefine and occurs handling. Add selenium code w/headless Chrome browser to get RRC dataset files Add symbol/symbol table unit tests.

fix bug in listener whereby PIC 9(08)V99 was not picking up the '9's following the V add to the main program the parsing of the EBCDIC file

the parser cannot handle redefines of redefines yet, so change the inputs to be able to deal with that - need to make progress in other areas update unit tests to make corrections for the "5" record type

markfyoung0711 added 17 commits May 17, 2024 17:08

Work to write an ANTLR parser to suck out the data structure so that it

3d28113

can be used to create a python type for parsing EBCDIC files according to the structure specified.

parse and add usage to Symbol

7e45a23

add link to the cobol parser work from README.md, add struct format f…

c15e9b5

…or each symbol that has one

try updating

11bdd31

try updating

9b0684d

try updating

c94297d

try updating

227cff5

try updating

05ae4f7

try updating

73096ed

try updating

fc73571

no need for dataclass for Symbol/SymbolTable

dadc16b

occurs flush was not cleaning up properly. now it is

bc778c0

clean up occurs vars after flush

d6429a3

set up proper setting and return of symbol table add click options to main

start the work to remove underlying redefineds

999da62

add unit tests before that work is complete to test for regression and to highlight the problem with not removing all enclosed redefineds add helper functions to Symbol.py to get length, etc.

repr same as str for Symbol

11beacf

process enclosed fields for redefineds so they are skipped in symbol table add to unit tests

markfyoung0711 commented May 22, 2024

View reviewed changes

markfyoung0711 changed the title ~~init_and_redefines_fixes~~ Generate Python Data Structures for TX EBCDIC Files defined by Cobol data grammar May 22, 2024

initial scraper for TX RRC Datasets: rrc_scrape.py

98ef966

markfyoung0711 commented May 23, 2024

View reviewed changes

markfyoung0711 added 5 commits May 26, 2024 12:34

Dockerfile added for app inclusion on GCP.

1df25cf

Removal of Locator.py and test Symbol table redefine and occurs handling. Add selenium code w/headless Chrome browser to get RRC dataset files Add symbol/symbol table unit tests.

more requirements now with new code

124b1b1

add columns to symbol table

e387c69

fix bug in listener whereby PIC 9(08)V99 was not picking up the '9's following the V add to the main program the parsing of the EBCDIC file

record types yaml needs record type number as key

48f5080

the parser cannot handle redefines of redefines yet, so change the inputs to be able to deal with that - need to make progress in other areas update unit tests to make corrections for the "5" record type

gather all the records into a dataframe, and concat into one

81f113f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate Python Data Structures for TX EBCDIC Files defined by Cobol data grammar #36

Generate Python Data Structures for TX EBCDIC Files defined by Cobol data grammar #36

markfyoung0711 commented May 22, 2024 •

edited

Loading

markfyoung0711 left a comment

markfyoung0711 May 22, 2024

markfyoung0711 May 22, 2024

markfyoung0711 May 22, 2024

markfyoung0711 May 22, 2024

markfyoung0711 May 22, 2024

markfyoung0711 May 22, 2024

markfyoung0711 May 22, 2024

markfyoung0711 May 22, 2024

markfyoung0711 May 22, 2024

markfyoung0711 May 22, 2024

markfyoung0711 May 23, 2024

markfyoung0711 May 23, 2024

		@@ -0,0 +1,302 @@
		'''

		The purpose of this program is to build a data structure that we can use to generate


		import pandas as pd

		computationals = set(['COMP-1', 'COMP-2', 'COMP-3', 'COMP-4', 'COMP-5',

		@@ -0,0 +1,3 @@
		[flake8]
		ignore = W504, E501, E303, E301, E231, E101, W291, W391, E202, E252, E225, E201, E222, E261, E275, E302, F401, W191, W293

		@@ -0,0 +1,10 @@
		from OilMaster import OilMaster

		def test_OilMaster():

Generate Python Data Structures for TX EBCDIC Files defined by Cobol data grammar #36

Are you sure you want to change the base?

Generate Python Data Structures for TX EBCDIC Files defined by Cobol data grammar #36

Conversation

markfyoung0711 commented May 22, 2024 • edited Loading

impetus

improvement

approach

choice 1 - build my own parser

choice 2 - ANTLR4 lexer/parser/listener

markfyoung0711 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markfyoung0711 commented May 22, 2024 •

edited

Loading