Skip to content

This repository contains my Bachelor's CS degree project as well as it's timeline and incremental progress.

License

Notifications You must be signed in to change notification settings

cosmin-ionita/Diploma-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Diploma-Project

This repository contains my Bachelor's CS degree project as well as it's timeline and incremental progress.

Features and architecture

Overall progress

Week 1 (19.02.2018 -> 23.02.2018)

  • ☑️ Define a specific set of use cases detailed in Images/Specifications.png file as a result of the discussion with Flavian, Cosmin R. and Dan T.
  • ☑️ Research on the appropriate technologies (Apache Flume, Hadoop, Solr)
  • ☑️ Create the project pre-architecture and establish it
  • ☑️ Research on Apache Flume to see if it supports metadata extraction and different log formats
  • ☑️ Create a prototype Solr project using Docker and SolrJ
  • ☑️ Request access to AWS infrastructure

Week 2 (26.02.2018 -> 02.03.2018)

  • ☑️ Discuss with Ciprian D. (coordinator professor) to get approval on project architecture and features (last week progress)
  • ☑️ Test Solr in manual configuration to understand the flow (results presented here)
  • ☑️ Research on different types of logs and think about a unique, structured data transfer object
  • ☑️ Research on Solr Analyzers
  • ☑️ Get access to AWS and Splunk
  • ☑️ Research on Morphlines and MapReduceIndexerTool

Week 3 (5.03.2018 -> 9.03.2018)

  • ☑️ Discuss with Andrei F. to get access to the AWS infrastructure
  • ☑️ Bootstrap the infrastructure and get access to Cloudera dashboard
  • ☑️ Analyze the current infrastructure
  • ☑️ Install Flume through Cloudera (need to make this persistent in the future)

Week 4 (12.03.2018 -> 16.03.2018)

  • ☑️ Analyze all keystone logs to see their format
  • ☑️ Create a simulator that takes a large log file, splits it into multiple files and archives those files
  • ☑️ Create a parser that decompresses each archive and converts each log event into a structured JSON model

Week 5 (19.03.2018 -> 23.03.2018)

  • ☑️ Test the parser along with the simulator
  • ☑️ Modify the parser to support high speed archive ingestion
  • ☑️ Add support for multiline logs (ex. stacktraces)
  • ☑️ Add parser functions that support only a subset of log formats (ongoing work)

Week 6 (26.03.2018 -> 30.03.2018)

  • ☑️ Research on Grok parser as a final parsing solution
  • ☑️ Created a test project to illustrate the functionality of Grok
  • ☑️ Started to analyze Flume configuration
  • ☑️ Modified Flume config file to send data to a specific HDFS directory

Week 7 (02.04.2018 -> 06.04.2018)

  • ☑️ Changed Flume config to send an entire blob (file) to HDFS, with an established max size
  • ☑️ Tested the LogGenerator along with the LogParser and with Flume
  • ☑️ Modified both projects accordingly to pass the functionality test
  • ☑️ Research on the MapReduceIndexerTool job

Week 8 (09.04.2018 -> 13.04.2018)

  • ☑️ Research on the Morphline concept along with the MapReduceIndexerTool
  • ☑️ Created a Morphline config file that matches the project needs
  • ☑️ Created a script that starts the indexing job
  • ☑️ Debugged a strange error on MapReduceIndexerTool using StackOverflow
  • ☑️ Managed to run the indexing job in --dry-run mode (without Solr loading)

Week 9 (16.04.2018 -> 20.04.2018)

  • ☑️ Modified the Morphline config file to load the index into Solr
  • ☑️ Created a script that generates a Solr Config file
  • ☑️ Created a script that creates a Solr Core based on a generated configuration
  • ☑️ Adapted the default Solr schema to match the serialized JSON model of a log event
  • ☑️ Ran the entire flow and checked the index correctness on small files with a few models

Week 10 (23.04.2018 -> 27.04.2018)

  • ☑️ Implemented the Grok engine in the project parser
  • ☑️ Created a mockup client project
  • ☑️ Implemented TarGz decompressor
  • ☑️ Implemented Zip decompressor
  • ☑️ Created unit tests (using JUnit) for all decompressors
  • ☑️ Presented the current progress to the coordinator professor
  • ☑️ Discussed with Cosmin R. about triggering the index job using SQS

Week 11 (30.04.2018 -> 4.05.2018)

  • ☑️ Created a set of config scripts that will prepare the newly created infrastructure
  • ☑️ Created the daemon that will run on HDFS machine and trigger the index job (IndexTrigger)
  • ☑️ Started building of a Desktop app UI using Java Swing

Week 12 (07.05.2018 -> 11.05.2018)

  • ☑️ Replaced Swing UI with a Java FX one (because it's more flexible)
  • ☑️ Tried to fix the QE cluster with Dragos C. and Vlad C.
  • ☑️ Built the presentation for Scientific Communication Session 2018
  • ☑️ Built a document with the initial work on this project

Week 12 (14.05.2018 -> 18.05.2018)

  • ☑️ Added CommonsCLI support in the Client project
  • ☑️ Added a custom control in the desktop UI for the search fields
  • ☑️ Developed the SolrAPI class on the client
  • ☑️ Created custom classes for each command

Week 13 (21.05.2018 -> 25.05.2018)

  • ☑️ Created the SpringBoot REST API on HadoopDriver project (index trigger)
  • ☑️ Fixed some packet collisions that were leading to a corrupt fat JAR
  • ☑️ Developed a HadoopRestAPI Client on the client project
  • ☑️ Tested the API to make sure that index-now and index-interval commands work as expected

Week 14 (28.05.2018 -> 01.06.2018)

  • ☑️ Created a command executor model on the client
  • ☑️ Developed the export command in both CLI and GUI
  • ☑️ Implemented a merge logic between the time-interval and date-interval commands
  • ☑️ Started to work on the official documentation

Week 15 (04.05.2018 -> 08.06.2018)

  • ☑️ Added S3 download functionality on parser
  • ☑️ Added SQS receive logic on parser
  • ☑️ Implemented the processing logic for each archive using an ExecutorService
  • ☑️ Created a test S3 bucket and tested the developed workflow
  • ☑️ Created a DataGenerator multithreaded project that generates archives and loads them in S3

Week 16 (11.05.2018 -> 15.06.2018)

  • ☑️ Finalized the client side and tested it manually
  • ☑️ Implemented the Job Scheduler on the Hadoop Driver (using a timer controllable by the client)
  • ☑️ Developed a logic to detect when the indexing job is finished
  • ☑️ Diagnosed a log4j deadlock (Call Appenders) caused by a missing log4j.properties file
  • ☑️ Accelerated the work on the documentation
  • ☑️ Worked on the presentation for the KeyStone team

Week 18 (18.05.2018 -> 22.06.2018)

  • ☑️ Presented the project to the keystone team
  • ☑️ Worked on the documentation
  • ☑️ Loaded 34.7 GB of data into the system and tested the entire data flow
  • ☑️ Worked on the Faculty formalities regarding the diploma exam

Week 19 (25.05.2018 -> 29.06.2018)

  • ☑️ Finalized and delivered the documentation
  • ☑️ Loaded 120 GB of data and tested the entire workflow
  • ☑️ Created the official presentation for the next week