The system is decentralized, all nodes are worker nodes.
The entire scheduling component is based on Quartz(job-store = MySQL). Users set the scheduling cron
for workflows through the UI, Quartz triggers scheduling and add workflow to the PriorityQueue in
memory waiting for execution. Refer to the code in com.flink.platform.web.quartz.JobFlowRunner
.
MySQL holds all info about jobs, users, resources, schedules, etc.
HDFS holds resource files uploaded by users, in the future I will also store job logs to hdfs.
In order to keep the system simple, I'm not using components like zookeeper to guarantee system
fault tolerance. All instances of flink-platform-web
communicate with MySQL, so I want to use
MySQL to complete system fault tolerance. Currently, You can restart/add flink-platform-web
instances arbitrarily, this won't affect the execution of workflows, the unfinished workflow will be
reloaded and executed after the flink-platfrom-web
node restarted, refer to the code
in com.flink.platform.web.service.InitJobFlowScheduler
.
For scaling operations, currently, you must migrate unfinished workflows to
other flink-platform-web
instances (Change host ip in table: t_job_flow_run and restart the target
instance), I will implement auto migration rely on MySQL in the future.