Flink:执行模式(译)

TODO

执行行为

状态后端/状态

事件时间/水印

When it comes to supporting event time, Flink’s streaming runtime builds on the pessimistic assumption that events may come out-of-order, i.e. an event with timestamp t may come after an event with timestamp t+1. Because of this, the system can never be sure that no more elements with timestamp t < T for a given timestamp T can come in the future. To amortise the impact of this out-of-orderness on the final result while making the system practical, in STREAMING mode, Flink uses a heuristic called Watermarks. A watermark with timestamp T signals that no element with timestamp t < T will follow.

TODO

In BATCH mode, where the input dataset is known in advance, there is no need for such a heuristic as, at the very least, elements can be sorted by timestamp so that they are processed in temporal order. For readers familiar with streaming, in BATCH we can assume “perfect watermarks”.

TODO

Given the above, in BATCH mode, we only need a MAX_WATERMARK at the end of the input associated with each key, or at the end of input if the input stream is not keyed. Based on this scheme, all registered timers will fire at the end of time and user-defined WatermarkAssigners or WatermarkStrategies are ignored.

TODO

处理时间

Processing Time is the wall-clock time on the machine that a record is processed, at the specific instance that the record is being processed. Based on this definition, we see that the results of a computation that is based on processing time are not reproducible. This is because the same record processed twice will have two different timestamps.

处理时间是记录被处理时机器的时钟,。根据该定义,我们可以看出基于处理时间的运算结果是不能复现的。这是因为,同一记录被两次处理,运算结果会有不同的时间戳。

Despite the above, using processing time in STREAMING mode can be useful. The reason has to do with the fact that streaming pipelines often ingest their unbounded input in real time so there is a correlation between event time and processing time. In addition, because of the above, in STREAMING mode 1h in event time can often be almost 1h in processing time, or wall-clock time. So using processing time can be used for early (incomplete) firings that give hints about the expected results.

尽管如此,在STREAMING模式中使用处理时间仍然是有用的。原因是:流式管道(Pipeline)通常是实时地摄入无界输入,所以事件时间与处理时间之间存在一定的关系。而且,在STREAMING模式中,1小时的事件时间通常几乎也是1小时的处理时间。所以,使用处理时间

This correlation does not exist in the batch world where the input dataset is static and known in advance. Given this, in BATCH mode we allow users to request the current processing time and register processing time timers, but, as in the case of Event Time, all the timers are going to fire at the end of the input.

TODO

Conceptually, we can imagine that processing time does not advance during the execution of a job and we fast-forward to the end of time when the whole input is processed.

TODO

参考

  1. https://ci.apache.org/projects/flink/flink-docs-release-1.12/zh/dev/datastream_execution_mode.html