Flink:执行模式(译)
TODO
执行行为
状态后端/状态
事件时间/水印
When it comes to supporting event time, Flink’s streaming runtime builds on the pessimistic assumption that events may come out-of-order, i.e. an event with timestamp
t
may come after an event with timestampt+1
. Because of this, the system can never be sure that no more elements with timestampt < T
for a given timestampT
can come in the future. To amortise the impact of this out-of-orderness on the final result while making the system practical, inSTREAMING
mode, Flink uses a heuristic called Watermarks. A watermark with timestampT
signals that no element with timestampt < T
will follow.
TODO
In
BATCH
mode, where the input dataset is known in advance, there is no need for such a heuristic as, at the very least, elements can be sorted by timestamp so that they are processed in temporal order. For readers familiar with streaming, inBATCH
we can assume “perfect watermarks”.
TODO
Given the above, in
BATCH
mode, we only need aMAX_WATERMARK
at the end of the input associated with each key, or at the end of input if the input stream is not keyed. Based on this scheme, all registered timers will fire at the end of time and user-definedWatermarkAssigners
orWatermarkStrategies
are ignored.
TODO
处理时间
Processing Time is the wall-clock time on the machine that a record is processed, at the specific instance that the record is being processed. Based on this definition, we see that the results of a computation that is based on processing time are not reproducible. This is because the same record processed twice will have two different timestamps.
处理时间是记录被处理时机器的时钟,。根据该定义,我们可以看出基于处理时间的运算结果是不能复现的。这是因为,同一记录被两次处理,运算结果会有不同的时间戳。
Despite the above, using processing time in
STREAMING
mode can be useful. The reason has to do with the fact that streaming pipelines often ingest their unbounded input in real time so there is a correlation between event time and processing time. In addition, because of the above, inSTREAMING
mode1h
in event time can often be almost1h
in processing time, or wall-clock time. So using processing time can be used for early (incomplete) firings that give hints about the expected results.
尽管如此,在STREAMING
模式中使用处理时间仍然是有用的。原因是:流式管道(Pipeline)通常是实时地摄入无界输入,所以事件时间与处理时间之间存在一定的关系。而且,在STREAMING
模式中,1小时的事件时间通常几乎也是1小时的处理时间。所以,使用处理时间
This correlation does not exist in the batch world where the input dataset is static and known in advance. Given this, in
BATCH
mode we allow users to request the current processing time and register processing time timers, but, as in the case of Event Time, all the timers are going to fire at the end of the input.
TODO
Conceptually, we can imagine that processing time does not advance during the execution of a job and we fast-forward to the end of time when the whole input is processed.
TODO