搭建 flink standalone 集群后,taskmanager 总是不明原因的挂掉,导致任务失败。查询日志发现:
2018-10-25 03:27:05,197 INFO org.apache.flink.runtime.blob.PermanentBlobCache - Shutting down BLOB cache
2018-10-25 03:27:05,197 INFO org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB cache
2018-10-25 03:27:05,197 INFO org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager - Shutting down TaskExecutorLocalStateStoresManager.
2018-10-25 03:27:05,200 ERROR akka.actor.ActorSystemImpl - Uncaught error from thread [flink-akka.actor.default-dispatcher-31]: Compressed class space, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[flink]
java.lang.OutOfMemoryError: Compressed class space
2018-10-25 03:27:05,200 ERROR akka.actor.ActorSystemImpl - Uncaught error from thread [flink-akka.actor.default-dispatcher-212]: Compressed class space, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[flink]
java.lang.OutOfMemoryError: Compressed class space
2018-10-25 03:27:05,200 ERROR akka.actor.ActorSystemImpl - Uncaught error from thread [flink-akka.actor.default-dispatcher-22]: Compressed class space, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[flink]
java.lang.OutOfMemoryError: Compressed class space
2018-10-25 03:27:05,225 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager removed spill file directory /tmp/flink-io-d7a472b1-6c12-4149-aa56-997c0963ceb5
发生了致命错误导致 JVM 进程被 shut down,原因 java.lang.OutOfMemoryError: Compressed class space,
flink 使用 akka 框架作为分布式 RPC 处理,详情查看参考资料,猜测是因为下面这两个参数:
我的任务是流式任务,开启了 checkpoint,在任务过程中报错时会不断的进行重试,重试就会与 jobmanager 进行通信,使用 akka 方式,很容易就达到瓶颈(程序主要做数据处理,遇到异常数据没有进行异常捕捉,所以就一直在抛异常,一直在重试)。
解决方法,适当调大默认配置值,其次,主要还是要在应用中捕获异常,目前已正常运行 20 天,后续出现问题再补充。
参考 :https://cwiki.apache.org/confluence/display/FLINK/Akka+and+Actors
https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/config.html