排查问题:提交hadoop作业偶尔失败

用户反馈偶尔有作业提交失败,一般运行10-20分钟就出现失败提醒,作业是在系统A提交的hive sql语句,系统A的日志和hiveserver2的日志输出是一样的,都提示YarnException: Failed to submit application_xxx to YARN : Application application_xxx was killed by user xxx at 10.10.x.x
hiveserver2日志如下:

2023-01-18T05:14:04,073 ERROR [HiveServer2-Background-Pool: Thread-917734] exec.Task: Job Submission failed with exception 'java.io.IOException(org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_xxx to YARN : Application application_xxx was killed by user userxxx at 10.x.x.x)'
java.io.IOException: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_xxx to YARN : Application application_xxx was killed by user userxxx at 10.x.x.x
	at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:345)
	at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:254)
	at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1570)
	at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1567)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
	at org.apache.hadoop.mapreduce.Job.submit(Job.java:1567)
	at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:576)
	at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:571)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
	at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:571)
	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:562)
	at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:411)
	at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:151)
	at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199)
	at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
	at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183)
	at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839)
	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1232)
	at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:255)
	at org.apache.hive.service.cli.operation.SQLOperation.access$800(SQLOperation.java:91)
	at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:348)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
	at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:362)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_xxx to YARN : Application application_xxx was killed by user userxxx at 10.x.x.x
	at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:304)
	at org.apache.hadoop.mapred.ResourceMgrDelegate.submitApplication(ResourceMgrDelegate.java:299)
	at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:330)
	... 35 more

提交了作业20分钟被kill了,YARN出错了?作业超时了?用户主动kill了?很难说,不清楚,hiveserver2提交作业是和Hadoop YARN交互,先检查下YARN日志再说。
YARN日志如下:

23/01/18 05:13:37 INFO resourcemanager.ClientRMService: Allocated new applicationId: 449050
23/01/18 05:13:38 INFO rmapp.RMAppImpl: Storing application with id application_xxx
23/01/18 05:13:38 INFO rmapp.RMAppImpl: application_xxx State change from NEW to NEW_SAVING on event = START
...
...
23/01/18 05:13:44 INFO rmapp.RMAppImpl: Updating application application_xxx with final state: KILLED
23/01/18 05:13:44 INFO rmapp.RMAppImpl: application_xxx State change from NEW_SAVING to FINAL_SAVING on event = KILL
23/01/18 05:13:44 INFO rmapp.RMAppImpl: application_xxx State change from FINAL_SAVING to FINISHING on event = APP_UPDATE_SAVED

日志没有异常信息,日志打印的原因仍然是收到kill event,和hiveserver2是一致的。经询问系统A本身没有kill功能,那为啥会提示kill,真是用户主动kill的?不敢随意质问用户,会不会超时了,超时会有timeout之类的日志吧,会不会某情况下触发了YARN内部机制然后kill了,先看看YARN源码吧,看看kill event的可能来源。直接搜索关键词“State change from”,状态机变更一般都会在一个地方实现,不会搜出很多处

找到变量定义STATE_CHANGE_MESSAGE,查找引用,找到打印日志的地方

根据日志是收到了kill event,查看kill event的定义

public enum RMAppEventType {
  // Source: ClientRMService
  START,
  RECOVER,
  KILL,

  // Source: Scheduler and RMAppManager
  APP_REJECTED,

查找所有引用kill的地方,发现2处调用(可能不全,未完整分析),一是来源RMAppKillByClientEvent,另外一个AbstractYarnScheduler

RMAppKillByClientEvent来源ClientRMService,根据其类注释可知是给YARN客户端调用

/**
 * The client interface to the Resource Manager. This module handles all the rpc
 * interfaces to the resource manager from the client.
 */
public class ClientRMService extends AbstractService implements



从日志打印来看,和搜索到的日志是匹配的,是用户主动kill了?
看看另外一处kill场景AbstractYarnScheduler:

如果来源是AbstractYarnScheduler,日志会有diagnostic信息,且该方法是kill整个queue的作业,从日志来看不是kill不是来源于此,还是用户或某程序主动概率大。
第二天用户告之,作业经常卡在NEW_SAVING状态,然后写了个shell用于kill NEW_SAVING状态的作业,这都可以。。。这么暴力。。。原因找打了。

展开阅读全文

页面更新:2024-03-23

标签:作业   主动   定义   状态   提示   来源   地方   用户   系统   日志

1 2 3 4 5

上滑加载更多 ↓
推荐阅读:
友情链接:
更多:

本站资料均由网友自行发布提供,仅用于学习交流。如有版权问题,请与我联系,QQ:4156828  

© CopyRight 2020-2024 All Rights Reserved. Powered By 71396.com 闽ICP备11008920号-4
闽公网安备35020302034903号

Top