Spark提交作业原理

Posted on 2020-08-24 Edited on 2021-10-12 In spark Views:

本篇介绍Spark是如何将我们定义的代码以作业的形式提交至集群的。本篇主要关注于作业提交，作业提交后任务的运行在《Spark运行任务原理》中有详细讲述。

LiveListenerBus

OutputCommitCoordinator

1
2
3

// key为stageId，value是StageState
// StageState表示的是
private val stageStates = mutable.Map[Int, StageState]()

MapOutputTrackerMaster

此类用于追踪stage的map输出位置。DAGScheduler使用此类来注册/注销map输出状态，以及为执行位置感知的reduce任务而查找统计信息。内部，MapOutputTrackerMaster使用ShuffleStatus对shuffle状态进行记录。

1 2	// 登记的Shuffle状态，key为shuffleId，value是一个ShuffleStatus类型 val shuffleStatuses = new ConcurrentHashMap[Int, ShuffleStatus]().asScala

ShuffleMapStage

在DAG执行当中，ShuffleMapStage是作为一种中间状态的stage，它用来为shuffle产生数据。shuffleDep表示的是该stage是作为某个shuffle的一部分，更确切地说，当前ShuffleMapStage为shuffleDep所表示的shuffle输出map端中间数据。所以我们可以理解为shuffle依赖是一个stage的创建依据：即，为shuffle依赖上游创建一个stage，这个stage末尾会运行map任务为此shuffle操作产生中间数据。

DAGScheduler

DAGScheduler是高层级的调度层，它是面向stage的调度。它负责为任务计算stage DAG、追踪哪些RDD和哪些stage的输出是已经被物化了的、并且寻找最佳的调度策略来运行任务。之后，它会将stages以TaskSet的形式提交到底层TaskScheduler。

除此，它还担负了在由于shuffle输出文件丢失导致失败时，重提交上游stages的任务；以及在作业执行完成后，对作业所涉及的数据结构的清理。

/**
 * 包含被缓存的RDD分区的位置，其中Map中的：
 * key为RDD的ID；value表示的数组的索引为分区编号，数组项表示该分区被缓存的位置集合。
 * 每次job执行完都会进行清理缓存的数据。
 */
private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]]

// 等待运行的Stage，它们的上游stage还没有完成
private[scheduler] val waitingStages = new HashSet[Stage]

// 当前在运行的stage
private[scheduler] val runningStages = new HashSet[Stage]


private[scheduler] val activeJobs = new HashSet[ActiveJob]

DAGSchedulerEventProcessLoop

它继承自EventLoop[DAGSchedulerEvent]，作为接收DAGSchedulerEvent事件类型的环路。

在内部，使用阻塞队列作为消息队列，接收调用者发送的消息；运行一个专门的线程来处理所有事件，并将接收到的消息根据消息类型，交由DAGScheduler进一步处理。

提交作业

def submitJob[T, U](
    rdd: RDD[T],
    func: (TaskContext, Iterator[T]) => U,
    partitions: Seq[Int],
    callSite: CallSite,
    resultHandler: (Int, U) => Unit,
    properties: Properties): JobWaiter[U] = {
  // Check to make sure we are not launching a task on a partition that does not exist.
  val maxPartitions = rdd.partitions.length
  partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
    throw new IllegalArgumentException(
      "Attempting to access a non-existent partition: " + p + ". " +
        "Total number of partitions: " + maxPartitions)
  }

  val jobId = nextJobId.getAndIncrement() // 分配作业ID
  if (partitions.size == 0) {
    // Return immediately if the job is running 0 tasks
    return new JobWaiter[U](this, jobId, 0, resultHandler)
  }

  assert(partitions.size > 0)
  val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
  val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
  // 发送JobSubmitted消息到eventProcessLoop
  eventProcessLoop.post(JobSubmitted(
    jobId, rdd, func2, partitions.toArray, callSite, waiter,
    SerializationUtils.clone(properties)))
  waiter
}

def runJob[T, U](
    rdd: RDD[T],
    func: (TaskContext, Iterator[T]) => U,
    partitions: Seq[Int],
    callSite: CallSite,
    resultHandler: (Int, U) => Unit,
    properties: Properties): Unit = {
  val start = System.nanoTime
  val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
  ThreadUtils.awaitReady(waiter.completionFuture, Duration.Inf)
  waiter.completionFuture.value.get match {
    case scala.util.Success(_) =>
      logInfo("Job %d finished: %s, took %f s".format
        (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
    case scala.util.Failure(exception) =>
      logInfo("Job %d failed: %s, took %f s".format
        (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
      // SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.
      val callerStackTrace = Thread.currentThread().getStackTrace.tail
      exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
      throw exception
  }
}

private[scheduler] def handleJobSubmitted(jobId: Int,
    finalRDD: RDD[_],
    func: (TaskContext, Iterator[_]) => _,
    partitions: Array[Int],
    callSite: CallSite,
    listener: JobListener,
    properties: Properties) {
  var finalStage: ResultStage = null
  try {
    // 构建stage的DAG
    finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
  } catch {
    ... ...
  }
  // Job submitted, clear internal data.
  barrierJobIdToNumTasksCheckFailures.remove(jobId)

  val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
  clearCacheLocs()
  logInfo("Got job %s (%s) with %d output partitions".format(
    job.jobId, callSite.shortForm, partitions.length))
  logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
  logInfo("Parents of final stage: " + finalStage.parents)
  logInfo("Missing parents: " + getMissingParentStages(finalStage))

  val jobSubmissionTime = clock.getTimeMillis()
  jobIdToActiveJob(jobId) = job
  activeJobs += job
  finalStage.setActiveJob(job)
  val stageIds = jobIdToStageIds(jobId).toArray
  val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
  listenerBus.post(
    SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
  submitStage(finalStage)
}

构建Stage图

/**
 * 创建为计算RDD所需要的ResultStage
 */
private def createResultStage(
    rdd: RDD[_],
    func: (TaskContext, Iterator[_]) => _,
    partitions: Array[Int],
    jobId: Int,
    callSite: CallSite): ResultStage = {
  checkBarrierStageWithDynamicAllocation(rdd)
  checkBarrierStageWithNumSlots(rdd)
  checkBarrierStageWithRDDChainPattern(rdd, partitions.toSet.size)
  // 首先，计算当前rdd的上游stage列表
  val parents = getOrCreateParentStages(rdd, jobId)
  val id = nextStageId.getAndIncrement()
  // 以当前rdd为基础创建ResultStage
  val stage = new ResultStage(id, rdd, func, partitions, parents, jobId, callSite)
  stageIdToStage(id) = stage
  updateJobIdStageIdMaps(jobId, stage)
  stage
}

/**
 * 获取/构建给定RDD的上游Stage列表，也就是说当前的rdd不在所返回的上游父亲Stage列表中，
 * 其中，构建的依据是依赖图中离当前rdd最近shuffle依赖；
 * 并且，Stage是一个树状结构，在这个方法返回时，Stage列表中的每个Stage树也已经构建完成。
 */
private def getOrCreateParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {
  getShuffleDependencies(rdd).map { shuffleDep =>
    // 以shuffle依赖为依据创建相应的ShuffleMapStage
    getOrCreateShuffleMapStage(shuffleDep, firstJobId)
  }.toList
}

/**
 * 使用了“栈+loop”编程范式来遍历rdd的依赖树，返回离给定RDD最近的shuffle依赖。
 */
private[scheduler] def getShuffleDependencies(
    rdd: RDD[_]): HashSet[ShuffleDependency[_, _, _]] = {
  val parents = new HashSet[ShuffleDependency[_, _, _]]
  val visited = new HashSet[RDD[_]]
  val waitingForVisit = new ArrayStack[RDD[_]]
  waitingForVisit.push(rdd)
  while (waitingForVisit.nonEmpty) {
    val toVisit = waitingForVisit.pop()
    if (!visited(toVisit)) {
      visited += toVisit
      toVisit.dependencies.foreach {
        case shuffleDep: ShuffleDependency[_, _, _] =>
          parents += shuffleDep // 如果是ShuffleDependency，则加入返回列表中，当前分支不再深度遍历
        case dependency =>
          waitingForVisit.push(dependency.rdd) // 否则，深度继续向下一层遍历
      }
    }
  }
  parents
}

/**
 * 查找给定rdd上游所有还未注册过的shuffle依赖，并以堆栈的形式返回，最近shuffle依赖在栈底，
 * 最远shuffle依赖在栈顶。
 */
private def getMissingAncestorShuffleDependencies(
    rdd: RDD[_]): ArrayStack[ShuffleDependency[_, _, _]] = {
  val ancestors = new ArrayStack[ShuffleDependency[_, _, _]]
  val visited = new HashSet[RDD[_]]
  // We are manually maintaining a stack here to prevent StackOverflowError
  // caused by recursively visiting
  val waitingForVisit = new ArrayStack[RDD[_]]
  waitingForVisit.push(rdd)
  while (waitingForVisit.nonEmpty) {
    val toVisit = waitingForVisit.pop()
    if (!visited(toVisit)) {
      visited += toVisit
      getShuffleDependencies(toVisit).foreach { shuffleDep =>
        if (!shuffleIdToMapStage.contains(shuffleDep.shuffleId)) {
          ancestors.push(shuffleDep)
          waitingForVisit.push(shuffleDep.rdd) //继续深度遍历祖先
        } // Otherwise, the dependency and its ancestors have already been registered.
      }
    }
  }
  ancestors
}

举例来说，如图所示，5个RDD，其中箭头代表依赖关系。那么，C的最近shuffle依赖是到B依赖；D的最近shuffle依赖是到C的依赖，E的最近shuffle依赖是到A和到C的依赖。

/** 
 * 获取或创建给定shuffle依赖所对应的ShuffleMapStage。
 * 如果ShuffleIdToMapStage已经存在于shuffleIdToMapStage，直接返回；
 * 否则，该方法会创建相应的ShuffleMapStage，并同时会解决其所有缺失的祖先ShuffleMapStage。
 */
private def getOrCreateShuffleMapStage(
    shuffleDep: ShuffleDependency[_, _, _],
    firstJobId: Int): ShuffleMapStage = {
  shuffleIdToMapStage.get(shuffleDep.shuffleId) match {
    case Some(stage) =>
      stage

    case None =>
      // 首先，创建所有上游stages
      // 注意，这里getMissingAncestorShuffleDependencies()返回的是栈，所以会先创建最前面的stage
      getMissingAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>
        // Even though getMissingAncestorShuffleDependencies only returns shuffle dependencies
        // that were not already in shuffleIdToMapStage, it's possible that by the time we
        // get to a particular dependency in the foreach loop, it's been added to
        // shuffleIdToMapStage by the stage creation process for an earlier dependency. See
        // SPARK-13902 for more information.
        if (!shuffleIdToMapStage.contains(dep.shuffleId)) {
          createShuffleMapStage(dep, firstJobId)
        }
      }
      // 然后，创建当前stage
      createShuffleMapStage(shuffleDep, firstJobId)
  }
}

/**
 * 创建shuffle依赖所在的stage
 */
def createShuffleMapStage(shuffleDep: ShuffleDependency[_, _, _], jobId: Int): ShuffleMapStage = {
  val rdd = shuffleDep.rdd // shuffle依赖的RDD作为ShuffleMapStage的rdd，也即map任务运行所作用的rdd
  checkBarrierStageWithDynamicAllocation(rdd)
  checkBarrierStageWithNumSlots(rdd)
  checkBarrierStageWithRDDChainPattern(rdd, rdd.getNumPartitions)
  val numTasks = rdd.partitions.length // 每个stage的并行度水平是与stage中最后一个rdd并行度一致的
  val parents = getOrCreateParentStages(rdd, jobId) // 递归地构建当前stage所依赖的父stage列表
  val id = nextStageId.getAndIncrement()
  val stage = new ShuffleMapStage(
    id, rdd, numTasks, parents, jobId, rdd.creationSite, shuffleDep, mapOutputTracker)

  stageIdToStage(id) = stage
  shuffleIdToMapStage(shuffleDep.shuffleId) = stage
  updateJobIdStageIdMaps(jobId, stage)

  if (!mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) {
    // Kind of ugly: need to register RDDs with the cache and map output tracker here
    // since we can't do it in the RDD constructor because # of partitions is unknown
    logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")")
    mapOutputTracker.registerShuffle(shuffleDep.shuffleId, rdd.partitions.length)
  }
  stage
}

提交Stage

/**
 * 提交stage，在提交当前stage前，首先递归地提交其祖先stage
 */
private def submitStage(stage: Stage) {
  // 查找需要此stage的最早被创建的活跃作业
  val jobId = activeJobForStage(stage)
  if (jobId.isDefined) {
    logDebug("submitStage(" + stage + ")")
    if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
      // 查找当前stage最近祖先stage当中的不可用stage
      val missing = getMissingParentStages(stage).sortBy(_.id)
      logDebug("missing: " + missing)
      if (missing.isEmpty) { // 如果当前stage上游stage全部可用，那么提交当前stage
        logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
        submitMissingTasks(stage, jobId.get)
      } else { // 否则，递归地提交不可用的最近祖先stage
        for (parent <- missing) {
          submitStage(parent)
        }
        waitingStages += stage // 当前stage标记为等待状态
      }
    }
  } else {
    abortStage(stage, "No active job for stage " + stage.id, None)
  }
}

/**
 * 查找当前stage的最近上游祖先stage，如果不可用将其返回
 */
private def getMissingParentStages(stage: Stage): List[Stage] = {
  val missing = new HashSet[Stage]
  val visited = new HashSet[RDD[_]]
  // We are manually maintaining a stack here to prevent StackOverflowError
  // caused by recursively visiting
  val waitingForVisit = new ArrayStack[RDD[_]]
  def visit(rdd: RDD[_]) {
    if (!visited(rdd)) {
      visited += rdd
      // 查找该rdd是否被缓存过
      val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
      if (rddHasUncachedPartitions) {
        for (dep <- rdd.dependencies) {
          dep match {
            case shufDep: ShuffleDependency[_, _, _] =>
              val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)
              if (!mapStage.isAvailable) { // 如果stage不可用，作为返回结果返回
                missing += mapStage
              }
            case narrowDep: NarrowDependency[_] => // 窄依赖继续深度遍历
              waitingForVisit.push(narrowDep.rdd)
          }
        }
      }
    }
  }
  waitingForVisit.push(stage.rdd)
  while (waitingForVisit.nonEmpty) {
    visit(waitingForVisit.pop())
  }
  missing.toList
}

提交任务

我们来分析一下submitMissingTasks方法

/** Called when stage's parents are available and we can now do its task. */
private def submitMissingTasks(stage: Stage, jobId: Int) {
  logDebug("submitMissingTasks(" + stage + ")")

  // 1. 计算当前stage待计算的分区IDs
  val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()

  // Use the scheduling pool, job group, description, etc. from an ActiveJob associated
  // with this Stage
  val properties = jobIdToActiveJob(jobId).properties

  runningStages += stage
  
  // 2. 通知OutputCommitCoordinator要启动stage
  stage match {
    case s: ShuffleMapStage =>
      outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
    case s: ResultStage =>
      outputCommitCoordinator.stageStart(
        stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
  }
  
  // 3. 为每个分区计算相应的TaskLocation
  val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
    stage match {
      case s: ShuffleMapStage =>
        partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
      case s: ResultStage =>
        partitionsToCompute.map { id =>
          val p = s.partitions(id)
          (id, getPreferredLocs(stage.rdd, p))
        }.toMap
    }
  } catch {
    case NonFatal(e) =>
      stage.makeNewStageAttempt(partitionsToCompute.size)
      listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
      abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
      runningStages -= stage
      return
  }

  // 4. 创建StageInfo作为一次stage attempt
  stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)

  // If there are tasks to execute, record the submission time of the stage. Otherwise,
  // post the even without the submission time, which indicates that this stage was
  // skipped.
  if (partitionsToCompute.nonEmpty) {
    stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
  }
  listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))

  // 5. 序列化stage相关信息作为task的二进制数据，并将其广播
  var taskBinary: Broadcast[Array[Byte]] = null
  var partitions: Array[Partition] = null
  try {
    // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
    // For ResultTask, serialize and broadcast (rdd, func).
    var taskBinaryBytes: Array[Byte] = null
    // taskBinaryBytes and partitions are both effected by the checkpoint status. We need
    // this synchronization in case another concurrent job is checkpointing this RDD, so we get a
    // consistent view of both variables.
    RDDCheckpointData.synchronized {
      taskBinaryBytes = stage match {
        case stage: ShuffleMapStage =>
          JavaUtils.bufferToArray(
            closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
        case stage: ResultStage =>
          JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
      }

      partitions = stage.rdd.partitions
    }

    taskBinary = sc.broadcast(taskBinaryBytes)
  } catch {
    // In the case of a failure during serialization, abort the stage.
    case e: NotSerializableException =>
      abortStage(stage, "Task not serializable: " + e.toString, Some(e))
      runningStages -= stage

      // Abort execution
      return
    case e: Throwable =>
      abortStage(stage, s"Task serialization failed: $e\n${Utils.exceptionString(e)}", Some(e))
      runningStages -= stage

      // Abort execution
      return
  }

  // 6. 为每个分区创建一个Task对象
  val tasks: Seq[Task[_]] = try {
    val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
    stage match {
      case stage: ShuffleMapStage =>
        stage.pendingPartitions.clear()
        partitionsToCompute.map { id =>
          val locs = taskIdToLocations(id)
          val part = partitions(id)
          stage.pendingPartitions += id
          //对于ShuffleMapStage，为每个分区创建ShuffleMapTask
          new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
            taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
            Option(sc.applicationId), sc.applicationAttemptId, stage.rdd.isBarrier())
        }

      case stage: ResultStage =>
        partitionsToCompute.map { id =>
          val p: Int = stage.partitions(id)
          val part = partitions(p)
          val locs = taskIdToLocations(id)
          //对于ResultStage，为每个分区创建ResultTask
          new ResultTask(stage.id, stage.latestInfo.attemptNumber,
            taskBinary, part, locs, id, properties, serializedTaskMetrics,
            Option(jobId), Option(sc.applicationId), sc.applicationAttemptId,
            stage.rdd.isBarrier())
        }
    }
  } catch {
    case NonFatal(e) =>
      abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
      runningStages -= stage
      return
  }

  // 7. 将上面创建的Task集合提交至TaskScheduler
  if (tasks.size > 0) {
    logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +
      s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")
    taskScheduler.submitTasks(new TaskSet(
      tasks.toArray, stage.id, stage.latestInfo.attemptNumber, jobId, properties))
  } else {
    // Because we posted SparkListenerStageSubmitted earlier, we should mark
    // the stage as completed here in case there are no tasks to run
    markStageAsFinished(stage, None)

    stage match {
      case stage: ShuffleMapStage =>
        logDebug(s"Stage ${stage} is actually done; " +
            s"(available: ${stage.isAvailable}," +
            s"available outputs: ${stage.numAvailableOutputs}," +
            s"partitions: ${stage.numPartitions})")
        markMapStageJobsAsFinished(stage)
      case stage : ResultStage =>
        logDebug(s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})")
    }
    submitWaitingChildStages(stage)
  }
}

/**
 * 为rdd的特定分区查找优先位置，
 * 有三种数据策略：
 * 1. 如果该rdd已经缓存过了，直接使用缓存数据地址；
 * 2. 否则，使用RDD级别定义的数据策略；
 * 3. 否则，使用当前stage中该rdd上游RDD中第一个数据策略可用的策略；
 */
private def getPreferredLocsInternal(
    rdd: RDD[_],
    partition: Int,
    visited: HashSet[(RDD[_], Int)]): Seq[TaskLocation] = {
  // If the partition has already been visited, no need to re-visit.
  // This avoids exponential path exploration.  SPARK-695
  if (!visited.add((rdd, partition))) {
    // Nil has already been returned for previously visited partitions.
    return Nil
  }
  // If the partition is cached, return the cache locations
  val cached = getCacheLocs(rdd)(partition)
  if (cached.nonEmpty) {
    return cached
  }
  // If the RDD has some placement preferences (as is the case for input RDDs), get those
  val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toList
  if (rddPrefs.nonEmpty) {
    return rddPrefs.map(TaskLocation(_))
  }

  // If the RDD has narrow dependencies, pick the first partition of the first narrow dependency
  // that has any placement preferences. Ideally we would choose based on transfer sizes,
  // but this will do for now.
  rdd.dependencies.foreach {
    case n: NarrowDependency[_] =>
      for (inPart <- n.getParents(partition)) {
        val locs = getPreferredLocsInternal(n.rdd, inPart, visited)
        if (locs != Nil) {
          return locs
        }
      }

    case _ =>
  }

  Nil
}

TaskSchedulerImpl

该类是面向任务的，它将任务在不同类型的集群上进行调度。

// key为stageId，value中的key为stageAttemptId，value中的value为TaskSetManager
private val taskSetsByStageIdAndAttempt = new HashMap[Int, HashMap[Int, TaskSetManager]]


private var schedulableBuilder: SchedulableBuilder = null
val rootPool: Pool = new Pool("", schedulingMode, 0, 0)

提交任务

override def submitTasks(taskSet: TaskSet) {
  val tasks = taskSet.tasks
  logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
  this.synchronized {
    val manager = createTaskSetManager(taskSet, maxTaskFailures)
    val stage = taskSet.stageId
    val stageTaskSets =
      taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])

    // Mark all the existing TaskSetManagers of this stage as zombie, as we are adding a new one.
    // This is necessary to handle a corner case. Let's say a stage has 10 partitions and has 2
    // TaskSetManagers: TSM1(zombie) and TSM2(active). TSM1 has a running task for partition 10
    // and it completes. TSM2 finishes tasks for partition 1-9, and thinks he is still active
    // because partition 10 is not completed yet. However, DAGScheduler gets task completion
    // events for all the 10 partitions and thinks the stage is finished. If it's a shuffle stage
    // and somehow it has missing map outputs, then DAGScheduler will resubmit it and create a
    // TSM3 for it. As a stage can't have more than one active task set managers, we must mark
    // TSM2 as zombie (it actually is).
    stageTaskSets.foreach { case (_, ts) =>
      ts.isZombie = true
    }
    stageTaskSets(taskSet.stageAttemptId) = manager
    schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)

    if (!isLocal && !hasReceivedTask) {
      starvationTimer.scheduleAtFixedRate(new TimerTask() {
        override def run() {
          if (!hasLaunchedTask) {
            logWarning("Initial job has not accepted any resources; " +
              "check your cluster UI to ensure that workers are registered " +
              "and have sufficient resources")
          } else {
            this.cancel()
          }
        }
      }, STARVATION_TIMEOUT_MS, STARVATION_TIMEOUT_MS)
    }
    hasReceivedTask = true
  }
  backend.reviveOffers()
}

// 为每个TaskSet创建TaskSetManager
private[scheduler] def createTaskSetManager(
    taskSet: TaskSet,
    maxTaskFailures: Int): TaskSetManager = {
  new TaskSetManager(this, taskSet, maxTaskFailures, blacklistTrackerOpt)
}

分配任务

/**
 * 向executor资源分配任务
 */
def resourceOffers(offers: IndexedSeq[WorkerOffer]): Seq[Seq[TaskDescription]] = synchronized {
  // 1.对executor资源记账
  var newExecAvail = false
  for (o <- offers) {
    if (!hostToExecutors.contains(o.host)) {
      hostToExecutors(o.host) = new HashSet[String]()
    }
    if (!executorIdToRunningTaskIds.contains(o.executorId)) {
      hostToExecutors(o.host) += o.executorId
      executorAdded(o.executorId, o.host)
      executorIdToHost(o.executorId) = o.host
      executorIdToRunningTaskIds(o.executorId) = HashSet[Long]()
      newExecAvail = true
    }
    for (rack <- getRackForHost(o.host)) {
      hostsByRack.getOrElseUpdate(rack, new HashSet[String]()) += o.host
    }
  }

  // 2. 过滤并对资源进行shuffle
  // 将黑名单过时的主机进行移除
  blacklistTrackerOpt.foreach(_.applyBlacklistTimeout())

  val filteredOffers = blacklistTrackerOpt.map { blacklistTracker =>
    offers.filter { offer =>
      !blacklistTracker.isNodeBlacklisted(offer.host) &&
        !blacklistTracker.isExecutorBlacklisted(offer.executorId)
    }
  }.getOrElse(offers)
  // 对资源进行shuffle操作
  val shuffledOffers = shuffleOffers(filteredOffers)
  
  // 3.根据executor资源，构建本轮次的任务列表
  val tasks = shuffledOffers.map(o => new ArrayBuffer[TaskDescription](o.cores / CPUS_PER_TASK))
  val availableCpus = shuffledOffers.map(o => o.cores).toArray
  val availableSlots = shuffledOffers.map(o => o.cores / CPUS_PER_TASK).sum
  val sortedTaskSets = rootPool.getSortedTaskSetQueue
  for (taskSet <- sortedTaskSets) {
    logDebug("parentName: %s, name: %s, runningTasks: %s".format(
      taskSet.parent.name, taskSet.name, taskSet.runningTasks))
    if (newExecAvail) { // 如果有新的executor可使用，那么对taskset的本地性重新计算
      taskSet.executorAdded()
    }
  }

  // 4.按照TaskSet的调度顺序，将TaskSet任务分配到executor资源执行
  for (taskSet <- sortedTaskSets) {
    // Skip the barrier taskSet if the available slots are less than the number of pending tasks.
    if (taskSet.isBarrier && availableSlots < taskSet.numTasks) {
      // Skip the launch process.
      // TODO SPARK-24819 If the job requires more slots than available (both busy and free
      // slots), fail the job on submit.
      logInfo(s"Skip current round of resource offers for barrier stage ${taskSet.stageId} " +
        s"because the barrier taskSet requires ${taskSet.numTasks} slots, while the total " +
        s"number of available slots is $availableSlots.")
    } else {
      var launchedAnyTask = false
      // Record all the executor IDs assigned barrier tasks on.
      val addressesWithDescs = ArrayBuffer[(String, TaskDescription)]()
      // 4.1 按本地性依次递增的顺序，将taskSet的任务分配到executor上
      // 本地性依次递增：PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY
      for (currentMaxLocality <- taskSet.myLocalityLevels) {
        var launchedTaskAtCurrentMaxLocality = false
        do {
          launchedTaskAtCurrentMaxLocality = resourceOfferSingleTaskSet(taskSet,
            currentMaxLocality, shuffledOffers, availableCpus, tasks, addressesWithDescs)
          launchedAnyTask |= launchedTaskAtCurrentMaxLocality
        } while (launchedTaskAtCurrentMaxLocality)
      }

      if (!launchedAnyTask) {
        taskSet.getCompletelyBlacklistedTaskIfAny(hostToExecutors).foreach { taskIndex =>
            // If the taskSet is unschedulable we try to find an existing idle blacklisted
            // executor. If we cannot find one, we abort immediately. Else we kill the idle
            // executor and kick off an abortTimer which if it doesn't schedule a task within the
            // the timeout will abort the taskSet if we were unable to schedule any task from the
            // taskSet.
            // Note 1: We keep track of schedulability on a per taskSet basis rather than on a per
            // task basis.
            // Note 2: The taskSet can still be aborted when there are more than one idle
            // blacklisted executors and dynamic allocation is on. This can happen when a killed
            // idle executor isn't replaced in time by ExecutorAllocationManager as it relies on
            // pending tasks and doesn't kill executors on idle timeouts, resulting in the abort
            // timer to expire and abort the taskSet.
            executorIdToRunningTaskIds.find(x => !isExecutorBusy(x._1)) match {
              case Some ((executorId, _)) =>
                if (!unschedulableTaskSetToExpiryTime.contains(taskSet)) {
                  blacklistTrackerOpt.foreach(blt => blt.killBlacklistedIdleExecutor(executorId))

                  val timeout = conf.get(config.UNSCHEDULABLE_TASKSET_TIMEOUT) * 1000
                  unschedulableTaskSetToExpiryTime(taskSet) = clock.getTimeMillis() + timeout
                  logInfo(s"Waiting for $timeout ms for completely "
                    + s"blacklisted task to be schedulable again before aborting $taskSet.")
                  abortTimer.schedule(
                    createUnschedulableTaskSetAbortTimer(taskSet, taskIndex), timeout)
                }
              case None => // Abort Immediately
                logInfo("Cannot schedule any task because of complete blacklisting. No idle" +
                  s" executors can be found to kill. Aborting $taskSet." )
                taskSet.abortSinceCompletelyBlacklisted(taskIndex)
            }
        }
      } else {
        // We want to defer killing any taskSets as long as we have a non blacklisted executor
        // which can be used to schedule a task from any active taskSets. This ensures that the
        // job can make progress.
        // Note: It is theoretically possible that a taskSet never gets scheduled on a
        // non-blacklisted executor and the abort timer doesn't kick in because of a constant
        // submission of new TaskSets. See the PR for more details.
        if (unschedulableTaskSetToExpiryTime.nonEmpty) {
          logInfo("Clearing the expiry times for all unschedulable taskSets as a task was " +
            "recently scheduled.")
          unschedulableTaskSetToExpiryTime.clear()
        }
      }

      if (launchedAnyTask && taskSet.isBarrier) {
        // Check whether the barrier tasks are partially launched.
        // TODO SPARK-24818 handle the assert failure case (that can happen when some locality
        // requirements are not fulfilled, and we should revert the launched tasks).
        require(addressesWithDescs.size == taskSet.numTasks,
          s"Skip current round of resource offers for barrier stage ${taskSet.stageId} " +
            s"because only ${addressesWithDescs.size} out of a total number of " +
            s"${taskSet.numTasks} tasks got resource offers. The resource offers may have " +
            "been blacklisted or cannot fulfill task locality requirements.")

        // materialize the barrier coordinator.
        maybeInitBarrierCoordinator()

        // Update the taskInfos into all the barrier task properties.
        val addressesStr = addressesWithDescs
          // Addresses ordered by partitionId
          .sortBy(_._2.partitionId)
          .map(_._1)
          .mkString(",")
        addressesWithDescs.foreach(_._2.properties.setProperty("addresses", addressesStr))

        logInfo(s"Successfully scheduled all the ${addressesWithDescs.size} tasks for barrier " +
          s"stage ${taskSet.stageId}.")
      }
    }
  }

  // TODO SPARK-24823 Cancel a job that contains barrier stage(s) if the barrier tasks don't get
  // launched within a configured time.
  if (tasks.size > 0) {
    hasLaunchedTask = true
  }
  return tasks
}

/**
 * 将TaskSetManager的任务分配到executor资源上
 */
private def resourceOfferSingleTaskSet(
    taskSet: TaskSetManager,
    maxLocality: TaskLocality,
    shuffledOffers: Seq[WorkerOffer],
    availableCpus: Array[Int],
    tasks: IndexedSeq[ArrayBuffer[TaskDescription]],
    addressesWithDescs: ArrayBuffer[(String, TaskDescription)]) : Boolean = {
  var launchedTask = false

  //对于资源采用round-robin的方式进行分配任务
  for (i <- 0 until shuffledOffers.size) {
    val execId = shuffledOffers(i).executorId
    val host = shuffledOffers(i).host
    if (availableCpus(i) >= CPUS_PER_TASK) {
      try {
        // taskSet.resourceOffer()会为一个executor分配一个任务
        for (task <- taskSet.resourceOffer(execId, host, maxLocality)) {
          tasks(i) += task
          val tid = task.taskId
          taskIdToTaskSetManager.put(tid, taskSet)
          taskIdToExecutorId(tid) = execId
          executorIdToRunningTaskIds(execId).add(tid)
          availableCpus(i) -= CPUS_PER_TASK
          assert(availableCpus(i) >= 0)
          // Only update hosts for a barrier task.
          if (taskSet.isBarrier) {
            // The executor address is expected to be non empty.
            addressesWithDescs += (shuffledOffers(i).address.get -> task)
          }
          launchedTask = true
        }
      } catch {
        case e: TaskNotSerializableException =>
          logError(s"Resource offer failed, task set ${taskSet.name} was not serializable")
          // Do not offer resources for this task, but don't throw an error to allow other
          // task sets to be submitted.
          return launchedTask
      }
    }
  }
  return launchedTask
}

追踪任务状态

def statusUpdate(tid: Long, state: TaskState, serializedData: ByteBuffer) {
  var failedExecutor: Option[String] = None
  var reason: Option[ExecutorLossReason] = None
  synchronized {
    try {
      Option(taskIdToTaskSetManager.get(tid)) match {
        case Some(taskSet) =>
          if (state == TaskState.LOST) {
            // TaskState.LOST is only used by the deprecated Mesos fine-grained scheduling mode,
            // where each executor corresponds to a single task, so mark the executor as failed.
            val execId = taskIdToExecutorId.getOrElse(tid, throw new IllegalStateException(
              "taskIdToTaskSetManager.contains(tid) <=> taskIdToExecutorId.contains(tid)"))
            if (executorIdToRunningTaskIds.contains(execId)) {
              reason = Some(
                SlaveLost(s"Task $tid was lost, so marking the executor as lost as well."))
              removeExecutor(execId, reason.get)
              failedExecutor = Some(execId)
            }
          }
          if (TaskState.isFinished(state)) {
            cleanupTaskState(tid)
            taskSet.removeRunningTask(tid)
            if (state == TaskState.FINISHED) {
              taskResultGetter.enqueueSuccessfulTask(taskSet, tid, serializedData)
            } else if (Set(TaskState.FAILED, TaskState.KILLED, TaskState.LOST).contains(state)) {
              taskResultGetter.enqueueFailedTask(taskSet, tid, state, serializedData)
            }
          }
        case None =>
          logError(
            ("Ignoring update with state %s for TID %s because its task set is gone (this is " +
              "likely the result of receiving duplicate task finished status updates) or its " +
              "executor has been marked as failed.")
              .format(state, tid))
      }
    } catch {
      case e: Exception => logError("Exception in statusUpdate", e)
    }
  }
  // Update the DAGScheduler without holding a lock on this, since that can deadlock
  if (failedExecutor.isDefined) {
    assert(reason.isDefined)
    dagScheduler.executorLost(failedExecutor.get, reason.get)
    backend.reviveOffers()
  }
}

TaskSetManager

它继承了Schedulable接口，

在内部，

该类负责追踪TaskSet中的每个任务、如果任务失败对任务进行重试、以及处理为TaskSet进行位置感知的调度。

1 2	// 所有等待执行的任务，该结构作为堆栈的形式使用 private val allPendingTasks = new ArrayBuffer[Int]

/**
 * 将指定索引的任务添加到allPendingTasks，
 * 其中对于有优先位置的任务，添加相应的记录到pendingTasksForExecutor、pendingTasksForHost和pendingTasksForRack；
 * 对于没有优先位置信息的任务，添加相应记录到pendingTasksWithNoPrefs
 */
private[spark] def addPendingTask(index: Int) {
  for (loc <- tasks(index).preferredLocations) {
    loc match {
      case e: ExecutorCacheTaskLocation =>
        pendingTasksForExecutor.getOrElseUpdate(e.executorId, new ArrayBuffer) += index
      case e: HDFSCacheTaskLocation =>
        val exe = sched.getExecutorsAliveOnHost(loc.host)
        exe match {
          case Some(set) =>
            for (e <- set) {
              pendingTasksForExecutor.getOrElseUpdate(e, new ArrayBuffer) += index
            }
            logInfo(s"Pending task $index has a cached location at ${e.host} " +
              ", where there are executors " + set.mkString(","))
          case None => logDebug(s"Pending task $index has a cached location at ${e.host} " +
              ", but there are no executors alive there.")
        }
      case _ =>
    }
    pendingTasksForHost.getOrElseUpdate(loc.host, new ArrayBuffer) += index
    for (rack <- sched.getRackForHost(loc.host)) {
      pendingTasksForRack.getOrElseUpdate(rack, new ArrayBuffer) += index
    }
  }

  if (tasks(index).preferredLocations == Nil) {
    pendingTasksWithNoPrefs += index
  }

  allPendingTasks += index  // No point scanning this whole list to find the old task there
}

/**
 * 为一个executor分配一个任务
 */
def resourceOffer(
    execId: String,
    host: String,
    maxLocality: TaskLocality.TaskLocality)
  : Option[TaskDescription] =
{
  val offerBlacklisted = taskSetBlacklistHelperOpt.exists { blacklist =>
    blacklist.isNodeBlacklistedForTaskSet(host) ||
      blacklist.isExecutorBlacklistedForTaskSet(execId)
  }
  if (!isZombie && !offerBlacklisted) {
    val curTime = clock.getTimeMillis()

    var allowedLocality = maxLocality

    if (maxLocality != TaskLocality.NO_PREF) {
      allowedLocality = getAllowedLocalityLevel(curTime)
      if (allowedLocality > maxLocality) {
        // We're not allowed to search for farther-away tasks
        allowedLocality = maxLocality
      }
    }

    dequeueTask(execId, host, allowedLocality).map { case ((index, taskLocality, speculative)) =>
      // Found a task; do some bookkeeping and return a task description
      val task = tasks(index)
      val taskId = sched.newTaskId()
      // Do various bookkeeping
      copiesRunning(index) += 1
      val attemptNum = taskAttempts(index).size
      val info = new TaskInfo(taskId, index, attemptNum, curTime,
        execId, host, taskLocality, speculative)
      taskInfos(taskId) = info
      taskAttempts(index) = info :: taskAttempts(index)
      // Update our locality level for delay scheduling
      // NO_PREF will not affect the variables related to delay scheduling
      if (maxLocality != TaskLocality.NO_PREF) {
        currentLocalityIndex = getLocalityIndex(taskLocality)
        lastLaunchTime = curTime
      }
      // Serialize and return the task
      val serializedTask: ByteBuffer = try {
        ser.serialize(task)
      } catch {
        // If the task cannot be serialized, then there's no point to re-attempt the task,
        // as it will always fail. So just abort the whole task-set.
        case NonFatal(e) =>
          val msg = s"Failed to serialize task $taskId, not attempting to retry it."
          logError(msg, e)
          abort(s"$msg Exception during serialization: $e")
          throw new TaskNotSerializableException(e)
      }
      if (serializedTask.limit() > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1024 &&
        !emittedTaskSizeWarning) {
        emittedTaskSizeWarning = true
        logWarning(s"Stage ${task.stageId} contains a task of very large size " +
          s"(${serializedTask.limit() / 1024} KB). The maximum recommended task size is " +
          s"${TaskSetManager.TASK_SIZE_TO_WARN_KB} KB.")
      }
      addRunningTask(taskId)

      // We used to log the time it takes to serialize the task, but task size is already
      // a good proxy to task serialization time.
      // val timeTaken = clock.getTime() - startTime
      val taskName = s"task ${info.id} in stage ${taskSet.id}"
      logInfo(s"Starting $taskName (TID $taskId, $host, executor ${info.executorId}, " +
        s"partition ${task.partitionId}, $taskLocality, ${serializedTask.limit()} bytes)")

      sched.dagScheduler.taskStarted(task, info)
      new TaskDescription(
        taskId,
        attemptNum,
        execId,
        taskName,
        index,
        task.partitionId,
        addedFiles,
        addedJars,
        task.localProperties,
        serializedTask)
    }
  } else {
    None
  }
}

/** 
* 
*/
private def dequeueTask(execId: String, host: String, maxLocality: TaskLocality.Value)
  : Option[(Int, TaskLocality.Value, Boolean)] =
{
  for (index <- dequeueTaskFromList(execId, host, getPendingTasksForExecutor(execId))) {
    return Some((index, TaskLocality.PROCESS_LOCAL, false))
  }

  if (TaskLocality.isAllowed(maxLocality, TaskLocality.NODE_LOCAL)) {
    for (index <- dequeueTaskFromList(execId, host, getPendingTasksForHost(host))) {
      return Some((index, TaskLocality.NODE_LOCAL, false))
    }
  }

  if (TaskLocality.isAllowed(maxLocality, TaskLocality.NO_PREF)) {
    // Look for noPref tasks after NODE_LOCAL for minimize cross-rack traffic
    for (index <- dequeueTaskFromList(execId, host, pendingTasksWithNoPrefs)) {
      return Some((index, TaskLocality.PROCESS_LOCAL, false))
    }
  }

  if (TaskLocality.isAllowed(maxLocality, TaskLocality.RACK_LOCAL)) {
    for {
      rack <- sched.getRackForHost(host)
      index <- dequeueTaskFromList(execId, host, getPendingTasksForRack(rack))
    } {
      return Some((index, TaskLocality.RACK_LOCAL, false))
    }
  }

  if (TaskLocality.isAllowed(maxLocality, TaskLocality.ANY)) {
    for (index <- dequeueTaskFromList(execId, host, allPendingTasks)) {
      return Some((index, TaskLocality.ANY, false))
    }
  }

  // find a speculative task if all others tasks have been scheduled
  dequeueSpeculativeTask(execId, host, maxLocality).map {
    case (taskIndex, allowedLocality) => (taskIndex, allowedLocality, true)}
}

CoarseGrainedSchedulerBackend

DriverEndpoint

/**
 * 
 */
private def makeOffers() {
  // Make sure no executor is killed while some task is launching on it
  val taskDescs = withLock {
    // Filter out executors under killing
    val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
    val workOffers = activeExecutors.map {
      case (id, executorData) =>
        new WorkerOffer(id, executorData.executorHost, executorData.freeCores,
          Some(executorData.executorAddress.hostPort))
    }.toIndexedSeq
    // 将资源提交给TaskSchedulerImpl进行任务分配
    scheduler.resourceOffers(workOffers)
  }
  if (!taskDescs.isEmpty) {
    launchTasks(taskDescs) // 启动任务
  }
}

// Launch tasks returned by a set of resource offers
private def launchTasks(tasks: Seq[Seq[TaskDescription]]) {
  for (task <- tasks.flatten) {
    val serializedTask = TaskDescription.encode(task)
    if (serializedTask.limit() >= maxRpcMessageSize) {
      Option(scheduler.taskIdToTaskSetManager.get(task.taskId)).foreach { taskSetMgr =>
        try {
          var msg = "Serialized task %s:%d was %d bytes, which exceeds max allowed: " +
            "spark.rpc.message.maxSize (%d bytes). Consider increasing " +
            "spark.rpc.message.maxSize or using broadcast variables for large values."
          msg = msg.format(task.taskId, task.index, serializedTask.limit(), maxRpcMessageSize)
          taskSetMgr.abort(msg)
        } catch {
          case e: Exception => logError("Exception in error callback", e)
        }
      }
    }
    else {
      val executorData = executorDataMap(task.executorId)
      executorData.freeCores -= scheduler.CPUS_PER_TASK

      logDebug(s"Launching task ${task.taskId} on executor id: ${task.executorId} hostname: " +
        s"${executorData.executorHost}.")

      executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
    }
  }
}

CoarseGrainedExecutorBackend

override def receive: PartialFunction[Any, Unit] = {

  case LaunchTask(data) =>
    if (executor == null) {
      exitExecutor(1, "Received LaunchTask command but executor was null")
    } else {
      val taskDesc = TaskDescription.decode(data.value)
      logInfo("Got assigned task " + taskDesc.taskId)
      executor.launchTask(this, taskDesc)
    }

}

TODO

NOTEs

本文以Spark 2.4.3为基础。