capacity-scheduler-in-yarn-hadoop
Capacity
Scheduler in YARN
In the post YARN in Hadoop we have already seen that it is
the scheduler component of the ResourceManager which is responsible for
allocating resources to the running jobs. The scheduler component is pluggable
in Hadoop and there are two options capacity scheduler and fair scheduler. This post talks about the
capacity scheduler in YARN, its benefits and how capacity scheduler can be
configured in Hadoop cluster.
Capacity
scheduler
Capacity scheduler in
YARN allows multi-tenancy of the Hadoop cluster where multiple users can
share the large cluster.
Every organization
having their own private cluster leads to a poor resource utilization. An
organization may provide enough resources in the cluster to meet their peak
demand but that peak demand may not occur that frequently, resulting in poor
resource utilization at rest of the time.
Thus sharing cluster
among organizations is a more cost effective idea. However, organizations are
concerned about sharing a cluster because they are worried that they may not
get enough resources at the time of peak utilization. The CapacityScheduler in
YARN mitigates that concern by giving each organization capacity guarantees.
Capacity
scheduler in YARN functionality
Capacity scheduler in
Hadoop works on the concept of queues. Each organization gets its own dedicated
queue with a percentage of the total cluster capacity for its own use. For
example if there are two organizations sharing the cluster, one
organization may be given 60% of the cluster capacity where as the organization
is given 40%.
On top of that, to
provide further control and predictability on sharing of resources, the
CapacityScheduler supports hierarchical queues. Organization can further divide
its allocated cluster capacity into separate sub-queues for separate set of
users with in the organization.
Capacity scheduler is
also flexible and allows allocation of free resources to any queue beyond its
capacity. This provides elasticity for the organizations in a cost-effective
manner. When the queue to which these resources actually belongs has increased
demand the resources are allocated to it when those resources are released from
other queues.
Capacity
scheduler in YARN configuration
To configure the
ResourceManager to use the CapacityScheduler, set the following property in the
conf/yarn-site.xml:
yarn.resourcemanager.scheduler.class-
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
For setting up queues
in CapacityScheduler you need to make changes in etc/hadoop/capacity-scheduler.xml
configuration file.
The CapacityScheduler
has a predefined queue called root. All queues in the system are
children of the root queue.
Setting
up further queues-
Configure property yarn.scheduler.capacity.root.queues with a list of
comma-separated child queues.
Setting
up sub-queues with in a queue-
configure property yarn.scheduler.capacity.<queue-path>.queues
Here queue-path is the full path of the queue’s hierarchy, starting at root, with . (dot) as the delimiter.
Here queue-path is the full path of the queue’s hierarchy, starting at root, with . (dot) as the delimiter.
Capacity
of the queue-
Configure property yarn.scheduler.capacity.<queue-path>.capacity
Queue capacity is provided in percentage (%). The sum of capacities for all queues, at each level, must be equal to 100. Applications in the queue may consume more resources than the queue’s capacity if there are free resources, providing elasticity.
Queue capacity is provided in percentage (%). The sum of capacities for all queues, at each level, must be equal to 100. Applications in the queue may consume more resources than the queue’s capacity if there are free resources, providing elasticity.
Capacity
scheduler queue configuration example
If there are two
child queues starting from root XYZ and ABC. XYZ further divides
the queue into technology and development. XYZ is given 60% of
the cluster capacity and ABC is given 40%.
================================================
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>XYZ, ABC</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.XYZ.queues</name>
<value>technology,marketing</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.XYZ.capacity</name>
<value>60</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.ABC.capacity</name>
<value>40</value>
</property>
=============================================================================
If you want to limit
the elasticity for applications in the queue. Restricting XYZ's elasticity to
80% so that it doesn't use more than 80% of the total cluster capacity even if
resources are available. In other words ABC has 20% to start with immediately.
===============================================
<property>
<name>yarn.scheduler.capacity.root.XYZ.maximum-capacity</name>
<value>80</value>
</property>
===========================================================================
For the two
sub-queues of XYZ, you want to allocate 70% of the allocated queue capacity to
technology and 30% to marketing.
=================================================
<property>
<name>yarn.scheduler.capacity.root.XYZ.technology.capacity</name>
<value>70</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.XYZ.marketing.capacity</name>
<value>30</value>
</property>
==================================================================================
Reference: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
That's all for this
topic Capacity Scheduler in YARN. If you have any doubt or any
suggestions to make please drop a comment. Thanks!
Comments
Post a Comment