How to Resolve Top Performance Issue in Impala?

Published on
June 9, 2022

What is Impala?

Impala is an MPP (Massive Parallel Processing) SQL query engine for processing huge volumes of data that is stored in a Hadoop cluster.

It is an open-source software which is written in C++ and Java. It provides high performance and low latency compared to other SQL engines for Hadoop.

In other words, Impala is the highest performing SQL engine (giving RDBMS-like experience) which provides the fastest way to access data that is stored in the Hadoop Distributed File System.

We have a major issue where many queries end up in the “Waiting to be closed” state in the `queries` of the Impala Daemon UI.

Generally, Impala does not close the query before the client fetches the full data and sometimes clients(Hue) leave the query open.

It does not close the query unless we close the tab so we have added the session timeout, which means when the session is not used for some time it will close all the queries and the session is closed.

Most of the times many users set this setting lower than the default. So people do not face the session time out error and can wait for queries to fetch the data.

So queries in Impala end up with the ‘waiting to closed’ state. When a query is in waiting to be closed state it is completed but still holds the resources, which causes the issue to run the new query if multiple queries go in the same state and user face impala performance issue. We can clear those queries manually by going to impala daemon UI like http://impala_daemon_hostname:25000/queries.

In the Hadoop cluster, as we have multiple nodes and Impala daemon it is time-consuming to connect to each node and then close the queries.

The below python script uses the CM API to close the queries in ‘waiting to be closed’ state. It will connect to each node and close the queries.

import urllib, json

datanodes =[“http://impala_daemon_hostname_1:25000/”,

“http://impala_daemon_hostname_2:25000/”,

“http://impala_daemon_hostname_3:25000/”,

“http://impala_daemon_hostname_4:25000/”,

“http://impala_daemon_hostname_n:25000/”]

for i, datanode in enumerate(datanodes):

print(“Checking {}: {}”.format(i, datanode)) try:

response = urllib.urlopen(datanode + “queries?json”)

data = json.loads(response.read())

#print data

if data[“num_waiting_queries”] > 0:

print(data[“num_waiting_queries”])

for in_flight_query in data[“in_flight_queries”]:

if in_flight_query[“waiting”] is True and in_flight_query[‘state’] == “FINISHED” or in_flight_query[‘state’] == “EXCEPTION”: cancel_url = datanode + “cancel_query?query_id={}”.format(in_flight_query[‘query_id’]) print(cancel_url)

response = urllib.urlopen(cancel_url)

except IOError:

print(“Skipping {}: {}”.format(i, datanode))

except Exception as e:

print(e)

Conclusion: As a result, we know about Impala & performance issues in Impala & how to solve it.


Subscribe to newsletter

Subscribe to receive the latest blog posts to your inbox every week.

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Share this post
Job title, Company name

How to Resolve Top Performance Issue in Impala?

What is Impala?

Impala is an MPP (Massive Parallel Processing) SQL query engine for processing huge volumes of data that is stored in a Hadoop cluster.

It is an open-source software which is written in C++ and Java. It provides high performance and low latency compared to other SQL engines for Hadoop.

In other words, Impala is the highest performing SQL engine (giving RDBMS-like experience) which provides the fastest way to access data that is stored in the Hadoop Distributed File System.

We have a major issue where many queries end up in the “Waiting to be closed” state in the `queries` of the Impala Daemon UI.

Generally, Impala does not close the query before the client fetches the full data and sometimes clients(Hue) leave the query open.

It does not close the query unless we close the tab so we have added the session timeout, which means when the session is not used for some time it will close all the queries and the session is closed.

Most of the times many users set this setting lower than the default. So people do not face the session time out error and can wait for queries to fetch the data.

So queries in Impala end up with the ‘waiting to closed’ state. When a query is in waiting to be closed state it is completed but still holds the resources, which causes the issue to run the new query if multiple queries go in the same state and user face impala performance issue. We can clear those queries manually by going to impala daemon UI like http://impala_daemon_hostname:25000/queries.

In the Hadoop cluster, as we have multiple nodes and Impala daemon it is time-consuming to connect to each node and then close the queries.

The below python script uses the CM API to close the queries in ‘waiting to be closed’ state. It will connect to each node and close the queries.

import urllib, json

datanodes =[“http://impala_daemon_hostname_1:25000/”,

“http://impala_daemon_hostname_2:25000/”,

“http://impala_daemon_hostname_3:25000/”,

“http://impala_daemon_hostname_4:25000/”,

“http://impala_daemon_hostname_n:25000/”]

for i, datanode in enumerate(datanodes):

print(“Checking {}: {}”.format(i, datanode)) try:

response = urllib.urlopen(datanode + “queries?json”)

data = json.loads(response.read())

#print data

if data[“num_waiting_queries”] > 0:

print(data[“num_waiting_queries”])

for in_flight_query in data[“in_flight_queries”]:

if in_flight_query[“waiting”] is True and in_flight_query[‘state’] == “FINISHED” or in_flight_query[‘state’] == “EXCEPTION”: cancel_url = datanode + “cancel_query?query_id={}”.format(in_flight_query[‘query_id’]) print(cancel_url)

response = urllib.urlopen(cancel_url)

except IOError:

print(“Skipping {}: {}”.format(i, datanode))

except Exception as e:

print(e)

Conclusion: As a result, we know about Impala & performance issues in Impala & how to solve it.


Subscribe to newsletter

Subscribe to receive the latest blog posts to your inbox every week.

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Share this post
Job title, Company name

Experience Azure to build and deploy the apps your way

Contact us to start your journey today

Enhance your business with Amazon cloud services

Contact us to know more about our AWS offerings

Accelerate your digital transformation with our IT solutions

Connect with our experts now

Optimize your business with the insights from Big Data

Contact us for any queries or services

Leverage IT to amplify your health care services

Contact us for more details

Get cutting-edge IT solutions for your manufacturing business

Connect with our expert now

Make the most out of your data with our solutions

Connect with our experts to know more

Get started with Database administration

Not sure what you need? Reach out to us now.

Ready to elevate your business to the next level?

Get in touch to know how

Select the right cloud service for you

Get advise from our dedicated Cloud Consultants

Optimize your business with the insights from Big Data

Contact us for any queries or services

Start your DevOps journey with us

Contact us to know all about DevOps

Subscribe to newsletter

Subscribe to receive the latest blog posts to your inbox every week.

By subscribing you agree to with our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Privacy PolicyTerms of ServiceSitemap
© 2022 Cloudaeon All Rights Reserved.