My NameNode is vulnerable to too many clients

An elephant keeper told me he was concerned about his NameNode server because some users may abuse it. To keep NameNode from suffering from too many, too hasty clients, actually we can check the following action list. If you have done all of this, your NameNode should be more reliable. Or if it’s not, you should have more related context for root cause analysis. This is also to address common potential performance problems. However, if you cluster is small or idle, no bother.

Read More

What I talk about when I talk about NameNode JMX

So I asked an elephant keeper, for a too many under-replicated blocks problem, please check the NameNode status via JMX. He found the JMX returned too verbose, and he was not sure what were the most important JMX metrics. I remember the old naive days when myself just started working on HDFS and troubleshooting NameNode issues, I was also wondering which JMX metrics were majory or general. I don’t keep my list secret so here it is.

Read More

My NameNodes are failing over too frequently

An elephant keeper tells me his HDFS NameNodes are failing over too frequently. He’s concerned about this because it’s a sign of something wrong in the High Availability (HA) cluster.

If you don’t want failover to happen that fast, I told him, you can simply increase ha.health-monitor.rpc-timeout.ms config key to whatever you want. Joking aside, the above is something that can help mitigate, but that would be a temporary fix rather than addressing root cause. There are several common causes of NameNode stalls and failovers. The first task is to find evidence of them happening in the NN logs.

Read More

My HDFS balancer is slow

An elephant keeper tells me his HDFS balancer is slow and he can’t sleep well at night. He asks me if I can help speed it up.

OK, by design the HDFS balancer runs slowly in background, balancing the whole cluster periodically. It’s fine to be slow, I tell him, so that it does not affect the normal cluster activities. Your users submit jobs, copy datas in and out, and operate the cluster for fun, without knowing that a balancer is running in the meantime. So go to sleep and sleep well. Don’t worry about slow balancer.

Read More

My Standby NameNode hangs from time to time

One elephant keeper asked me, should he be concerned if his standby NameNode hangs occasionally, from 10 seconds to 30 seconds. Sometimes he found it’s not responsive to block reports, failover requests, or other operations; fortunately the standby NN was able to recover later. Maybe there are other short hangs that he was not aware of.

Read More

Distcp to Amazon S3 reports FileNotFoundException

An elephant keeper told me that he was trying to copy the data from his HDFS to S3 and he saw quite a few FileNotFoundException. However, when he checked the failing files immediately from Amazon S3 web console, he was able to see them in S3 Bucket. I then kindly asked him one question: Did you use the -p option in your Distcp command line? He said, yes, ‘cause he does not want to lose the file metadata so he thought it’s a good practise to keep file attributes when copying files.

Read More