Flume Interview Questions

Displaying 1 - 5 of 5

How can you tune Flume for better performance?

Tiered data collection, choosing the right channel, batch size, channel capacity, and channel transaction capacity are some of the important factors for tuning the Flume and achieving better performance.

I have a website and I want to capture logs of the web server. Which channel should I use — memory or file channel?

If the agent goes down, the Flume state can be restored. Based on this situation, let’s consider the following scenarios.

File channel: Let’s say the agent went down and the source of the agent was reading from a database. Now, if you start the agent on another machine, it can resume processing the events from where it had left off.

Memory channel: In the memory channel, the event state cannot persist in the channel and it will be lost if the agent goes down. However, the memory channel works fast in terms of performance, with the

caveat that if the agent goes down, it can’t resume from the point where it had left off and the events are lost.

For web server logs, you should use the memory channel because if the agent goes down, some of the logs can be skipped. This is because it is not as important as transaction data, but if you use a file channel, it will keep pushing the local disk space; thus, it will be filled continuously as logs will be streamed without any fail. So, it is better to use a memory channel.

Explain the replicating and multiplexing selectors in Flume.

On the basis of the Flume header value, an event can be written just to a single channel or to multiple channels. If you do not explicitly define a channel selector, it will be the replicating selector by default.

Replicating selector sends the same event to all the channels, while a multiplexing selector sends different events to different channels.