Grafana Article Series (XI): How Tags in Loki Make Log Queries Faster and Easier
This article was last updated on: July 24, 2024 am
👉️URL: https://grafana.com/blog/2020/04/21/how-labels-in-loki-can-make-log-queries-faster-and-easier/
📝Description:
Everything you need to know about how tags really work in Loki. It may not be the same as you think
For most of our first year on the Loki project, questions and feedback seemed to come from people familiar with Prometheus. After all, Loki is like Prometheus – but for logs! "。
But recently, we’ve seen more and more people trying to use Loki, who don’t have experience with Prometheus, and many come from systems with different policies to process logs. This raises a lot of questions about a very important concept of Loki, and even Prometheus experts want to know more:Labels!
This post will cover a lot to help everyone who is new to Loki and who wants to review. We will explore the following topics.
What is a Label?
Tags are key-value pairs that can be defined as anything! We like to call them metadata, which describes the log stream. If you’re familiar with Prometheus, you’ll habitually see labels like:job
andinstance
, which I will use in the next examples.
We also define these labels using the scrape configuration provided by Loki. If you are using Prometheus,Having consistent labels between Loki and Prometheus is one of Loki’s super strengths, make you It’s easy to correlate your application metrics with your log data。
How Loki uses tags
Tags in Loki perform a very important task. They define a stream. Rather, the combination of key and value for each tag defines the flow. If only one tag value changes, a new stream is generated.
If you’re familiar with Prometheus, the term used there is series; However, Prometheus has an additional dimension: metric name. Loki simplified this, there were no metric names, just labels, and we decided to use streams instead of series.
Let’s take an example:
1 |
|
This configuration will track a file and assign a label:job=syslog
。 You can query like this:
{job=”syslog”}
This will create a stream in Loki.
Now let’s expand this example a bit:
1 |
|
Now we are tracking two files. Each file only gets one label and one value, so Loki will now store two data streams.
We can query these streams in several ways:
{job=”apache”} <- 显示标签 job 是 apache 的日志
{job=”syslog”} <- 显示标签 job 是 syslog 的日志
{job=~”apache|syslog”} <- 显示标签 job 是 apache **或** syslog 的日志
In the last example, we used a regex label matcher to record a stream of two values using the label job. Now consider how to also use an extra tag:
1 |
|
Now we can do this instead of using regular expressions:
{env=”dev”} <- 返回 env=dev 的所有日志,本例中包括两个日志流
Hopefully, you’re starting to see the power of labels now. By using a single label, you can query many data streams. By combining several different tags, you can create very flexible log queries.
A label is an index of Loki’s log data. They are used to find compressed log content, which is stored separately in block form. Each unique combination of tags and values defines a stream, and the logs of one stream are compressed in batches and stored as blocks.
To make Loki efficient and cost-effective, we must use labels responsibly. This is explored in more detail in the next section.
Cardinality
The previous two examples use statically defined labels with only one value; However, there are ways to define tags dynamically. Let’s look at Apache logs and the large number of overlapwords you can use to parse such log lines.
11.11.11.11 - frank [25/Jan/2000:14:00:01 -0500] "GET /1986.js HTTP/1.1" 200 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"
1 |
|
This phrase matches each component of the log line and extracts the value of each component into a capture group. In the pipeline code, this data is placed in a temporary data structure that allows it to be used for multiple purposes when the log line is processed (at which point the temporary data is discarded). More details on this can be found at Over here Found it.
From this coincident code, we will use two capture groups to dynamically set two labels based on the contents of the log line itself.
action (e.g., action=“GET”, action=“POST”) status_code (e.g., status_code=“200”, status_code=“400”).
Now let’s look at a few example lines:
11.11.11.11 - frank [25/Jan/2000:14:00:01 -0500] "GET /1986.js HTTP/1.1" 200 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"
11.11.11.12 - frank [25/Jan/2000:14:00:02 -0500] "POST /1986.js HTTP/1.1" 200 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"
11.11.11.13 - frank [25/Jan/2000:14:00:03 -0500] "GET /1986.js HTTP/1.1" 400 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"
11.11.11.14 - frank [25/Jan/2000:14:00:04 -0500] "POST /1986.js HTTP/1.1" 400 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"
In Loki, the following data flows are created:
{job=”apache”,env=”dev”,action=”GET”,status_code=”200”} 11.11.11.11 - frank [25/Jan/2000:14:00:01 -0500] "GET /1986.js HTTP/1.1" 200 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"
{job=”apache”,env=”dev”,action=”POST”,status_code=”200”} 11.11.11.12 - frank [25/Jan/2000:14:00:02 -0500] "POST /1986.js HTTP/1.1" 200 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"
{job=”apache”,env=”dev”,action=”GET”,status_code=”400”} 11.11.11.13 - frank [25/Jan/2000:14:00:03 -0500] "GET /1986.js HTTP/1.1" 400 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"
{job=”apache”,env=”dev”,action=”POST”,status_code=”400”} 11.11.11.14 - frank [25/Jan/2000:14:00:04 -0500] "POST /1986.js HTTP/1.1" 400 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"
These four log lines become four separate streams and begin to populate four separate blocks.
Any additional log lines that match these label/value combinations will be added to the existing stream. If there is another unique combination of tags coming in (eg status_code="500"
), another new flow is created.
Now imagine if you set a label for the IP. Not only does every request from the user become a unique flow. Each request from the same user with a different action or status code will get its own stream.
Do some simple calculations if there are four common actions (GET, PUT, POST, DELETE) and four common status codes (although there may be more than four!). ), which would be 16 streams and 16 separate blocks. Now, if we use a label to represent an IP, multiply that number by each user. You can have thousands or tens of thousands of streams very quickly.
This results in high cardinality. Will kill Loki.
When we talk about cardinality , we are referring to the combination of labels and values and the number of streams they create. High cardinality refers to the use of tags that have a large range of possible values, such as ip, or a combination of many tags, even if they have a small and limited set of values, such as using status_code and action.
High cardinality causes Loki to build a huge index (pronounced: ) and flushes thousands of small chunks into object storage (read: 💲💲💲💲 slow). Currently, Loki performs poorly in this configuration, and it will be the least cost-effective and least fun to run and use.
Best Loki performance using parallelization
Now you might ask: if using a large number of tags or labels with a large number of values is not good, then how should I query my logs? If none of the data is indexed, wouldn’t the query be slow?
When we see that people using Loki are used to using other index-duplicate solutions, they seem to feel obligated to define a large number of labels in order to query their logs efficiently. After all, many other logging solutions are about indexing, which is also a common way of thinking.
When using Loki, you may need to forget what you know and see how to solve this problem in parallelization. Loki’s superpower is to break up queries into small chunks and schedule them in parallel so that you can query large amounts of log data in a small amount of time.
This crude approach may not sound ideal, but let me explain why it is.
Large indexes are complex and expensive. Typically, the full-text index of your log data is the same size or larger than the log data itself. In order to query your log data, you need to load this index, and for performance, it should probably be in memory. This is hard to scale, and as you ingest more logs, your index gets bigger very quickly.
Now let’s talk about Loki, whose index is typically an order of magnitude smaller than the amount of logs you ingest. So if you can keep your data streams and data streams churned well, the index grows very slowly compared to ingested logs.
Loki will effectively keep your static costs as low as possible (index size and memory requirements and static log storage) and make query performance a level of scalability that you can control at runtime.
To understand this, let’s go back and look at the example of us querying access log data for a specific IP address. We don’t want to store IPs with a tag. Instead, we use a filter expression to query it.
1 |
|
Behind the scenes, Loki breaks up the query into smaller fragments (shards) and opens each shard for the stream that the label matches to start looking for this IP address.
The size of these shards and the amount of parallelization are configurable and based on the resources you provide. If you want, you can configure the shard interval to 5M, deploy 20 queriers, and process billions of bytes of logs in seconds. Or you can frantically configure 200 queryers to process terabytes of logs.
This trade-off between smaller indexes and parallel brute-force queries versus larger/faster full-text indexes allows Loki to be cost-effective compared to other systems. The cost and complexity of operating a large index is high, and it’s usually fixed – whether you query it or not, you’re paying for it 24 hours a day.
The advantage of this design is that you can decide how much query power you want to have, and you can change it as needed. Query performance becomes a function of how much money you want to spend on it. At the same time, data is heavily compressed and stored in low-cost object storage such as S3 and GCS. This minimizes fixed operating costs while enabling incredibly fast query capabilities
Best practices
Here are some of Loki’s most effective labeling practices right now to give you the best experience with Loki.
1. Static labels are recommended
Things like hosts, applications, and environments are good labels. They are fixed for a given system/application and have defined values. Using static labels makes it easier for you to query your logs logically (e.g., show me all logs for a given application and a specific environment, or show me all logs for all applications on a particular host).
2. Use dynamic tags sparingly
Too many combinations of tag values result in too much data flow. In Loki, the penalty for this is a large index and small chunks in storage, which in turn degrades performance.
To avoid these problems, don’t label something until you know you need it. Use a filter expression ( |= "text"
, |~ "regex"
, …) and brute-force these logs. It’s effective – and fast.
Since the early days, we’ve been using the promtail pipeline for level
A label is set dynamically. This seems intuitive to us, as we often want to just displaylevel="error"
logs; However, we are now re-evaluating this as writing a query.{app="loki"} |= "level=error"
For many of our applications, proof with{app="loki",level="error"}
Just as fast.
This may seem surprising, but if the application has medium to low volume, the label causes an application’s logs to be split into up to five streams, which means 5 times more blocks are stored. Loading blocks has an overhead associated with it. Imagine if this query was{app="loki",level!="debug"}
。 This will have to be than{app="loki"} != "level=debug"}
Load multiple chunks.
Above, we mentioned in youDon’t add tags until you need them, so when will youNeed a label? One more point is about chunk_target_size
section. If you set this to 1MB (which is reasonable), this will attempt to cut chunks at a compressed size of 1MB, which is about 5MB or so of uncompressed logs (possibly as much as 10MB, depending on compression). If your logs have enough capacity in thanmax_chunk_age
For 5MB writes in less time, or as many blocks as you have in that time frame, you might want to consider using dynamic tags to split it into separate streams.
What you want to avoid is splitting a log file into streams, which will cause blocks to be flushed because the stream is idle or reaches its maximum age before it is full. from Loki 1.4.0 To get started, there is a metric to help you understand why you are refreshing your blockssum by (reason) (rate(loki_ingester_chunks_flushed_total{cluster="dev"}[1m]))
。
It is not critical that each block is full on refresh, but it will improve operation in many aspects. Therefore, our current guiding idea is to avoid dynamic tags as much as possible in favor of filter expressions. For example, do not add level
Dynamic tags, while using|= "level=debug"
In lieu of.
3. Tag values must always be bounded
If you want to set labels dynamically, never use labels that can have unbounded or infinite values. This always causes big problems for Loki.
Try to limit the values to the smallest possible range. We don’t have a perfect guide to the values Loki can handle, but for dynamic tags, it’s important to considerSingle digitsor10 numeric values。 This is less important for static labels. For example, if you have 1,000 hosts in your environment, a host label with 1,000 values would be fine.
4. Pay attention to the dynamic tags of the client
Loki has several client options.Promtail(systemd log ingestion and TCP-based syslog ingestion are also supported),FluentD,Fluent BitOne Docker plugin, and much more!
Each has a way to configure what tags are used to create the log stream. But be aware of which dynamic tags might be used. Use the Loki family of APIs to understand what your log stream looks like and see if there is a way to reduce the stream and cardinality. Details of the series of APIs can be found in Over here Find, or you can use it logcli to find information about Loki’s series.
5. Configure caching
Loki can cache data at multiple levels, which can greatly improve performance. Details of this will be covered in future articles.
6. The logs for each stream must be incremented in chronological order (the new version accepts unordered logs by default)
📝Notes:
One problem that many people encounter when using Loki is that their clients receive incorrect log entries. This is because Loki has a hard and fast rule inside.
- For any single log stream, logs must always be sent in increasing chronological order. If the timestamp of the received log is greater than the timestamp of the most recent log received by the stream, the log is discarded.
From this statement, there are several things to dissect. First, this limit is per-stream. Let’s look at an example:
{job=”syslog”} 00:00:00 i’m a syslog!
{job=”syslog”} 00:00:01 i’m a syslog!
If Loki receives these two lines for the same class, then everything will be fine. But what about this situation?
{job=”syslog”} 00:00:00 i’m a syslog!
{job=”syslog”} 00:00:02 i’m a syslog!
{job=”syslog”} 00:00:01 i’m a syslog! <- 拒绝不符合顺序的!
Well, well. … But what can we do? What if this is because the source of these logs is a different system? We can solve this problem with an additional tag, which is unique in each system.
{job=”syslog”, instance=”host1”} 00:00:00 i’m a syslog!
{job=”syslog”, instance=”host1”} 00:00:02 i’m a syslog!
{job=”syslog”, instance=”host2”} 00:00:01 i’m a syslog! <- 被接受,这是一个新的流!
{job=”syslog”, instance=”host1”} 00:00:03 i’m a syslog! <- 被接受,流 1 仍是有序的
{job=”syslog”, instance=”host2”} 00:00:02 i’m a syslog! <- 被接受,流 2 仍是有序的
But what if the logs generated by the application itself are unhealthy? Well, I’m afraid that’s a problem. If you use something like a promtail pipeline stage to extract timestamps from log lines, you can instead do so and have Promtail assign a timestamp to the log lines. Or you can hope to fix it in the app itself.
But I want Loki to fix this! Why can’t you buffer the data stream for me and reorder it? To be honest, this will add a lot of memory overhead and complexity to Loki, and as a common denominator in this article, we want Loki to be simple and economical. Ideally, we’d like to improve our client to do some basic buffering and sorting, as this seems like a better place to solve this problem.
It’s also worth noting that the batched nature of the Loki push API may result in receiving some misordered situations, which are actually false positives. (Maybe a batch part succeeds and appears; Or anything that has previously been successful returns an out-of-order entry; Or anything new will be accepted).
7. Use chunk_target_size
This was earlier in 2020 for us Release Loki v1.3.0 When it was added, we’ve been experimenting with it for months. Now we have it in all environmentschunk_target_size: 1536000
。 This instructs Loki to try to populate all chunks to a 1.5MB targetcompressSize. These larger blocks are more efficient for Loki to handle.
Several other configuration variables affect the size of a block. Loki defaults max_chunk_age
for 1 hour,chunk_idle_period
30 minutes to limit the amount of memory used and the risk of losing logs if the process crashes.
Depending on the compression used (we always use snappy, which is less compressible but faster performant), you need 5-10x or 7.5-10MB of raw log data to fill 1.5MB of chunks. Remember, a block is every stream, and the more streams you divide your log files into, the more blocks there are in memory, and the more likely they are to hit the aforementioned timeout before they are filled.
Many small, unfilled blocks are currently Loki’s stubborn stones. We are always working to improve this and may consider using a compressor to improve this in some cases. However, in general, the guideline should remain the same: do your best to fill the block.
If you have an app, it records fast enough to fill up these blocks quickly (much lessmax_chunk_age
), then it makes more sense to use dynamic tags to break it down into separate data streams.
summary
Let me conclude by emphasizing the idea of a dead horse as a living horse doctor!
Use parallelization for performance instead of labels and indexes
Be strict with the label. Static tags are generally good, but dynamic tags should be used sparingly. (If your log stream is written at 5-10MB per minute, then considering how a dynamic tag splits it into two or three streams can improve query performance.) If your amount is small, stick with it Filter expressions。
Indexing doesn’t have to be Loki’s path to performance! Start by prioritizing parallelization and LogQL query filtering.
Remember: Loki requires a different way of thinking compared to other log storage solutions. We are optimizing Loki for fewer data streams and smaller indexes, which helps fill larger chunks and makes it easier to query through parallelization.
We are actively improving Loki and looking at how to do this. Be sure to stay tuned The unfolding of Loki’s story, we are all figuring out how to get the most out of this really effective tool!