Thursday, June 28, 2012

Migrating to Graylog2 0.9.6 issues

We had a problem recently. We had been logging to Graylog2 using the gelf4j appender in a lot of applications. Our MongoDB was configured with a capped collection because we can quickly run our server out of space.

After we upgraded to Graylog2 0.9.6 we noticed a few things.
1. Our server quickly ran out of space
2. The dates were showing up in the year 44461.

Version 0.9.6 is much more responsive, due to the data being stored directly in ElasticSearch, however the capped collections no longer solve the problem of data size. It took some searching before finding that there is a 'settings' tab with a 'message retention time' tab that can be changed from 60 days worth of data to a smaller number.

This doesn't fully solve our data size problem. If a flood of errors occur it can still push the size beyond what the server can handle. Plus this deletion value uses the created_at date that is now passing the wrong data. All data is listed as being in the future and not getting cleaned up with setting.

It took a bit of tinkering, but I was able to finally come up with an ElasticSearch query that would allow me to delete all data dated in the future. This is not the best case scenario, but I can live with manually deleting data until the applications are updated.


To figure out what value to use for the date I used the unix date command to determine the value for 1/1/2013.



Fixing the problem with the dates was much easier.  A new version of gelf4j was available with the fix already implemented.   Simply replacing the previous version with the new version solved the date problem.




Dates with Graylog2 is an issue I saw mentioned several times.


Users complained that their servers are all over the world and the dates are showing local times.   So they want the ability to sort dates differently.


To me this could be solved by always passing through the UTC time instead of a local time.  Then all data would be in order by the time it was sent according to the sending server.


Some users want the graylog server to show dates in the order it receives them instead of by the date order that is stamped on them.


I think that this could be a useful feature, but I don't want it to be the default.  I want the time that the message occurred to be the sorting field.  Some servers could batch up messages and hold them for minutes before sending them.  I want to be able to find problems based on the time that they occur, not whenever Graylog happened to finally get it's copy of the data.    


I believe that if Graylog did add it's own timestamp for when a message was retrieved then deletions could be done against the internal date.  If that were the case Graylog could still hold messages for 2 days from when it received them, regardless of the date-time that the sending server happened to say it was.

Finally, I believe that Graylog2 0.9.6 is a good improvement to the previous version, the web browser is much more responsive, but the inability to cap the index size in ElasticSearch is a big issue that needs to be addressed as soon as possible.

No comments: