Opened 7 years ago

Closed 7 years ago

#1982 closed defect (fixed)

osgeo6 disk full: 100% /var

Reported by: neteler Owned by: sac@…
Priority: blocker Milestone:
Component: SysAdmin Keywords: osgeo6
Cc: strk

Description

Disk full!

Aug 14 02:39:01 osgeoroot@osgeo6:/var/log# df -h .
Filesystem              Size  Used Avail Use% Mounted on
/dev/mapper/osgeo6-var   19G   19G   20K 100% /var

It seems that our monitoring services are not being looked at...? Nor they generate warnings by email to the SAC list.

Change History (10)

comment:1 by neteler, 7 years ago

As an emergency solution, I have

root@osgeo6:/var/log# du -hs * | grep G
4.8G	mail.info.1
4.8G	mail.log.1

root@osgeo6:/var/log# rm -f mail.info.1 mail.log.1

comment:2 by neteler, 7 years ago

Cc: strk added

Since osgeo6 is also the mailman list server, there is a notable backlog of currently deferred emails:

root@osgeo6:/var/log# mailq | grep mailman-bounces@lists.osgeo.org | wc -l
143069

This is bloating the log files since due our too high posting frequency emails are temporarily rejected:

ls -lart | grep 'mail.log\|mail.info'
...
-rw-r-----  1 root     adm      959388006 Aug 14 03:45 mail.log
-rw-r-----  1 root     adm      959408486 Aug 14 03:45 mail.info

tail -f /var/log/mail.info
...
Aug 14 03:47:41 osgeo6 postfix/smtp[11659]: 6CD906099BF8: host smx00.udag.de[62.146.106.132] refused to talk to me: 421 Too many concurrent SMTP connections from this IP address; please try again later.
...

These huge log files are eventually filling up the disk to 100%.

comment:3 by wildintellect, 7 years ago

So clearly the log files contribute, but I'm not sure that's the root cause. I think the mailman was starting to throw errors and that cascaded it filling it's own logs, which then caused more errors. Anyone know if there's a way to limit how big a mailman log will get before it cycles, rather than waiting for logrotate? Can someone try to extract the head of the logs to figure out what caused the initial issue?

I'll check on the munin email setup, I'm not sure what it's set to do when it hits a limit.

in reply to:  3 comment:4 by neteler, 7 years ago

Replying to wildintellect:

So clearly the log files contribute, but I'm not sure that's the root cause. I think the mailman was starting to throw errors and that cascaded it filling it's own logs, which then caused more errors. Anyone know if there's a way to limit how big a mailman log will get before it cycles, rather than waiting for logrotate?

AFAIK it is postfix, not mailman writing those two log files.

Can someone try to extract the head of the logs to figure out what caused the initial issue?

Most messages are like this:

Aug 14 14:01:15 osgeo6 postfix/smtp[9601]: 3FFB36332EB7: host aspmx.l.google.com[74.125.28.26] said: 450-4.2.1 The user you are trying to contact is receiving mail at a rate that 450-4.2.1 prevents additional messages from being delivered. Please resend your 450-4.2.1 message at a later time. If the user is able to receive mail at that 450-4.2.1 time, your message will be delivered. For more information, please 450-4.2.1 visit 450 4.2.1  https://support.google.com/mail/?p=ReceivingRate d11si5262351pln.414 - gsmtp (in reply to RCPT TO command)

Apparently we are hammering some other servers too much. At time it is a kind of endless loop with those servers (i.e. for some recipients).

For now, I have added some postfix throttling:

/etc/postfix/main.cf

# throttle, see https://trac.osgeo.org/osgeo/ticket/1982
smtp_destination_concurrency_limit = 2
smtp_destination_rate_delay = 1s
smtp_extra_recipient_limit = 10

Maybe that will help to calm down google and dtag mail servers. We can comment out these lines once the mail queue is closer to empty. At time:

mailq | grep mailman-bounces@lists.osgeo.org | wc -l
111064

which is 30,000 less than 11hs ago.

I'll check on the munin email setup, I'm not sure what it's set to do when it hits a limit.

ok thx

comment:5 by neteler, 7 years ago

No more mailman emails arrive, we should reboot "osgeo6"

comment:6 by martin, 7 years ago

To me it looks like more than 95% of the Postfix message queue consists of bounces. I'm now going to remove these from the queue - and I hope I don't catch any relevant EMail ....

comment:7 by wildintellect, 7 years ago

Looking at the munin graph, it started on Aug 7th. Logs from that date would be the best thing to check to figure out how to avoid a repeat in the future. http://webextra.osgeo.osuosl.org/munin/osgeo.org/osgeo6.osgeo.org/postfix_mailqueue.html

I updated munin to email the sysadmin@osgeo (Alex, Martin and Sandro get this), I would add SAC but I'm not sure how to put multiple emails into munin notify, or how to whitelist munin so it can send to the SAC list. Anyone else know those parts?

comment:8 by strk, 7 years ago

1) Have the mail sent to list 2) Access the list moderation and whitelist the sender

in reply to:  7 comment:9 by neteler, 7 years ago

Replying to wildintellect:

I updated munin to email the sysadmin@osgeo (Alex, Martin and Sandro get this)

Keep in mind that such alarms won't arrive in case of osgeo6's disk being full as per this ticket. I guess that the addresses should be added directly in munin.

comment:10 by neteler, 7 years ago

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.