Please Look before you Leap a second (take down your servers edition)

by Michael S. Kaplan, published on 2012/07/03 06:41 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2012/07/03/10326365.aspx


It way way back in 2008 that I blogged Please Look before you Leap a second.

The proposal to abolish leap seconds didn't really succeed, by the way.

But that isn't what today's blog is about.

It's something that happened on June 30th.

It was described in Leap second bug cripples Linux servers at airlines, Reddit, LinkedIn (Not a good time to be Australian).

Ugh.

From the article:

A spokesperson for Amadeus confirmed to The Reg today that the outage had been caused by a bug in the kernel of the open-source Linux operating system, and the flaw was triggered by the leap-second change on Saturday night. He said the problem has been sidestepped using a workaround within an hour, but Amadeus is investigating how to avoid and detect similar bugs in advance.

Servers run by Mozilla, StumbleUpon, Yelp, FourSquare, Reddit and LinkedIn were also reported to have been hit by the same bug. Mozilla said its implementation of the Java-based Hadoop data processing framework and ElasticSearch weren’t working properly on Saturday evening.

Leap seconds aren't a completely new concept, so I am having my trouble getting my head around a bug that could cause so many different components and servers and sites to have trouble.

Perhaps somebody familiar with the issue can enlighten me on this point.

None of my Vista, Server 2008, Windows 7, Server 2008 R2, or Windows 8 machines had problems during that period, for what it's worth....

Is this a new service in the Linux kernel? Or a bug in an existing service?

I'm really not familiar enough with Linux to know how one goes about making kernel changes -- does someone have the bug assigned to them to fix this?

Who owns the issue?


metathinker on 3 Jul 2012 7:28 AM:

It was a bug in the kernel's real-time clock code to update the system time for a leap second adjustment - the code forgot to call the function to signal an RTC change. Because the kernel's (high-resolution) timers are partially dependent on the system time, this caused all of them to expire one second early. Then, they were reset by or on behalf of their user-space clients, so they expired again, and again, and again, for quite a while - causing a sudden CPU load spike on the affected machines.

See this page for much much more: lwn.net/.../5c4c9ae88c52d92b

Mihai on 4 Jul 2012 12:51 AM:

"Leap seconds aren't a completely new concept'

And the leap year concept is even older. But still...

www.wired.com/.../azure-leap-year-bug

Daniel Cheng on 4 Jul 2012 9:36 AM:

It changed the way time adjustment are handled for some high-resolution timer.

Yes, they have tested the leap second thing -- that's why they fixed a deadlock issue long ago. The new issue is just high cpu usage which is harder to notice.


go to newer or older post, or back to index or month or day