Friday, February 18, 2011

Data Munging #1: Runs in binary sequence

This will be the first in an ongoing series of posts about my attempts to make various data look the way I want it to. Eventually I may put it into a cookbook of some kind... who knows. I will try to keep a fairly conversational approach, walking roughly through my thoughts as they go along, and not as a "handed down from the gods" perfectly clean solution, straight off the bat. I won't include some of the dumber mis-steps that I make in my analysis, and everythign should be moving towards the final code.

The last few nights, my internet has been really wonky. It'll be up for a while, then randomly come down. I noticed this when I had trouble watching Netflix, since when it came down, the stream would play to as far as it had managed to download ahead, then stop.

In an effort to figure out what was going on (as of last night, it wasn't a problem, despite my taking no actual efforts to fix it), I ran ping on Google for about an hour to at least see if there was a pattern to the up and down times:

$ ping google.com > ping.txt
^C
$ tail -n 20 ping.txt
64 bytes from 74.125.224.83: icmp_seq=4726 ttl=55 time=15.728 ms
64 bytes from 74.125.224.83: icmp_seq=4727 ttl=55 time=15.823 ms
64 bytes from 74.125.224.83: icmp_seq=4728 ttl=55 time=18.201 ms
64 bytes from 74.125.224.83: icmp_seq=4729 ttl=55 time=19.277 ms
64 bytes from 74.125.224.83: icmp_seq=4730 ttl=55 time=14.826 ms
64 bytes from 74.125.224.83: icmp_seq=4731 ttl=55 time=15.469 ms
64 bytes from 74.125.224.83: icmp_seq=4732 ttl=55 time=18.754 ms
64 bytes from 74.125.224.83: icmp_seq=4733 ttl=55 time=21.095 ms
64 bytes from 74.125.224.83: icmp_seq=4734 ttl=55 time=17.561 ms
Request timeout for icmp_seq 4735
Request timeout for icmp_seq 4736
Request timeout for icmp_seq 4737
Request timeout for icmp_seq 4738
Request timeout for icmp_seq 4739
Request timeout for icmp_seq 4740
Request timeout for icmp_seq 4741

--- google.com ping statistics ---
4743 packets transmitted, 3665 packets received, 22.7% packet loss
round-trip min/avg/max/stddev = 13.875/24.125/1083.971/30.219 ms


Then, I fired up Pylab, and turned my logfile into a set of arrays:
>>> up = ['timeout' not in l for l in file('ping.txt') if 'icmp' in l]
>>> down = ['timeout' in l for l in file('ping.txt') if 'icmp' in l]


Since each one of these holds the exact same information, though, I'm only going to use the list of "am I up" bools. My first attempt almost worked well:
>>> ups = [idx for idx, d in enumerate(diff(up)) if d == 1]
>>> downs = [idx for idx, d in enumerate(diff(up)) if d == -1]
>>> ups[:10]
[4, 13, 49, 57, 94, 102, 139, 147, 184, 193]
>>> downs[:10]
[]


Wait, what? Oh, I get it: diff([False, True]) is [True], as we expect, but diff([True, False]) is also [True], since mathematically it comes out to -1, which is not zero, and therefore True. Okay, no big deal, I need to cast them to ints first:

>>> ups = [idx for idx, d in enumerate(diff(map(int,up))) if d == 1]
>>> downs = [idx for idx, d in enumerate(diff(map(int,up))) if d == -1]
>>> ups[:10]
[13, 57, 102, 147, 193, 238, 284, 327, 372, 403]
>>> downs[:10]
[4, 49, 94, 139, 184, 229, 274, 320, 364, 395]
>>> len(ups)
133
>>> len(downs)
134


From here, I can just subtract ups from downs or downs from ups and get the intervals:

>>> downtimes = array(ups) - array(downs[:-1])
>>> uptimes = array(downs[1:]) - array(ups)


From here, I can plot a histogram of up- and down-time durations:

>>> hist(downtimes, range(0,max(downtimes)+1), label='Down')
>>> hist(uptimes, range(0,max(uptimes)+1), label='Up')
>>> legend()
>>> xlabel('Duration (s)')




I could call it a day here (especially 'cause I don't know what to make of the fact that there's peaks every 7 seconds in the uptimes), but I'm not happy with how I got the up and down durations... there must be a cleaner way to do it. Then it comes to me:

>>> switches = array([idx for idx, d in enumerate(diff(map(int,up))) if d != 0])
>>> uptimes = diff(switches)[1::2]
>>> downtimes = diff(switches)[::2]
>>> uptimes[:10]
array([36, 37, 37, 37, 36, 36, 36, 37, 23, 23])
>>> downtimes[:10]
array([ 9, 8, 8, 8, 9, 9, 10, 7, 8, 8])


Hooray! Now everything is pretty much one pass, right up until the moment I need it.



Final Code:

up = ['timeout' not in l for l in file('ping.txt') if 'icmp' in l]

switches = array([idx for idx, d in enumerate(diff(map(int,up))) if d != 0])

uptimes = diff(switches)[1::2]
downtimes = diff(switches)[::2]

hist(downtimes, range(0,max(downtimes)+1), label='Down')
hist(uptimes, range(0,max(uptimes)+1), label='Up')
legend()
xlabel('Duration (s)')

Total time: ~1/2 hour.

No comments:

Post a Comment