Episode 73 | How To Deal With Service Outages

Show Notes


[00:00] Mike:  This is startups for the rest of us episode 73.


[00:12] Mike: Welcome to startups for the rest of us, the podcast that helps developers, designers and entrepreneurs be awesome at launching software products, whether you’ve built your first product or just thinking about it. I’m Mike.

[00:18] Rob: And I’m Rob.

[00:21] Mike: We’re here to share our experiences to help you avoid the same mistakes we’ve made. What’s going on this week?

[00:25] Rob: I’m just hanging out man, I’m getting over the fact that I’ve had more contractors flake on me in the past two weeks than I have in the past six months. It’s been such a downer couple of weeks and it’s just, I think it’s just one of those things that happens every once in a while. Just people have like, well sometimes there is a valid reason and sometimes there’s not but I think I joked with you over email that outsourcing never, ever works which is obviously weird, you know we don’t believe at all.

[00:49] And actually a couple of them like disappeared or have stopped emailing or have emailed and said like oh I’m just, I’m booked and I have to back out. Now luckily a couple of them did get back in touch and we are like oh no I am going to go through with it and we talked again and figured it out so feeling better about it. But it was a dark time for the last couple of weeks.

[01:05] Mike: Yeah it’s one of those things where the one thing that I don’t necessarily like is the fact that you know because it is a contractor and typically you are relying on email or none real time communications to get in touch with them and you basically just send your message off into oblivion and you hope that they got it and if they did get it then hopefully they, you know, respond to it in some reasonable amount of time.

[01:26] And you just don’t have any visibility there so  you don’t know if they read the message or not or whether they got it or, you know, you are basically just sitting there waiting. And I think that’s probably the worst part because you are trying to move other things along and you don’t want to have to pay attention to it but just still kind of have to.

[01:41] Rob: Right, and then you are like well this task is maybe, you know, a 10 hour task or maybe it’s a 40 hour task and you have to sit there and weigh like should I just suck it up and do this, you know kind of pull a long night and kill a whole day and do it so I am not waiting because it could be a week or two until I hear back or until this person has time to do it. So that’s been the dilemma the  whole time is it’s like alright now three or four weeks has passed, that task still isn’t done, is it time for  me to just you know suck it up and make it happen. So this is  the downside I guess of outsourcing and luckily it doesn’t happen to me very much so it’s just funny that it all kind off happened at one time.

[02:15] Mike: Yeah I’ve got the same problem right now, I’ve got a contractor who is contracting some software for me and I haven’t heard  anything in probably four or five days so I emailed a day or two ago and I haven’t heard anything yet so…

[02:27] Rob: I have actually really found Boomerang is helpful for this because then I am able to email and say if they don’t reply ping me in two days so that at least I can get back to them again because otherwise I forget about it and then a week later I remember and I am like pissed off, you know?

[02:39] Mike: Right.

[02:38] Rob: Oh my God that thing still isn’t done I forgot about that.

[02:41] Mike: No I have that set up it’s just, you know.

[02:43] Rob: Yeah I know.

[02:44] Mike: It’s just in the back of my mind, I know that that thing is going to come back around in the next couple of days and I’ve got to say okay well I haven’t heard from you in another four days so what’s going on?

[02:54] Rob: Hey you had brought up business taxes filing by March 15th.

[02:57] Mike: Oh.

[02:58] Rob: And I was like oh yeah, not for me, I don’t need that and then my accountant emails and he is like hey I just filed an extension for you and I am like what is that? Isn’t that hilarious? I was like for what? And he is like its all of your Numa Group taxes so apropos you had hit it on the head last time. Then you spent just a couple of hours on your taxes and you were all done?

[03:16] Mike: No I sent it off to my CPA and it took, I don’t know I would say probably a week and a half of going back and fourth with him. Just there were a few transactions here and there that really threw my numbers out of whack. So things were off like $15,000 in one direction and off $17,000 in the other direction and they were still close but they were just off by a little bit. And you know accountants don’t like numbers when they don’t add up. But you know I got through it, it wasn’t that bad I mean my taxes have already been filed.

[03:44] Rob:    I had  couple of other things, I was listening back to I think it was last week’s episode and I had heard since them about some other low cost per click Ad networks. We had mentioned Ad networks last week for driving traffic to landing pages. There is a couple, one is called advertise.com, one is pulse360.com and the third is adbright.com and so we will link them up in the show notes. But they are like lower quality traffic but you can get some crazy cheap like nickel clicks from Ad Bright and the others. So definitely something for folks to check out and the nice part is they are broader than just techies, you know they don’t focus on designers and developers I mean you can actually get traffic targeting other things.

[04:21] Mike: That’s really cool, have to check those out. So the only other thing I have been working on aside from taxes is talking to a lot of the MicroConf sponsors I mean we are down, getting close to the wire. We’ve got what, six week left, seven weeks left? And  just trying to finalize things with a lot of the sponsors and for anyone who is interested we’ve got a lot of good sponsors this year again, we’ve got Balsamiq, Bidsketch, Usabilitest, Kiss Metrics, Software Promotions, WP Engine which I think you are an investor of.

[04:47] Rob: Right.

[04:48] Mike: And then Mail Chimp, Red Gate Software, User Voice, Kit Apps and Constant Contact. So pretty good so far and you know we will probably have a few more that will sign on over the next  couple of weeks and we may need  to draw a line in the sand and just say you know we are not really accepting any more at this point, but so far things are going pretty good.

[05:05] Rob: Nice. I listened to TechZing, the most recent episode and they were talking about our review competition and so I had realized that I had never reviewed the TechZing podcast so I logged into iTunes and gave them a five star. You know you don’t actually need to do a full comment review where you actually write stuff, you can just click on the stars. It’s pretty easy to go to iTunes, search for startups and we are currently ahead in the competition. I think the last thing for me is I’d mentioned last week that I was disappointed with the way that HitTail had been going.

[05:36] Basically that I had re-launched it about two months ago with a new design and then I had some goals for where I wanted revenue and then coming all that to be, and I had not hit anywhere near them. And I realized a couple of things, one, I was actually talking to my wife about it and said you know, that I have friends, some are in mastermind groups and others are just colleagues and I am talking to them about their progress and I am feeling like you know I am not able to move as quickly as them. And she asked me like what else do they do? And I was like what do you mean? And she is like well do they just run an app like HitTail, and I said yeah. And she said but you do other stuff right?

[06:09] You are running a conference and you have your blog and you have your book, we have the academy, we have the podcast, I have an email newsletter, like we just started talking through it all. And I realized like wow, it’s stupid that I hadn’t thought about it but my week is so divided among these things that I really haven’t been able to focus anywhere near the amount of just pure unadulterated hours that I should be if I want to hit these goals that I have set.

[06:31] So in actuality in actuality, my disappointment is probably unjustified because I basically set an almost unachievable goal for myself I have realized over the past couple of days. And I was pretty down about it you know over the past couple of weeks. And then I have been coming out of it over the past two or three days realizing like what I need to do is adjust my goals or I need to get rid of everything else, you know not do that other stuff and I don’t want to do that, I enjoy the variety. So I have taken a more realistic view of revenue goals that I have for HitTail and I am actually feeling mentally just way better about it.

[07:01] Mike: Oh cool


[07:06] Mike: Today’s episode is going to be about how to deal with service outages. And given our previous discussions about how things have gone with the academy and how we have encountered some of the downtime that we have I mean and obviously through no fault of our own we really just can’t control when a server decides to die or a raid controller goes. But what we can control is how we deal with the situation afterwards and how we can let people know how things are going.

[07:30] And this also came about because somebody had emailed in a comment to questions at startupsfortherestofus.com and said I would like to hear you guys talk about the Azure outage and related business issues. Seeing as this would affect AuditShark, I am thinking it would be good to discuss the downsides of relying on specific cloud resources like this or Amazon with their latest big outage. Talking points that I think would be good to hear about are there ways to mitigate this and have redundant systems, but a micropreneur also has to weigh the cost of time and effort related to these.

[07:59] Is it really worth it for most micropreneurs to do what it will take to protect themselves from this? Even if you are not using a service like this, significant outages can happen. What are some good ways to make things right with your customers? The importance of keeping customers in the loop during outages and letting them know not just that you are aware but what you are doing to fix things is something that is constantly overlooked by companies, even though letting folks know these things can help them to empathize with you and not be all pissed off about the site being down. Thanks for the great show, okay.

[08:26] So between the academy going down and then getting this email, it really made me stop and think, are we doing the right things? And I think that we definitely have some room for improvement. And so what I did was I sat down and I thought about the things you should be doing and I realized that we are not doing nearly enough I don’t think. So we will talk about it and then afterwards I think that we are going to go and actually go and implement some of these because we just– I think we have just neglected it. We haven’t really looked at that stuff you know, we haven’t really had to deal downtime either.

[08:51] Rob: Right, yeah that’s the thing. I mean to give folks—you know a lot of people probably don’t know what we are talking about with the academy going down. Basically the academy is hosted on DreamHost and it has been since it launched three years ago. And typically DreamHost it has maybe one or two outages a year, like a typical webhost you know. There  might be a six hour outage or sometimes there were 15 hour outages or something, but which sucks at the time but for the price and for what they allow you to have it’s totally worth it since you know  I don’t run any mission critical systems on that.

[09:20] Now in the past maybe 40 days I think the academy has had four or five outages and DreamHost specifically is having some issues with the server it’s the hardware that my VPS runs on and so that’s been the headache of it. Now there hasn’t, knock on wood, there hasn’t been an outage now in about two weeks and they were coming like every five days right? And it was always something different, it was like the raid controller or one time it was the hard drive and…

[09:45] Mike: It was the router at one point I think too.

[09:46] Rob: Router yeah.

[09:47] Mike: Yeah.

[09:47] Rob:    And they would move it and then they would restore from backup and it would take, you know, one of them was like a 48 hour outage which is just inexcusable right, from any standpoint, even a none mission critical thing like the academy, it just—that sucks. So with that said we had several bouts of down time, we are hoping we are through but we are in the process of moving to a new webhost that’s going to be  faster and hopefully have  better uptime. But that is part of the impetus for this episode.

[10:11] Mike: So this episode is going to concentrate largely on things that are outside of your control because planned outages and planned upgrades or downtime are a lot easier to manage because people generally realize that upgrades need to take place and although they are sometimes a little bit inconvenient, it’s also the cost of doing business with a service provider that, you know, gives you these services online.

[10:30] So typically those types of outages, you know about them in advance, you can kind of plan around them and if they let you know, oh we are going to be down between the hours of 4:00 AM and 5:00 AM on Sunday morning you typically don’t care. But what’s not easy to deal with is these extended outages where the customer doesn’t know that it’s coming and it’s going to impact their business. Now with the academy we are not providing business critical functions but at the same time there is a lot of online services out there that do.

[10:56] Take for example 37 Signals, I mean they’ve got I would imagine a pretty large server firm to run all the stuff that they do. And if base camp goes down for a day I can’t even begin to imagine the number of emails that they would get in support just because Base Camp is down, I mean how many million users do they have on that?

[11:16] Rob:    Seriously. I mean that same thing Joel Spolsky  said before they did the FogBugz on demand, the hosted version of FogBugz, he said the reason they hadn’t done that is because he never felt like they had the infrastructure knowledge and the money to do it the way he wanted to do it. He wanted to have absolute redundant, hot backup data centers on opposite ends of the country because he didn’t want FogBugz, going down to cause basically millions of dollars in lost productivity because people use this as a workflow tool and if, you know, when you learn to rely on it basically developers would be sitting on their hands.

[11:46] So when they eventually rolled it out that’s what they did, it was like super redundant. So the same thing with 37 Signals, I imagine they have a pretty extensive infrastructure going on there.

[11:54] Mike: One of the first things to look at when you have a service that you are offering as a hosted service to  your customers is to set up monitoring systems  to independently let you know when down time actually occurs. So you don’t really want to be relying on your hosting provider or your co-location facility or anything like that because I’ve had, for example, I’ve relied on that stuff in the past before for a server that I had co-located some place, and they had all these on site monitoring and everything else.

[12:21] And my server just stopped responding one day and I didn’t know why and I was trying to figure it out and I called them up and them up and they said oh yeah, the power is out at the co-location facility. I’m like well don’t you have diesel backup generators, and they said well yeah but we turned them all off because the cooling system went out. So basically the cooling system went out, they turned everything off and including all of the servers so that the place didn’t go up in smoke. And relying on those internal controls is not going to save you when that sort of stuff happens so you really have to have a third party that will let you know about those things.

[12:52] And a couple of tools that I have looked at before, I have used Pingdom I actually have an account with them, I have a Verelo account I have looked at uptimemonitor.net. But basically these tools all do the same kind of thing, they allow you to monitor your sites and you know each system has some different mechanisms to it, but the basic idea is you want to know when things are a problem before your customers find out that it’s a problem.

[13:10] Rob: Yeah this is kind of a big deal. I have actually started doing this with all of my sites and I didn’t used to. But what’s cool is that you can actually, when you get an email especially like you said Verelo and Pingdom do either minute or sub-minute comparisons and I think you can tell it to wait a couple of minutes before emailing or they automatically do that because you don’t just want there to be some tiny glitch and you know to suddenly get emailed your site is down.

[13:32] But if it’s down for a couple of minutes you get an instant email, you always know before your webhost does. Like I don’t think there has ever been a time where I would email my webhost and said my stuff is down and they were like yeah we already  know, it’s always like oh we had no idea. And then if they get on it fast and  fix it, your customers quite possibly, quite often will never find out and that’s the ideal scenario right? It’s basically fixing the problem before they ever hear about it.

[13:55] Mike: So the second part of building a process for dealing with these types of issues is determining if redundant systems are warranted or necessary. And I think that this depends a lot on your business model, the price point, the expectations from your users, you know. If you have 10 million users you really need redundancy and there is really no question about it at that point, if you have 10 chances are probably not. So there is that range in there. I think you have to make the judgment call based on the revenue that you are making and how much that downtime can actually cost you.

[14:23] So one of the things that you can and there is a few different ways to look at it. One is, you know, how much money are you going to have to refund the customers, what sort of backlash would you expect to get, and a big one is the number of support emails that you get because if you get a hundred support emails and you have to spend two or three hours dealing with them that’s basically three hours of support cost. I have seen entrepreneurs who’ve told me, hey last quarter I had $50,000 worth of support cost because of this one customer.

[14:48] They email in so often or they provide a product that they in turn resell and you know they get emails so those things could basically get forwarded on to you know this entrepreneur. And those things cost real money, I mean you may think oh it’s just an email, but they still cost time. Somebody has to respond to them and you have to deal with those on an ongoing basis. So a lot of the times it’s just making judgment calls about what works and makes sense for your business. And one of the things that comes to mind is Netflix.

[15:16] And one of the first things that Netflix did when they built out their cloud infrastructure was they built this thing called the Chaos Monkey and you can read about it online, just search for Chaos Monkey and Netflix and you will find it. And essentially it’s all about—they built this little application that would run in the cloud and its sole job was to run around all the other services that were run by net Flicks and shut them down or you know basically cause problems.

[15:40] And the developers’ jobs was to code the application in such a way that it could tolerate failures wherever they may occur. So if the log in was failing for example, go to some sort of a caching system. If you know the streaming was failing, go to a different set of servers for streaming, or if there is an entire datacenter crash, be able to go to another data center. Most people don’t realize this but Netflix uses Azure, Windows Azure and Amazon to host NetFlix so that if one of those services goes down they can actually flip over to the other and most customers don’t even know.

[16:16] Rob: And that’s right there where you said you now there is that balance between having 10 customers and 10 million and they are at the 10 million, they are in that order of magnitude so it’s totally…

[16:22] Mike: Yeah they are close to 20 million.

[16:25] Rob:    20 million yeah so it’s totally worth it for them right to have all that money and time invested into that infrastructure and to be able to switch over.

[16:32] Mike: And maybe just host it on Amazon.

[16:35] Rob: That makes sense.

[16:36] Mike: But the idea was that they did have it split between multiple data centers so that  if one failed it could go over to the other. And maybe that was the issue; like they put too much traffic over onto the other Amazon data center I am not real sure of that.

[16:47] Rob:    Right, you know I think a second part to this, the way you started this point was talking about determine if redundant systems are warranted or necessary, and I think a big part of that is how many customers you have, your budget, but I also think a part of it is what are your customers doing on your website? If you have a job board, then being down for six hours, 12 hours, 24 hours it does suck and you know you probably want to refund some money to your customers but that’s not the end of the world, like people can come back in a few hours and look for a job or post a job.

[17:14] But when you have something like a workflow tool and a big enterprise is using it or even a small enterprise, you know something like Base Camp that’s actually holding people up from doing their work, then you are going to have a much bigger problem on, you are going to have people who are more pissed off. And I think Netflix is a unique case where it’s like an enterprise is not using it but consumers do tend to kind of wreak havoc on you, you know, especially if you have 20 million of them. If you are down you are going to have to deal with that support cost you talked about earlier that I bet is a very real support cost when they have outages. You know the last thing I wanted to add to this is kind of an aside, have you ever listened to the podcast This Developer’s Life?

[17:50] Mike: I listened to it a couple of times.

[17:52] Rob: It’s really cool, its done by—it’s Rob Conery and Scott Hanselman and they basically—it’s modeled after this American  life which is, you know, an NPR podcast. And so This Developers Life basically just they choose a theme and then they do a couple of vignettes and they interview typically software developers. And there is one episode, it’s one of the best of the series, the best one I’ve heard or one of the best is called ‘Pressure’ and it’s about when Stack Overflow went down and their database got corrupted. And so they interviewed Jeff Atwood and all the people dealing with that, it is really good if you haven’t listened to it I highly recommend it.

[18:25] Mike: I think that’s one of the few episodes that I did catch of it.

[18:28] Rob: They talk about all this stuff about when you are in the moment and they talk about the chaos monkey and all that so if you want to hear more about this I recommend folks check it out.

[18:35] Mike: So the next that you should do is provide a third party channel for people to submit issues and make  sure they know how to use it. And that’s the key piece of this one is not just providing the third party channel but making sure they know how to use it and publicizing it to them. So for example if you have a website, typically the third party channel is going to be email. Relying on a ‘contact us’page on your website is not going to cut it, you really need to make sure that people know to email you support at whatever your domain is, because otherwise that that thing goes down and you know you are kind of host, there is not much you can do about it at that point because they have no way to get to the contact us page because it’s not even responding because your site is down.

[19:11] And with that kind of goes separating out your email host from your webhost because if they are both hosted on the same server or on the same set of servers you could easily take the whole infrastructure down in one shot. So making sure that you separate those up, divide them into you know the different levels of responsibility. You know there are some hosts that I know of who use a third party hosted status site. So for example DreamHost has a—what was it, dreamhoststatus.com?

[19:38] Rob: yeah.

[19:39] Mike: They just have this other site and people know to go there if DreamHost is having problem for whatever reason. It’s pretty rare when I see the DreamHost website itself go down, but I have seen it go down on occasion.

[19:50] Rob: I think one note here and something and  a micropreneur who you know doesn’t have a lot of infrastructure and doesn’t want to set up a bunch of third party sites, I make sure the support at my domain name always resolves. I have a support at email address for all of my apps and so to me it’s at least a semi-intuitive way for people to contact me, even if the website is down they can always reach us.

[20:11] Mike: So the next item on the list is to commit to fixing ongoing issues, not just patching things up with duct tape. And you need to let users know what happened and how you intend to address these issues in the future and prevent them from ever happening again. And one of the things that came to mind was very recently I got an email from Windows Azure, and basically it just summarized what had previously happened. There was a disruption on February 29th, they basically just say we want to let you know, here is our follow up on this particular disruption, you know, here is our blog post that shares all of our findings, root cause analysis and here is where they are posted, and then they out.

[20:49] These were all the service exits that were impacted and you know they apologize for it and then they basically say that they recognize that you know this could have some serious impact on their users. And they say they are going to apply credit whether you were affected by this outage or not, which is kind of nice but you know I don’t necessarily think that if you, you know if you weren’t affected by it then you know you don’t necessarily need to refund somebody’s money for it.

[21:12] But at the same time if you are not tracking who was affected or you don’t have a way to track who is affected, it might be worth the goodwill of just saying hey we are going to refund your money for the time that it was down or in excess of the time that it was down in order to essentially buy goodwill with your user base.

[21:29] Rob: Yeah I think this is one thing that we should probably do.

[21:31] Mike: Yes.

[21:32] Rob: We should probably send out an email to the entire—all the members of the academy and let them know, you post about in the forums and conversations, really we should just be  proactive and email and say this is what’s happened, this is what we are doing because we are moving to a new host.

[21:43] Mike: Right.

[21:44] Rob:    We are moving to a better, faster and more reliable host. We should let people know that we are doing that so that they at least know that we are aware of it and we are not just sitting on our hands on this one.

[21:51] Mike: This is definitely what I had more in mind when I was going through this list and because we do have monitoring  set up, in terms off redundant systems we don’t really have redundancy but it’s not mission critical if it’s down for a few hours, it’s really not that big a deal. The support burden doesn’t get too high, I mean we do get some emails, especially I see some emails to me personally on occasion saying oh did you know that the site is down and I can at least say yes. As we said we’ve already got the support at microepreneur.com up and running.

[22:17] People tend to know that they can send an email there and you know get support for that or let us know that something is down. But I think that committing to fixing the ongoing issues is really a big thing and letting people know that what we are doing about is a big one and I just don’t think we’ve done that well enough. I have posted in the forums, I have talked to people about it individually through email but I don’t feel like it’s been publicized very well. And the other thing that kind of ties into this is that you should have a way to contact your users when you do have an outage because right now we don’t. If our site is down and it’s going to be down and we now it, we don’t have a way to let them know.

[22:49] Rob:    You know we do, we have an email list that’s separate from DreamHost.

[22:52] Mike: Is it?

[22:53] Rob:    That we could use yeah, it notifies people about new modules so they could have unsubscribed from it but I mean how many copies can we really have. So that would do a pretty good job, I bet we would hit most people.

[23:03] Mike: Right. I knew that that list was there but I wasn’t sure how accurate it would be.

[23:07] Rob:    Well it’s really accurate how about that?

[23:09] Mike: But I mean just doing something like exporting the list like once a day or something like that would be probably fine and I think that that would catch the vast majority of people and we would be able to let them know hey we are having an outage. At least let us look into those things because right now we just—I don’t feel like we have a good process or procedure in place for dealing with that stuff.

[23:25] Rob: Yeah.

[23:26] Mike: So to follow up on those, if you are hosting provider is unable or unwilling to fix some things, you need to go somewhere else and honestly that’s kind of the point we are at with the academy is we have decided to actually move the entire system from one host to another and we are moving to WP engine. And we have done some preliminary tests on it and it’s radically faster than what the old system is. The tests that I did don’t get me wrong are around strict HTML but the one HTML page that I loaded on WP engine loaded in I think 900 milliseconds and on DreamHost it was 2.8 milliseconds.

[23:59] Rob:    Right so it’s like three times faster.

[24:01] Mike: Three times faster which is crazy and a lot of that 900 milliseconds was actually spent queering the DNS server on DreamHost.

[24:08] Rob:    So if it caches it will be even faster is that right?

[24:11] Mike: Right, yeah.

[24:12] Rob:    Cool. Yeah so this is a tough one, the problem with moving hosting providers is it always takes either your time or your money, you know, to pay someone to do it. And so you do have to get pushed to that point of feeling like it’s all I can take and I cannot stand it no more you know. That’s definitely where we’ve gone, hopefully yeah I don’t know of like an absolute metric to say.

[24:28] If someone goes down three times in one month then you should leave because frankly having been with several webhosts, I have probably six or seven different webhosting accounts and been with many of them for many years, every once in a while something bad will happen at a host and you know they will go down a couple of times in a month and I always think oh men I am totally going to leave and then they won’t go down for a year. And so it’s worth staying right? It’s not worth jumping off the bus at the first sign of an outage.

[24:52] It’s really a judgment call and it’s also a call of how well they respond. Do they follow these things, do they keep you updated? Do they apologize, do they credit you something? Are they a good host in general and you do believe them that they actually are working to fix it.

[25:07] Mike: Yeah and what you said was dead on. I think it’s how do they respond to it and how do they react and like I said I feel like we haven’t done a very good job, you know I have apologized to a few people and said you know look we are really sorry and you know posted in the forums and told everybody kind of what’s going on. But I think that you know regardless of where you are at there is probably always room for improvement. So maybe it just me being hypercritical but I think we can do better.

[25:30] We moving to a new host is a step in that direction I think. So the last thing is that if you are going to go somewhere else keep in mind that self hosting is an option but you are still tied to a co-location provider and you know a lot of times managed hosting is a better option but it’s also more expensive. I mean the fact is that if you are going to rely on someone for your hosting, good service costs money. But at the same time you don’t need to over pay for a good service either. I mean there is a lot of hosting providers out there that provide good service at a reasonable price point.

[26:00] And one company that I remember, it was probably 12 years ago and I forget what the name of the company was, it was down in Dallas I believe. But they essentially had a set up where the co-location facility was on the border off two towns or two counties so they had power connections from each of them. And pretty much everything was redundant, all of their power, everything else, you know, their internet connections came in from both places and they claimed the highest reliability that you could possibly get.

[26:29] Now to host there was like $2000 a month and that was for one server and  I think it was a shared server at the time but it was just crazy expensive. I don’t think it was worth it but I think that they wanted to portray that they were providing internet access so they didn’t want to provide the illusion, the perception that their site was going down because oh well if they are a service provider and they can’t keep their site up then what does it say about them.

[26:54] Rob:    Right and I think the thing to keep in mind if you are looking for a webhost is that every webhost is going to have horror stories. You search for rack space outages or rack space complaints and since they are so big, there is just—every time they have an outage they affect  you know a thousand  people even though that’s like a bazzilionth of a percent of the number of people they have hosted. So you can do the same with Dream Host and Soft Layer and Media Temple and you hear some people have terrible experiences with them and then other people swear by them.

[27:17] And I am still you know at this point, having been with DreamHost with seven years, yes I am miffed about their outages but realistically they actually have been a damn good host for the price that I pay. And so while we are going to get the academy off and I am going to get a couple of other, what I consider more mission critical sites off, I am going to leave several sites on there. I probably have 20 to 25 websites on there still and I still will recommend it to some people  and  I will let them know always that they are a lower cost provider and that they will have more outages than rack space but rack space is like four or five times the cost. And so you have to weigh that out as you are looking.

[27:50] I think one other thing that is critically important and I think that it’s something that we do have especially, I know I have on pretty much all my sites but especially we have with the academy, is that having a good relationship with your customers is huge when you have an outage because they give you some leeway, because they trust that you are not screwing them. And if you have just some business that’s purely transactional where they come and pay you money, they don’t know who you are you are anonymous. Then when you have an outage they are going to be pissed because you are faceless corporation so to speak.

[28:19] Whereas with the academy like everyone knows Mike and I and  they know that we are, A, not doing it on purpose, that B we are going to make it right, that C we are doing the best we can, you know all these things. And so having that good relationship with your customers really is that foundation of allowing them to trusts you and allowing them to give you the benefit of the doubt when you do in fact have outages. And so Mike I think that’s something you know we have with the academy, we should definitely send out an email like you said to notify them but it has been nice, we have been, you know, given the benefit of the doubt this far.

[28:46] I actually had one question for you about this because the original question was about Azure going and about, you know it made me think of like huh, how would you handle that with AuditShark? Like do you have redundancies built in at this point? And I guess the answer would probably be no you are still you know at an MVP stage. But have you thought about that long term once you got  you know some bigger clients  using it that you were going to build some type of crazy redundancy.

[29:08] Mike: I don’t know about crazy redundancy, I mean right now there is a lot of things that—it’s more like everything is distributed.

[29:15] Rob: It’s a bunch of database inserts right? And it’s distributed to what, to multiple servers?

[29:19] Mike: Well what I have is, yeah I have multiple servers in Azure. So what happens is I have two different servers in there right now, so if I grow the business and I need to scale out beyond too I can scale to three, four, five or whatever and because of the caching service that I am using  for the session data, basically it just scales all the customers across those servers and I don’t have to worry about that because in building for Azure  you have a tendency to construct your application in such a way that it is scalable so that it can scale out instead of just scaling up.

[29:49] If I need to scale up I can, I just thrown in another like—add an extra large instance for example, right now I am on two extra small instances. But I could scale those up as needed; I think there is like four or five different levels. And if I get to the point where I am on two extra large instances and it’s still not cutting because you know it’s still just, they are being overworked, I can basically add in a third one and a fourth one and a fifth one. And to the end customer I don’t need to make any changes because all of them resolved to the same URL, so all the customer installation go back to that same URL.

[30:19] What I would have a problem with is if the entire data center goes down, that’s where I would have an issue. A lot of the things that happen at the customer sites they are going to happen on a scheduled basis. So those things are essentially, they are caching their jobs and tasks already so they are going to run them out on whatever schedule they are supposed to run. And then right now I am basically just piping the data directly back but the data is in the future going to be written to disk if for any reason it can’t be sent up.

[30:49] Right now it just kind of dumps it and redoes the work which you know is not the greatest solution in the world but it does work for the time being.

[30:55] Rob: Yeah that’s nice, as long as it doesn’t yeah, it’ll choke and somehow do bad stuff to their network or whatever, I don’t—yeah it’s not like stuff is so critical that it has to get done right that minute.

[31:05] Mike: I will be sending out emails and stuff like that on an ongoing basis for reporting and everything, but you know it has made me wonder a little bit you know, what do I do if for example I am trying to run a report or update some of the indexes in the database to do look ups into the various Azure tables because some of them are time sensitive. So what I have started doing is I have started restructuring my data a little bit so that if I need to identify data and what to do about the indexes or how to rebuild those pointers within the different Azure tables because Azure tables are not like sequel database, Azure tables are more like a no sequel database.

[31:42] So I have pointers all over the  place that are just, you know little snippets from—and maybe three or four different tables that kind of slice and dice the data in different ways to make it easy to read. Well what I have started doing is I have started taking those data and saying okay well for my primary key for this particular table I am going to use the date. And then you know if I need to catch up on things I basically just go back and look at a row that says when was the last time that this calculated and using that calculate what the Azure table name is and then go from there.

[32:13] You know its stuff that I had to at least think about. But again going back to your question about the redundancy, the systems themselves I don’t have to worry about it if the entire North Dakota datacenter goes out, I will have a problem.

[32:25] Rob:    Right and you would in any case. So let’s just say there are more important clients than you right now using their North Dakota datacenter.

[32:32] Mike: Right, I have thought a little bit about it whether I would do an app one, an app two and kind of cluster things in different datacenters or whatever but you know realistically Azure only has two in the United States.

[32:43] Rob: Realistically like tracking apps like what you have and what Hit Tail is where you just get a ton of inserts  all the time, those are different than just having a web app, a transactional web app where someone is building like an invoicing SaaS app or time tracking or something. The load is much less heavy and if it does go down you can let people know about it. Whereas  like if Hit Tail were to go down and I was using the old synchronous java script code, people who had tracking code on their site it would actually slow their website down. You know it doesn’t do that anymore, it’s all asynchronous.

[33:16] Since I bought it I rewrote it as asynchronous so now if someone’s webpage loads, even if our server is down their page loads as quickly. And I think that’s all you have to think about you know on your end is that if your serves go down, does your client’s software that’s installed out of the customers’ area you know, does that fail gracefully and doesn’t impact your clients kind of work flow and network and all that stuff.

[33:38] Mike: Yeah and that’s one of those things where you kind of need to rely on some sort of a caching service. And one of our academy members Jason Shore he owned Vista DB for a while and ended up going over to Microsoft to work in their sequel server group. He sent a link out about a Microsoft research talk that he gave and I looked at it the other day, the talk is all about doing things on the Windows phone and different types of database access that you can use and how to access things over the network and different caching mechanism that you can use.

[34:10] And as I am sitting there watching this talk that he is giving, it just struck me that there are so many of these things that apply to like a distributed model where you have  your software out there and there is different components where you may need to rely on external data. And he talked a lot about caching and how the fact is that you know just like a 32K access can create an extra three seconds of load time on a Windows phone. So then again some of these things are related directly to the windows phone.

[34:38] But there were other things he talked about like latency and going over like a 4G connection where the latency is, I think he said 3G actually, it’s a 250 millisecond latency. But if you are doing say 10 data loads I mean that’s—you are doing nothing but waiting for 2500 milliseconds for data to come down and on a phone that’s just a killer. But if you could cache some of that information instead on the machine or you do it as a single query out to the data source that you pull it down as one chunk and then you slice it up locally, or if you are just caching everything and then you go out and do like asynchronous data loads those are the sort of things that are really going to help your application survive when things on the internet break.

[35:21] [Music]

[35:23] Rob: I think that about wraps us up today, if you have a question or comment you can call it into us at  888-801-969-0 or you can email it to us in MP3 or text format to questions@startupsfortherestofus.com. Our theme music is an excerpt from “We are Outta of Control” by MoOT used under Creative Commons. A full transcript of this podcast is always available at our website startupsfortherestoffus.com. Thanks for listening, see you next time.

Twitter Digg Delicious Stumbleupon Technorati Facebook Email

3 Responses to “Episode 73 | How To Deal With Service Outages”

  1. If you’re servers go down, how do you still get your support emails? How do you get your email to run through a different server than you host?

    • To do this you need to make changes to your DNS records to route mail to a different web host (Google it for a more specific answer). I don’t tend to do this unless my web host has a lot of downtime, but it is the best way to separate your concerns.

  2. I did find some tutorials, thanks.

    What do you think about Google Apps for email management? It would keep your email off your servers like discussed, right?