Irretrievable loss of the EC2 instance, EBS volumes and snapshots all
Read in due time "Cloudmouse removed all virtual server" and some comments in the style of "he is guilty, it was necessary to trust the proven cloud", decided to tell his story of horror with a highly respected cloud of Amazon (AWS). podcast radio I briefly talked about this, but then, I think, important details and impressions of everything that happened, nightmare.
I'm a relatively long time (more than 3 years) use AWS for work and personal projects and consider myself a fairly advanced user. From the rich list of AWS I had to work, or at least to try out a big part of all this abundance, but a few basic services undoubtedly is most often used. Is EC2 (instances / VMS) in a VPC (private network), the associated EBS (volumes), ELB (load balancer) plus Route53 (DNS). Of these five, you can collect different virtual machine configuration, networked, and if to this business to add S3 for data storage, then perhaps this will be a small gentlemen's set of the most popular AWS services.
The reliability of these systems is different, and they are given different SLA, mostly very impressive. From the point of view of practical use, no critical problems with the AWS I have not encountered. This does not mean that everything is absolutely smooth, but when properly organized system where the intelligent user doesn't put all his eggs in one basket and at least distributes the services between AZ (availability zones) of all rare drops and problems were able to exit without much loss and with minimal headache.
From what I have encountered in real-world use, most often met the situation with the planned reboot ("your Amazon EC2 instances are scheduled to be rebooted for required host maintenance") and with the planned migration to new hardware ("EC2 has detected degradation of the underlying hardware"). In both cases, nothing crippling has not occurred and the instance was available after a reboot with all your data on EBS. A couple of times there were strange with the IP address (Elastic IP), and he was suddenly bound to the instance and once I have completely lost the routing with new interface to one of the virtual machines. All these cases were from the category of “yeah it happens, but rarely and not hurt" and no particular fear/anger at me did not cause. If there was something wrong, I could always contact their support and get help or at least a clear explanation.
And then it happened. On 26 January one of the instances that were automatically rise about 5 o'clock in the evening, refused to start. Attempts to run it from the AWS console logged in as "initializing" and after a few seconds, returned to the hopeless "stopped". No logs were not created, because to download the OS it's probably not reached. However, no explanation of what happened, at first glance were not found. On closer inspection I noticed a suspicious message in front of the volume list — "Internal error" with an error code. Go to where you can see all my EBS volume, I found both volumes from a deceased instance in "red state" and simply "Error".
It was weird, unpleasant, but not fatal. After all, as a true paranoid, I every day keep snapshots for each volume of each instance and keep them from a week to 6 months. To restore volumes from snapshots in AWS this is a trivial task. However, there have found really strange and very scary — all the snapshots of these two sections is also turned into "error" and use it. When I say "all" I mean the whole story, all 7 days are kept. Don't know about you, but I find it difficult to believe my eyes. Nothing like I've ever seen before and nothing like this has ever heard. According to the degree of unreality it's not even caused a panic — I was sure that this is a failure of their console and of course, this can not be so lost and EBS volumes and snapshots at the same time. After all, the theory that everything is stored near or even on the same disk array, which died suddenly, completely contrary to their description of how this works.
Calling support (this is a separate paid service), I got on the Indian technique. Before all my treatment in support were local and very competent experts who are usually pleased sagacity. This was also intelligent, but very slow. After I explained that it was not so, he disappeared for 15 minutes, lost in my inspection. From time to time returned to report that he and a team of experts investigate the problem. Such deep dives into the study had a few, and I've been with him on the phone for about an hour. The end result was disappointing — it's all gone. Of course, I demanded an explanation and full analysis of the causes of the incident, but all he could say is "please accept our apologies, but nothing can be done, your data will be lost." My question is, "why, I did everything correctly, I had snapshots, how did this happen?", he offered to return the money for the storage of these lost snapshots, which of course laughter through tears. However, I continued to demand explanations, and he reluctantly admitted that the problem is related to the failure of new types of instacal (C4) and it is already fixed. How it is connected and that it was repaired, he explained, but promised to send email with full report and all the responses.
From the report, which they sent next day
Answer here and does not smell. Besides the factual errors ("in 1 of your volumes" when there were 2) and the standard reply there is nothing useful there. However, with this answer I could not tolerate and involved our PM from the AWS (such Amazon attaches to more or less significant customers), telling him about the disaster. I called late at night and left a message. While words specifically did not choose and did not hide the degree of their shock and very clearly gave to understand what I think about it. 30 minutes later he called back and invited in the morning to arrange a conversation between me and all those of the AWS who have the answers.
By the way, with all the severity of the failure, no ill effects for our combat systems it didn't. Firstly, this node was not the only one, and secondly he was completely decerations and almost completely immutable. To recover, and in fact, to build a new one, it was 10 minutes — ansible freewheeling all that is necessary to run the containers and control them, then delivered and set up the necessary containers. Since this instance is part of a data processing system, the end of the day, no unique data on it was not, and everything you need for it is taken from foreign places. However, if this happens with an instance in which a really important and unique data, and an independent backup is still not built, then this can be a very big problem.
On a conference call from Amazon has arrived a formidable team. In addition to the PM and attached to the us solution architect there have been several experts from the group dealing with EBS a couple of engineers from the service support (as I thought, the one with whom I spoke and his head), and a man introduced only by name. The conversation lasted about an hour and held me in a given direction. I was interested in the answers to 3 questions what happened, what are they doing to ensure that this does not happen again and what can I do to protect your system from such in the future. I was trying to understand if Amazon shares my sense of dread from such an event.
Yes, of course, be clear that this is something out of the ordinary. Concern and a very serious attitude to what has happened sounded in every word, and they said in plain text that this is indeed a critical issue and that this should not have happened. Part of the review of the incident, I realized the following — this disaster was not the 26th, but a week ago, with the initial creation of the instance. And throughout the week that were partially destroyed or, at least, it showed up something, after discovering why they decided to make it unavailable. All my attempts to clarify what exactly there was the broken to the particular were not successful — all they could say is that was destroyed integrity at the logical level, and a similar problem to repair was impossible. There is a clear connection with lost snapshots — they were removed from problem volumes and therefore has been marked as failed.
Thus, it appears that I during the week used a virtual machine with the volumes, which from the beginning was something wrong. And a week their system did not detect any problems, as, indeed, I have not seen anything strange. Of course, and asked a reasonable question — why is their detection system, a whole week of this and did not notice what happened, what they noticed? And in addition — if I lived on such a faulty service for a whole week, why such harshness and haste now? Why don't they warn in advance not to give me the time to take action before the data and virtual machines are removed?
The answer to the question about the discovery was given in the way that this was the root of all problems, not the problem with the C4, as I said to the engineer first line support. I must say that this statement sounded a little insecure, maybe something there is and was tied to C4, but they are not admitted. All warmly assured me that the new C4 is quite reliable, and I can safely use them. Looking ahead, I will say that over the past half a year I used them many times, very active, in many problems with large requirements of CPU and these instance types are no more weird problems did not cause.
But to answer the question "what was the rush all about" I have not received at all. In my opinion, it was forbidden to discuss with the customer, for reasons which I can only guess. A fine net of conspiracy can assume some kind of leak each other's data to my Tom, and then the whole dramatic story can be somehow justified.
In answer to the question "what are they doing to ensure that this does not happen again," they were assured that doing everything possible, but the details are not given, referring to the fact that I refused to sign a confidentiality agreement and a more detailed response in this case is impossible. This, incidentally, was not the first time they offered to sign an NDA, but I refused because of the absurdity that is the form of NDA, which they offered and which (if followed to the letter, literally) I could even write a critical review on Amazon that sell books and whatnot. As for the last section ("what can I do to protect your system"). Then they talked a lot, willingly and mostly banal. In fact, nothing, I can not do, but should always be prepared for the worst that I could understand and without this team of specialists.
To summarize, in the part where they explain something instructive, let me remind you again — everything in the world ever breaks down and even the best cloud providers can fail. And to this we must be ready every day. In my case it saved a partial combination of preparedness and luck, but from this incident I learned a few lessons, thinking "how would I survive if I broke another part of the system, and in several places at once." According to the results of the accident I have taken a number of rather paranoid actions, like multiple redundancy of certain data to other, not nasonovskii clouds, reviewed all the unique nodes and made them at least duplicate, greatly shifted the whole cloud infrastructure in the direction of disposable, i.e. anything can kill and rebuild without compromising the overall functionality.
This incident could be the most serious blow on my initiative move of 3 real data center in cloud if by the time the entire company would not be afflicted and support of your hardware to the extreme condition. For the half year that I worked in this organization, we survived the fire (the real, with smoke, fire and loss of a rack), attacks from other computers in adjacent racks, the many failures of iron, the wild subsidence speed external reasons, the bout sysadmin for a week, and other wonderful adventures. So even such a serious incident failed to shake our, or rather my determination full moving to the cloud, which we completed in April of this year.
Article based on information from habrahabr.ru
I'm a relatively long time (more than 3 years) use AWS for work and personal projects and consider myself a fairly advanced user. From the rich list of AWS I had to work, or at least to try out a big part of all this abundance, but a few basic services undoubtedly is most often used. Is EC2 (instances / VMS) in a VPC (private network), the associated EBS (volumes), ELB (load balancer) plus Route53 (DNS). Of these five, you can collect different virtual machine configuration, networked, and if to this business to add S3 for data storage, then perhaps this will be a small gentlemen's set of the most popular AWS services.
The reliability of these systems is different, and they are given different SLA, mostly very impressive. From the point of view of practical use, no critical problems with the AWS I have not encountered. This does not mean that everything is absolutely smooth, but when properly organized system where the intelligent user doesn't put all his eggs in one basket and at least distributes the services between AZ (availability zones) of all rare drops and problems were able to exit without much loss and with minimal headache.
From what I have encountered in real-world use, most often met the situation with the planned reboot ("your Amazon EC2 instances are scheduled to be rebooted for required host maintenance") and with the planned migration to new hardware ("EC2 has detected degradation of the underlying hardware"). In both cases, nothing crippling has not occurred and the instance was available after a reboot with all your data on EBS. A couple of times there were strange with the IP address (Elastic IP), and he was suddenly bound to the instance and once I have completely lost the routing with new interface to one of the virtual machines. All these cases were from the category of “yeah it happens, but rarely and not hurt" and no particular fear/anger at me did not cause. If there was something wrong, I could always contact their support and get help or at least a clear explanation.
And then it happened. On 26 January one of the instances that were automatically rise about 5 o'clock in the evening, refused to start. Attempts to run it from the AWS console logged in as "initializing" and after a few seconds, returned to the hopeless "stopped". No logs were not created, because to download the OS it's probably not reached. However, no explanation of what happened, at first glance were not found. On closer inspection I noticed a suspicious message in front of the volume list — "Internal error" with an error code. Go to where you can see all my EBS volume, I found both volumes from a deceased instance in "red state" and simply "Error".
It was weird, unpleasant, but not fatal. After all, as a true paranoid, I every day keep snapshots for each volume of each instance and keep them from a week to 6 months. To restore volumes from snapshots in AWS this is a trivial task. However, there have found really strange and very scary — all the snapshots of these two sections is also turned into "error" and use it. When I say "all" I mean the whole story, all 7 days are kept. Don't know about you, but I find it difficult to believe my eyes. Nothing like I've ever seen before and nothing like this has ever heard. According to the degree of unreality it's not even caused a panic — I was sure that this is a failure of their console and of course, this can not be so lost and EBS volumes and snapshots at the same time. After all, the theory that everything is stored near or even on the same disk array, which died suddenly, completely contrary to their description of how this works.
Calling support (this is a separate paid service), I got on the Indian technique. Before all my treatment in support were local and very competent experts who are usually pleased sagacity. This was also intelligent, but very slow. After I explained that it was not so, he disappeared for 15 minutes, lost in my inspection. From time to time returned to report that he and a team of experts investigate the problem. Such deep dives into the study had a few, and I've been with him on the phone for about an hour. The end result was disappointing — it's all gone. Of course, I demanded an explanation and full analysis of the causes of the incident, but all he could say is "please accept our apologies, but nothing can be done, your data will be lost." My question is, "why, I did everything correctly, I had snapshots, how did this happen?", he offered to return the money for the storage of these lost snapshots, which of course laughter through tears. However, I continued to demand explanations, and he reluctantly admitted that the problem is related to the failure of new types of instacal (C4) and it is already fixed. How it is connected and that it was repaired, he explained, but promised to send email with full report and all the responses.
From the report, which they sent next day
During a recent integrity check we discovered unrecoverable corruption in 1
of your volumes. We have changed the state of the volume to “error.” In
addition, snapshots made from these volume are also unrecoverable, so we have
changed their status from “Completed” to “Error.” You will no longer be charged.
for these volumes and snapshots and may delete them at your convenience.
Please note that you will no longer be able to launch instances using AMIs
referencing these snapshots. Instructions for removing AMIs is available here
(http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/deregister-ami.html).
Although EBS volumes are designed for reliability, including being backed by
multiple physical drives, we are still exposed to durability risks when
multiple component failures occur. We publish our durability expectations on
the EBS detail page here (http://aws.amazon.com/ebs/details).
We apologize for the inconvenience this may have caused you. If you have any
further questions or comments regarding this matter, please contact us at:
aws.amazon.com/support.
Answer here and does not smell. Besides the factual errors ("in 1 of your volumes" when there were 2) and the standard reply there is nothing useful there. However, with this answer I could not tolerate and involved our PM from the AWS (such Amazon attaches to more or less significant customers), telling him about the disaster. I called late at night and left a message. While words specifically did not choose and did not hide the degree of their shock and very clearly gave to understand what I think about it. 30 minutes later he called back and invited in the morning to arrange a conversation between me and all those of the AWS who have the answers.
By the way, with all the severity of the failure, no ill effects for our combat systems it didn't. Firstly, this node was not the only one, and secondly he was completely decerations and almost completely immutable. To recover, and in fact, to build a new one, it was 10 minutes — ansible freewheeling all that is necessary to run the containers and control them, then delivered and set up the necessary containers. Since this instance is part of a data processing system, the end of the day, no unique data on it was not, and everything you need for it is taken from foreign places. However, if this happens with an instance in which a really important and unique data, and an independent backup is still not built, then this can be a very big problem.
On a conference call from Amazon has arrived a formidable team. In addition to the PM and attached to the us solution architect there have been several experts from the group dealing with EBS a couple of engineers from the service support (as I thought, the one with whom I spoke and his head), and a man introduced only by name. The conversation lasted about an hour and held me in a given direction. I was interested in the answers to 3 questions what happened, what are they doing to ensure that this does not happen again and what can I do to protect your system from such in the future. I was trying to understand if Amazon shares my sense of dread from such an event.
Yes, of course, be clear that this is something out of the ordinary. Concern and a very serious attitude to what has happened sounded in every word, and they said in plain text that this is indeed a critical issue and that this should not have happened. Part of the review of the incident, I realized the following — this disaster was not the 26th, but a week ago, with the initial creation of the instance. And throughout the week that were partially destroyed or, at least, it showed up something, after discovering why they decided to make it unavailable. All my attempts to clarify what exactly there was the broken to the particular were not successful — all they could say is that was destroyed integrity at the logical level, and a similar problem to repair was impossible. There is a clear connection with lost snapshots — they were removed from problem volumes and therefore has been marked as failed.
Thus, it appears that I during the week used a virtual machine with the volumes, which from the beginning was something wrong. And a week their system did not detect any problems, as, indeed, I have not seen anything strange. Of course, and asked a reasonable question — why is their detection system, a whole week of this and did not notice what happened, what they noticed? And in addition — if I lived on such a faulty service for a whole week, why such harshness and haste now? Why don't they warn in advance not to give me the time to take action before the data and virtual machines are removed?
The answer to the question about the discovery was given in the way that this was the root of all problems, not the problem with the C4, as I said to the engineer first line support. I must say that this statement sounded a little insecure, maybe something there is and was tied to C4, but they are not admitted. All warmly assured me that the new C4 is quite reliable, and I can safely use them. Looking ahead, I will say that over the past half a year I used them many times, very active, in many problems with large requirements of CPU and these instance types are no more weird problems did not cause.
But to answer the question "what was the rush all about" I have not received at all. In my opinion, it was forbidden to discuss with the customer, for reasons which I can only guess. A fine net of conspiracy can assume some kind of leak each other's data to my Tom, and then the whole dramatic story can be somehow justified.
In answer to the question "what are they doing to ensure that this does not happen again," they were assured that doing everything possible, but the details are not given, referring to the fact that I refused to sign a confidentiality agreement and a more detailed response in this case is impossible. This, incidentally, was not the first time they offered to sign an NDA, but I refused because of the absurdity that is the form of NDA, which they offered and which (if followed to the letter, literally) I could even write a critical review on Amazon that sell books and whatnot. As for the last section ("what can I do to protect your system"). Then they talked a lot, willingly and mostly banal. In fact, nothing, I can not do, but should always be prepared for the worst that I could understand and without this team of specialists.
To summarize, in the part where they explain something instructive, let me remind you again — everything in the world ever breaks down and even the best cloud providers can fail. And to this we must be ready every day. In my case it saved a partial combination of preparedness and luck, but from this incident I learned a few lessons, thinking "how would I survive if I broke another part of the system, and in several places at once." According to the results of the accident I have taken a number of rather paranoid actions, like multiple redundancy of certain data to other, not nasonovskii clouds, reviewed all the unique nodes and made them at least duplicate, greatly shifted the whole cloud infrastructure in the direction of disposable, i.e. anything can kill and rebuild without compromising the overall functionality.
This incident could be the most serious blow on my initiative move of 3 real data center in cloud if by the time the entire company would not be afflicted and support of your hardware to the extreme condition. For the half year that I worked in this organization, we survived the fire (the real, with smoke, fire and loss of a rack), attacks from other computers in adjacent racks, the many failures of iron, the wild subsidence speed external reasons, the bout sysadmin for a week, and other wonderful adventures. So even such a serious incident failed to shake our, or rather my determination full moving to the cloud, which we completed in April of this year.
Комментарии
Отправить комментарий