A glitch in VMware’s most recent update had its customers up in arms this week. The problems were caused by a bug from the beta version of the software that engineers failed/forgot to remove which left VMware users unable to power on virtual machines running the hypervisor software.
The bug, referred to as a “time bomb”, is code that developers insert in beta software to push users to upgrade to an application’s final version. The code is a commonly used tool for developers; however it must be removed from anything into which it has been inserted prior to final release.
The virtualisation software maker released an “express patch” on Wednesday to fix the glitch. However, VMware customers have been left a little disgruntled, and the incident has made the company look a bit amateurish.
The way people look at VMware after the cock-up is “definitely not very good,” according to Gary Chen, a Yankee Group analyst.
“This is the most publicised issue they’ve had in their history, and it’s really the sort of embarrassing bug that never should have made it past QA (quality assurance),” he told said.
In a letter posted on the company’s blog, Paul Maritz, VMware’s recently appointed chief executive officer said: “Last night, we became aware of a code issue with the recently released update to ESX 3.5 and ESXi 3.5 (Update 2).”
According to Maritz, when the time clock in a server running the updated ESX 3.5 or ESXi 3.5 software registered 12:00 a.m. on August 12, 2008, the code caused the product license to expire. As a result of this, powered-off virtual machines could not be turned on; those that had been suspended could not be awakened from that mode; and machines could not migrate using Vmotion.
The problem has also occurred with a recent patch to ESX 3.5 or ESXi 3.5 Update 2. The company has begun a review of its QA processes, Maritz said. (Which means someone’s getting the sack)
To VMware’s credit that it took less than 24 hours to come up with a patch that seems to have corrected the problem, said Chen.
“From what I’ve heard, the patch fixes the problem. You do have to give kudos to VMware for addressing the issue so quickly,” he noted.
Some users have turned to VMware’s Communities discussion pages to vent. “As a VMware Enterprise Partner and VMware Authorized Consultant, I can tell you this IS a big deal for VMware to release a product that has such grave consequences for even a relatively small portion of the total VMware user population,” wrote one user.
“A small percentage does not diminish the severity of problem for affected users and the upmost urgency is expected from a company that caters to enterprise customers who don’t have ‘downtime’ in their corporate dictionary anymore.
“Bugs happen,” the poster continued. “However, I believe this could have been prevented by not rushing an update to market which was intended to be free and compete with [Microsoft’s] Hyper V. This will no doubt teach VMware a lesson and unfortunately will cast doubt about the reliability of VMware in the enterprise. It’s a shame a clearly superior product is going to get bad publicity from this oversight. Let’s give them credit and hope they learn from their mistakes.”
Chen pointed out that most customers were glad of the quick response time from VMware: “The issue was fixed quickly, and there was lots of communication as to the status, cause and future changes to prevent another incident,” he said.
“However, some faith has been lost, as most customers I’ve talked to are disappointed that a bug like this made it past QA. Many admins have been pushing virtualisation to their executives, and this doesn’t help their case,” Chen added.
“Virtualisation is still in the emerging stages, and enterprise reliability is a huge issue that can only be proven over time,” said Chen. “Vendors have been pushing the idea that it is enterprise-ready, and an incident like this hurts not only VMware but the entire virtualisation movement. Virtualisation is inevitable and will certainly continue to proceed, but people will slow down and think more about how to protect themselves against things like this.”
“More and more people are using it, and a major incident, whether a bug or a security hack, could freeze your entire infrastructure. I think people will begin to re-evaluate their options and contingency plans for an incident like this, including perhaps diversifying their infrastructure and adopting multiple hypervisors,” Chen concluded.