Tag Archives: Servers

ARMy of Servers and Project Moonshot

In my first blog post about building out servers with ARM processors, I had mentioned that one could build a high-density scale-out server infrastructure by fitting 20 blades into a 3U chassis:

How would 20 ARM-based server nodes be able to fit into a 3U chassis? It has been done before, even with nodes that pull more than 15-20W each. Both Sun Microsystems and Compaq had a 3U blade server chassis that supported 20 blades (with one UltraSPARC IIe/IIi or one Pentium M processor, 1-2GB of RAM and one 2.5″ hard drive bay) and an Ethernet switch or pass-through module. One can use the same blade setup, use smaller and more efficient power supplies and cooling (as the need for power and cooling with be a lot less than 20W per blade), update the switch to support Gigabit downstream ports and 10Gbps Ethernet uplink ports, and reduce the chassis depth. I would even bet that one could find a way to fit 20 nodes in 2U of space without sacrificing any functionality or availability.

Well, HP has taken that idea and ratcheted it up to a very impressive scale. Project Moonshot crams 72 server nodes into a 2U half-width tray housing eighteen Calxeda EnergyCards and four external 10Gb/s XAUI ports. Four of those trays can slot into a 4U SL6500 chassis, for a total of 288 nodes (72 nodes per 1U). Each EnergyCard contains four EnergyCore processors with up to 4GB per processor and 4 SATA ports, all while drawing 25 watts. In turn, each EnergyCore processor comes in both two and four Cortex-A9 core configurations and has a high-throughput fabric switch built-in. The fabric switch provides multiplexed access to five 10Gb/s XAUI ports and six 1Gb/s SGMII ports, all of which is wrapped around by three 10GbE MAC ports. Each EnergyCore also provides five SATA ports, several PCIe controllers and an SD/eMMC controller (say, for booting an operating system).

While such an impressive setup may not immediately fit into common enterprise workloads, don’t be surprised to see these things popping up at places where companies need an enormous amount of light/moderate duty workloads that can be scaled out across thousands of threads; and, where the cost of power and cooling are at a premium. Both Ubuntu and Fedora can be used, though Windows Server 8 might be an option if Microsoft deems it to be worth the time and money.

Facebook and Open Compute Project

I just want to get this out in the open, I am not a Facebook (or Twitter) user nor have I been thrilled with the fast-and-loose nature of Facebook’s handling of public or private user information.

In an interesting twist to Facebook’s “openness”, Facebook has released details and documents of their Open Compute Project that is the basis for their recent datacenter located in Prineville, Oregon. The specifications have been released by Facebook under the Open Web Foundation Agreement, while the design and implementation files are released under the Creative Commons Attribution 3.0 license.  Being a hardware-nut and have always been interested in datacenter design and architecture, the released server, rack and power component specifications made my day.

To me, it is very nice to see what was done to make the datacenter as lean and mean as possible, mostly the power supply used by each server node. The 450W power supply has two inputs, a nominal 277VAC input as primary and a 48VDC input as a backup. For typical servers, you would have two separate AC or DC power supplies that should pull power from two different sources (such as independent UPS or circuits). There is inherent inefficiencies with this setup as you have to deal with additional losses due to having two sets of conversion components, increased cooling, and an external interposer.

Right now, the two server boards that have had their designs and specifications released are both dual socket boards. The first board is a high-memory capacity board that can take two AMD Opteron 6100-series processors and has 24 memory slots evenly distributed between the two sockets. The second board uses Intel Xeon 5500/5600-series processors and has the common 18 memory slots (9 per socket, 3 slots per channel) setup found in most 1U and 2U servers. Both boards are custom designs and only have the required components included, such as SATA, USB and Gigabit Ethernet. Efficiency, cost savings and simplicity are the reasons for the stark nature of these boards, which is the priority for large scale-out compute systems.

Another unique feature of both boards is how power is provided; rather than using the more common ATX-style connectors (which are available for testing purposes), the designs call for a edge-mounted connector that is more likely found in Cisco Catalyst 6500-series modules or various blade servers. Again, simplicity and efficiency are critical.

As intriguing as the hardware is, it may not be as practical in more common small or medium business environments. Nonetheless, some of the design elements are already found in blade servers or industrial applications. It is the datacenter design elements that will have more of an impact in the near future, as IT continues its march towards cost and energy efficiency.

Edit: BTW, how cool are LED lights powered over Ethernet?

ARM servers do not have to face software issues

In a recent report posted on PC World, one of Dell’s VPs states that porting software from x86 to ARM can be the biggest hurdles for ARM servers. I will have to agree with his statement, if and only if, some, most or all of the software that needs to be ported are provided in closed-source, pre-compiled form.

The problem does not really exist in the world of Open Source software; and, even if popular operating systems, distributions and software packages are not currently available pre-compiled for ARM, it would not take long for the community and its sponsors to turn around and provided tested binary packages. For those that want to squeeze out as much efficiency and already roll their own compiled packages to servers, it isn’t too difficult to do the same for ARM servers.

For me, the biggest hurdle I see for Open Source software and ARM servers is the availability of developer and production hardware, along with the cost to acquire the needed hardware for testing. In terms of operating systems and distributions, the three main BSD projects (FreeBSD, OpenBSD and NetBSD) have one or more ARM ports available with the Ports and Packages repositories providing many of the required software packages used in scale-out hosting and compute environments.

On the Linux side of the world, Debian has had a small variety of ARM ports and Ubuntu has dabbled in the ARM space for a little while now. Fedora, CentOS and Gentoo are also viable foundations for building up Linux for ARM servers. There have been several Open Source and commercial Linux distributions that target ARM devices for real-time and embedded applications and can also be used to build up lean and mean distributions for the two server workloads that would excel on ARM. I do not know where Red Hat or SUSE stand in terms of ARM preparedness, but SUSE (by way of Novell and the partnership with Microsoft) may not want to stir the waters.

If the applications were written on top of various Open Source software foundations, such as Apache Tomcat, the various Python, PHP, Perl or Ruby frameworks, or other forms of interpreted languages and bytecode available from the Open Source community… porting and testing should not be difficult. Granted, the level of difficulty depends on the complexity and size of the application.

This leaves the two other major stalwarts in the operating systems market: Oracle and Microsoft. While Oracle Linux is Oracle’s own spin of Red Hat Enterprise Linux (or CentOS) and could benefit from either organization’s development for the ARM platform, the bigger concern would be the Solaris clan. ARM servers would compete against Oracle’s UltraSPARC T family of processors, not in terms of cores per server, but rather watts per core/thread.

On Microsoft’s side of the world, they have announced a form of Windows that can run on ARM netbooks, notebooks, tablets and desktops and would be part of the Windows 8 family. Unfortunately, there the lack of further details on whether that will trickle over to the next release of Windows Server and how serious Microsoft is on ARM development in general (outside of their current struggles with Windows Phone 7 on ARM). A half-arsed effort from Microsoft will hurt the prospects of ARM and could be seen as yet another form of “embrace, extend, extinguish”.

At this point, I don’t see the other major player in the ARM market taking the ARM server movement seriously… Apple. The ARM processor has been at the backbone of Apple’s transformation from being a computer and software company, to being a player in consumer electronics. Apple’s first venture in the world of ARM processors didn’t end up being a commercial success (at least compared to their later products with ARM processors), which was the Apple Newton. It wasn’t until the release of the Apple iPod did the combination of Apple and ARM make any kind of splash. This continued until Apple really broke ground with the iPhone, and later, the iPad.

Even with Apple’s success with the ARM architecture and processors, Apple’s recent exit from the server hardware market and integration of key server features into the base release of OS X Lion may not bode well for the hopes for a new Apple Mac server with ARM processors. I could be wrong on this, but all signs seem to point elsewhere.

So, at the end of the day, do ARM servers actually have software issues that need to be addressed? Yes and no. If your infrastructure is already based on Open Source software as a foundation, the source code and compilers are there and waiting to be used, particularly if you already roll your own code. If you are stuck on a closed-source and commercial platform with no access to foundation source code, well… that’s kind of what you get for depending on such platforms and foundations. Sorry for the bluntness, but that’s the core issue of non-Open Source software.

Sixteen-Core Intel Atom for Server Workloads?

Recently, Microsoft made a some news by asking for a special, multi-core (sixteen cores to be exact) version of Intel’s efficient Atom processors to be used in servers. After thinking about it for a couple of days, the idea made a lot of sense. I know, it’s rare that I agree with Microsoft :)

Why so many cores? If you consider the kinds of workloads that application servers must deal with, the server must deal with a large number connections and requests and tend to idle while waiting for data to be crunched by other application or database servers. The actual data crunching that the application servers, particularly web app and content servers, need to do before sending back the results is not all that difficult. The large number of cores would facilitate a large number of threads required to handle many thousands of requests per minute.

Why use the somewhat lackluster Atom processor? The Atom processor may be a bit anemic for desktop or laptop duties, where you have numerous workloads going on at once, including rendering graphics, playing music or videos, web browsing and photo editing. On a web server, the in-order execution of the Atom processor does not have as much impact on an individual request level. Another benefit of using an Atom processor core over a Xeon core is power consumption. A desktop or server-optimized dual-core Atom processor has a TDP of less than 15W, versus a dual or quad-core Xeon’s TDP of 60-80W (even more when you look at the X models).

By taking advantage of the low power requirements of each Atom core, some of the latest fabrication processes and the proliferation of serial interconnects (PCI Express, SATA/SAS, 10GbE), building sixteen Atom cores plus memory controllers and I/O controllers on to one processor package is not too difficult to do. In fact, I put together a basic diagram of what such a processor might look like:

The processor package would include five or seven dies, four of which would each contain four 64-bit capable Atom cores with HyperThreading and an intermediate memory and I/O crossbar. The other three dies could be combined into one, with a central component providing buffered memory interfaces (or SMI in Intel terminology), IPMI for management, and high-speed links to two I/O hubs. Each I/O hub would provide external I/O interfaces, such as PCI Express, 6Gbps SAS/SATA and four 2.5Gbps 8b/10b links. The four 2.5Gbps 8b/10b links can be joined together to provide one 10Gb Ethernet port or four 1Gb Ethernet ports. The only other components a server manufacturer would need to include could include a SoC for remote management (see: ILO, DRAC and ILOM) and possibly a USB controller to provide local media or serial console access by way of a converter.

To some, this discussion may trigger a sense of deja vu. This has in fact been discussed and done before, except with UltraSPARC processor cores rather than Atom processor cores. The product would be called the UltraSPARC T series processors. The first generation was the UltraSPARC T1, which had eight cores sharing an I/O crossbar, memory controller and floating point unit. Each in-order processing core had the facilities to handle four threads concurrently, for a total of 32 threads. Kind of a coincidence that a sixteen-core Atom processor would also be able to handle 32 threads with the help of HyperThreading.

The UltraSPARC T1 debuted to mixed reviews, in which it performed beautifully in naturally multi-threaded environments but suffered under heavy, single-threaded application workloads. The Atom processor ran into some of the same criticism, which was exacerbated by the fact that the first Atom processors only had one core and HyperThreading partially helped when an additional thread was introduced to the workload.

Sun later improved on the design with the UltraSPARC T2, which so happened to integrate not only a PCI Express controller, but also a dual-port 10Gb Ethernet controller and would use fully buffered memory (a bit less efficient than DDR3 via SMI buffers, but helped reduce pin counts). The four concurrent threads per core was lifted to eight, and the shared floating point unit was replaced with one unit per core (which is then shared across the eight threads per core). A second version of the UltraSPARC T2 would later come out to support multiple sockets, at the expense of the 10Gb Ethernet controller, which migrated from being on-package to being located on the system board.

With the re-designed processor, the UltraSPARC T2 continued to beat up other processors in thread-heavy workloads and even conquered several key Oracle benchmarks. The processor still had a slight weakness to single-threaded applications, but that was mostly hidden by an increase in clock speed. The processor was improved once more, now in the form of the UltraSPARC T3.

In short, the idea of creating a many, many in-order processing core processor that can handle two or more concurrent threads per core is not a new idea, nor is it one doomed to fail. In fact, such a processor might be a significant boom for those looking to consolidate and/or virtualize web front-end or web application servers.

Intel, please heed Microsoft’s call and built this processor. If not Intel, will you do it AMD?

P.S.: I know this is a departure from my recent advocacy of building ARM processors explicitly for server workloads, but the two are not mutually exclusive. In fact, many of the ARM processor designs are based on an in-order execution design and require very little power to run. Having both an Atom-based design (or a Bobcat-based design if AMD were to join in) and an ARM-based design would ignite much needed innovation and competition in the server market. Also, an Atom-based design would allow Microsoft Windows-based to be deployed.

ARMy recruiting: NVIDIA’s Project Denver

ARM Ltd must be jumping with joy with the recent developments. Yesterday at CES, not only did Microsoft announce that Windows 8 will be coming to ARMs (pun intended), NVIDIA announced a continued partnership with ARM and “Project Denver“.

NVIDIA’s goal for the joint project is to be able to design new ARM-based processors for use in desktops, servers and hyper scale-out clustered supercomputers. While NVIDIA has had several setbacks with their Tegra and Tegra 2 processors (including D-Link’s switch from the latter to an Intel Atom processor in their Boxee Box device), the partnership will open up new doors for future generations of the Tegra processor. This would include access to the latest Cortex cores, which will not only provide additional performance and features, but can also keep power consumption at bay. Continue reading

Found kissing in a tree: Windows 8 and ARM

One of the more intriguing announcements at this year’s CES is one from Microsoft: Windows 8 will add ARM to its supported architecture list. This is a very specific answer to the recent surge of ARM based devices and the lack of a proper answer to Google’s Android and Apple’s iOS. Sure, Microsoft has Windows Phone 7, but that is more of a simultaneous step forward and step backwards from the already ARM-friendly Windows Mobile.

Microsoft will continue to push developers to use .NET even more now, as an application developed and compiled for the .NET Framework would be able to run on 32-bit and 64-bit x86 processors, Itanium processors (although, Microsoft has recently stepped away from the architecture) and soon, ARM processors. Silverlight was one way Microsoft was able to extend the .NET Framework’s reach into ARM devices.

What would make me even more interested is if Microsoft will make an ARM release for the next major version of Windows Server that will be based on Windows 8. As I’ve written previously, I am excited with the prospects of low-power, hyper-scale out ready ARM-based servers. One of the limitations of going with an ARM-based server is the inability to deploy Windows Server and, therefore, applications developed against recent versions of the .NET Framework. Well, this might change in a couple of years, iff Microsoft is getting really serious about the ARM architecture.

If all this pans out, we can probably see Dell (or HP) releasing a next, next generation smart client with a dual (or quad) core ARM processor, 2-4GB of RAM and a moderate-sized SSD used to cache a read-only instance of Windows 8 that is streamed down by ARM-based Windows Servers. That would make for an efficient deployment for retail or financial environments. One can dream, yeah?