Recently, Microsoft made a some news by asking for a special, multi-core (sixteen cores to be exact) version of Intel’s efficient Atom processors to be used in servers. After thinking about it for a couple of days, the idea made a lot of sense. I know, it’s rare that I agree with Microsoft :)
Why so many cores? If you consider the kinds of workloads that application servers must deal with, the server must deal with a large number connections and requests and tend to idle while waiting for data to be crunched by other application or database servers. The actual data crunching that the application servers, particularly web app and content servers, need to do before sending back the results is not all that difficult. The large number of cores would facilitate a large number of threads required to handle many thousands of requests per minute.
Why use the somewhat lackluster Atom processor? The Atom processor may be a bit anemic for desktop or laptop duties, where you have numerous workloads going on at once, including rendering graphics, playing music or videos, web browsing and photo editing. On a web server, the in-order execution of the Atom processor does not have as much impact on an individual request level. Another benefit of using an Atom processor core over a Xeon core is power consumption. A desktop or server-optimized dual-core Atom processor has a TDP of less than 15W, versus a dual or quad-core Xeon’s TDP of 60-80W (even more when you look at the X models).
By taking advantage of the low power requirements of each Atom core, some of the latest fabrication processes and the proliferation of serial interconnects (PCI Express, SATA/SAS, 10GbE), building sixteen Atom cores plus memory controllers and I/O controllers on to one processor package is not too difficult to do. In fact, I put together a basic diagram of what such a processor might look like:

The processor package would include five or seven dies, four of which would each contain four 64-bit capable Atom cores with HyperThreading and an intermediate memory and I/O crossbar. The other three dies could be combined into one, with a central component providing buffered memory interfaces (or SMI in Intel terminology), IPMI for management, and high-speed links to two I/O hubs. Each I/O hub would provide external I/O interfaces, such as PCI Express, 6Gbps SAS/SATA and four 2.5Gbps 8b/10b links. The four 2.5Gbps 8b/10b links can be joined together to provide one 10Gb Ethernet port or four 1Gb Ethernet ports. The only other components a server manufacturer would need to include could include a SoC for remote management (see: ILO, DRAC and ILOM) and possibly a USB controller to provide local media or serial console access by way of a converter.
To some, this discussion may trigger a sense of deja vu. This has in fact been discussed and done before, except with UltraSPARC processor cores rather than Atom processor cores. The product would be called the UltraSPARC T series processors. The first generation was the UltraSPARC T1, which had eight cores sharing an I/O crossbar, memory controller and floating point unit. Each in-order processing core had the facilities to handle four threads concurrently, for a total of 32 threads. Kind of a coincidence that a sixteen-core Atom processor would also be able to handle 32 threads with the help of HyperThreading.
The UltraSPARC T1 debuted to mixed reviews, in which it performed beautifully in naturally multi-threaded environments but suffered under heavy, single-threaded application workloads. The Atom processor ran into some of the same criticism, which was exacerbated by the fact that the first Atom processors only had one core and HyperThreading partially helped when an additional thread was introduced to the workload.
Sun later improved on the design with the UltraSPARC T2, which so happened to integrate not only a PCI Express controller, but also a dual-port 10Gb Ethernet controller and would use fully buffered memory (a bit less efficient than DDR3 via SMI buffers, but helped reduce pin counts). The four concurrent threads per core was lifted to eight, and the shared floating point unit was replaced with one unit per core (which is then shared across the eight threads per core). A second version of the UltraSPARC T2 would later come out to support multiple sockets, at the expense of the 10Gb Ethernet controller, which migrated from being on-package to being located on the system board.
With the re-designed processor, the UltraSPARC T2 continued to beat up other processors in thread-heavy workloads and even conquered several key Oracle benchmarks. The processor still had a slight weakness to single-threaded applications, but that was mostly hidden by an increase in clock speed. The processor was improved once more, now in the form of the UltraSPARC T3.
In short, the idea of creating a many, many in-order processing core processor that can handle two or more concurrent threads per core is not a new idea, nor is it one doomed to fail. In fact, such a processor might be a significant boom for those looking to consolidate and/or virtualize web front-end or web application servers.
Intel, please heed Microsoft’s call and built this processor. If not Intel, will you do it AMD?
P.S.: I know this is a departure from my recent advocacy of building ARM processors explicitly for server workloads, but the two are not mutually exclusive. In fact, many of the ARM processor designs are based on an in-order execution design and require very little power to run. Having both an Atom-based design (or a Bobcat-based design if AMD were to join in) and an ARM-based design would ignite much needed innovation and competition in the server market. Also, an Atom-based design would allow Microsoft Windows-based to be deployed.