Starting, timeline

Before May 21: familiarize myself with the Monkey codebase

Weeks 1-2: draft the API design, request comments, iterate

  • design the various API calls
  • assemble them as an easy to read text file (header + comments)
  • post this RFC to the mailing list for public feedback
  • address all shortcomings and comments

Weeks 3-4: change the build system to build a shared library as well.
Implement the API.

  • create stubs for all the functions, so that the build system can be tweaked
  • add –enable-shared-library and –disable-shared-library options to the configure script
  • edit the Makefiles according to those options: the library won’t include the binary main function, and the binary won’t need the library functions.
  • with the build system in place, start implementing the functions in priority order (start, stop, restart, configure, callbacks, then the rest)

Weeks 5-6: Create various examples, testing the viability of the API

  • first example: simple “hello world”
  • second example: directory listing
  • third example: something interactive, maybe a quiz

Weeks 7-8: Documentation, any tweaks left over

  • create man pages for the library functions
  • any cleanups that remain
  • run the code through all static analyzers and checkers I have available; fix all warnings

Right now, I’ve gotten familiar with the Monkey codebase. I started with fixing compiler warnings and proceeded to benchmarks and profiling, and many cleanups are already integrated in master.

With the official program starting today, it’s API design time.

May 21, 2012cand No Comments »
FILED UNDER :Meta

nginx update

I asked on #nginx (freenode) why nginx scaled badly (5% improvement) when going from 1 to 5 workers.

The consensus was that nginx is designed so that every worker calls accept(). I think that this is a bad design decision, but given a modern linux host, it was said that this is no longer an issue.

Nginx has an option to disable the mutex controlling accept(), accept_mutex off in the event block of the main config file. Given how important this one option is, I’m extremely disappointed that

  • It’s not the default on suitable systems
  • It’s not even documented in the shipped config files

This one option is why nginx scales terribly in its default configuration. When compiled on a suitably new linux system, ideally it would turn it off by default. And in all cases, it should be documented in the shipped config files, even if commented out.

Tested with 5 workers and accept_mutex off, nginx got 28000 +- 0.2%.


This puts nginx up to the expected numbers.

May 8, 2012cand No Comments »
FILED UNDER :Benchmark , Meta

State before – monkey vs nginx

Curious from the even results on the Raspberry Pi, I thought it was time to test nginx on the 6-core comp too (Phenom II X6 2.8GHz).

The website was the static one shipped with monkey, containing html and one picture. Siege was run with the usual settings, benchmark mode, 10 concurrent threads and 10 seconds of test time.

siege -c10 -t 10S -b localhost:1234

Versions:

  • nginx 1.2.0

  • Monkey ea83cb6860f9555011a4959294bd7d765b1fe533 plus some small fixes

Monkey was tested only in the default config (5 workers). Nginx was tested both in its default config (1 worker) and with 5 workers, to gain even ground with Monkey.

All numbers are averaged over three runs. Without further ado:

Woah.

Given the equal results under the Pi, I wasn’t expecting this. Further, it was surprising how little nginx gained from moving to 5 workers (+1 management thread, all on a 6-core cpu).

Here’s the raw data for one of the three runs:

      Date & Time,  Trans,  Elap Time,  Data Trans,  Resp Time,  Trans Rate,  Throughput,  Concurrent,    OKAY,   Failed

nginx default, 1 worker

2012-05-04 18:02:16, 122599,       9.75,         159,       0.00,    12574.26,       16.31,        9.34,  122599,       0

nginx 5 workers

2012-05-04 18:04:15, 136634,       9.99,         177,       0.00,    13677.08,       17.72,        9.34,  136635,       0

monkey default

2012-05-04 18:47:07, 254261,       9.01,         330,       0.00,    28219.87,       36.63,        9.33,  254261,       0
May 4, 2012cand No Comments »
FILED UNDER :Benchmark

Quality time with a Raspberry Pi

Thanks to flaushy I got access to a Raspberry Pi. Model B, 256mb+ethernet, running Debian 6, gcc 4.4.5, soft-float.

I ran some benchmarks and got a feel of the system. Only server/command-line use was considered.

All benches were ran in the default config, 192mb of RAM for the system and 64mb reserved for the GPU.


General feel

As usual for these ARM boards, SD cards suck, and the readers suck. The Pi is no exception, you’re going to have 70-80% of cpu time spent in io (waiting for the card) if you dare to touch it.

For this reason all the benches were done in tmpfs (ram).

The Pi’s slow cpu showed up in particular in the heavy cpu time spent in sys, 5-30% for most tasks. This means two things, both that the cpu is slow, but also that the device will greatly benefit from an optimized kernel.

Whether the custom Debian spin from the Foundation’s site has a well-optimized kernel I can’t say. The image was rather big at 3.8mb, but the only module it had was fuse.


Stats

The CPU is an ARMv6 Broadcom one, running at 700MHz. The specific core is arm1176jzf-s.

Supported features from cpuinfo:

swp half thumb fastmult vfp edsp java tls

It has 128kb of L2 cache, but as far as I know no way to detect whether it is enabled. There were no benchmark differences after adding “enable_l2cache=1″ to /boot/config.txt and rebooting.

Comparing the core to my Phenom II X6, depending on the benchmark, it is about 1/20th core-vs-core. 1/4th of the frequency (700Mhz vs 2.8GHz) and 1/5th the IPC (instructions per cycle). I don’t have any devices of the same class around for comparison.

I could not find the memory spec for the integrated chip, but I measured it could be written to at about 137 MB/s (4kb blocksize – at 1MB blocksize it was 75MB/s). This would put it about at PC100 SDRAM. (tmpfs overhead etc.)


Ethernet

One use-case I’m quite interested in is a small-scale server: NAS, www, ftp. I recall reading that the Ethernet on the Pi is connected via USB internally. This is a rather bad sign: USB as a protocol had terrible cpu overhead.

To measure the ethernet, and the ethernet only, another machine in the LAN was set up to serve zeros up as fast as possible.

nc -l -p 7777 < /dev/zero

On the upside, the Pi reached 11.7 MB/s. At 94% of the theoretical max, the speed is good and comparable to many common ethernet cards.

The downside was that the cpu use was huge: 50% sys!

This was only using the ethernet fully. Any protocol and server overhead would come on top of that, and since the ethernet is shared with the usb, if your storage would be usb, expect bad performance. If using a more advanced file system, such as ext4, jfs or xfs, this would only be exacerbated.

In short: don’t use for a server/nas.


Freedom

The box needs blobs from Broadcom to boot, use the GPU, and for some other smaller functions.

The only supported RAM splits are 128/128, 192/64, 224/32. You can’t give all the RAM to the CPU.


Shipped GCC

Testing the optimization of the shipped GCC (4.4.5), I measured how long it took to compress an 1gb file of zeros with gzip.

time dd if=/dev/zero bs=1M count=1024 | gzip -9 > /dev/null

The times of “-march=armv5 -O2″, “-march=armv6 -O3″, and “-mcpu=arm1176jzf-s -O3″ in order of increasingly better optimization, were identical within a second or two, well within variance. 2min 50s.

The same on the Phenom took 8.8s, using an older binary (not well optimized), for a rough yardstick comparison.

I can conclude that a GCC that old didn’t benefit from targeted optimizations. I read that Linaro’s GCC 4.7 has more work for ARMv6, it would be interesting to try that later. However compiling gcc on the Pi is something I don’t want to do, understandably.


Comparison against another ARM board

Running

openssl speed rsa4096

gave the Pi a score of 1.2 signs/sec.

Comparing it to soft-float results from Phoronix:

Pandaboard ES (dual Cortex-A9), 2.7
Atom N270, 2.8

Phoronix also measured hard-float:

Pandaboard ES, 4.6

Given these results, the Pi did surprisingly well against the soft-float Pandaboard.


Web server tests

Finally, I compared three web servers. All were benchmarked in their default configurations sans the port number and document directory. All served the same static page:

<html><body>Hello world</body></html>

All were under siege over localhost, with ten clients hammering away:

siege -c10 -t 10S -b localhost:1234

The tested versions were:

  • Busybox httpd, git 576b1d3c417ddea79481063401837ec0bdb91658
  • Monkey, git 484f819cf5a65d8f26add14243c8ffcee6293cc1
  • nginx 1.2.0

All tested servers successfully completed the test. Results:

Of these, busybox is a forking server (a new process for each connection), while nginx and monkey use an event-driven model.


Conclusion

I found the Pi to be fairly unsuitable for the uses I’m interested in. It may run fine as a media player or a teach-yourself-coding box, but it has some serious limitations for low-powered server use.

While it’s cheap on its own, perf/$ is not that good. Performance per watt is decent, given a maximum TDP of 3.5W.

For myself, I look forward to the Rhombus Tech A10 board. Until next time ;)

April 29, 2012cand No Comments »
FILED UNDER :Sidetracked

GSOC 2012

I’ve been accepted to Summer of Code 2012.

I’ll be working on making Monkey run as a shared library.

April 23, 2012cand 4 Comments »
FILED UNDER :Meta

Mesa 8.0 released!

Today the first release of Mesa with MLAA is out. It will be shipped in the next Fedora, Ubuntu 12.04, among other distros.

Also, my fixes mentioned in the previous post are in both master and 8.0.

February 10, 2012cand 2 Comments »
FILED UNDER :Tickbox

Status update

With recent events, there’s something to post about (!).

First, a bug was found in the depth buffer handling. Patches for that haven’t yet been applied in Mesa master.

Second, the Evergreen loop bug is fixed.

So in current master, the MLAA color filter works on:
- softpipe
- llvmpipe
- r600g (r700 and Evergreen confirmed myself, r600 should work, no idea about *. Islands)
- Nouveau on Fermi (maybe more?)

In master + those patches, the depth filter works on the same set as above.

For r300g, neither filter produces correct results, so I suspect a bug in r300g. Given four other drivers working fine it shouldn’t be in common code.

January 27, 2012cand 2 Comments »
FILED UNDER :Progress

Merged in master!

http://cgit.freedesktop.org/mesa/mesa/commit/?id=6571c0774af1f5ebd0fab40bf4769702d3c9ded5

The post-processing queue is now in Mesa master.

August 25, 2011cand 3 Comments »
FILED UNDER :Meta , Progress , Tickbox

r600g success!

With the loop bug worked around (https://bugs.freedesktop.org/show_bug.cgi?id=40034), MLAA now runs on my netbook with comparable quality to llvmpipe.

Since this was the last point to do, next I will focus on integration.

August 16, 2011cand No Comments »
FILED UNDER :Progress , Tickbox

Cel-shading

Look how cute he looks!

August 14, 2011cand No Comments »
FILED UNDER :Meta , Sidetracked