13 Replies Latest reply on Jun 5, 2006 9:56 AM by acoliver

Brief statement of JBMS design goals

acoliver May 26, 2006 12:09 PM

JBoss Mail Server is a central compoent in the JBoss Collaboration Server suite. The mail server has some general principles for which its design adheres:

1. open protocol support
2. reutlization of appserver features (such as JAAS login modules)
3. extensibility
4. databasestore independence
5. scale

In particular I'd like to address the latter. JBMS in its early stages loaded every mail into memory at least once (actually more than once). Now we are very careful not to do that!
Loading mails into memory keeps pressure off of the database but limits scale in terms of acheivable rate of mixed size mails (read some will be 40k some will be 8mb). Moreover, due to the way that Java manages memory, in a -server configuration, it is not possible to see when you've "bitten off more than you can chew" in terms of memory consumption. Thus we choose maximum scale and stability over low-end performance. Meaning it is likely that a higher number of really small messages could be processed "in memory" provided you are willing to risk loosing messages in the event of a failure or risk running out of heap space (thus loosing said messages when the VM dies). Also as memory becomes full Java reduces performance by executing more freqent major and full GC runs.

Thus the principal performance gains are to be found through faster storage medium (RAID 1+0, write behind cache with backup). Reliability is storage dependent and counterweights performance (database replication, backup, etc). A poorly performing database will significantly degrade performance.

That being said, it is possible to deoptimize this structure on purpose or by accident. The Store interface is designed to be pluggable. There is no reason that you cannot plug in a more memory-oriented version. You can also (at any point) get access to the raw stream of a mail or stored mail. If you pass this to JavaMail then JavaMail will completely load the mail into memory. For some systems, doing this in the maillisteners will provide acceptable perforamance and stability (provided that volume is relatively low and the allowed number of incomming connections is low).

We will be offering plugins which deoptimize this structure as options for the first release (JASEN for instance pretty much requires this) and will seek to replace them by either working with the authors or moving to alternatives. The default structure will always adhere to this goal and base level functionality will never require anything that does not adhere to this goal. I realize this is not an incontrovertable design decision, but a lot of work, benchmarking, crashing and mistakes were made along the way to arrive at it. Please discuss this rationally and offer proof rather than conjecture.

Thanks,

-Andy

1. Scalability

lhoriman May 26, 2006 1:35 PM (in response to acoliver)

I'll assume that this discussion is at least in part aimed at me. I don't have a lot of time so I'll be as concise as I can.

Generally speaking, large scale scalability of most enterprise applications comes not from optimizing the behavior of any one JVM but asking:

How does it scale in a cluster?

Performance tuning an application to run twice or three times as fast on a single machine is great, but will not get you the kind of throughput you could conceivably accomplish by sharing the load across 10, 20, or 100 machines.

You've created an architecture in which all data goes in and out of the database. This may approach the theoretical optimum scale for an application that does nothing more than read data off an SMTP stream and store it in user's mailboxes. Such an application would be essentially constrained by the write throughput of your database anyways, so clustering the application server wouldn't buy you much.

However, most mail-based applications are not like this. Even JBCS is not such an application.

Mail doesn't get written, it gets read. As it stands now in JBCS, probably not a lot. However, as people start to write listeners and plugins, you're going to find that the number of times a message gets processed expands dramatically, and each one of those passes will stream the message out of the DB.

By the time you get to writing a mailing list manager with enough features that people will want to use it, you're going to find:

1) Each message written to the db gets read and emailed out to possibly tens of thousands of recipients.
2) More people will be reading and searching the archives than will probably be on the actual list itself.
3) You have to do an awful lot of processing on email messages when they arrive and when you send them out.

Just pulling a number out of thin air, I would guess that a typically message passing through SubEtha goes through a dozen different processing phases. For each outbound message! And it could be a lot more than that, depending on what user-written plugins are installed on the system.

Leaning on the database won't scale you very far. Keeping as much as possible in-memory may ultimlately reduce the throughput of a single machine as it churns through big chunks of data, but I don't really care if one machine slows down from a big message...

...because I have a dozen other machines in the cluster that can pick up the slack.

So when bring up "maximum scale and stability", I'm surprised that you bring up garbage collector tuning. The first question I ask when discussing scalability is "how many machines can I run this on?"

Jeff
Actions
2. Re: Brief statement of JBMS design goals

dmlloyd May 26, 2006 6:06 PM (in response to acoliver)

Generally speaking, large scale scalability of most enterprise applications comes not from optimizing the behavior of any one JVM but asking:

How does it scale in a cluster?

So you're basically saying that it's ok if it's horribly slow, as long as you can spend ten times as much on hardware rather than simply making the implementation more efficient?

In any case, if ever there was an application that can't be clustered it's email. Each connection maintains its own state. Clustering implies that state is shared between nodes, which is definitely not going to be true here.

As for your point(s) on mailing lists and message storage: keeping all messages in memory during an SMTP transfer is just foolish. It makes it trivially simple to DoS the mail server. In fact the mail server could run out of memory under normal load in a configuration like this.

Far more sensible would be to keep a fixed number of fixed-size buffers available in a pool. When relaying a message, grab and fill a buffer and try to connect to the relay or message store immediately. If the message size exceeds the buffer size, then spool it to disk and return the buffer to the pool, using disk to back the buffer beyond that point. Otherwise relay the message right from the buffer... this optimizes for the common case of small messages (which are 99% of what's in my mailbox anyway) without setting yourself up for disaster in the case of large messages. Also, since you're not creating and freeing giant byte buffers, you're not abusing the collector.

I think you'll find that most highly scalable email platforms in use by the big ISPs are heavily optimized as far as streaming data and minimizing copying and frequent memory allocation/freeing. To pretend that these things don't matter, given the huge amount of literature out there on these topics, is probably unwise.
Actions
3. Re: Brief statement of JBMS design goals

lhoriman May 26, 2006 7:49 PM (in response to acoliver)

"dmlloyd" wrote:

So you're basically saying that it's ok if it's horribly slow, as long as you can spend ten times as much on hardware rather than simply making the implementation more efficient?

No. I'm saying that if you make architectural choices that eliminate the advantage of clustering, you've destroyed your ability to scale.

Somebody asked in this forum if you could scale to 6,000 messages per second. You can hand-write your mailserver in assembly language and you still won't be able to meet that kind of throughput on a single box.

Nobody is arguing that code should be inefficient. However, if you want real scalability, you need to cluster graciously. Tuning your garbage collection is a tactical move, not a strategic one.

"dmlloyd" wrote:

In any case, if ever there was an application that can't be clustered it's email. Each connection maintains its own state. Clustering implies that state is shared between nodes, which is definitely not going to be true here.

Horseshit. Round-robin DNS can distribute mail delivery code across servers. You are still constrained by write speed to your database, but you distribute the mail processing overhead. Furthermore, you can easily distribute the read processing.

Caching will get you some advantage in IMAP and POP and WebMail, but you'll find enormous gains if you try to implement a mailing list manager, because you'll find yourself reading the same data over and over and over in separate transactions.

"dmlloyd" wrote:

As for your point(s) on mailing lists and message storage: keeping all messages in memory during an SMTP transfer is just foolish. It makes it trivially simple to DoS the mail server. In fact the mail server could run out of memory under normal load in a configuration like this.

Far more sensible would be to keep a fixed number of fixed-size buffers available in a pool. When relaying a message, grab and fill a buffer and try to connect to the relay or message store immediately. If the message size exceeds the buffer size, then spool it to disk and return the buffer to the pool, using disk to back the buffer beyond that point. Otherwise relay the message right from the buffer... this optimizes for the common case of small messages (which are 99% of what's in my mailbox anyway) without setting yourself up for disaster in the case of large messages. Also, since you're not creating and freeing giant byte buffers, you're not abusing the collector.

JBMS isn't any harder to DoS. Since you're streaming everything straight into the DB, watch me open up 1000 connections to your database and see how Oracle likes that.

SubEtha limits the input stream to 10MB. It could be 100MB. Just depends on your use case. Yeah, we require the whole message in memory at once just because we use JavaMail. We can live with that.

Incidentally, our SMTP server does have the ability to do deferred streaming to disk, and it will do that in some cases. You might want to use our code. If you avoid shoveling everything back and forth from the database everytime you read data, you might actually get your application to cluster efficiently.

"dmlloyd" wrote:

I think you'll find that most highly scalable email platforms in use by the big ISPs are heavily optimized as far as streaming data and minimizing copying and frequent memory allocation/freeing. To pretend that these things don't matter, given the huge amount of literature out there on these topics, is probably unwise.

I think you'll find that exactly zero of the most highly scalable email platforms in use by the big ISPs run on a single unclustered machine. I'll take my ability to scale into a cluster over your ability to avoid big malloc's.

http://www-128.ibm.com/developerworks/library/j-jtp01274.html

Jeff
Actions
4. Re: Brief statement of JBMS design goals

dmlloyd May 26, 2006 10:07 PM (in response to acoliver)

Horseshit. Round-robin DNS can distribute mail delivery code across servers.

Load balancing is not clustering. The point is that you don't have to bridge state between the servers. I could set up ten MTA boxes and load balance among them, but that's not a cluster. If one box goes down, everything connected to it goes down too.

If your point is that a database may not suffice for large installations, I'm with you there, although there are things you can do to mitigate that (Oracle has a cluster product, and so does MySQL, I'm sure lots of others do too).

However it seems like the message store is the only single point of contention in the system. I imagine that if one database isn't enough, and database clustering is not enough, you could use several databases.

In a large installation though, I'd bet that the MTAs will see far more CPU activity than the message store will... once you figure in the fact that the MTA handles inbound and outbound mail, and that we've reached a fine point in the evolution of the internet where spam accounts for at least 40% of email sent to the average ISP.
Actions
5. Re: Brief statement of JBMS design goals

lhoriman May 26, 2006 10:38 PM (in response to acoliver)

This time hopefully I'll get the quote tags right :-)

"dmlloyd" wrote:
Horseshit. Round-robin DNS can distribute mail delivery code across servers.

Load balancing is not clustering. The point is that you don't have to bridge state between the servers. I could set up ten MTA boxes and load balance among them, but that's not a cluster. If one box goes down, everything connected to it goes down too.

Arguing about what is and isn't clustering is kinda silly. There is obviously bridged state because the nodes share a common view of the database. The clustered, replicated (or invalidated) cache is what gets it to scale beyond the database limitations. It's a cluster.

An SMTP box going down is no big deal. It may halt any transactions in progress, but clients will queue and redeliver. There's very little point in making that part any more robust. Mail will get to it's destination, it's just a question of when.

"dmlloyd" wrote:

If your point is that a database may not suffice for large installations, I'm with you there, although there are things you can do to mitigate that (Oracle has a cluster product, and so does MySQL, I'm sure lots of others do too).

I have a lot of experience with Oracle RAC from the days I worked on The Sims Online. It's not a panacea for scalability. In fact, unless you write your application specifically anticipating the behavioral characteristics of RAC, each additional node will cause your application to de-scale pretty dramatically.

It's been my experience that for most applications, clustering the database (either through fancy software or replication) is hard and usually ineffective. Clustering the application (through app-level in-memory caches, ie hibernate) offers the ability to scale much higher - and if you still need to, at the end of the day you can still fall back to databse federation/replication schemes as an additional step.

"dmlloyd" wrote:

However it seems like the message store is the only single point of contention in the system. I imagine that if one database isn't enough, and database clustering is not enough, you could use several databases.

In a large installation though, I'd bet that the MTAs will see far more CPU activity than the message store will... once you figure in the fact that the MTA handles inbound and outbound mail, and that we've reached a fine point in the evolution of the internet where spam accounts for at least 40% of email sent to the average ISP.

Try writing a mailing list manager. 1 message in easily translates to 10,000 messages out, plus a considerable amount of processing and munging of each message. Spam is the least of your problems :-)

Jeff Schnitzer
http://subetha.tigris.org
Actions
6. Re: Brief statement of JBMS design goals

jason.greene May 27, 2006 12:51 AM (in response to acoliver)

"lhoriman" wrote:

JBMS isn't any harder to DoS. Since you're streaming everything straight into the DB, watch me open up 1000 connections to your database and see how Oracle likes that.

SubEtha limits the input stream to 10MB. It could be 100MB. Just depends on your use case. Yeah, we require the whole message in memory at once just because we use JavaMail. We can live with that.

Reading the whole message into memory is destructive. Why? because it affects the QOS of ALL OTHER TRAFFIC. The scenario of many copies of a large message causes gc problems, and memory exhaustion, which results in large pauses, which results in the best case, a non responsive MTA. This creates a horrible experience to the end user (at least a 4 hour delay in mail delivery).

There is a reason why mail was designed to have the headers at the top of the message.

You also assume that the data store is slow. There are many strategies for making this just as efficient as the MTA. Partitioning your accounts into multiple data sources is one of them, and scales quite nicely. Also a message may not even need to be stored, it may be relayed.

Incidentally, our SMTP server does have the ability to do deferred streaming to disk, and it will do that in some cases. You might want to use our code. If you avoid shoveling everything back and forth from the database everytime you read data, you might actually get your application to cluster efficiently.

An MTA practically never reads from the data source, whatever that source may be.

I think you'll find that exactly zero of the most highly scalable email platforms in use by the big ISPs run on a single unclustered machine. I'll take my ability to scale into a cluster over your ability to avoid big malloc's.

http://www-128.ibm.com/developerworks/library/j-jtp01274.html

Jeff

The changes you talk about offer no greater ability to scale horizontally than anything else that is being done or discussed.

-Jason
Actions
7. Re: Brief statement of JBMS design goals

jason.greene May 27, 2006 1:07 AM (in response to acoliver)

"lhoriman" wrote:

Arguing about what is and isn't clustering is kinda silly. There is obviously bridged state because the nodes share a common view of the database. The clustered, replicated (or invalidated) cache is what gets it to scale beyond the database limitations. It's a cluster.

A replicated cache is not very useful, especially as the number of email boxes increases. Heavy reads of common data is a good candiate for caching. This is very unlike mail which is way more write heavy. The only information that realy benefits from caching is authentication information.

Each node should have an independant stateless view of the traffic it processes. This has maximum scalability.

Try writing a mailing list manager. 1 message in easily translates to 10,000 messages out, plus a considerable amount of processing and munging of each message. Spam is the least of your problems :-)

Jeff Schnitzer
http://subetha.tigris.org

Are you kidding? 40 percent of all email is spam. Mailing list traffic doesn't even come close to that.

-Jason
Actions
8. Re: Brief statement of JBMS design goals

acoliver May 30, 2006 2:10 PM (in response to acoliver)

I'll assume that this discussion is at least in part aimed at me.

Actually it was just a statement of goals because others had similar confusion. Mike kept having to come in and explain this at the bottom of several threads -- wanted a top level thread.

Thanks,

-Andy
Actions
9. Re: Brief statement of JBMS design goals

lhoriman Jun 1, 2006 5:23 AM (in response to acoliver)

"jason.greene@jboss.com" wrote:

A replicated cache is not very useful, especially as the number of email boxes increases. Heavy reads of common data is a good candiate for caching. This is very unlike mail which is way more write heavy. The only information that realy benefits from caching is authentication information.

A replicated cache doesn't buy you much for an MTA that simply writes SMTP to a mailbox and reads it once to a POP client, I'll agree. The read/write ratio is likely 1-to-1.

A replicated or clustered-invalidated cache is critical for scaling a mailing list server. At the very least you're going to read the message once for every delivery. If you want to do anything remotely interesting with the archives it's going to be read for each page view as well. The read/write ratio is easily 1000-to-1.

If you want to offer a webmail system with a simple html client (ie no smart caching in a flash client), a cache will help.

"jason.greene@jboss.com" wrote:

Each node should have an independant stateless view of the traffic it processes. This has maximum scalability.

Only if you have a trivial application. Try implementing mail-archive.com with your architecture. You won't get far.

"jason.greene@jboss.com" wrote:

"lhoriman" wrote:

Try writing a mailing list manager. 1 message in easily translates to 10,000 messages out, plus a considerable amount of processing and munging of each message. Spam is the least of your problems :-)

Are you kidding? 40 percent of all email is spam. Mailing list traffic doesn't even come close to that.

A mailing list server easily sends out 100 times the number of messages it receives. In terms of processing load, incoming spam is negligable. Think about it.

But SPAM is an interesting thing to bring up. If 40% of email that your MTA receives is SPAM, then streaming messages straight into the database has a 40% chance of being a really dumb idea. At the very least, your processing chain should stream the message to a DeferredFileOutputStream, give spam filters a crack at it, and then decide whether or not to transfer it into the database. You'll at least scale to a cluster of two machines.

Jeff
Actions
10. Re: Brief statement of JBMS design goals

lhoriman Jun 1, 2006 5:36 AM (in response to acoliver)

By the way, I've spent some time profiling SubEtha with various size messages.

The garbage collector is irrelevant.

The biggest bottleneck *by far* is CPU time encoding and decoding the message data stream. Garbage collection doesn't even register on the list.

Jeff
Actions
11. Re: Brief statement of JBMS design goals

acoliver Jun 1, 2006 4:27 PM (in response to acoliver)

The biggest bottleneck *by far* is CPU time encoding and decoding the message data stream.

Jeff, that is related to the thread problems with JavaMail and JAF that I mentioned.

This thread has moved WAY off topic and seems more appropriate for the subetha mail lists than JBMS mail lists. From a community standpoint you're also kind of not matching our general style of discussion. We're REALLY REALLY REALLY do-ocratic. You'll notice that I'm by far the most chatty of the people who write a lot of code here. Mostly people "show" their points. James said "Flex is the schizzle" ... I said "thats nice dear" ... he sent me a working demo.... I gave him committer access... he made something functional....Aron and I started contributing.... Other optimizations took place due to problems that we had in production.... That's more our general style. Less this kind of theoretical discussion (not that all theoretical discussion is useless but this one is fairly offtopic anyhow). So if you have something code-wise to contribute, hey lets see it and maybe there is some code-centric-discussion to be had. If you want to discuss subetha scale issues then maybe the subetha lists are a better place.

-Andy
Actions
12. Re: Brief statement of JBMS design goals

lhoriman Jun 3, 2006 11:35 PM (in response to acoliver)

"acoliver@jboss.org" wrote:

The biggest bottleneck *by far* is CPU time encoding and decoding the message data stream.

Jeff, that is related to the thread problems with JavaMail and JAF that I mentioned.

No, it's not even remotely related. Synchronization is also a non-issue.

But I agree that this conversation is pretty pointless. Continue writing your software, I'll continue writing mine, and we can talk about how we'll run in the same JVM if or when somebody actually wants to run that configuration.

Jeff
Actions
13. Re: Brief statement of JBMS design goals

acoliver Jun 5, 2006 9:56 AM (in response to acoliver)

But I agree that this conversation is pretty pointless.

Dude...you hijacked my thread with your own stuff as if it were all about you and subetha.

Continue writing your software

Thanks for permission.

we can talk about how we'll run in the same JVM if or when somebody actually wants to run that configuration.

If you believe that memory consumption, GC and synchronization don't matter as strongly as you've stated then I think there is little chance of that happening.
Actions

Go to original post