this is a complex topic, so I cannot give you a definite answer. But I can try and give you some factors which influence the type of machines to be picked.
BTW: you *can* add apache instances later on; you just have to make them known to the JBoss instances (which can be done via JMX), and to the front-end load balancer (DNS, or hardware).
Having said that, the question is whether the httpds are going to handle static traffic as well. So, for example, are they going to serve web pages, images etc, anything that is *not* forwarded to the back-end JBoss cluster. You need to compute how much the aggregated traffic will be for all users for static data.
On top of that, you'll need to have an idea of how much data will be going to the JBoss cluster, and how much is received from it.
Are the connections from the clients to the httpds encrypted; if SSL is used, apache httpd needs a bit more CPU cycles to decrypt the request and encrypt the response again.
I suggest you run a load test where you try to simulate the approximate load that will be on the prod system, and measure the load on the httpds and te JBoss cluster.
Again, you *can* add and remove httpds dynamically, what you need to do here is write some scripts which tell the instances in the JBoss cluster.
By the way, you want to take a look at my talk  at JUDCon, which deals with the different ways of running a JBoss cluster in the cloud...
Everything will be transferred through -- the Apache servers will do nothing more than load balancing. There wouldn't be any encryption.
The total amount of bytes per second would probably (at peak) be around say...30MB±10MB inbound and an equal amount outbound.
And yes, I'd love to test all of this out to get a visceral sense of the needs, but unfortunately I've been given an impressively tight schedule to work on. By the time we'll have a working testbed system running that we could run benchmarks on, it will practically time to move to deployment. Having some idea of the likely range of the budget for hardware (on Amazon) will help us to at least know how much extra help we can hire to meet the deadline.
How do you front the httpds ? A hardware load balancer ? DNS round robin ?
If the peak (*not* average !) load is 40MB/sec, then I'd reserve only 1 instance. I'd enable CloudWatch to get stats on the httpd, and I'd also gather stats on the back-end cluster.
Should you find out that one httpd is not enough, just add another one (to cover peaks). If this is permanent, reserve the 2nd instance, too.
Re: cluster size: what do you use the cluster for ? If you for example use session replication, the heap size of each JBoss instance has to be at least N * D (where N is the cluster size and D the size of the average data an instance holds). If you use (Infinispan) distribution, then the memory can be smaller and you can add more nodes to a cluster.
If you use mod-cluster and partition your cluster into several mod-cluster domains, you could use replication as well.
Performance is of course also affected by whether you use sync or async replication; the latter is orders of magnitude faster, but in edge cases you can incur data loss (e.g. a session is gone and a user needs to login again). You could also stream sessions to persistent storage in the background as a 2nd line of defense.
Re machine types: I think a Large instance (that's the smallest type for 64 bit architectures) will be good enough.
Again, these are just recommendations, without knowing your application in detail...