Experience with Grapevine: The Growth of Distributed System
Schroeder, Birrell, Needham (1984)
What kind of paper is this?
Experience
- Ethernet/gateways/Altos: 30 usec/proc call; 128 KB mem; 5 MB disk
- 17 Grapevine servers.
- 4400 individuals
- 1500 groups.
- 8500 messages; 35,000 receptions.
Effects of scale
- Want to scale by addition of equally powerful servers.
- Should be able to calculate how many servers you need.
- Server load should not scale with overall size of the system.
- Currently handles target size (30 servers; 10,000 users).
- Per-registry state is about 15 KB; fits comfortably.
- Increased growth might imply 3 level naming hierarchy.
- Certain distribution lists proportional to # users ; makes message
delivery slow; on the brink of overload now.
- Could subdivide groups into registry specific groups so that no
server had to handle more than one registry; bulletin boards offer
another alternative.
- Information overload: number of messages generated grows with
the user population.
- Grapevine uses direct connection - as the internet grows, this
become unwieldy; unreliable links are problematic.
Configuration Decisions
- Since messages are shared on a server, want servers to map to
communication units.
- Currently: avg: 4.7 recipients per message, max: 300, most are 1.
- Knowing when to add servers is easy:
- If link to rest of net is unreliable add server.
- System load gets too heavy. add server/
- Haven't tried it, but consider splitting registry and message
servers
across two machines.
- Assigning secondary inboxes is hard.
- Don't want to overload secondary on failure.
- Still want to get sharing of messages.
- Want to avoid problems with unreliable links.
- Locating registries:
- Registry should be easily accessible to message server
with registry's inbox.
- Distribution lists should be close to the message servers
that accept messages for them.
- Registry data should be robust against links going down.
- Enough replicas to avoid data loss in case of
catastrophic failure.
- Avoid overloading any particular server.
Transparency of Distribution and Replication
- Most of the time, people see single system image.
- Slow propagation occasionally causes problems.
- Create user in one registration server; access from
another.
- Add person to DL, immediately send to DL.
- Detecting unused distribution lists nearly impossible
- Remailing when an inbox gets deleted/moved is a problem; a delete
on empty flag might have been better.
- Messages sometimes get duplicated in inboxes (repeated names on
expanded DLs).
- Things are't always done in the best place (e.g. expansion of
distribution lists).
- Not good diagnostic support (i.e. why can't I read mail); need
to investigate how much complexity it is worth to provide better monitoring.
Adjusting to Load
- Merges on updates turned out to be suboptimal.
- Most frequent change was adding/deleting to a DL.
- Much better to send the update than the entire list and have the
server to a merge.
- Modified update protocol so that DL changes were broadcast as
updates while others were still merges.
- Grapevine became single source of authentication.
- Use caching (12 hour time out) to improve performance.
- Authentication denial became slow when groups became too
nested.
- Added syntactic identifier for groups so that the authenticator could
distinguish groups and expand only those.
- Also keep flattened groups for authentication purposes.
- Did not predict email usage well.
- Many users read from dumb terminals and want:
- Random access to messages (designed for sequential).
- Use inbox as permanent storage (designed as buffer).
Operation of a Dispersed System
- Two types of people need control access: operators and experts.
- Experts are always remote.
- Operators should not have to understand Grapevine internals.
- Have remote fsck capabilities.
- Special expert interface.
- System logs in regular text.
- Record all activities of a server.
- Dead letter facility useful to track mail problems.
Reliability
- Resource constraints are a source of unreliability.
- Where possible, Grapevine insures that resources don't become
scarce (i.e. don't accept new connections when disk free space is
below 5%).
- Replication for reliability has been a big win.
- However, replication means presence of additional resources.
- Have to realize that average usage differs from peak usage; plan
accordingly.
- Willingness to reserve resources is not always available.
- Archiving was added as an after thought -- they really needed
bigger disks.
Some Favorite Quotes
- At some point the number of messages arriving for a user would
start to overwhelm both the user and the system. We do not know if
this phenomenon has a natural sociological limit. In the world of
paper, the problem is controlled by the distribution of information
through periodicals that have editors to filter the input. An
analogous filtering mechanism will be required in the world of
electronic message systems before they can become universal.
- The Grapevine inboxes that are intended as buffers become semipermanent repositories for
these users, loading the system in unin- tended ways