Tech Note 0034
Load Balancing Strategies
Recommendations for server load balancing and fail-over
ExpeDat 1.17 and SyncDat 1.4 introduced Server Groups, which allow transactions to be distributed among multiple servers for balanced loads and automatic fail-over. This improves performance and reliability by pooling infrastructure resources and by automatically routing around problems.
For example, typical server hardware or cloud instances are limited to, at most, one gigabit per second of throughput. By pooling multiple server instances, greater total speeds can be achieved and more client operations can be accommodated. If any servers in the group become unavailable, loads are automatically shifted to the other servers. Servers can be grouped into a single active-active pool, or prioritized for active-passive fail-over.
Server groups are specified per-transaction, allowing each operation to customize its use of server resources. Load balancing and fail-over logic is distributed among the clients so that there is no central point of failure. Each client, and each transaction, can operate independently and servers can be members of many group combinations.
Servers may be grouped as singletons, peers, and fallbacks.
A singleton is one server host, specified by DNS name or IP address, plus a port number.
A peer group is a comma separated list of one or more singletons which will all be queried at the same time. The client will choose the best available server from the list. Peer groups allow transactions from many clients to be evenly spread among multiple servers. If one of the servers in a peer group becomes unreachable, loads automatically shift to the others.
A fallback group is a semi-colon separated list of one or more peer groups which will be queried in the order given. Peer groups later in the list will only be used if no servers are reachable in an earlier group. This can be used to specify servers which should only be used in case of disaster, such as those with limited resources or disaster recovery (DR) restricted licenses.
Host groups are specified per-transaction and are managed entirely by individual clients. A server does not need any special configuration to participate in a group. However, with respect to any given transaction, all servers in a group must be functionally equivalent. For example, to download a file with a server group target, that file must be accessible at the same path using the same credentials on all servers in the group. All servers in a group must share the following properties:
Because the client may fail-over to a different server in the middle of a session, all servers in a group should be as identical as possible. Ideally, they should have the same bandwidth, storage, and load capacity.
Load Balancing & Active-Active Failover
At the start of each transaction to a server group, the client evaluates the availability of the first peer group. If one or more servers is reachable, the client will choose the one with the greatest availability as measured by each server's load and capacity. If no servers are reachable, the client will try the fallback groups.
Under normal operations, all clients in the first peer group will have approximately the same load. If one server becomes unreachable, clients will automatically choose from among the other servers. No transactions will be directed to a fallback group unless all servers in the first peer group are unreachable.
If all reachable servers in a peer group are at maximum capacity, clients will auto-retry until capacity becomes available. The fallback group will not be used to extend capacity, it will only be used when all servers in the first peer group are unreachable.
Disaster Recovery & Active-Passive Failover
Some servers may be undesirable for normal use because they have limited resources, are licensed only for disaster recovery operation, or are reserved for special functions. Including such servers in a fallback group ensures that they will only be used when all of the first peer group servers are unreachable.
A single CloudDat gateway for Amazon S3 instance has 2 gigabits per second of throughput. By operating four such instances and including them in a peer group, the total S3 bandwidth becomes 8 gigabits per second.
A data center in New York may have two ExpeDat servers, with content mirrored at a disaster recovery location in Las Vegas. With two ExpeDat Disaster Recovery licenses deployed in Las Vegas, the New York servers would be specified as the first peer group, and the Las Vegas DR servers as the fallback group.
Two workgroups may maintain separate High Performance Computing clusters in Virginia and Singapore. Scheduling requirements dictate that workers in the US should use the Virginia servers and workers in Asia should use the Singapore servers, but sharing resources is permitted for disaster recovery. Workers in the US would use the Virginia servers as their first peer group and the Singapore servers as their fallback. Workers in Asia would use the Singapore servers as their first peer group, and the Virginia servers as their fallback group.
Tech Note History
|Feb||20||2019||Updated CloudDat Example|