You do stress testing and don’t find any bugs, but can you really be confident that you’re ready to ship? Such a system will have hundreds of messages flying back and forth between the machines. Consider a cloud service that, let’s say, implements the Raft consensus protocol among a group of machines in an effort to provide a highly reliable fault-tolerant cluster to clients. Techniques such as failure injection and stress testing can either be too complicated to set up or time-consuming with no guarantees that found bugs can be reproduced. Non-deterministic systems exist in all software domains, not just cloud services, and best practices for building and testing these systems fall short. For example, there’s non-determinism in the scheduling of concurrent operations, the order in which messages are received, the random system failures, and the random firing of timers, either for retry logic or timeouts from other services that have become unresponsive. Such a programming environment is full of non-determinism, or scenarios outside developers’ control. They are complex by nature, hard to get right, and require protection from failures that could jeopardize client data or halt key services. Cloud services are distributed programs comprising multiple back-end systems that continuously exchange asynchronous signals while responding to incoming web requests. This challenge is especially apparent with online cloud services, which are often dictated by aggressive shipping deadlines. For developers, writing bug-free software that doesn’t crash is getting difficult in an increasingly competitive world where software needs to ship before it becomes obsolete.
0 Comments
Leave a Reply. |