Platform API Design Checklist
- Functional Requirements
- What is the business problem this API will solve?
- Technical Requirements
- How many requests are expected per second for this API?
- What should be the TPS limit of this API?
- What is the potential max throughput of this API’s downstream dependencies? (Since TPS limit of this API has to be lower than that.)
- What should be the availability SLA of this API?
- What should be the latency SLA of this API?
- Use Cases
- What are the use cases of this API?
- Systems Design
- What is the end to end flow between the components?
- Should this API have a sync flow or async flow? i.e. Maybe for large volume requests it will take too much time to process and it should be an async API instead, that just starts an async process.
- Database Design
- Is the DB design normalized or denormalized?
- If denormalized, will this design be able to maintain the denormalized database properly under all use cases, will it be able to remain consistent?
- If denormalized, is it really needed? Because normalized db design is more maintainable.
- Data Model
- What are the key data models about this API?
- Is the response size small enough for the consumer or could it cause network/cpu/memory bottlenecks on the consumer due to large amounts of data?
- Reliability
- Are there any edge cases that this solution will not work?
- Is there any possibility of data inconsistency?
- If the requests are made concurrently will the solution still work?
- Or it requires some additional locking design such as optimistic or pessimistic locking?
- Can there be some thread safety problems?
- If the requests are received out of order, would this design still be able to handle as if the order of requests were correct?
- Reliability - Retries
- Is there a retry requirement if the API fails to process a request or event?
- Reliability - Rollbacks
- Is there a rollback requirement if the API fails to process a request or event?
- Availability
- How the solution is meeting availability SLA in the technical requirements?
- Are there some scenarios that would require downtime?
- During the deployments, would there be any downtime or misbehavior?
- Can schema/model changes cause downtime during the deployment?
- Can this design cause downtime on an external service?
- Scalability
- How much max potential throughput this API can achieve with this design?
- If this is a redesign, how does it compare with the existing?
- What are the bottlenecks that are limiting the potential throughput?
- Does this solution require some capacity planning?
- Can this solution handle the projected load of upcoming years?
- What is the scaling model (serverless / horizontal scaling / vertical scaling)?
- On the flow of the systems design, what is the limitation of the other components in the flow in terms of
- What is the allowed execution time / latency budget for each component? (e.g. APIGW has 30 seconds hard limit)
- What is the payload size limit of each component? (e.g. 6MB limit on Lambda sync invocation, 256KB limit on Lambda async invocation)
- How much max potential throughput this API can achieve with this design?
- Performance
- How the solution is meeting latency SLA in the technical requirements?
- Maintainability & Future Extensibility
- How easy is it to handle new use cases, or adapt to modifications on existing use cases?
- Testability
- How this solution will be unit tested?
- How this solution will be ad-hoc end to end tested?
- How this solution will be integration tested (automated)?
- Cost
- What are the projections on infrastructure cost?
- If this is a redesign, how this design’s infra. cost compare with the previous one’s?
- Monitoring
- Performance metrics
- How latency of the API will be monitored?
- Reliability metrics
- How error count and error rate of the API will be monitored? (Internal server errors)
- Usage metrics
- How the quota utilizations of this API will be monitored?
- Utilization of downstream services, i.e. utilized TPS percent out of the TPS limit
- How the quota utilizations of this API will be monitored?
- Custom metrics
- What should be the KPIs (key performance indicators) for this API, also considering the business impact?
- Performance metrics
- Alarms
- Are alarms planned to be created for each monitored key metric?
- How the team will get notified by the alarms and resolve?
- Email notifications
- Ticketing system integration (e.g. JIRA REST API)
- Security
- What is the payload size limit for this API?
- How the TPS limit will be applied for this API?
- API level (basic, minimum requirement)?
- Per-client (advanced)?
- Is the data encrypted at transit and at rest?
- Development Effort
- If long term outcome of multiple design options are equivalent, how they compare about the effort needed to implement?
- Backwards compatibility / Impact (if this is a change on an existing API)
- How this design makes sure none of the existing clients will break?
- Change management
- What would be the procedure to roll back this change?
- DevOps
- How the integration tests will be included in the CI/CD pipelines?
Please also check Platform Implementation Checklist.