Keeping a consistent architectural vision


因此,微服务分离允许团队独立和并行工作,而对于单体应用,大多数从事微服务工作的人会跟踪正在发生的事情,甚至会因工作超出特定开发人员关注的领域而分心.他们会知道新版本何时发布,并看到新代码被添加到他们正在处理的同一代码库中。然而,在 微服务架构re 中并非如此。在这里,团队 专注于他们的服务,不会被其他功能分心。这带来了清晰度和生产力。

但是,仍然需要对该系统的全球视野。需要对 系统的架构应如何随时间变化以进行调整有长远的看法。这种愿景(在单体系统中)是隐含的。微服务需要对这些变化有更好的理解,这样才能有效地发挥作用,所以能够统一这种全球视野的领先架构师非常重要。


在本书中,我们将把它定义为一个处理 API 和服务整体结构的角色。他们的主要目标是在涉及技术问题时协调团队,而不是直接处理代码。


In small companies, Chief Technical Officers may fulfill the architect's role, though they will also be busy handling elements that are related to managerial processes and costs.

领先架构师的主要职责是确保微服务部门在其发展过程中保持有意义,并确保服务之间通信的 API 保持一致。他们还应该努力促进跨团队制定标准并在整个组织内共享知识。



Dividing the workload and Conway's Law



Remember that microservices are useful when the development team is big. For small teams, a monolith is easier to develop and maintain. It's only when many developers work on the same system that dividing the work and accepting the overheads of a microservice architecture makes sense.



The size of a team is highly variable, but normally, the 7 ±2 components are considered as a rule of thumb for the ideal number of people who should be in a team.

Bigger groups tend to generate smaller groups on their own, but this means there will be too many to manage and some may not have a clear focus. It's difficult to know what the rest of the team is doing.

Smaller teams tend to create overhead in terms of management and inter-team communication. They'll develop faster with more members.






Describing Conway's Law

康威定律是一句软件格言(https:// www.nagarro.com/en/blog/post/76/microservices-revisiting-conway-s-law)。换句话说,在任何生产软件的组织中,软件都会复制组织的通信结构。例如,以一种非常简化的方式,一个组织分为两个部门:采购和销售。这将生成两个软件块:一个专注于购买,另一个专注于销售。他们会在需要时进行沟通。

In this section, we will talk about software units. This is a generic term that describes any software that's treated as a single cohesive element. It can be a module, a package, or a microservice.

In the microservice architecture, these software units are mainly microservices, but in some cases, there can be other types. We will see examples of this in the Dividing the software into different kinds of software units section.


  • Inter-team APIs are more expensive than intra-team APIs, both in terms of operating them as well as developing them since their communication is more complicated. It makes sense to make them generic and flexible so that they can be reused.
  • If the communication structures replicate the human organization, it makes sense to be explicit. Inter-team APIs should be more visible, public, and documented than intra-team APIs.
  • When designing systems, dividing them across the lines of the layered team structure is the path of least resistance. Engineering them any other way would require organizational changes.
  • On the other hand, changing the structure of an organization is a difficult and painful process. Anyone that has gone through a reorg knows this. The change will be reflected in the software, so plan accordingly.
  • Having two teams working on the same software unit will create problems because each one will try to pull it to their own goals.
The owner of a software unit should be a single team. This shows everyone who's responsible for who has the final say on any change and helps us focus on our vision and reduce technical debt.
  • Different physical locations impose communications restrictions, such as time differences, which will produce barriers when we're developing software across them. It is common to split teams by location, which creates the need to structure the communication (and therefore the APIs) between these teams.

请注意,DevOps 运动与康威定律有关。划分工作的传统方式是将 正在开发的软件与其运行方式分开。正如康威定律所描述的那样,这在两个团队之间造成了差距,从而产生了与两个团队之间缺乏理解相关的问题。

对这个问题的反应是创建可以开发和操作自己的软件以及部署它的团队。这称为 DevOps。它将运营问题转移给了开发团队,目的是创建一个反馈循环来激励、理解和修复它们。


Remembering this may help us design the system so that the communication flow makes sense for the organization and existing software.

DevOps 运动的关键组成部分之一是推进构建系统的技术,以简化生产环境的运行方式,从而简化部署过程。这使我们能够以新的方式构建团队,从而使多个团队能够控制发布。


Dividing the software into different kinds of software units


The main characteristic of a microservice is that it is independent in terms of development and deployment, so full parallelization can be achieved. Other divisions may reduce this and introduce dependencies.

Ensure that you justify these changes.

在本书介绍的示例系统中,我们介绍了一个模块来验证请求是否由用户签名。 Users Backend 生成一个签名的 header,Thoughts Backend 和 Frontend 通过 token_validation.py 模块独立验证它。


避免重复并使其始终保持同步的最佳方法是生成一个可以安装在相关微服务上的 Python 包。然后,可以像对待 requirements.txt 文件中的任何其他外部依赖项一样对待这些包。

要在 Python 中打包库,我们可以使用多种工具,包括官方 Python 打包用户指南中的工具(https://packaging.python.org/) 到较新的工具,例如 Poetry (https://poetry .eustace.io),它更容易用于新项目。

如果我们希望它公开可用,可以将包上传到 PyPI。或者,我们可以使用 Gemfury 等工具将其上传到私有存储库,或者在需要时托管我们自己的存储库。这在包及其维护者以及将其用作依赖项的团队之间进行了明确的划分。


Designing working structures






Structuring teams around technologies


一个很好的例子是移动应用程序,因为它们在使用的语言方面受到限制(Android 的 Java 和 iOS 的 Objective-C 或 Swift)。具有网站和移动应用程序的应用程序可能需要特定的团队来处理移动应用程序的代码。



下图是我们将遇到的团队类型的示例。它们按技术和通信方式分组。数据库团队将与创建 Web 服务后端的团队进行沟通,他们将与 Web 和移动团队进行沟通:



Structuring teams around domains




  • Sales: Handles the external website and marketing.
  • Inventory: Purchases the merchandise so that it can be sold, and also handles stock.
  • Shipping: Delivers the product to the customers. The tracking information is displayed on the website.

在这种情况下,每个区域都有自己的数据库,以便可以存储相关数据及其服务。它们通过定义的 API 相互通信,并且最频繁的更改发生在域内。这允许在域内快速发布和开发。





Structuring teams around customers

在某些组织中,主要目标是为客户创建定制工作。也许客户需要以定制的 B2B 方式与产品集成。在这种情况下,能够开发和运行自定义代码至关重要。






Structuring teams around a mix





Balancing new features and maintenance



Regular maintenance

这种维护以软件服务性质固有的任务形式出现。通过运行依赖于其他组件(例如底层操作系统或 Python 解释器)的服务,我们需要使它们保持最新并将它们升级到新版本。

在使用容器和 Kubernetes 的情况下,我们需要考虑两个充当操作系统的系统。一是来自容器的操作系统;在这里,我们使用了 Alpine。另一个是处理Kubernetes节点的OS,AWS EKS是自动处理的,但需要升级到Kubernetes版本。


  • New versions fix security problems.
  • General performance improvements.
  • New features can be added that enable new functionality.

如果我们为这些任务做好计划,就可以减轻这些任务。例如,使用标有 Long-Term Support (LTS) 的操作系统版本可以减少更新系统时出现的问题。

操作系统的 LTS 版本是在较长周期内接收支持和关键更新的版本。例如,一个常规的 Ubuntu 版本每 6 个月发布一次,并在 9 个月内接收更新(包括关键安全更新)。 LTS 版本每 2 年发布一次,并获得 5 年的支持。

运行服务时,建议使用 LTS 版本,以尽量减少所需的维护工作。


更新依赖项可能需要我们调整代码,具体取决于代码的某些部分是否被弃用或被删除。在某些情况下,这可能代价高昂。在撰写本文时,最著名的迁移是 Python 社区从 Python 2 升级到 Python 3,这项任务需要数年时间。

不过,大多数升级通常都是非常常规的,只需要很少的工作。尝试制定一个以合理的方式跟上步伐并产生可靠指导的升级计划;例如,新的操作系统 LTS 版本何时发布所有系统应在接下来的 3 个月内迁移等规则。这产生了可预测性,并为每个人提供了一个可以跟进和执行的明确目标。

Continuous integration tools can help in this process. For example, GitHub automatically detects dependencies in files such as requirements.txt and notifies us when a vulnerability is detected. It's even possible to automatically generate pull requests when updating modules. Check out the documentation for more information: https://help.github.com/en/github/managing-security-vulnerabilities/configuring-automated-security-fixes.


  • To clean up or archive old data. These operations can normally be automated, saving a lot of time and reducing problems.
  • To fix operations that are dependent on business processes, such as generating monthly reports, and so on. These should be automated when possible, or tools should be designed so that users can produce them automatically instead of relying on technical staff doing bespoke operations.
  • To fix permanent problems that are produced by bugs or other errors. Bugs sometimes leave the system in a bad state; for example, there may be a corrupted entry in the database. While the bug is being fixed, we may need to resolve the situation by unblocking a process or a user.



Understanding technical debt


作为一个比喻,技术债务自 90 年代初就已经存在,但在此之前已经描述了这个概念。



除了无法避免之外,也可以是经过深思熟虑的选择。 开发受到时间的限制,因此一个不完美的快速市场解决方案可能比错过最后期限更可取。




An architectural migration such as this is a big effort and will require time to deliver. New microservices that are reproducing the features that already exist in the monolith may conflict with new features being introduced.

This creates a moving target effect that can be very disruptive. Ensure that you identify these conflicting points and try to minimize them in your migration plan. Some new features may be able to be delayed until the new microservice is ready, for example.


Continuously addressing technical debt


检测技术债务通常来自开发团队内部,因为他们更接近代码。 团队应该考虑哪些方面的操作可以更顺畅,并预留时间来执行这些改进。

A great source of information that allows us to detect technical debt is metrics, such as the ones we set up in Chapter 10, Monitoring Logs and Metrics.



我们在本书中谈到的很多技术都可以帮助我们以持续的方式改进系统,这些技术来自我们在中描述的持续集成技术。 第四章, 创建管道和工作流,我们在中描述的代码审查和批准 第 8 章使用 GitOps 原则



Avoiding technical debt



  • Lack of a strategic, high-level plan to give direction: This produces inconsistent results because each time the same problem is found, it will be resolved in a different way. We talked about how coordination across teams needs to address standards across the organization and ensure they are followed. Having someone acting as a software architect, looking for creating consistent guidelines across the board, should greatly improve this case.
  • Not having the proper knowledge to choose the right option: This is quite common. Sometimes, the people that need to make a decision don't have all the relevant pieces of information due to miscommunication or simply lack of experience. This problem is typical of structures that lack experience in the problems at hand. Ensuring that you have a highly trained team and are creating a culture where more experienced members help and mentor junior members will reduce these cases. Documentation that keeps track of previous decisions and simplifies how to use other microservices will help us coordinate teams so that they have all the relevant parts of the puzzle. This helps them avoid making mistakes due to incorrect assumptions. Another important element is to ensure that teams have proper training in the tools they're using so that they are fully aware of their capacities. This should be the case for external tools, such as being skilled in Python or SQL, and in any internal tool that requires training materials, documentation, and appointed points of contact.
  • Not spending enough time investigating different options or planning: This problem is created by pressure and by the need to make quick progress. This can be ingrained in the organization culture, and slowing down decision-making could be a challenging task when the organization grows since smaller organizations tend to require a faster process. Documenting the decision process and requiring it to be peer-reviewed or approved can help slow this down and ensure that work is thorough. It's important to find a balance in terms of what decisions require more scrutiny and which ones don't. For example, everything that fits neatly inside a microservice can be reviewed inside the team, but features that require multiple microservices and teams should be reviewed and approved externally. In this scenario, finding the proper balance between gathering information and making a decision is important. Document the decisions and the inputs so that you understand the process that got them there and refine your process.


Designing a broader release process





Planning in the weekly release meeting



  • Planned releases for the next 7 days and rough times for when; for example, we plan to release a new version of the Users Backend on Wednesday.
  • You should provide a heads-up for any important new feature, especially if other teams can use it. For example, if the new version improves authentication, make sure that you redirect your teams to the new API so that they can get these improvements as well.
  • State any blockers. For example, we can't release this version until the Thoughts Backend releases their version with feature A.
  • Raise any flags if there's critical maintenance or any changes that could affect the releases. For example, on Thursday morning, we need to do database maintenance, so please don't release anything until 12 o'clock. We will send an email when the work is done.
  • Review the release problems that happened the week prior. We'll talk about this in more detail later.

这类似于在许多敏捷实践(例如 SCRUM)中出现的通常的站立会议,但侧重于发布。为了能够做到这一点,我们需要提前指定发布的时间。

鉴于 微服务 发布的异步性质,以及随着持续集成实践的实施和加速这一过程,将会有很多例行发布不会提前这么长时间计划。这很好,意味着发布过程正在改进。


随着时间的推移,随着持续集成实践越来越成熟,发布越来越快,每周的发布会议会慢慢变得越来越不重要,以至于可能不再需要这样做了- 至少不那么有规律。这是反思持续改进实践的一部分,这也是通过识别发布问题来实现的。

Reflecting on release problems







Capturing problems is not, and should never be, about assigning blame. It's the organization's responsibility to detect and correct problems.

If this happens, not only will the environment become less attractive to work in, but problems will be hidden by teams so that they don't get blamed.

Unaddressed problems tend to be multiplicative, so reliability will suffer greatly.




Running post-mortem meetings




  • What problem was detected? Include how it was detected if this isn't evident; for example, the website was down and was returning 500 errors. This shows that there was an increase in errors.
  • When did it start and finish? A timeline of the incident; for example, Thursday from 3PM to 5PM.
  • Who was involved in remediating the incident? Either detecting the problem or fixing it. This helps us collect information about what happened.
  • Why did it fail? Go to the root cause and the chain of events leading to that; for example, the website was down because the application couldn't connect to the database. The database was unresponsive because the hard drive was full. The hard drive was full because the local logs filled up the disk.
  • How was it fixed? Steps were taken to solve the incident; for example, logs older than a week were deleted.
  • What actions should follow up from this incident? Actions that should remediate or fix the different issues. Ideally, they should include who will perform the action; for example, no local logs should be stored and they should be sent to the centralized log. The amount of hard disk space should be monitored and alert if less than 80% of the space is available.


As we discussed in the Reflecting on release problems section, be sure to encourage open and candid discussion when you're dealing with service interruption incidents.

Post-mortem meetings are not there to blame anyone, but to improve the service and reduce risks when you're working as a team.









最后,我们解决了发布可能导致的一些问题,包括团队之间的充分协调,特别是在使用 GitOps 的早期阶段,以及在发布失败或服务关闭时进行回顾性分析。


