「技术天地」Microsoft Word正则表达式...有没有尽头？

vlambda
2020-05-08

「技术天地」Microsoft Word正则表达式...有没有尽头？

点击蓝字，关注艺果

此系列文章，选自SDL公司的高级工程师技术大牛Paul Filkin的multifarious，译文为上海艺果组织利用SDL TradosGroupshare进行翻译，供各位译友学习交流，版权归原作者所有，转载请注明出处。

翻译：徐广志

审校：申明

Unfortunately the practice of being asked to translate a Microsoft Word file that contains HTML code doesn’t look as though it will go away any time soon for some translators. But it’s not the end of the world and it’s often all in the preparation of the Word file before you translate it.

不幸的是，一些译员拿到的Microsoft Word翻译文件中总是会包含HTML代码，而这种情况似乎没有可能在短期内消失。然而，这并不会导致世界末日，你通常只需要在开始翻译Word文件之前做些准备，就可以解决这种情况。

This article is just a short.. ish one I decided to write after seeing this come up again in ProZ last week, and because it’s another place where all those lovely regular expressions we’re learning about can come in handy. Yes, Microsoft Word also supports regular expressions, although it is their own flavour. You can read more about this by just googling for “regular expressions in Microsoft Word” and you will find plenty of help on the subject. In Word they are called wildcards but they have many similar principles as we’ll see with this very simple example.

这只是一篇短文。上周我在ProZ上看到又有人提起了这个问题，所以我决定写下这篇文章。另外，我们正在学习的那些可爱的正则表达式也可以在这里派上用场。是的，Microsoft Word也支持正则表达式，只是他们有自己独特的风格。关于“Microsoft Word中的正则表达式”的更多信息，您只需在谷歌中搜索一下，便会找到很多相关的帮助。在Word中，正则表达式被称为通配符，但它们有许多相似的原则，正如下面这个非常简单的例子所示。

I have a Word file that looks like this and you can see I have added what’s often referred to as embedded HTML copied in as text:

我有这么一个Word文件，你可以看到，我已经在其中复制添加了一段通常被称为嵌入式HTML文本的内容：

If I open this ins Studio I get this which is not too easy to work with. Hardly surprising though as this is a terrible way to handle content like this… actually if anyone can tell me why people do it I’d be interested to learn!:

如果我直接在Studio中打开这个文件，那翻译界面就是这样子，不好操作。虽然用这种方式来处理这样的内容显然是很糟糕的，但事实上如果有人能告诉我为什么人们会这么做，我倒是很有兴趣知道！

So the solution for Studio users is to do one of two things:

Studio用户可以通过以下两种方法来解决这个问题：

1. Copy the html into a decent text editor, save as html, and then use Studio to handle the html separately, or

将这段HTML文本复制到一个适当的文本编辑器中，另存为HTML文件，然后使用Studio单独处理这个HTML文件，或

2. Use a little regex magic to replace all the tags as hidden text so they can’t been seen in Studio

使用神奇的“正则表达式”将所有标记替换为隐藏文本，不让它们出现在Studio中。

For this article I’m going to use the latter and search and replace the tags with the hidden formatting property in Word. Sometimes this is an easier approach for files with embedded content like this because the HTML may be scattered all over the place so this is one operation rather than many. To do this I’ll use the following expression to find the tags:

在这篇文章中，我要使用第二种方法，即查找Word中的标记，并将其替换为带隐藏格式的属性。有时候，对于这样的嵌入内容，这是比较简单的处理方法，因为HTML文本可能分散在文件中的各个位置，而这种方法只需操作一次即可解决所有问题。为实现这个操作，我将使用下述表达式查找标记：

\<*\>

So very similar to .NET flavour of regex that Studio uses but this has a slightly different meaning. Word uses the angle brackets to mark the start and end of a word so that you can find single words only… sort of like word boundary markers in .NET. I actually want to find the angle brackets so I have to escape them and this is what the backslash does. The star symbol is exactly the same as .NET, it just means find anything. So in my Word find and replace dialogue I set it up like this:

这与Studio使用的.NET风格的正则表达式非常相似，但是意义略有不同。Word使用尖括号标记单词的开始和结束，所以我们只能查找单个单词，这有点像.NET中的单词边界标志。其实我是想查找尖括号，所以我必须用反斜线来对它们进行转义。星号的意义和.NET是完全一样的，表示搜索任何字符。所以我在Word的查找和替换对话框中进行如下操作：

1. I enter my regular expression

输入正则表达式

2. I check the “Use Wildcards” checkbox

勾选“使用通配符”复选框

3. I click on “Format”, then “Font” and in there click on “Hidden”
依次点击“格式”和“字体”，然后勾选“隐藏”

You can see just beneath the search pattern and beneath the empty replace box it tells me what settings I used for each. Now all I do is click on “Replace All”. Immediately all my tags have disappeared and the Word file looks like this:

你可以在搜索模式和空白替换框下看到我为各部分选择了什么设置。现在我只需点击“全部替换”。所有标记将马上消失，Word文件如下所示：

But don’t worry… if I click the display formatting button it all comes back again… so the button shown here on the right. The text will now have dotted lines under it but this just tells you that it has the hidden font properties so I can simply set the option in Studio not to extract hidden text for translation. You can find this option here under the “Common” node in the filetype settings for Microsoft Word:

但别担心，只要点击显示格式按钮，它们就会全部恢复。这个按钮如右边图片所示。现在，这些HTML文本下方会带有虚线，这表示它带有隐藏字体属性，所以我可以通过简单地在Studio中设置选项，不提取隐藏的文本进行翻译。进入Studio选项的“文件类型”部分，您可以在Microsoft Word文件类型设置的“常规”节点下找到这个选项：

Now when I open the file for translation I see this:

现在，打开文件进行翻译时，将显示如下：

Much easier to handle, all the HTML code is hidden, and I can safely handle the file.

接下来就好处理了，所有的HTML代码都已被隐藏，现在可以放心地处理文件了。

In reality this is an exercise in seeing yet another application for regular expressions in other software tools…. this time Microsoft Office… because I truly hope you don’t see any files like this at all. But if you do, as I do occasionally see, then perhaps this article will be helpful for you in having to safely navigate the content of the file without destroying the tags.

实际上，这只是正则表达式在另一款应用程序中的应用示例，也就是Microsoft Office。真心希望大家不会遇到像这种类型的文件吧。但是如果您和我一样不幸，偶尔会遇到这样的文件，那么这篇文章介绍的方法也许能帮您在不损坏标记的前提下安全处理文件内容。

Once you are done you select the text in the target file, right click and select font, then unhide the hidden text. Simple!

完成翻译后，请选择译文文件中的文本，右击并选择字体，然后取消隐藏这些隐藏的文本，即可恢复原格式。就这么简单！

上海艺果核心业务欢迎垂询

TEL18701561680

艺果“英法俄”铁血翻译军团招募中

详情请戳↓_↓

往期精彩文章集锦

项目管理——

技术天地——

经典案例——

艺果杯——

长

按

关

注

复工大吉

向抗疫英雄们致敬！

vlambda博客
学习文章列表