4 鲁棒性
在 一个更大的示例 中,信使示例中存在一些问题。例如,如果一个用户登录的节点在没有执行注销操作的情况下宕机,用户将保留在服务器的 User_List 中,但客户端会消失。这将导致用户无法再次登录,因为服务器认为该用户已经登录。
或者,如果服务器在发送消息的中间宕机,导致发送方客户端永远卡在 await_result 函数中会怎么样?
4.1 超时
在改进信使程序之前,让我们先看一下一些通用原则,并以 ping pong 程序为例。回想一下,当 "ping" 完成后,它会通过向 "pong" 发送原子 finished 作为消息来告诉 "pong" 它已经完成了,以便 "pong" 也能完成。让 "pong" 完成的另一种方法是,如果 "pong" 在特定时间内没有收到来自 ping 的消息,则让 "pong" 退出。这可以通过在 pong 中添加一个 **超时** 来实现,如下例所示:
-module(tut19). -export([start_ping/1, start_pong/0, ping/2, pong/0]). ping(0, Pong_Node) -> io:format("ping finished~n", []); ping(N, Pong_Node) -> {pong, Pong_Node} ! {ping, self()}, receive pong -> io:format("Ping received pong~n", []) end, ping(N - 1, Pong_Node). pong() -> receive {ping, Ping_PID} -> io:format("Pong received ping~n", []), Ping_PID ! pong, pong() after 5000 -> io:format("Pong timed out~n", []) end. start_pong() -> register(pong, spawn(tut19, pong, [])). start_ping(Pong_Node) -> spawn(tut19, ping, [3, Pong_Node]).
编译完成后,将文件 tut19.beam 复制到必要的目录,在 (pong@kosken) 上可以看到以下内容:
(pong@kosken)1> tut19:start_pong().
true
Pong received ping
Pong received ping
Pong received ping
Pong timed out
在 (ping@gollum) 上可以看到以下内容:
(ping@gollum)1> tut19:start_ping(pong@kosken).
<0.36.0>
Ping received pong
Ping received pong
Ping received pong
ping finished
超时设置在:
pong() -> receive {ping, Ping_PID} -> io:format("Pong received ping~n", []), Ping_PID ! pong, pong() after 5000 -> io:format("Pong timed out~n", []) end.
当进入 receive 时,超时 (after 5000) 会启动。如果接收到 {ping,Ping_PID},超时将被取消。如果未接收到 {ping,Ping_PID},则在 5000 毫秒后执行超时后的操作。 after 必须在 receive 中处于最后位置,也就是说,它必须位于 receive 中所有其他消息接收规范之前。还可以调用一个函数来返回超时的整数时间
after pong_timeout() ->
一般来说,与使用超时相比,还有更好的方法来监控分布式 Erlang 系统的某些部分。超时通常适合监控外部事件,例如,如果预计在指定时间内从某个外部系统接收消息。例如,可以使用超时在用户没有访问信使系统 10 分钟后将其注销。
4.2 错误处理
在深入研究 Erlang 系统中的监控和错误处理之前,让我们看看 Erlang 进程是如何终止的,或者用 Erlang 术语来说,是 **退出** 的。
执行 exit(normal) 或只是执行完所有操作的进程具有 **正常** 退出。
遇到运行时错误(例如,除以零、匹配错误、尝试调用不存在的函数等)的进程会以错误退出,也就是说,具有 **异常** 退出。执行 exit(Reason)(其中 Reason 是除原子 normal 之外的任何 Erlang 项)的进程也具有异常退出。
Erlang 进程可以与其他 Erlang 进程建立链接。如果一个进程调用 link(Other_Pid),它将在自身和名为 Other_Pid 的进程之间建立一个双向链接。当一个进程终止时,它会向其所有链接的进程发送一个称为 **信号** 的东西。
信号包含有关发送信号的进程的 pid 和退出原因的信息。
接收到正常退出信号的进程的默认行为是忽略该信号。
在上述另外两种情况(即异常退出)中的默认行为是:
- 绕过发送到接收进程的所有消息。
- 杀死接收进程。
- 将相同的错误信号传播到被杀进程的链接。
通过这种方式,可以使用链接将事务中的所有进程连接在一起。如果其中一个进程异常退出,事务中的所有进程都会被杀死。由于通常希望创建进程并同时与其建立链接,因此存在一个特殊的 BIF,spawn_link,它与 spawn 做相同的事情,但还会与生成的进程建立链接。
现在,来看一个使用链接来终止 "pong" 的 ping pong 示例:
-module(tut20). -export([start/1, ping/2, pong/0]). ping(N, Pong_Pid) -> link(Pong_Pid), ping1(N, Pong_Pid). ping1(0, _) -> exit(ping); ping1(N, Pong_Pid) -> Pong_Pid ! {ping, self()}, receive pong -> io:format("Ping received pong~n", []) end, ping1(N - 1, Pong_Pid). pong() -> receive {ping, Ping_PID} -> io:format("Pong received ping~n", []), Ping_PID ! pong, pong() end. start(Ping_Node) -> PongPID = spawn(tut20, pong, []), spawn(Ping_Node, tut20, ping, [3, PongPID]).
(s1@bill)3> tut20:start(s2@kosken).
Pong received ping
<3820.41.0>
Ping received pong
Pong received ping
Ping received pong
Pong received ping
Ping received pong
这是一个对 ping pong 程序的细微修改,其中两个进程都从同一个 start/1 函数生成,并且 "ping" 进程可以在单独的节点上生成。请注意 link BIF 的使用。"Ping" 在完成时调用 exit(ping),这会导致向 "pong" 发送退出信号,从而也导致 "pong" 终止。
可以修改进程的默认行为,使其在接收到异常退出信号时不被杀死。相反,所有信号都将被转换为格式为 {'EXIT',FromPID,Reason} 的普通消息,并添加到接收进程的消息队列末尾。这种行为由以下设置决定:
process_flag(trap_exit, true)
还有其他几个进程标志,请参见 erlang(3)。通常不会在标准用户程序中更改进程的默认行为,而将其留给 OTP 中的监控程序。但是,ping pong 程序已被修改以说明退出捕获。
-module(tut21). -export([start/1, ping/2, pong/0]). ping(N, Pong_Pid) -> link(Pong_Pid), ping1(N, Pong_Pid). ping1(0, _) -> exit(ping); ping1(N, Pong_Pid) -> Pong_Pid ! {ping, self()}, receive pong -> io:format("Ping received pong~n", []) end, ping1(N - 1, Pong_Pid). pong() -> process_flag(trap_exit, true), pong1(). pong1() -> receive {ping, Ping_PID} -> io:format("Pong received ping~n", []), Ping_PID ! pong, pong1(); {'EXIT', From, Reason} -> io:format("pong exiting, got ~p~n", [{'EXIT', From, Reason}]) end. start(Ping_Node) -> PongPID = spawn(tut21, pong, []), spawn(Ping_Node, tut21, ping, [3, PongPID]).
(s1@bill)1> tut21:start(s2@gollum).
<3820.39.0>
Pong received ping
Ping received pong
Pong received ping
Ping received pong
Pong received ping
Ping received pong
pong exiting, got {'EXIT',<3820.39.0>,ping}
4.3 添加鲁棒性的较大示例
让我们回到信使程序,并添加更改以使其更健壮
%%% Message passing utility. %%% User interface: %%% login(Name) %%% One user at a time can log in from each Erlang node in the %%% system messenger: and choose a suitable Name. If the Name %%% is already logged in at another node or if someone else is %%% already logged in at the same node, login will be rejected %%% with a suitable error message. %%% logoff() %%% Logs off anybody at that node %%% message(ToName, Message) %%% sends Message to ToName. Error messages if the user of this %%% function is not logged on or if ToName is not logged on at %%% any node. %%% %%% One node in the network of Erlang nodes runs a server which maintains %%% data about the logged on users. The server is registered as "messenger" %%% Each node where there is a user logged on runs a client process registered %%% as "mess_client" %%% %%% Protocol between the client processes and the server %%% ---------------------------------------------------- %%% %%% To server: {ClientPid, logon, UserName} %%% Reply {messenger, stop, user_exists_at_other_node} stops the client %%% Reply {messenger, logged_on} logon was successful %%% %%% When the client terminates for some reason %%% To server: {'EXIT', ClientPid, Reason} %%% %%% To server: {ClientPid, message_to, ToName, Message} send a message %%% Reply: {messenger, stop, you_are_not_logged_on} stops the client %%% Reply: {messenger, receiver_not_found} no user with this name logged on %%% Reply: {messenger, sent} Message has been sent (but no guarantee) %%% %%% To client: {message_from, Name, Message}, %%% %%% Protocol between the "commands" and the client %%% ---------------------------------------------- %%% %%% Started: messenger:client(Server_Node, Name) %%% To client: logoff %%% To client: {message_to, ToName, Message} %%% %%% Configuration: change the server_node() function to return the %%% name of the node where the messenger server runs -module(messenger). -export([start_server/0, server/0, logon/1, logoff/0, message/2, client/2]). %%% Change the function below to return the name of the node where the %%% messenger server runs server_node() -> messenger@super. %%% This is the server process for the "messenger" %%% the user list has the format [{ClientPid1, Name1},{ClientPid22, Name2},...] server() -> process_flag(trap_exit, true), server([]). server(User_List) -> receive {From, logon, Name} -> New_User_List = server_logon(From, Name, User_List), server(New_User_List); {'EXIT', From, _} -> New_User_List = server_logoff(From, User_List), server(New_User_List); {From, message_to, To, Message} -> server_transfer(From, To, Message, User_List), io:format("list is now: ~p~n", [User_List]), server(User_List) end. %%% Start the server start_server() -> register(messenger, spawn(messenger, server, [])). %%% Server adds a new user to the user list server_logon(From, Name, User_List) -> %% check if logged on anywhere else case lists:keymember(Name, 2, User_List) of true -> From ! {messenger, stop, user_exists_at_other_node}, %reject logon User_List; false -> From ! {messenger, logged_on}, link(From), [{From, Name} | User_List] %add user to the list end. %%% Server deletes a user from the user list server_logoff(From, User_List) -> lists:keydelete(From, 1, User_List). %%% Server transfers a message between user server_transfer(From, To, Message, User_List) -> %% check that the user is logged on and who he is case lists:keysearch(From, 1, User_List) of false -> From ! {messenger, stop, you_are_not_logged_on}; {value, {_, Name}} -> server_transfer(From, Name, To, Message, User_List) end. %%% If the user exists, send the message server_transfer(From, Name, To, Message, User_List) -> %% Find the receiver and send the message case lists:keysearch(To, 2, User_List) of false -> From ! {messenger, receiver_not_found}; {value, {ToPid, To}} -> ToPid ! {message_from, Name, Message}, From ! {messenger, sent} end. %%% User Commands logon(Name) -> case whereis(mess_client) of undefined -> register(mess_client, spawn(messenger, client, [server_node(), Name])); _ -> already_logged_on end. logoff() -> mess_client ! logoff. message(ToName, Message) -> case whereis(mess_client) of % Test if the client is running undefined -> not_logged_on; _ -> mess_client ! {message_to, ToName, Message}, ok end. %%% The client process which runs on each user node client(Server_Node, Name) -> {messenger, Server_Node} ! {self(), logon, Name}, await_result(), client(Server_Node). client(Server_Node) -> receive logoff -> exit(normal); {message_to, ToName, Message} -> {messenger, Server_Node} ! {self(), message_to, ToName, Message}, await_result(); {message_from, FromName, Message} -> io:format("Message from ~p: ~p~n", [FromName, Message]) end, client(Server_Node). %%% wait for a response from the server await_result() -> receive {messenger, stop, Why} -> % Stop the client io:format("~p~n", [Why]), exit(normal); {messenger, What} -> % Normal response io:format("~p~n", [What]) after 5000 -> io:format("No response from server~n", []), exit(timeout) end.
添加以下更改:
信使服务器捕获退出。如果它接收到退出信号 {'EXIT',From,Reason},这意味着客户端进程已终止或由于以下原因之一而无法访问:
- 用户已注销("注销" 消息已删除)。
- 与客户端的网络连接已断开。
- 客户端进程所在的节点已宕机。
- 客户端进程执行了某些非法操作。
如果接收到如上所述的退出信号,则会使用 server_logoff 函数从服务器的 User_List 中删除元组 {From,Name}。如果运行服务器的节点宕机,则会向所有客户端进程发送退出信号(由系统自动生成):{'EXIT',MessengerPID,noconnection},导致所有客户端进程终止。
此外,在 await_result 函数中引入了 5 秒的超时。也就是说,如果服务器在 5 秒(5000 毫秒)内没有回复,则客户端会终止。这仅在客户端和服务器建立链接之前的登录序列中需要。
一个有趣的情况是,如果客户端在服务器与其建立链接之前终止。这可以通过以下方式解决:与不存在的进程建立链接会导致自动生成退出信号 {'EXIT',From,noproc}。这与进程在链接操作后立即终止的效果相同。