Welcome to the #dominoforever Product Ideas Forum! The place where you can submit product ideas and enhancement request. We encourage you to participate by voting on, commenting on, and creating new ideas. All new ideas will be evaluated by HCL Product Management & Engineering teams, and the next steps will be communicated. While not all submitted ideas will be executed upon, community feedback will play a key role in influencing which ideas are and when they will be implemented.
For more information and upcoming events around #dominoforever, please visit our Destination Domino Page
As it would be ideally.
We have several clusters. Each cluster includes 4 servers.
Cluster-1: hub-1, hub-2, app1, app2.
In Domino Directory there are 2 specific sched agents configuration documents, where for Cluster-1 is written:
Fields of doc-1:
Name: MAIN_AGENTS;
Server: hub-1.
Fields of doc-2:
Name: OTHER_AGENTS;
Server: hub-2.
In all schedule agents not the server is selected, but a specific configuration for launching the agent. I.e MAIN_AGENTS or OTHER_AGENTS.
Server hub-1 is down.
We convinced that it would not up quickly.
In the document MAIN_AGENTS we change hub-1 server to app2 (at balancer for users in app1 priority).
Profit:
1. Easy to manage - just one change.
2. There is no need to change and resign design elements (agents).
Agents can mark their successful work in special logs on the administrative server. But I for manual switching.
I thought of this as an enhancement to existing mechanisms: If you mark an agent as run on <cluster name>, in the agent design, the AMgr could lookup the cldbdir entry for this nsf to find out failover rules.
The cldbdir could provide an easy to use admin interface to manage this feature similar to enabling/disabling cluster replication. Of course there are issues to address - like timing on server startup between cldbdir replication and amgr initialization for run-on-cluster agents.
Still far from simple - but it would let Domino Clustering shine even brighter
That's not what I meant. When ping (or ARP) DOES work, but NRPC does NOT -> Domino server down. If ping (ARP) does NOT work -> network issue, state of server unclear. Yes, this would prevent fail over in cases where the server has a hardware fault or power failure. But these issues happen next to never. It would work well in case of a Domino crash - and that's by far the most often cause for server unavailability.
If ping (or name resolution) doesn't work it doesn't mean the other server is down. It also doesn't mean that the other server does not run agents or has not already executed these agents. So its rather tricky to solve this request.
@Thomas Hampel
The server could detect if the other server is still reachable via ICMP. This does of course not make for 100% doubtlessness. But it would be sufficient from my point of view. And: Network outages are very rare, Domino crashes not (unfortunately)
This fail over should of course be configurable per agent!
To prevent a split-brain condition you could:
#1 Define a master server to run the agent if there is no connection to the other hosts.
#2 Use at least 3 nodes (servers or dedicated arbiter software), so that a quorum is possible. (See: https://docs.gluster.org/en/v3/Administrator%20Guide/arbiter-volumes-and-quorum/#client-quorum )
@Thomas, good point. Maybe we should not make it automatic, but rather give the admin an option to fail over all agents to the cluster mate, using one command.
So:
With this, you should be able to tackle the potential network issue, with 2 servers running the same agent, modifying the same documents and creating replication conflicts after network gets restored.
It still is manual, but only 1 command for all (failover activated) agents.
I'm not sure about the second server field in the agent properties, but I think that might be handy in cases you have clusters of 3 or more servers and you want 1 particular server to be the failover for a particular agent, but potentially another server for another agent...
And this doesn't necessarily needs to be limited to clusters. Let's say you have servers in different regions, with some applications replicating on schedule, it might be handy to use this feature also in case one of these servers is down for a longer period...
Thibaud
In case of a network hick-up both agents would run or how would servers detect that its not server outage but a connectivity issue?
This would solve many problems I encounter regularly. Needs some thought about making it optional or default - maybe by introducing the possiblity to run on <ClusterName> instead run on <ServerName> ?
This is a nice suggestion and feature that will make the cluster servers a real cluster even for agents.
Yes, this definitely needs to be addressed.