Global Directory
Global Directory
EXPLORE OUR SITES
London Stock Exchange Group
LSEG Data & Analytics
MyAccount
LSEG Perspectives
London Stock Exchange
FTSE Russell
LCH
Contact Us
Home
TR Internal
How does Cobalt handle SQL database replication and fault tolerance across multiple sites?
Andrew Gerber
Hello,
This is a broad question, but I'm wondering how Cobalt handles its SQL databases across multiple sites? Specifically, I'm interested in replication and fault tolerance. For example, the Foldering module has a database it writes to. Presumably there is a separate physical database for site A and site B in Cobalt. How are writes synced between these two databases? How are failure situations handled with Cobalt databases? If we have an outage of a "core" database on one of the sites, how is this handled?
I'm really just looking for an overall picture of "best practices" regarding SQL databases on Cobalt.
Thanks!
Find more posts tagged with
sql
refinitiv-internal
Accepted answers
Lee Bosacker
In the sake of brevity I'll leave out a lot of details but here's a high level answer to your questions.
**Regarding replication,**
All of the customer facing Cobalt modules are set up to write to two separate database RAC clusters housed in different sites. Under normal operating conditions user traffic is routed to application servers across the two data centers that use database connection pools that connect to the database cluster in that site. All write activity is asynchronously replicated to the other site using either Oracle Streams or GoldenGate. Both of these tools read the online transaction log and use queuing mechanisms to send the transactions to the database on the other site. The target site has a replication apply process that effectively replays the insert or update to the target database. The typical latency for these transactions is around 1-3 seconds and cause very few problems.
Unfortunately we do have occasional replication conflicts when the same data is inserted or updated on both sites. To address these conflicts the replication tools have conflict detection and resolution procedures set up that handle almost all of these errors. We use a standard conflict resolution rule that can be summarized as "the last one wins". All replicated tables must have a timestamp column that is updated with each transaction and when conflicts arise the conflict handlers compare the timestamps, save new newest record and discard the older one.
Even with these procedures in place there remain few exceptions, typically due to outages or other failures. To monitor the replication processes we use a tool called Veridata that runs daily comparisons between data on the two sites. Veridata will detect all discrepancies between the two databases and someone from one of the DevDBA teams will investigate the error, bring the data back into sync and follow up with the appropriate teams to fix the underlying cause.
Here are a couple of "best practices" to keep in mind to minimize replication conflicts.
- All dml activity must set the appropriate timestamp column, typically named CHANGED, to the current date/time.
- When possible, user activity on common data should be directed to a single site.
- All tables must include a CHANGED timestamp column and a primary key.
**Fault Tolerance**
There is a lot of fault tolerance built into the overall architecture but from a database perspective we use Oracle Real Application Cluster (RAC) technology to reduce outages on each database. The RAC databases, running on separate sites, use the replication tools mentioned above, set up in an active-active configuration, to keep the databases in sync across sites.
Each cluster is comprised of two hosts that each run an Oracle instance and write to a single database residing in a NetApp NAS filer. Oracle's RAC technology maintains synchronous updates to the database. Each host has database listener processes that manage connection requests from the connection pools on the application servers. The cluster services support configuration of database services that are set up to connect to a preferred node in the cluster while providing no-loss failover to the other node if a database instance fails.
All comments
Lee Bosacker
In the sake of brevity I'll leave out a lot of details but here's a high level answer to your questions.
**Regarding replication,**
All of the customer facing Cobalt modules are set up to write to two separate database RAC clusters housed in different sites. Under normal operating conditions user traffic is routed to application servers across the two data centers that use database connection pools that connect to the database cluster in that site. All write activity is asynchronously replicated to the other site using either Oracle Streams or GoldenGate. Both of these tools read the online transaction log and use queuing mechanisms to send the transactions to the database on the other site. The target site has a replication apply process that effectively replays the insert or update to the target database. The typical latency for these transactions is around 1-3 seconds and cause very few problems.
Unfortunately we do have occasional replication conflicts when the same data is inserted or updated on both sites. To address these conflicts the replication tools have conflict detection and resolution procedures set up that handle almost all of these errors. We use a standard conflict resolution rule that can be summarized as "the last one wins". All replicated tables must have a timestamp column that is updated with each transaction and when conflicts arise the conflict handlers compare the timestamps, save new newest record and discard the older one.
Even with these procedures in place there remain few exceptions, typically due to outages or other failures. To monitor the replication processes we use a tool called Veridata that runs daily comparisons between data on the two sites. Veridata will detect all discrepancies between the two databases and someone from one of the DevDBA teams will investigate the error, bring the data back into sync and follow up with the appropriate teams to fix the underlying cause.
Here are a couple of "best practices" to keep in mind to minimize replication conflicts.
- All dml activity must set the appropriate timestamp column, typically named CHANGED, to the current date/time.
- When possible, user activity on common data should be directed to a single site.
- All tables must include a CHANGED timestamp column and a primary key.
**Fault Tolerance**
There is a lot of fault tolerance built into the overall architecture but from a database perspective we use Oracle Real Application Cluster (RAC) technology to reduce outages on each database. The RAC databases, running on separate sites, use the replication tools mentioned above, set up in an active-active configuration, to keep the databases in sync across sites.
Each cluster is comprised of two hosts that each run an Oracle instance and write to a single database residing in a NetApp NAS filer. Oracle's RAC technology maintains synchronous updates to the database. Each host has database listener processes that manage connection requests from the connection pools on the application servers. The cluster services support configuration of database services that are set up to connect to a preferred node in the cluster while providing no-loss failover to the other node if a database instance fails.
Andrew Gerber
Thanks Lee! This was very helpful!
Nidhi.Sen
+1. Great answer!
Quick Links
All Forums
Recent Questions
Terms of use
Privacy & Cookie Statement
Cookies settings
Do not sell my info
Whistleblowing
UK Bribery Act
Modern Slavery Act