Oracle Coherence is the industry's leading in-memory distributed data grid solution, enabling companies to scale mission-critical applications. Coherence provides distributed (partitioned) data management and caching services on top of a reliable, highly scalable peer-to-peer clustering protocol.
The peer-to-peer clustering protocol enables Coherence to scale without any centralized point of failure. For this same reason Coherence does not include a centralized monitoring and reporting console. Instead, hundreds of operating metrics per node are provided through a JMX interface. Validating cluster configuration, health, resource usage, activity and performance without a dedicated Coherence monitoring application is often quite challenging, especially for large or complex clusters.
First released in 2007, RTView Oracle Coherence Monitor (OCM) provides a comprehensive monitoring, performance tuning and visibility solution for Oracle Coherence. Unlike some monitoring tools, RTView OCM is typically used by Coherence architects, developers and operations staff alike. The monitor is lightweight, scalable, and easy to install and configure. Data is collected, analyzed and presented in real-time without the latency associated with database-centric monitoring systems. Typically metrics are updated at 10- to 30-second intervals, so the monitor will capture the critical events associated with system latency and performance degradation.
Click on a link below to learn more about the OCM’s functionality:
- Cluster Physical and Logical Configuration
- Cluster Activity and Workload, Current and Historical
- Cluster Health, Resource Usage and Important KPIs
- Hotspot and Bottleneck Analysis Through Innovative Visual Analytics
- Alerting and Most Important Cluster KPIs
- Efficient and Innovative Cluster-Wide JMX MBean Data Collection
- Lightweight Footprint Suitable for Use with Production, Test and Development Clusters
Is your cluster configured as intended? RTView OCM presents a complete picture of your Coherence cluster configuration including both physical and logical resources. Important configuration information includes physical resources like hosts, nodes (JVMs), memory, CPU, threads, etc. The OCM also reports on Coherence-specific logical resources. These include the cluster service and its current membership, member roles, the distributed and replicated cache services, the caches and cache configurations associated with those services, and the number of objects in each cache. Other logical resources include the invocation service, the proxy service, and the client connections for each proxy node.
Cluster Overview display showing high-level view of cluster configuration, activity and health
A top 20 bank provides real-time, role-based, multi-tier visibility across hundreds of custom applications, with drill-down capabilities into individual applications and infrastructure components.
Cluster activity and workload are monitored for all services and service types. The information is presented in real-time and persisted to the OCM Historian, so both short- and long-term trends may be easily analyzed from the same interface. In Coherence, performance data is usually reported at the service level (messages, requests and pending requests, task backlog, threads and abandoned threads, etc.). Workload (puts and gets, evictions and expiry, database stores and store failures, etc.) is reported at the cache level.
Single Service Summary display shows health, activity and performance for the storage nodes for this distributed cache service
Current Size Chart shows the total number of objects in all distributed caches and the memory consumed for the primary objects in bar chart format. The “Table” button displays the current information in table format for easy export.
Is the cluster stable with good communication health among all members? Does the cluster have enough capacity (memory, network, CPU, threads, high units) overall, and is any individual member or individual cache capacity constrained or in an invalid state? RTView OCM offers advanced cluster-wide analytics to answer these questions. The current state of all available Coherence resources are monitored such as the HA status of a service, the amount of data in each cache, the performance metrics reported by services, failures in writing to the database, etc.
This heatmap shows the six hosts in the cluster. Each small square is a node (JVM) in the cluster. The box size represents the heap size of node. The green color shows CPU usage by the node.
Hotspots, bottlenecks and latency in Coherence are depicted in innovative “Heat Map History” displays that show holistic cluster behavior over time as rows of light and dark squares. Each row is a trend chart representing something about a physical or logical resource over time. By stacking these together, you can see patterns in the cluster over time. Users can mouse over the squares to see the underlying metrics. Heat map history displays allow users to “see” internal Coherence “load balancing” of cluster activity and resource usage. You can also easily see queue formation and persistence across nodes. Users can quickly identify unexpected behavior or poorly distributed activity or resource usage, and see where and when it occurred.
A dark vertical line indicates a cluster-wide event at a point in time. A dark spot indicates a hotspot in a specific resource at a point in time, for example a data object is “hot” with queued requests at some points in time. Dark horizontal lines indicate a hotspot that persists over time in a physical or logical resource. These displays are very helpful in troubleshooting and finding the root cause of issues in production, and also in interpreting load and performance test results and in analyzing the effects of code changes.
This display shows memory usage and garbage collection over time for all the cluster JVMs. Note that the JVMs have been aggregated by role. Usually storage nodes have very different memory profiles than process nodes (non-storage).
Coherence is frequently a component in critical business applications where operations staffs need proactive monitoring so fault conditions may be addressed before users are affected. RTView OCM includes 18 pre-defined alerts to identify the most important cluster failure and performance degradation conditions. Most of these alerts involve cluster-wide analytics.
Many problems in individual nodes are transient and do not require intervention, whereas problems in the cluster as a whole should be addressed immediately. For example, low available memory or unusual network packet loss in a single node may be an expected, transient condition related to normal garbage collection. However a cluster-wide low available memory condition or unusual cluster-wide packet loss would be an abnormal condition calling for immediate alerting and investigation. Alert notifications via email or to another alerting system are easy to configure.
Just click on an alert to change the “Current Alert Settings”. Alerting may be globally enabled or disabled.
It is surprisingly difficult to collect consistent, accurate monitoring data from Coherence every 10 to 30 seconds via JMX polling technology with low impact on the cluster. SL has developed and implemented a number of proprietary techniques based on years of experience working with some of the largest Coherence clusters in operation today. The big problems that must be solved include collecting all MBean data at a single point of time without time skew (for accuracy) or impact on the cluster, and improving the performance of the JMX MBean server so tens of thousands of MBeans can be returned quickly to allow for sub-minute polling intervals.
The JMX Ds Stats display tracks performance of all cluster JMX requests. The total execution time must be less than the polling interval in order to return accurate data.
RTView OCM carefully monitors the monitor, and provides detailed reporting on every JMX request made by the OCM. The information can be used to tune the monitor’s polling interval while ensuring accuracy of the returned data. A “JMX Tables” custom MBean in provided with the OCM to reduces cluster overhead of JMX data collection. The network overhead reduction is estimated at 66%. RTView OCM also provides three different JMX cluster connection types including an alternate “SuperSize” management framework.
Lightweight Footprint Suitable for Use with Production, Test and Development Clusters
Like Coherence itself, the OCM is written in Java, operates in memory, and scales through partitioning. It is lightweight, flexible, and easy to install and configure. To view monitoring data collected and aggregated in the OCM, RTView provides multiple display and reporting clients available for use out-of-the-box. A Java desktop client provides the best interaction, while a browser client provides remote access to multiple devices including mobile phones via HTML server deployment.
Additionally, all of the monitoring data collected by OCM, both raw and processed (aggregated) are available to users via a REST web service, XML HTTP Request, or a simple Java API. Both current and historical data can be extracted from OCM and presented in a user's own reporting tools or written to .csv files or Excel for further analysis.
Data collected by the OCM Historian may be written to any JDBC-enabled database, such as Oracle, and accessed by other reporting tools as needed.
Caches/Nodes/Alerts display as viewed on iPhone. Page down to see active alerts if any. Click on any portion of the display to see more information.