Monitoring process failures
When a process fails or restarts, the Availability Management Framework (AMF) module sends a log with information on why and how the process died. There are 3 main categories of logs for process failures.
Each category has a specific format that can be followed to parse different information from the log.
Each log entry uses this format:
Log entry format
<DATE> <TIME> <hostname> <processid> Blue Cedar: [LOGCLASS], SubCls:<XYZ>, EID: X, Type: <type>, Sev:<severity>, <log details>
Each log can be parsed based on 3 items: the LOGCLASS, the SubCls, and the log-details. Each log-details field has a format specific to the category. To parse each log-details field correctly, please refer to the specifics of that category.
A process has died or was killed
When a process dies, AMF generates a log entry with these fields.
- LOGCLASS: AMFAGENT
- SubCls: 010
- Log-details
Format of log-details from the generic log entry format at the top:
Log details example
Hard Error#012Component: <component>#012Reporting Process: <process>#012Attributes: <attributes>#012Description: <description>#012Details: <details>
More info and examples are located below.
Field | Description |
---|---|
Component | A string containing some internal information about the "component" (or process) which has died. Check "safComp" for the name of the process. Example:
TEXT
|
Reporting Process | The internal process which is reporting the failure. This can usually be ignored. |
Attributes | The attributes of why the process died (usually due to a Linux signal). |
Description | Why the process died. This may also indicate the Linux signal number which caused the process to die. |
Details | A string containing the processName of the process which has died and information about the process IDs. |
Example:
Nov 4 15:26:13 bluecedar-atlas journal: Blue Cedar: [AMFAGENT], SubCls:010, EID: 0, Type: Fault, Sev:Critical, Hard Error#012Component: safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway#012Reporting Process: elemProgramMgr#012Attributes: type=SW,class=LINUX_SIGNAL,subclass=Terminated#012Description: Linux application died due to signal 15 (Terminated)#012Details: processName:aaa spid:10071 pid:3013
To make the above example more clear, replace "#012
" with a newline (\n
):
Nov 4 15:26:13 bluecedar-atlas journal: Blue Cedar: [AMFAGENT], SubCls:010, EID: 0, Type: Fault, Sev:Critical, Hard Error
Component: safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway
Reporting Process: elemProgramMgr
Attributes: type=SW,class=LINUX_SIGNAL,subclass=Terminated
Description: Linux application died due to signal 15 (Terminated)
Details: processName:aaa spid:10071 pid:3013
AMF terminates a process
Each process responds to keepalive events sent by the AMF modules. If a process becomes unresponsive, AMF restarts the process. The default keepalive policy states that a keepalive will be sent every 60 seconds and a process must reply within 5 minutes. This cannot be modified. When AMF restarts the process, the log entry includes these fields.
- LOGCLASS: AMFAGENT
- SubCls: 013
- Log-details
Format of log-details from the generic log entry format at the top:
Log details examples
Healthcheck timeout for key <key>#012Component: <component>
Field | Description |
---|---|
Key | Internal Key ID. Usually MAG_Watchdog |
Component | A string containing some internal information about the "component" (or process) which has died. Check "safComp" for the name of the process. Example:
TEXT
|
Example:
Nov 4 16:01:28 bluecedar-atlas journal: Blue Cedar: [AMFAGENT], SubCls:013, EID: 0, Type: Fault, Sev:Critical, Healthcheck timeout for key MAG_Watchdog#012Component: safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway
To make the above example more clear, replace "#012
" with a newline (\n
):
Nov 4 16:01:28 bluecedar-atlas journal: Blue Cedar: [AMFAGENT], SubCls:013, EID: 0, Type: Fault, Sev:Critical, Healthcheck timeout for key MAG_Watchdog
Component: safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway
A process is restarted too many times within the restart policy
The Connect Gateway has a default restart policy which keeps track of the number of times a process restarts within a period of time. This cannot be modified. The default policy says if a process restarts 3 times within 20 seconds, that process may not be started again.
- LOGCLASS: AMFAGENT
- SubCls: 999
- Log-details
Format of log-details from the generic log entry format at the top:
Log details example
CORE : Component <component> has restarted <number> times within component probation period of <period> ms, restart SU
Field | Description |
---|---|
Component | A string containing some internal information about the "component" (or process) which has died. Check "safComp" for the name of the process. Example:
TEXT
|
Number | The default restart policy for the number of times a process can restart is 3. This cannot be modified. |
Period | The default restart policy for the restart period is 20 seconds (20000 ms). This cannot be modified. |
Example:
Nov 4 15:26:29 bluecedar-atlas journal: Blue Cedar: [AMFAGENT], SubCls:999, EID: 0, Type: Config, Sev:Major, CORE : Component safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway has restarted 3 times within component probation period of 20000 ms, restart SU