Skip to main content
Skip table of contents

Monitoring process failures

When a process fails or restarts, the Availability Management Framework (AMF) module sends a log with information on why and how the process died. There are 3 main categories of logs for process failures. 

Each category has a specific format that can be followed to parse different information from the log.

Each log entry uses this format:

Log entry format

TEXT
<DATE> <TIME> <hostname> <processid> Blue Cedar: [LOGCLASS], SubCls:<XYZ>, EID:          X, Type:   <type>, Sev:<severity>, <log details>

Each log can be parsed based on 3 items: the LOGCLASS, the SubCls, and the log-details. Each log-details field has a format specific to the category. To parse each log-details field correctly, please refer to the specifics of that category.

A process has died or was killed

When a process dies, AMF generates a log entry with these fields.

  • LOGCLASS: AMFAGENT
  • SubCls: 010
  • Log-details

Format of log-details from the generic log entry format at the top:

Log details example

TEXT
Hard Error#012Component: <component>#012Reporting Process: <process>#012Attributes: <attributes>#012Description: <description>#012Details: <details>

More info and examples are located below.

FieldDescription
​Component

​A string containing some internal information about the "component" (or process) which has died. Check "safComp" for the name of the process.

Example: 

TEXT
safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway
Reporting ProcessThe internal process which is reporting the failure. This can usually be ignored.
AttributesThe attributes of why the process died (usually due to a Linux signal).
DescriptionWhy the process died. This may also indicate the Linux signal number which caused the process to die.
DetailsA string containing the processName of the process which has died and information about the process IDs.


Example:

CODE
Nov  4 15:26:13 bluecedar-atlas journal: Blue Cedar: [AMFAGENT], SubCls:010, EID:          0, Type:   Fault, Sev:Critical, Hard Error#012Component: safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway#012Reporting Process: elemProgramMgr#012Attributes: type=SW,class=LINUX_SIGNAL,subclass=Terminated#012Description: Linux application died due to signal 15 (Terminated)#012Details: processName:aaa  spid:10071  pid:3013


To make the above example more clear, replace "#012" with a newline (\n):

CODE
Nov  4 15:26:13 bluecedar-atlas journal: Blue Cedar: [AMFAGENT], SubCls:010, EID:          0, Type:   Fault, Sev:Critical, Hard Error
Component: safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway
Reporting Process: elemProgramMgr
Attributes: type=SW,class=LINUX_SIGNAL,subclass=Terminated
Description: Linux application died due to signal 15 (Terminated)
Details: processName:aaa  spid:10071  pid:3013

AMF terminates a process

Each process responds to keepalive events sent by the AMF modules. If a process becomes unresponsive, AMF restarts the process. The default keepalive policy states that a keepalive will be sent every 60 seconds and a process must reply within 5 minutes. This cannot be modified. When AMF restarts the process, the log entry includes these fields.

  • LOGCLASS: AMFAGENT
  • SubCls: 013
  • Log-details

Format of log-details from the generic log entry format at the top:

Log details examples

TEXT
Healthcheck timeout for key <key>#012Component: <component>


FieldDescription
Key​Internal Key ID. Usually MAG_Watchdog
Component

A string containing some internal information about the "component" (or process) which has died. Check "safComp" for the name of the process.

Example:

TEXT
safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway


Example:

TEXT
Nov  4 16:01:28 bluecedar-atlas journal: Blue Cedar: [AMFAGENT], SubCls:013, EID:     0, Type:   Fault, Sev:Critical, Healthcheck timeout for key MAG_Watchdog#012Component: safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway

To make the above example more clear, replace "#012" with a newline (\n):

TEXT
Nov 4 16:01:28 bluecedar-atlas journal: Blue Cedar: [AMFAGENT], SubCls:013, EID: 0, Type: Fault, Sev:Critical, Healthcheck timeout for key MAG_Watchdog
Component: safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway

A process is restarted too many times within the restart policy

The Connect Gateway has a default restart policy which keeps track of the number of times a process restarts within a period of time. This cannot be modified. The default policy says if a process restarts 3 times within 20 seconds, that process may not be started again. 

  • LOGCLASS: AMFAGENT
  • SubCls: 999
  • Log-details

Format of log-details from the generic log entry format at the top:

Log details example

TEXT
CORE : Component <component> has restarted <number> times within component probation period of <period> ms, restart SU
FieldDescription
Component

​A string containing some internal information about the "component" (or process) which has died. Check "safComp" for the name of the process.

Example:

TEXT
safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway
NumberThe default restart policy for the number of times a process can restart is 3. This cannot be modified.
PeriodThe default restart policy for the restart period is 20 seconds (20000 ms). This cannot be modified.


Example:

TEXT
Nov  4 15:26:29 bluecedar-atlas journal: Blue Cedar: [AMFAGENT], SubCls:999, EID:          0, Type:  Config, Sev:Major, CORE : Component safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway has restarted 3 times within component probation period of 20000 ms, restart SU



JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.