When a process fails or restarts, the Availability Management Framework (AMF) module sends a log with information on why and how the process died. There are 3 main categories of logs for process failures. 

Each category has a specific format that can be followed to parse different information from the log.

Each log entry uses this format:

Log entry format

<DATE> <TIME> <hostname> <processid> Blue Cedar: [LOGCLASS], SubCls:<XYZ>, EID:          X, Type:   <type>, Sev:<severity>, <log details>
TEXT

Each log can be parsed based on 3 items: the LOGCLASS, the SubCls, and the log-details. Each log-details field has a format specific to the category. To parse each log-details field correctly, please refer to the specifics of that category.

A process has died or was killed

When a process dies, AMF generates a log entry with these fields.

  • LOGCLASS: AMFAGENT
  • SubCls: 010
  • Log-details

Format of log-details from the generic log entry format at the top:

Log details example

Hard Error#012Component: <component>#012Reporting Process: <process>#012Attributes: <attributes>#012Description: <description>#012Details: <details>
TEXT

More info and examples are located below.

FieldDescription
​Component

​A string containing some internal information about the "component" (or process) which has died. Check "safComp" for the name of the process.

Example: 

safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway
TEXT
Reporting ProcessThe internal process which is reporting the failure. This can usually be ignored.
AttributesThe attributes of why the process died (usually due to a Linux signal).
DescriptionWhy the process died. This may also indicate the Linux signal number which caused the process to die.
DetailsA string containing the processName of the process which has died and information about the process IDs.


Example:

Nov  4 15:26:13 bluecedar-atlas journal: Blue Cedar: [AMFAGENT], SubCls:010, EID:          0, Type:   Fault, Sev:Critical, Hard Error#012Component: safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway#012Reporting Process: elemProgramMgr#012Attributes: type=SW,class=LINUX_SIGNAL,subclass=Terminated#012Description: Linux application died due to signal 15 (Terminated)#012Details: processName:aaa  spid:10071  pid:3013
CODE


To make the above example more clear, replace "#012" with a newline (\n):

Nov  4 15:26:13 bluecedar-atlas journal: Blue Cedar: [AMFAGENT], SubCls:010, EID:          0, Type:   Fault, Sev:Critical, Hard Error
Component: safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway
Reporting Process: elemProgramMgr
Attributes: type=SW,class=LINUX_SIGNAL,subclass=Terminated
Description: Linux application died due to signal 15 (Terminated)
Details: processName:aaa  spid:10071  pid:3013
CODE

AMF terminates a process

Each process responds to keepalive events sent by the AMF modules. If a process becomes unresponsive, AMF restarts the process. The default keepalive policy states that a keepalive will be sent every 60 seconds and a process must reply within 5 minutes. This cannot be modified. When AMF restarts the process, the log entry includes these fields.

  • LOGCLASS: AMFAGENT
  • SubCls: 013
  • Log-details

Format of log-details from the generic log entry format at the top:

Log details examples

Healthcheck timeout for key <key>#012Component: <component>
TEXT


FieldDescription
Key​Internal Key ID. Usually MAG_Watchdog
Component

A string containing some internal information about the "component" (or process) which has died. Check "safComp" for the name of the process.

Example:

safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway
TEXT


Example:

Nov  4 16:01:28 bluecedar-atlas journal: Blue Cedar: [AMFAGENT], SubCls:013, EID:     0, Type:   Fault, Sev:Critical, Healthcheck timeout for key MAG_Watchdog#012Component: safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway
TEXT

To make the above example more clear, replace "#012" with a newline (\n):

Nov 4 16:01:28 bluecedar-atlas journal: Blue Cedar: [AMFAGENT], SubCls:013, EID: 0, Type: Fault, Sev:Critical, Healthcheck timeout for key MAG_Watchdog
Component: safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway
TEXT

A process is restarted too many times within the restart policy

The Connect Gateway has a default restart policy which keeps track of the number of times a process restarts within a period of time. This cannot be modified. The default policy says if a process restarts 3 times within 20 seconds, that process may not be started again. 

  • LOGCLASS: AMFAGENT
  • SubCls: 999
  • Log-details

Format of log-details from the generic log entry format at the top:

Log details example

CORE : Component <component> has restarted <number> times within component probation period of <period> ms, restart SU
TEXT
FieldDescription
Component

​A string containing some internal information about the "component" (or process) which has died. Check "safComp" for the name of the process.

Example:

safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway
TEXT
NumberThe default restart policy for the number of times a process can restart is 3. This cannot be modified.
PeriodThe default restart policy for the restart period is 20 seconds (20000 ms). This cannot be modified.


Example:

Nov  4 15:26:29 bluecedar-atlas journal: Blue Cedar: [AMFAGENT], SubCls:999, EID:          0, Type:  Config, Sev:Major, CORE : Component safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway has restarted 3 times within component probation period of 20000 ms, restart SU
TEXT