Notes - a mobile SRE operation and maintenance system exchange

This article was last updated on: February 7, 2024 pm


SRE landing that may be more in line with the actual situation of domestic state-owned enterprises.

pain point

  • Traditional silo IT architecture (closed, isolated, non-standard, difficult O&M)
  • X86 server hardware stability is insufficient
  • Open source software is unreliable and uncontrollable
  • There is a failure, and the passive fire fighting cannot be saved


This has spawned the need for transformation and upgrading:

  1. Transformation of operational intelligence (SRE).

SRE operation and maintenance mode

Core responsibilities


  1. Business continuity
  2. Application continuity
  3. Platform continuity

Division of duties

  1. Integrated operation and maintenance post
    1. 24/7 online or remote duty
    2. Business monitoring
    3. Business O&M operations
    4. Troubleshooting
    5. Emergency treatment
  2. O&M professional group (evolved from infrastructure: host, storage, network, middleware, database positions)
    1. System architecture sorting and optimization
    2. Create a new system review
    3. Failure drills
    4. Introduction of new technologies
    5. Professional responsibilities and experience empower comprehensive O&M positions, such as providing database automation scripts, database switching drill process standardization, etc
  3. O&M development
    1. Develop O&M tools and O&M systems for integrated O&M posts
    2. Collect and analyze the automation and monitoring requirements of O&M professional groups
    3. DevOps, automated O&M, intelligent monitoring systems, container platforms and other systems development and continuous iterative evolution

Integrated O&M post - full-stack O&M entrance


  • Comprehensive O&M
  • Tool application
  • Unified entrance
  • Talent development

👨 💻 Personnel requirements:

Science and engineering background;

Fresh graduates; intern

Typical process

Event tracking and fault handling;

Leave the manual processing to the professional team;

Escalate the failure to the Duty Manager

O&M professional group

  • Technology selection - standardization, new technology selection,
    • Industry ecology
    • Features
    • Development planning
    • Business characteristics
  • Architecture governance - Achieve business continuity, high availability, and high reliability
  • Scenario refinement - upgrade, high-availability switchover, migration, release
  • Troubleshooting

Notes - a mobile SRE operation and maintenance system exchange
Posted on
February 27, 2019
Licensed under