Notes - a mobile SRE operation and maintenance system exchange

This article was last updated on: February 7, 2024 pm

📝Notes:

SRE landing that may be more in line with the actual situation of domestic state-owned enterprises.

pain point

  • Traditional silo IT architecture (closed, isolated, non-standard, difficult O&M)
  • X86 server hardware stability is insufficient
  • Open source software is unreliable and uncontrollable
  • There is a failure, and the passive fire fighting cannot be saved

transformation

This has spawned the need for transformation and upgrading:

  1. Transformation of operational intelligence (SRE).

SRE operation and maintenance mode

Core responsibilities

Guarantee:

  1. Business continuity
  2. Application continuity
  3. Platform continuity

Division of duties

  1. Integrated operation and maintenance post
    1. 24/7 online or remote duty
    2. Business monitoring
    3. Business O&M operations
    4. Troubleshooting
    5. Emergency treatment
  2. O&M professional group (evolved from infrastructure: host, storage, network, middleware, database positions)
    1. System architecture sorting and optimization
    2. Create a new system review
    3. Failure drills
    4. Introduction of new technologies
    5. Professional responsibilities and experience empower comprehensive O&M positions, such as providing database automation scripts, database switching drill process standardization, etc
  3. O&M development
    1. Develop O&M tools and O&M systems for integrated O&M posts
    2. Collect and analyze the automation and monitoring requirements of O&M professional groups
    3. DevOps, automated O&M, intelligent monitoring systems, container platforms and other systems development and continuous iterative evolution

Integrated O&M post - full-stack O&M entrance

essentials

  • Comprehensive O&M
  • Tool application
  • Unified entrance
  • Talent development

👨 💻 Personnel requirements:

Science and engineering background;

Fresh graduates; intern

Typical process

Event tracking and fault handling;

Leave the manual processing to the professional team;

Escalate the failure to the Duty Manager

O&M professional group

  • Technology selection - standardization, new technology selection,
    • Industry ecology
    • Features
    • Development planning
    • Business characteristics
  • Architecture governance - Achieve business continuity, high availability, and high reliability
  • Scenario refinement - upgrade, high-availability switchover, migration, release
  • Troubleshooting

Notes - a mobile SRE operation and maintenance system exchange
https://e-whisper.com/posts/28453/
Author
east4ming
Posted on
February 27, 2019
Licensed under