Notes - a mobile SRE operation and maintenance system exchange
This article was last updated on: July 24, 2024 am
📝Notes:
SRE landing that may be more in line with the actual situation of domestic state-owned enterprises.
pain point
- Traditional silo IT architecture (closed, isolated, non-standard, difficult O&M)
- X86 server hardware stability is insufficient
- Open source software is unreliable and uncontrollable
- There is a failure, and the passive fire fighting cannot be saved
transformation
This has spawned the need for transformation and upgrading:
- Transformation of operational intelligence (SRE).
SRE operation and maintenance mode
Core responsibilities
Guarantee:
- Business continuity
- Application continuity
- Platform continuity
Division of duties
- Integrated operation and maintenance post
- 24/7 online or remote duty
- Business monitoring
- Business O&M operations
- Troubleshooting
- Emergency treatment
- O&M professional group (evolved from infrastructure: host, storage, network, middleware, database positions)
- System architecture sorting and optimization
- Create a new system review
- Failure drills
- Introduction of new technologies
- Professional responsibilities and experience empower comprehensive O&M positions, such as providing database automation scripts, database switching drill process standardization, etc
- O&M development
- Develop O&M tools and O&M systems for integrated O&M posts
- Collect and analyze the automation and monitoring requirements of O&M professional groups
- DevOps, automated O&M, intelligent monitoring systems, container platforms and other systems development and continuous iterative evolution
Integrated O&M post - full-stack O&M entrance
essentials
- Comprehensive O&M
- Tool application
- Unified entrance
- Talent development
👨 💻 Personnel requirements:
Science and engineering background;
Fresh graduates; intern
Typical process
Event tracking and fault handling;
Leave the manual processing to the professional team;
Escalate the failure to the Duty Manager
O&M professional group
- Technology selection - standardization, new technology selection,
- Industry ecology
- Features
- Development planning
- Business characteristics
- Architecture governance - Achieve business continuity, high availability, and high reliability
- Scenario refinement - upgrade, high-availability switchover, migration, release
- Troubleshooting
Notes - a mobile SRE operation and maintenance system exchange
https://e-whisper.com/posts/28453/